General CS Notes: Machine Learning

Activation Functions

$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x} = 1 - \sigma(-x)$

$\frac{d}{dx}\sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2} = \sigma(x)(1 - \sigma(x))$

$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

$\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)$

$\text{Softplus}(x) = \frac{1}{\beta}\log(1 + e^{\beta x})$

Smooth ReLU approximation. Shown here with $\beta = 1$.

$\frac{d}{dx}\text{Softplus}(x) = \sigma(\beta x)$

Again, shown with $\beta = 1$.

$\text{ReLU}(x) = \max(0, x)$

Rectified Linear Unit.

$$ \frac{d}{dx}\text{ReLU}(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \\ \end{cases} $$

(likely 0 or 1 chosen as a subdifferential at $x=0$)

$$ \text{Leaky ReLU}(x) =\\ \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \\ \end{cases} $$

$$ \frac{d}{dx}\text{Leaky ReLU}(x) =\\ \begin{cases} 1 & \text{if } x > 0 \\ 0.01 & \text{if } x \leq 0 \\ \end{cases} $$

$$ \text{PReLU}(x) =\\ \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \\ \end{cases} $$

Parametric ReLU. $\alpha$ learnable; here shown $\alpha = 0.25$

$$ \frac{d}{dx}\text{PReLU}(x) =\\ \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{if } x \leq 0 \\ \end{cases} $$

Again, shown $\alpha = 0.25$

$$ \text{ELU}(x) =\\ \begin{cases} x & \text{if } x > 0 \\ \alpha (e^x - 1) & \text{if } x \leq 0 \\ \end{cases} $$

Exponential Linear Unit. $\alpha$ is a fixed hyperparameter; here shown with typical $\alpha = 1$.

$$ \frac{d}{dx}\text{ELU}(x) =\\ \begin{cases} 1 & \text{if } x > 0 \\ \alpha e^x & \text{if } x \leq 0 \\ \end{cases} $$

Again, plotted with $\alpha = 1$.

$\text{GELU}(x) = x \Phi(x)$

Gaussian Error Linear Unit. $\Phi$ is the Gaussian CDF.

$\frac{d}{dx}\text{GELU}(x) = \Phi(x) + x \phi(x)$

$\phi$ is the Gaussian PDF.

$\text{Swish}(x) = x \sigma(\beta x)\\\text{SiLU}(x) = x \sigma(x)$

Generally $\beta$ may be fixed or trainable, here $\beta = 1$, which is equivalent to the Sigmoid Linear Unit.

$\frac{d}{dx}\text{Swish}(x) = \sigma(\beta x)[1+ \beta x (1 - \sigma(\beta x))]$
$\frac{d}{dx}\text{SiLU}(x) = \sigma(x) + x\sigma'(x)$

The next series of activations are Gated Linear Unit (GLU) variants. GLUs are really layers more than activations; they project an input vector $x$ out into two spaces, $xW + b$ and $xV + c$, using learned matrices $W,V$ and bias vectors $b,c$. Then, one space is passed through some kind of activation $a(xW + b)$, and both spaces are finally element-wise multiplied together: $a(xW + b) \otimes xV + c$.

This general formulation leads to variants for the common activations, which all end up with funny names ending in GLU.⁰¹

$$ \begin{aligned} \text{GLU}(x) = \sigma(xW + b) &\otimes (xV + c)\\ \text{Bilinear}(x) = (xW + b) &\otimes (xV + c)\\ \text{ReGLU}(x) = \text{ReLU}(xW + b) &\otimes (xV + c)\\ \text{GEGLU}(x) = \text{GELU}(xW + b) &\otimes (xV + c)\\ \text{SwiGLU}_1(x) = \text{Swish}_1(xW + b) &\otimes (xV + c)\\ \end{aligned} $$

We’ll continue considering a single weight. All of $W, V, a, b$ below are scalars, with slight abuse of notation. Those four parameters are adjustable below, and the plots that follow react to update.

$W$: 1.6 $b$: 1.4 $V$: -0.3 $c$: 1.2

$\text{GLU}(x) = \sigma(xW + b) \otimes (xV + c)$

Gated Linear Unit.

$\frac{d}{dx}\text{GLU}(x) = \sigma'(xW + b)W(xV+c) + \sigma(xW + b)v$

$\text{Bilinear}(x) = (xW + b) \otimes (xV + c)$

$\frac{d}{dx}\text{Bilinear}(x) = W(xV+c) + (xW + b)v$

$\text{ReGLU}(x) = \text{ReLU}(xW + b) \otimes (xV + c)$

$\frac{d}{dx}\text{ReGLU}(x) = \text{ReLU}'(xW + b)W(xV+c) + \text{ReLU}(xW + b)v$

$\text{GEGLU}(x) = \text{GELU}(xW + b) \otimes (xV + c)$

$\frac{d}{dx}\text{GEGLU}(x) = \text{GELU}'(xW + b)W(xV+c) + \text{GELU}(xW + b)v$

$\text{SwiGLU}(x) = \text{Swish}(xW + b) \otimes (xV + c)$

$\frac{d}{dx}\text{SwiGLU}(x) = \text{Swish}'(xW + b)W(xV+c) + \text{Swish}(xW + b)v$

Comments:

Vanishing gradients: Nearly-zero gradients, like at either end of a sigmoid or tanh, can cause the gradient to become too small. This affect compounds across layers with the chain rule. The gradient may vanish by the time it reachers earlier network layers. Remedies: ReLU, batch norm, Xavier initialization.
Dead ReLUs: If a neuron with ReLU always outputs 0 (for every datum in the dataset), its gradient is always 0, and it can’t recover. Remedies: GELU/Swish, batch norm, He initialization. Perhaps surprisingly, leaky variants (leaky ReLU, PReLU, ELU) never really caught on. The reasons seem to be (a) batch norm and He init work well, (b) given that, the extra params (PReLU), hyperparams (Leaky), or computational cost (ELU) aren’t worth it, (c) if neurons aren’t fully dead, ReLU’s 0s are nice because we get sparsity, (d) some dead neurons (~5%) are fine if we’re massively over-parameterized. That said, GELU/Swish, and their GLU-variants, have replaced ReLU for modern transformers.

General notes on the evolution of activation functions:

Activation	Used	Notes
Sigmoid		OG, but since replaced. Vanishing gradients, not 0-centered, computation ($e^x$)
tanh		Now centered at 0, stronger gradient, but still vanishing gradients.
softplus		Might as well just ReLU
ReLU	MLP	Gold standard. Simple, keeps gradient when positive, helps with sparsity. Risk: “dead.”
Leaky ReLU	GANs	Prevent dead neurons, but practitioners prefer to just init correctly rather than fuss with leakiness.
PReLU		Extra parameter(s) not worth the marginal gains.
ELU		Zero-mean outputs. Smooth transition from dead to alive. Again, not worth the effort (here, computational).
GELU		Works well (BERT, GPT-2), though complex. Smooth. Non-monotonic. Nice interpretation: scale $x$ by likelihood of $x > 0$. Got us off ReLUs in NLP.
Swish	LLMs	Like GELU, but simpler. Modern simple choice.
GLU		More parameters, learnable gate. Core idea nice, but it’d be even better with…
SwiGLU	LLMs	Trainable gating, nice shape, easy to compute. Top modern NLP choice. (Though in Shazeer (2020)'s results, various GLUs win at different tasks.)

References:

GLU Variants Improve Transformer (Shazeer, 2020)

Combinations and Permutations

Derivatives

$f(x)$	$f’(x) = \frac{d}{dx}f(x) = \frac{df}{dx}$
Constants, power
$a$	$0$
$x$	$1$
$ax$	$a$
$x^a$	$ax^{a-1}$
$a f(x)$	$a f’(x)$
Sum
$af(x) + bg(x)$	$af’(x) + bg’(x)$
$af(x) - bg(x)$	$af’(x) - bg’(x)$
Product
$f(x)g(x)$	$f’(x)g(x) + f(x)g’(x)$
$f(x)g(x)h(x)$	$f’(x)g(x)h(x) + f(x)g’(x)h(x) + f(x)g(x)h’(x)$
Chain
$f(g(x))$	$f’(g(x)) g’(x)$
Reciprocal
$\frac{1}{f(x)}$	$-\frac{f’(x)}{f(x)^2}$
Quotient
$\frac{f(x)}{g(x)}$	$\frac{f’(x)g(x) - f(x)g’(x)}{g(x)^2}$
Applications
$f(x)^a$	$af(x)^{a-1}f’(x)$
$\ln x$	$\frac{1}{x}$
$\ln f(x)$	$\frac{1}{f(x)}f’(x)$
$e^x$	$e^x$
$e^{f(x)}$	$e^{f(x)} f’(x)$
$a^x$	$(\ln a) a^x$
$a^{f(x)}$	$(\ln a) a^{f(x)} f’(x)$
$\log_a(x)$	$\frac{1}{(\ln a)x}$
$\log_a(f(x))$	$\frac{1}{(\ln a)f(x)}f’(x)$
$\sin(x)$	$\cos(x)$
$\sin(f(x))$	$\cos(f(x))f’(x)$

Reciprocal as power and chain rules:

$$ \frac{d}{dx}\frac{1}{f(x)} = \frac{d}{dx} f(x)^{-1} = -1 f(x)^{-2} f’(x) = -\frac{f’(x)}{f(x)^2} $$

Quotient as power and chain rules:

$$ \begin{aligned} \frac{d}{dx}\frac{f(x)}{g(x)} &= \frac{d}{dx} f(x) g(x)^{-1}\\\\ &= f'(x) g(x)^{-1} + f(x)(-1)g(x)^{-2}g'(x)\\\\ &= \frac{f'(x)}{g(x)} - \frac{f(x)g'(x)}{g(x)^2}\\\\ &= \frac{f'(x)g(x)}{g(x)^2} - \frac{f(x)g'(x)}{g(x)^2}\\\\ &= \frac{f'(x)g(x) - f(x)g'(x)}{g(x)^2} \end{aligned} $$

Using change of base for $\frac{d}{dx}\log_a(x)$:

$$ \log_a(x) = \frac{\ln(x)}{\ln(a)} = \frac{1}{\ln(a)}\ln(x) $$

$$ \frac{d}{dx}\log_a(x) = \frac{1}{\ln(a)}\frac{d}{dx}\ln(x) = \frac{1}{\ln(a)}\frac{1}{x} = \frac{1}{\ln(a)x} $$

References:

A nice cheat sheet

Expectation

Inequalities

Layer Norm

class LayerNorm(nn.Module):
    def __init__(self, dims: Sequence[int]):
        super().__init__()
        self.dims = tuple(dims)
        self.stats_dims = tuple(range(-len(dims), 0))  # all but batch
        self.gamma = nn.Parameter(torch.ones(dims))
        self.beta = nn.Parameter(torch.zeros(dims))

    def forward(self, x: torch.Tensor):
        assert x.shape[-len(self.dims):] == self.dims
        var = x.var(dim=stats_dims, correction=0, keepdims=True)
        mean = x.mean(dim=stats_dims, keepdims=True)
        x_norm = (x - mean) * (var + 1e-5).rsqrt()
        return (x_norm * self.gamma) + self.beta

Notes:

normalize over all of a conceptual “item’s” dimensions at once.
- images: (b,h,w,c), item = image = (h,w,c), produces just b vars & means
  - why? this preserves relationships across dimensions.
  - e.g., if one row is large (101, 102, 103) and one is small (1, 2, 3), we don’t want to remove that information when normalizing
- language: (b,l,d), item = token = (d), produces (b, l) vars & means
  - why? keep token representations independent; adding word to sequence shouldn’t normalize other tokens
the normalizing by 1/sqrt(var + eps) is important vs 1/(std + eps)
- why? std + eps prevents divide by zero in the forward pass, but problems remain in the backward pass
- in the backward pass
  1. The derivative of std dev contains a 1/std term which creates a divide by 0. pytorch prevents this, presumably by having a guard rail that says the derivative at std=0 is just 0, not NaN.
  2. Even when std is not zero, the gradient of 1/(std + eps) explodes and stays high (plateaus) as std shrinks, whereas the gradient of 1/sqrt(var + eps) does peak extremely high when var is small, but then back shrinks down towards 0 as var keeps shrinking.
for $d$ input features we introduce $2d$ new weights (gamma scales and beta offsets)
- note that this is fine because $2d << d^2$ (a single linear layer)

Implementation notes:

typically epsilon (1e-5) is passed as a class argument so it can be set correctly for models trained with different values
machine learning typically uses unbaised (population) estimates of var & std (1/N) rather than biased (sample) ones (1/(N-1))
at a lower level, we’d want to do this with a “fused” kernel for efficiency. sketch:
- we can consider on a single input (e.g., $d$-vector $\mathbf{x}$) as a unit of parallelism
- we need to compute $\mu$ and $\sigma^2$ for the before centering
- these both rely on computing a sum of $\sum x$ and $\sum x^2$
- but we can’t naively compute $E[x^2]$ and $E[x]^2$ and subtract, else we’ll hit cAtAsTrOpHiC cAnCeLlAtIoN
- also, we don’t want to compute the mean, compute the variance, and compute the centering separately
- instead, we want to load the input into fast memory and keep it there during the whole computation
- so, we: (1) load $\mathbf{x}$, (2) compute all the statistics online, (3) center, (4) write
- the way we compute mean and variance with computational stability in a single pass is with an online algorithm (Welford’s)

Log laws

Change of base

$$ \log_b(a) = \frac{\log_x(a)}{\log_x(b)} $$

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a beautiful, unifying framing of problem setups and assumptions from which the loss functions we use in everyday ML fall out.

Here’s our initial setup:

$$ \begin{aligned} x \quad &\text{data inputs} \\ y \quad &\text{data output} \\ \theta \quad &\text{model parameters} \\ \hat{y} = f_\theta(x) \quad &\text{the model's output} \\ p(y|\hat{y}) = p(y|f_\theta(x)) \quad &\text{likelihood} \end{aligned} $$

Inputs $x$ may be an image, text, or some set of features. Outputs $y$ may be a number (regression) or class label (classification). Our final term of interest $p(y|\hat{y})$ is the likelihood of the true output $y$ under our model $f_\theta$.

Given data pairs $(x,y)$, our model $f_\theta$ will produce $\hat{y} = f_\theta(x)$. Informally, $\hat{y}$ is a prediction of $y$. But crucially, our model output $\hat{y}$ is actually the parameter of a probability distribution $p(y|\hat{y})$ that tries to model the likelihood of $y$ (true label) occurring. In other words: under our model $f_\theta$, how likely is the true data $y$?

The beginnings of maximum likelihood estimation are already emerging: we probably want an estimator (a model $f_\theta$) which maximizes the likelihood that we’d observe the data $p(y|f_\theta(x))$.

To measure this over a corpus of data, we’ll expand our definitions

$$ \begin{aligned} \textbf{x} = (x_1 \ldots x_n) \quad &\text{dataset of } n \text{ inputs} \\ \textbf{y} = (y_1 \ldots y_n) \quad &\text{dataset of } n \text{ outputs} \\ \end{aligned} $$

Now, we can write down the likelihood $L$ of our full dataset as an enormous unwieldly joint distribution:

$$ \begin{aligned} L(\theta) = p(\mathbf{y}|\mathbf{x},\theta) = p(y_1, \ldots, y_n | x_1, \ldots x_n, \theta)\\ \end{aligned} $$

This is where we’ll introduce the i.i.d. assumption: that our data are independent, and all drawn from the same distribution (identically distributed). This allows us to break up the huge joint distribution above into a bunch of individual terms that only consider $(x_i,y_i)$ pairs.

$$ \begin{aligned} L(\theta) = \prod_{i}^{n}{p(y_i| x_i,\theta)} = \prod_{i}^{n}{p(y_i| f_\theta(x_i))}\\ \end{aligned} $$

Now we can go wild and define lots of terms that start with the letter “L.” We’ve already got the likelihood $L$. We’ll also introduce the log-likelihood $\ell$, and the negative log-likelihood $\mathcal{L}$, by first hitting $L$ with a $\log$ and then a $-1$. The negative log-likelihood $\mathcal{L}$ is going to become our loss function(s)!

$$ \begin{aligned} L(\theta) &= \prod_{i}^{n}{p(y_i| f_\theta(x_i))}\\\\ \ell(\theta) &= \log L(\theta) = \sum_{i}^{n}{\log p(y_i| f_\theta(x_i))}\\\\ \mathcal{L}(\theta) &= - \ell(\theta) = -\sum_{i}^{n}{\log p(y_i| f_\theta(x_i))}\\\\ \end{aligned} $$

We start taking the logarithm of things simply because it’s easier to work with sums than products and will allow us to pull exponents down soon. Since log is monotonically increasing, maximizing the log of a function also maximizes that function. We also multiply by -1 so that we can call the term a loss and instead talk about minimizing it—i.e., it’s best when it’s zero.

MLE maximizes $L$ or $\ell$, or minimizes $\mathcal{L}$.

Now we start specifying details of $p(y_i| f_\theta(x_i))$. First, we ask: what does our $y$ look like? Then, we have a choice: what kind of noise would we like to assume?

Why do we have to assume noise? By assuming noise, we allow for the fact that additional data we haven’t observed (say, more train or test data) could exist, and we want some way of measuring the error of our model in predicting it. It’s exactly wrapped up in our formulation of $y$ as a random variable, whose parameter we try to model with $\hat{y}$. In other words, assuming randomness allows us to generalize. If we assumed there was no noise, we’d be saying our training data was all that should ever exist, and our likelihood $p(y|\hat{y})$ should be 1 when exactly $y=\hat{y}$ and $0$ elsewhere. This degenerate distribution gives us no way to learn or measure error—i.e., no way to generalize.

The standard formulations are below. Note that we can assume different kinds of noise (e.g., other than Gaussian for regression) and end up with different loss functions!

	Regression	Classification	Multi-class Classification
Question: What does our data $y$ look like?	$y \in \mathbb{R}$	$y \in \{0, 1\}$	$y \in \{0, \dots, k\}$
Choice: What kind of randomness do we assume?	Gaussian Noise $y \sim \mathcal{N}(\hat{y}, \sigma^2)$	Bernoulli $y \sim \hat{y}^y (1 - \hat{y})^{1-y}$	Categorical $y \sim \prod_{j}^{k}{\hat{y}_j^{[y=j]}} $
Result: What kind of loss function emerges?	Mean squared error $(y - \hat{y})^2$	Binary Cross-Entropy $-y\log\hat{y} -(1-y)\log(1-\hat{y})$	Cross Entropy $-\log \hat{y}_y$

Below I’ll show the derivations.

Mean Squared Error = Regression + Gaussian Noise

Say $y$ is continuous ($y \in \mathbb{R}$), so we’re doing regression. We’re treating $y$ as a random variable. Let’s assume $y$ is distributed around $\hat{y}$ with Gaussian noise $\mathcal{N}(0,\sigma^2)$.⁰² . Put another way, our output $\hat{y} = f_\theta(x)$ predicts the mean of a Gaussian that aims to capture $y$.

$$ y = \hat{y} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) $$

$$ y \sim \mathcal{N}(\hat{y}, \sigma^2) $$

Here’s our likelihood:

$$ p(y|\hat{y}) = \mathcal{N}(y|\hat{y}, \sigma^2) $$

Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term:

$$ \begin{aligned} \mathcal{L}(\theta) &= -\sum_{i}^{n}{\log p(y_i| \hat{y}_i)} \\ &= - \sum_{i}^{n}{\log \mathcal{N}(y_i | \hat{y}_i, \sigma^2)} \\ &= - \sum_{i}^{n}{\log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i-\hat{y}_i)^2}{2\sigma^2}} \right)} \\ &= - \sum_{i}^{n}{\left( \log \frac{1}{\sqrt{2\pi\sigma^2}} + \log e^{-\frac{(y_i-\hat{y}_i)^2}{2\sigma^2}} \right)} \\ &= - n \log \frac{1}{\sqrt{2\pi\sigma^2}} - \sum_{i}^{n}{-\frac{(y_i-\hat{y}_i)^2}{2\sigma^2}} \\ &= n \log \sqrt{2\pi\sigma^2} + \frac{1}{2\sigma^2} \sum_{i}^{n}{(y_i-\hat{y}_i)^2} \\ \end{aligned} $$

Because we’re interested in maximizing our likelihood (i.e., minimizing the negative (log) likelihood), we can ignore all of the terms that do not depend on $\theta$ (our model parameters), because they’ll be the same for any choice of model we have. Here, we won’t estimate $\sigma$. Thus, our loss is proportional to the final sum, which is the total squared error. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the mean squared error.

$$ \begin{aligned} \mathcal{L(\theta)} \propto & \sum_{i}^{n}{(y_i-\hat{y}_i)^2} \\ \propto & \frac{1}{n} \sum_{i}^{n}{(y_i-\hat{y}_i)^2} \\ & \text{\small Mean Squared Error} \end{aligned} $$

Binary Cross-Entropy = Binary Classification + Bernoulli

Say $y$ is binary ($y \in {0, 1}$), so we’re doing binary classification. We’re treating $y$ as a random variable. A Bernoulli distribution represents a binary random variable with odds $p$ of being true (1), and $1-p$ of being false (0). Our output $\hat{y} = f_\theta(x)$ is the parameter of this distribution ($p = \hat{y}$).

$$ y \sim \text{Bernoulli}(\hat{y}) $$

Here’s our likelihood:

$$ p(y | \hat{y}) = \text{Bernoulli}(y|\hat{y}) = \begin{cases} \hat{y} & \text{if } y=1 \\ 1-\hat{y} & \text{if } y=0 \\ \end{cases} $$

We can use an exponent trick to write these cases as one equation, where the inactive term just becomes 1:

$$ \text{Bernoulli}(y|\hat{y}) = \hat{y}^{y}(1 - \hat{y})^{1 - y} $$

Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term:

$$ \begin{aligned} \mathcal{L}(\theta) &= -\sum_{i}^{n}{\log p(y_i| \hat{y}_i)} \\ &= - \sum_{i}^{n}{\log \left( \hat{y}_i^{y_i}(1 - \hat{y}_i)^{1 - y_i} \right)} \\ &= - \sum_{i}^{n}{\log \hat{y}_i^{y_i} + \log(1 - \hat{y}_i)^{1 - y_i}} \\ &= - \sum_{i}^{n}{y_i\log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)} \\ \end{aligned} $$

Some notes for intuitively reading this. (1) Only one term will be nonzero. (2) Our outputs $\hat{y}_i$ should be $\in (0, 1]$ and the log of a number in $(0, 1]$ is negative, so the preceding minus sign turns these positive, into a loss.

This is the total loss for our dataset. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the binary cross-entropy:

$$ \begin{aligned} \mathcal{L(\theta)} \propto &-\frac{1}{n}\sum_{i}^{n}{y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)}\\ & \text{\small Binary Cross-Entropy} \end{aligned} $$

Remark: While this formulation is clever because it lets us use the fact that treating our labels as the numbers 0 and 1 let us always cancel out one of the terms, I don’t like it because it obscures how simple the loss is. I actually prefer the formulation for multi-class classification (below), because it makes it clear that our loss per datum is simply the (negative log) probability we assigned to the correct class.

Cross-Entropy = Multi-Class Classification + Categorical

Say $y$ is one of $k$ categories ($y \in \{1, \ldots, k\}$), so we’re doing multi-class classification. We’re treating $y$ as a random variable. A categorical distribution represents a discrete random variable that can take on $k$ classes, each with corresponding odds $(p_1, \ldots, p_k)$ of occurring, and $0 \leq p_k \leq 1$. Our output $\mathbf{\hat{y}} = f_\theta(x)$ is now a vector of length $k$, the parameters of this distribution:

$$ y \sim \text{Categorical}(\mathbf{\hat{y}}) = \text{Categorical}(\hat{y}_1, \ldots, \hat{y}_k) $$

Here’s our likelihood:

$$ \begin{aligned} p(y=c|\mathbf{\hat{y}}) &= \text{Categorical}(y=c|\mathbf{\hat{y}}) \\ &= \hat{y}_c \end{aligned} $$

It’s a bit awkward to write out. The probability of $y$ being some class $c \in \{1, \ldots, k\}$ is just our model’s output for that class $\hat{y}_c \in [0, 1]$. We can write this another way using Iverson brackets, a superscript $^{[\text{condition}]}$ which evaluates to 1 if the condition is true and 0 if not.

$$ \begin{aligned} p(y|\mathbf{\hat{y}}) &= \text{Categorical}(y|\mathbf{\hat{y}}) \\ &= \prod_{j}^{k}{\hat{y}_j^{[y=j]}} \end{aligned} $$

Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term, with apologies that $\hat{y}_{i,j}$ now means our prediction of the $j$th class of the $i$th datum.

$$ \begin{aligned} \mathcal{L}(\theta) &= -\sum_{i}^{n}{\log p(y_i| \hat{y}_i)} \\ &= - \sum_{i}^{n}{\log \left( \prod_{j}^{k}{\hat{y}_{i,j}^{[y_i=j]}} \right)} \\ &= - \sum_{i}^{n}{\sum_{j}^{k}{\log \hat{y}_{i,j}^{[y_i=j]}} } \\ &= - \sum_{i}^{n}{\log \hat{y}_{i,y_i} } \\ \end{aligned} $$

Intuition for removing the inner sum: all classes $j \neq y_i$ will be taken to the 0th power by the Iverson bracket. This will turn it into 1. $log(1) = 0$. So it disappears from the sum. So all that’s left of the inner sum is where $j = y_i$.

$$ \begin{aligned} \mathcal{L(\theta)} \propto &- \frac{1}{n}\sum_{i}^{n}{\log \hat{y}_{i,y_i} } \\ & \text{\small Cross-Entropy} \end{aligned} $$

Remark: While the term $\hat{y}_{i,y_j}$ is gross, I like this formulation because it makes it clear what our loss is per datum: it’s just the (negative log) probability we predicted for the correct class $y_i$. Our loss is not affected by the probabilities we assigned for all the incorrect classes.

Addendum: MLE of Known Distributions

If we assume our data $x$ comes from a known distribution, we can often use the maximum likelihood estimate to directly derive the parameter(s) $\theta$ of the distribution. (I think this is probably taught first, but it’s less exciting for general machine learning, so I put it here at the end.)

For example, say our data $x_1, \ldots x_n$ are all binary ($x_i \in \{0, 1\}$), and we want to model them as if drawn (i.i.d.) from a Bernoulli distribution with parameter $\theta$. What’s the best value for $\theta$?

We can take the maximum likelihood estimate of $\theta$. First, let’s write down the the likelihood $L$ of the data under a choice of $\theta$:

$$ \begin{aligned} L(\theta) &= \prod_i^n{L(x_i)} \\ &= \prod_i^n{\theta^{x_i}(1-\theta)^{1 - x_i}} \end{aligned} $$

This is nice, but we can do better. We’ve observed $\mathbf{x}$. Let’s define $s$ to be the number of successes, i.e., the count of $x = 1$. Then, $n-s$ will naturally be the count of $x = 0$. This lets us rewrite $L$ without a product:

$$ L(\theta) = \theta^s (1 - \theta)^{n-s} $$

Now for the log-likelihood $\ell(\theta)$:

$$ \begin{aligned} \ell(\theta) &= \log L(\theta) \\ &= \log \left( \theta^s (1 - \theta)^{n-s} \right) \\ &= s \log \theta + (n-s)\log(1 - \theta) \end{aligned} $$

In the above sections, after we wrote down $\ell$, we defined the negative log-likelihood $\mathcal{L}$, got rid of constants, called it a loss function, and called it a day. But here, we can analytically derive the maximum likelihood estimator $\theta_{\text{\small MLE}}$! We’ll take the derivative of the above $\ell(\theta)$ and set it to 0 to try to solve for a maximum.⁰³

$$ \begin{aligned} \frac{d}{d\theta}\ell(\theta) &= \frac{d}{d\theta} \left( s \log \theta + (n-s)\log(1 - \theta) \right) \\ &= \frac{s}{\theta} + (n-s)\frac{1}{1-\theta}\frac{d}{d\theta}(1 - \theta) \\ &= \frac{s}{\theta} + \frac{n-s}{1-\theta}(-1) \\ &= \frac{s}{\theta} + \frac{s-n}{1-\theta} \\ 0 &= \frac{s}{\theta} + \frac{s-n}{1-\theta} \quad \text{set to 0} \\ \frac{s}{\theta} &= \frac{n-s}{1-\theta} \\ s(1 - \theta) &= (n-s)(\theta) \\ s - s\theta &= n\theta - s\theta \\ s &= n\theta \\ \theta &= \frac{s}{n} \\ \end{aligned} $$

Well, would you look at that. The Bernoulli parameter $\theta$ which maximizes the likelihood of the data $\theta_{\text{\small MLE}} = \frac{s}{n}$, the proportion of samples that were 1.⁰⁴

Mean

Drag the red data points to see the computed values change. The data mean is shown in the dashed vertical orange line. The sum of squared difference with all the data points are plotted in yellow. Takeaway: the mean is the value which minimizes the sum of squared differences to all points.

Median

Drag the red data points to see the computed values change. The data median is shown in orange area. The sum of absolute differences with all the data points are plotted in yellow. Takeaway: the medians are the range of values which minimize the sum of absolute differences to all points.

RMSNorm

RMSNorm is like Layer Norm, but it just divides by the RMS = $\sqrt{\frac{1}{n}\sum{x_i}}$. It leaves out the mean-centering and bias addition.

class RMSNorm(nn.Module):
    def __init__(self, dims: Sequence[int]):
        super().__init__()
        self.dims = tuple(dims)
        self.n = reduce(lambda x, y: x*y, dims)
        self.stats_dims = tuple(range(-len(dims), 0))
        self.gamma = nn.Parameter(torch.ones(dims))

    def forward(self, x: torch.Tensor):
        assert x.shape[-len(self.dims):] == self.dims
        ms = (x**2).sum(dim=self.stats_dims, keepdim=True) / self.n
        x_scaled = x * (ms + 1e-5).rsqrt()
        return x_scaled * self.gamma

RMSNorm’s hypothesis is that the mean shifting part of Layer Norm is unnecessary.

When the inputs’ mean $\mu = 0$, RMS norm is exactly equal to Layer Norm.

RMSNorm forces the summed inputs into a √n-scaled unit sphere
- Because we divide by RMSNorm, which has a 1/√n term, we end up scaling by √n
- Euclidean norm lacks this √n term, which the authors claim makes it not work well as a layer normalization
Most public LLMs have switched to RMSNorm
Why is RMSNorm faster than LayerNorm, even with fused kernels?
- Register pressure: they both compute normalization quantities, Layer Norm’s is more involved to track the running mean and variance
  - this takes ~3x as many operations
  - the extra register pressure reduces the max parallelism
- Slightly more FLOPs: Layer Norm also has to subtract the mean and add the bias weights $\mathbf{\beta}$
Skipped implementation details typically done:
- cast to FP32 to prevent $\sum x^2$ overflow in, e.g., FP16.
  - then cast back at end
- pass in customizable $\epsilon$ (for loading models)

Variance

Notebooks

Notebooks and scripts for many practical notes are over at mbforbes/ml.

(It’d be nice to incorporate them here sometime.)

Footnotes

SwiGLU is obviously the funniest to say, with GEGLU coming in close second. ↩︎
Recall the standard form of a Gaussian is written $x \sim \mathcal{N}(\mu,\sigma^2)$ or $\mathcal{N}(x|\mu,\sigma^2)$ ↩︎
A second derivative check or other analysis would be needed to be thorough here and verify it’s a maximum and not a minimum. ↩︎
The end result is so unsurprising (if I saw 739 out of 1000 successes, I’ll guess $\theta = \frac{739}{1000}$) that I think seeing these examples first kind of waters down the magic of MLE. But wow is it a useful concept. ↩︎

Published	Nov 13, 2025
Disclaimer	This is an entry in the garage. It may change or disappear at any point.
Tags	garage
Inbound	Mean and Median Surprises Maximum Likelihood Estimation

Machine Learning

Activation Functions#

Combinations and Permutations#

Derivatives#

Expectation#

Inequalities#

Layer Norm#

Log laws#

Maximum Likelihood Estimation#

Mean Squared Error = Regression + Gaussian Noise#

Binary Cross-Entropy = Binary Classification + Bernoulli#

Cross-Entropy = Multi-Class Classification + Categorical#

Addendum: MLE of Known Distributions#

Mean#

Median#

RMSNorm#

Variance#

Notebooks#

Activation Functions

Combinations and Permutations

Derivatives

Expectation

Inequalities

Layer Norm

Log laws

Maximum Likelihood Estimation

Mean Squared Error = Regression + Gaussian Noise

Binary Cross-Entropy = Binary Classification + Bernoulli

Cross-Entropy = Multi-Class Classification + Categorical

Addendum: MLE of Known Distributions

Mean

Median

RMSNorm

Variance

Notebooks