Activation Functions
Smooth ReLU approximation. Shown here with $\beta = 1$.
Again, shown with $\beta = 1$.
Rectified Linear Unit.
(likely 0 or 1 chosen as a subdifferential at $x=0$)
Parametric ReLU. $\alpha$ learnable; here shown $\alpha = 0.25$
Again, shown $\alpha = 0.25$
Exponential Linear Unit. $\alpha$ is a fixed hyperparameter; here shown with typical $\alpha = 1$.
Again, plotted with $\alpha = 1$.
Gaussian Error Linear Unit. $\Phi$ is the Gaussian CDF.
$\phi$ is the Gaussian PDF.
Generally $\beta$ may be fixed or trainable, here $\beta = 1$, which is equivalent to the Sigmoid Linear Unit.
$\frac{d}{dx}\text{SiLU}(x) = \sigma(x) + x\sigma'(x)$
The next series of activations are Gated Linear Unit (GLU) variants. GLUs are really layers more than activations; they project an input vector $x$ out into two spaces, $xW + b$ and $xV + c$, using learned matrices $W,V$ and bias vectors $b,c$. Then, one space is passed through some kind of activation $a(xW + b)$, and both spaces are finally element-wise multiplied together: $a(xW + b) \otimes xV + c$.
This general formulation leads to variants for the common activations, which all end up with funny names ending in GLU.01
Weâll continue considering a single weight. All of $W, V, a, b$ below are scalars, with slight abuse of notation. Those four parameters are adjustable below, and the plots that follow react to update.
Gated Linear Unit.
Comments:
-
Vanishing gradients: Nearly-zero gradients, like at either end of a sigmoid or tanh, can cause the gradient to become too small. This affect compounds across layers with the chain rule. The gradient may vanish by the time it reachers earlier network layers. Remedies: ReLU, batch norm, Xavier initialization.
-
Dead ReLUs: If a neuron with ReLU always outputs 0 (for every datum in the dataset), its gradient is always 0, and it canât recover. Remedies: GELU/Swish, batch norm, He initialization. Perhaps surprisingly, leaky variants (leaky ReLU, PReLU, ELU) never really caught on. The reasons seem to be (a) batch norm and He init work well, (b) given that, the extra params (PReLU), hyperparams (Leaky), or computational cost (ELU) arenât worth it, (c) if neurons arenât fully dead, ReLUâs 0s are nice because we get sparsity, (d) some dead neurons (~5%) are fine if weâre massively over-parameterized. That said, GELU/Swish, and their GLU-variants, have replaced ReLU for modern transformers.
General notes on the evolution of activation functions:
| Activation | Used | Notes |
|---|---|---|
| Sigmoid | OG, but since replaced. Vanishing gradients, not 0-centered, computation ($e^x$) | |
| tanh | Now centered at 0, stronger gradient, but still vanishing gradients. | |
| softplus | Might as well just ReLU | |
| ReLU | MLP | Gold standard. Simple, keeps gradient when positive, helps with sparsity. Risk: âdead.â |
| Leaky ReLU | GANs | Prevent dead neurons, but practitioners prefer to just init correctly rather than fuss with leakiness. |
| PReLU | Extra parameter(s) not worth the marginal gains. | |
| ELU | Zero-mean outputs. Smooth transition from dead to alive. Again, not worth the effort (here, computational). | |
| GELU | Works well (BERT, GPT-2), though complex. Smooth. Non-monotonic. Nice interpretation: scale $x$ by likelihood of $x > 0$. Got us off ReLUs in NLP. | |
| Swish | LLMs | Like GELU, but simpler. Modern simple choice. |
| GLU | More parameters, learnable gate. Core idea nice, but itâd be even better with⊠| |
| SwiGLU | LLMs | Trainable gating, nice shape, easy to compute. Top modern NLP choice. (Though in Shazeer (2020)'s results, various GLUs win at different tasks.) |
References:
Combinations and Permutations
Derivatives
| $f(x)$ | $fâ(x) = \frac{d}{dx}f(x) = \frac{df}{dx}$ |
|---|---|
| Constants, power | |
| $a$ | $0$ |
| $x$ | $1$ |
| $ax$ | $a$ |
| $x^a$ | $ax^{a-1}$ |
| $a f(x)$ | $a fâ(x)$ |
| Sum | |
| $af(x) + bg(x)$ | $afâ(x) + bgâ(x)$ |
| $af(x) - bg(x)$ | $afâ(x) - bgâ(x)$ |
| Product | |
| $f(x)g(x)$ | $fâ(x)g(x) + f(x)gâ(x)$ |
| $f(x)g(x)h(x)$ | $fâ(x)g(x)h(x) + f(x)gâ(x)h(x) + f(x)g(x)hâ(x)$ |
| Chain | |
| $f(g(x))$ | $fâ(g(x)) gâ(x)$ |
| Reciprocal | |
| $\frac{1}{f(x)}$ | $-\frac{fâ(x)}{f(x)^2}$ |
| Quotient | |
| $\frac{f(x)}{g(x)}$ | $\frac{fâ(x)g(x) - f(x)gâ(x)}{g(x)^2}$ |
| Applications | |
| $f(x)^a$ | $af(x)^{a-1}fâ(x)$ |
| $\ln x$ | $\frac{1}{x}$ |
| $\ln f(x)$ | $\frac{1}{f(x)}fâ(x)$ |
| $e^x$ | $e^x$ |
| $e^{f(x)}$ | $e^{f(x)} fâ(x)$ |
| $a^x$ | $(\ln a) a^x$ |
| $a^{f(x)}$ | $(\ln a) a^{f(x)} fâ(x)$ |
| $\log_a(x)$ | $\frac{1}{(\ln a)x}$ |
| $\log_a(f(x))$ | $\frac{1}{(\ln a)f(x)}fâ(x)$ |
| $\sin(x)$ | $\cos(x)$ |
| $\sin(f(x))$ | $\cos(f(x))fâ(x)$ |
Reciprocal as power and chain rules:
$$ \frac{d}{dx}\frac{1}{f(x)} = \frac{d}{dx} f(x)^{-1} = -1 f(x)^{-2} fâ(x) = -\frac{fâ(x)}{f(x)^2} $$
Quotient as power and chain rules:
Using change of base for $\frac{d}{dx}\log_a(x)$:
$$ \log_a(x) = \frac{\ln(x)}{\ln(a)} = \frac{1}{\ln(a)}\ln(x) $$
$$ \frac{d}{dx}\log_a(x) = \frac{1}{\ln(a)}\frac{d}{dx}\ln(x) = \frac{1}{\ln(a)}\frac{1}{x} = \frac{1}{\ln(a)x} $$
References:
Expectation
Inequalities
Layer Norm
class LayerNorm(nn.Module):
def __init__(self, dims: Sequence[int]):
super().__init__()
self.dims = tuple(dims)
self.stats_dims = tuple(range(-len(dims), 0)) # all but batch
self.gamma = nn.Parameter(torch.ones(dims))
self.beta = nn.Parameter(torch.zeros(dims))
def forward(self, x: torch.Tensor):
assert x.shape[-len(self.dims):] == self.dims
var = x.var(dim=stats_dims, correction=0, keepdims=True)
mean = x.mean(dim=stats_dims, keepdims=True)
x_norm = (x - mean) * (var + 1e-5).rsqrt()
return (x_norm * self.gamma) + self.beta
Notes:
-
normalize over all of a conceptual âitemâsâ dimensions at once.
- images:
(b,h,w,c), item = image =(h,w,c), produces justbvars & means- why? this preserves relationships across dimensions.
- e.g., if one row is large (101, 102, 103) and one is small (1, 2, 3), we donât want to remove that information when normalizing
- language:
(b,l,d), item = token =(d), produces(b, l)vars & means- why? keep token representations independent; adding word to sequence shouldnât normalize other tokens
- images:
-
the normalizing by 1/sqrt(var + eps) is important vs 1/(std + eps)
- why? std + eps prevents divide by zero in the forward pass, but problems remain in the backward pass
- in the backward pass
- The derivative of std dev contains a 1/std term which creates a divide by 0. pytorch prevents this, presumably by having a guard rail that says the derivative at std=0 is just 0, not NaN.
- Even when std is not zero, the gradient of 1/(std + eps) explodes and stays high (plateaus) as std shrinks, whereas the gradient of 1/sqrt(var + eps) does peak extremely high when var is small, but then back shrinks down towards 0 as var keeps shrinking.
-
for $d$ input features we introduce $2d$ new weights (gamma scales and beta offsets)
- note that this is fine because $2d << d^2$ (a single linear layer)
Implementation notes:
-
typically epsilon (
1e-5) is passed as a class argument so it can be set correctly for models trained with different values -
machine learning typically uses unbaised (population) estimates of var & std (1/N) rather than biased (sample) ones (1/(N-1))
-
at a lower level, weâd want to do this with a âfusedâ kernel for efficiency. sketch:
- we can consider on a single input (e.g., $d$-vector $\mathbf{x}$) as a unit of parallelism
- we need to compute $\mu$ and $\sigma^2$ for the before centering
- these both rely on computing a sum of $\sum x$ and $\sum x^2$
- but we canât naively compute $E[x^2]$ and $E[x]^2$ and subtract, else weâll hit cAtAsTrOpHiC cAnCeLlAtIoN
- also, we donât want to compute the mean, compute the variance, and compute the centering separately
- instead, we want to load the input into fast memory and keep it there during the whole computation
- so, we: (1) load $\mathbf{x}$, (2) compute all the statistics online, (3) center, (4) write
- the way we compute mean and variance with computational stability in a single pass is with an online algorithm (Welfordâs)
See also:
Log laws
Change of base
$$ \log_b(a) = \frac{\log_x(a)}{\log_x(b)} $$
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a beautiful, unifying framing of problem setups and assumptions from which the loss functions we use in everyday ML fall out.
Hereâs our initial setup:
Inputs $x$ may be an image, text, or some set of features. Outputs $y$ may be a number (regression) or class label (classification). Our final term of interest $p(y|\hat{y})$ is the likelihood of the true output $y$ under our model $f_\theta$.
Given data pairs $(x,y)$, our model $f_\theta$ will produce $\hat{y} = f_\theta(x)$. Informally, $\hat{y}$ is a prediction of $y$. But crucially, our model output $\hat{y}$ is actually the parameter of a probability distribution $p(y|\hat{y})$ that tries to model the likelihood of $y$ (true label) occurring. In other words: under our model $f_\theta$, how likely is the true data $y$?
The beginnings of maximum likelihood estimation are already emerging: we probably want an estimator (a model $f_\theta$) which maximizes the likelihood that weâd observe the data $p(y|f_\theta(x))$.
To measure this over a corpus of data, weâll expand our definitions
$$ \begin{aligned} \textbf{x} = (x_1 \ldots x_n) \quad &\text{dataset of } n \text{ inputs} \\ \textbf{y} = (y_1 \ldots y_n) \quad &\text{dataset of } n \text{ outputs} \\ \end{aligned} $$
Now, we can write down the likelihood $L$ of our full dataset as an enormous unwieldly joint distribution:
$$ \begin{aligned} L(\theta) = p(\mathbf{y}|\mathbf{x},\theta) = p(y_1, \ldots, y_n | x_1, \ldots x_n, \theta)\\ \end{aligned} $$
This is where weâll introduce the i.i.d. assumption: that our data are independent, and all drawn from the same distribution (identically distributed). This allows us to break up the huge joint distribution above into a bunch of individual terms that only consider $(x_i,y_i)$ pairs.
$$ \begin{aligned} L(\theta) = \prod_{i}^{n}{p(y_i| x_i,\theta)} = \prod_{i}^{n}{p(y_i| f_\theta(x_i))}\\ \end{aligned} $$
Now we can go wild and define lots of terms that start with the letter âL.â Weâve already got the likelihood $L$. Weâll also introduce the log-likelihood $\ell$, and the negative log-likelihood $\mathcal{L}$, by first hitting $L$ with a $\log$ and then a $-1$. The negative log-likelihood $\mathcal{L}$ is going to become our loss function(s)!
We start taking the logarithm of things simply because itâs easier to work with sums than products and will allow us to pull exponents down soon. Since log is monotonically increasing, maximizing the log of a function also maximizes that function. We also multiply by -1 so that we can call the term a loss and instead talk about minimizing itâi.e., itâs best when itâs zero.
MLE maximizes $L$ or $\ell$, or minimizes $\mathcal{L}$.
Now we start specifying details of $p(y_i| f_\theta(x_i))$. First, we ask: what does our $y$ look like? Then, we have a choice: what kind of noise would we like to assume?
Why do we have to assume noise? By assuming noise, we allow for the fact that additional data we havenât observed (say, more train or test data) could exist, and we want some way of measuring the error of our model in predicting it. Itâs exactly wrapped up in our formulation of $y$ as a random variable, whose parameter we try to model with $\hat{y}$. In other words, assuming randomness allows us to generalize. If we assumed there was no noise, weâd be saying our training data was all that should ever exist, and our likelihood $p(y|\hat{y})$ should be 1 when exactly $y=\hat{y}$ and $0$ elsewhere. This degenerate distribution gives us no way to learn or measure errorâi.e., no way to generalize.
The standard formulations are below. Note that we can assume different kinds of noise (e.g., other than Gaussian for regression) and end up with different loss functions!
| Regression | Classification | Multi-class Classification | |
|---|---|---|---|
| Question: What does our data $y$ look like? | $y \in \mathbb{R}$ | $y \in \{0, 1\}$ | $y \in \{0, \dots, k\}$ |
| Choice: What kind of randomness do we assume? | Gaussian Noise $y \sim \mathcal{N}(\hat{y}, \sigma^2)$ |
Bernoulli $y \sim \hat{y}^y (1 - \hat{y})^{1-y}$ |
Categorical $y \sim \prod_{j}^{k}{\hat{y}_j^{[y=j]}} $ |
| Result: What kind of loss function emerges? | Mean squared error $(y - \hat{y})^2$ |
Binary Cross-Entropy $-y\log\hat{y} -(1-y)\log(1-\hat{y})$ |
Cross Entropy $-\log \hat{y}_y$ |
Below Iâll show the derivations.
Mean Squared Error = Regression + Gaussian Noise
Say $y$ is continuous ($y \in \mathbb{R}$), so weâre doing regression. Weâre treating $y$ as a random variable. Letâs assume $y$ is distributed around $\hat{y}$ with Gaussian noise $\mathcal{N}(0,\sigma^2)$.02 . Put another way, our output $\hat{y} = f_\theta(x)$ predicts the mean of a Gaussian that aims to capture $y$.
$$ y = \hat{y} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) $$
$$ y \sim \mathcal{N}(\hat{y}, \sigma^2) $$
Hereâs our likelihood:
$$ p(y|\hat{y}) = \mathcal{N}(y|\hat{y}, \sigma^2) $$
Letâs write out our full negative log likelihood $\mathcal{L}$ from above and use this term:
Because weâre interested in maximizing our likelihood (i.e., minimizing the negative (log) likelihood), we can ignore all of the terms that do not depend on $\theta$ (our model parameters), because theyâll be the same for any choice of model we have. Here, we wonât estimate $\sigma$. Thus, our loss is proportional to the final sum, which is the total squared error. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the mean squared error.
Binary Cross-Entropy = Binary Classification + Bernoulli
Say $y$ is binary ($y \in {0, 1}$), so weâre doing binary classification. Weâre treating $y$ as a random variable. A Bernoulli distribution represents a binary random variable with odds $p$ of being true (1), and $1-p$ of being false (0). Our output $\hat{y} = f_\theta(x)$ is the parameter of this distribution ($p = \hat{y}$).
$$ y \sim \text{Bernoulli}(\hat{y}) $$
Hereâs our likelihood:
$$ p(y | \hat{y}) = \text{Bernoulli}(y|\hat{y}) = \begin{cases} \hat{y} & \text{if } y=1 \\ 1-\hat{y} & \text{if } y=0 \\ \end{cases} $$
We can use an exponent trick to write these cases as one equation, where the inactive term just becomes 1:
$$ \text{Bernoulli}(y|\hat{y}) = \hat{y}^{y}(1 - \hat{y})^{1 - y} $$
Letâs write out our full negative log likelihood $\mathcal{L}$ from above and use this term:
Some notes for intuitively reading this. (1) Only one term will be nonzero. (2) Our outputs $\hat{y}_i$ should be $\in (0, 1]$ and the log of a number in $(0, 1]$ is negative, so the preceding minus sign turns these positive, into a loss.
This is the total loss for our dataset. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the binary cross-entropy:
Remark: While this formulation is clever because it lets us use the fact that treating our labels as the numbers 0 and 1 let us always cancel out one of the terms, I donât like it because it obscures how simple the loss is. I actually prefer the formulation for multi-class classification (below), because it makes it clear that our loss per datum is simply the (negative log) probability we assigned to the correct class.
Cross-Entropy = Multi-Class Classification + Categorical
Say $y$ is one of $k$ categories ($y \in \{1, \ldots, k\}$), so weâre doing multi-class classification. Weâre treating $y$ as a random variable. A categorical distribution represents a discrete random variable that can take on $k$ classes, each with corresponding odds $(p_1, \ldots, p_k)$ of occurring, and $0 \leq p_k \leq 1$. Our output $\mathbf{\hat{y}} = f_\theta(x)$ is now a vector of length $k$, the parameters of this distribution:
$$ y \sim \text{Categorical}(\mathbf{\hat{y}}) = \text{Categorical}(\hat{y}_1, \ldots, \hat{y}_k) $$
Hereâs our likelihood:
$$ \begin{aligned} p(y=c|\mathbf{\hat{y}}) &= \text{Categorical}(y=c|\mathbf{\hat{y}}) \\ &= \hat{y}_c \end{aligned} $$
Itâs a bit awkward to write out. The probability of $y$ being some class $c \in \{1, \ldots, k\}$ is just our modelâs output for that class $\hat{y}_c \in [0, 1]$. We can write this another way using Iverson brackets, a superscript $^{[\text{condition}]}$ which evaluates to 1 if the condition is true and 0 if not.
$$ \begin{aligned} p(y|\mathbf{\hat{y}}) &= \text{Categorical}(y|\mathbf{\hat{y}}) \\ &= \prod_{j}^{k}{\hat{y}_j^{[y=j]}} \end{aligned} $$
Letâs write out our full negative log likelihood $\mathcal{L}$ from above and use this term, with apologies that $\hat{y}_{i,j}$ now means our prediction of the $j$th class of the $i$th datum.
Intuition for removing the inner sum: all classes $j \neq y_i$ will be taken to the 0th power by the Iverson bracket. This will turn it into 1. $log(1) = 0$. So it disappears from the sum. So all thatâs left of the inner sum is where $j = y_i$.
This is the total loss for our dataset. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the cross-entropy loss:
Remark: While the term $\hat{y}_{i,y_j}$ is gross, I like this formulation because it makes it clear what our loss is per datum: itâs just the (negative log) probability we predicted for the correct class $y_i$. Our loss is not affected by the probabilities we assigned for all the incorrect classes.
Addendum: MLE of Known Distributions
If we assume our data $x$ comes from a known distribution, we can often use the maximum likelihood estimate to directly derive the parameter(s) $\theta$ of the distribution. (I think this is probably taught first, but itâs less exciting for general machine learning, so I put it here at the end.)
For example, say our data $x_1, \ldots x_n$ are all binary ($x_i \in \{0, 1\}$), and we want to model them as if drawn (i.i.d.) from a Bernoulli distribution with parameter $\theta$. Whatâs the best value for $\theta$?
We can take the maximum likelihood estimate of $\theta$. First, letâs write down the the likelihood $L$ of the data under a choice of $\theta$:
This is nice, but we can do better. Weâve observed $\mathbf{x}$. Letâs define $s$ to be the number of successes, i.e., the count of $x = 1$. Then, $n-s$ will naturally be the count of $x = 0$. This lets us rewrite $L$ without a product:
$$ L(\theta) = \theta^s (1 - \theta)^{n-s} $$
Now for the log-likelihood $\ell(\theta)$:
In the above sections, after we wrote down $\ell$, we defined the negative log-likelihood $\mathcal{L}$, got rid of constants, called it a loss function, and called it a day. But here, we can analytically derive the maximum likelihood estimator $\theta_{\text{\small MLE}}$! Weâll take the derivative of the above $\ell(\theta)$ and set it to 0 to try to solve for a maximum.03
Well, would you look at that. The Bernoulli parameter $\theta$ which maximizes the likelihood of the data $\theta_{\text{\small MLE}} = \frac{s}{n}$, the proportion of samples that were 1.04
Mean
Drag the red data points to see the computed values change. The data mean is shown in the dashed vertical orange line. The sum of squared difference with all the data points are plotted in yellow. Takeaway: the mean is the value which minimizes the sum of squared differences to all points.
Median
Drag the red data points to see the computed values change. The data median is shown in orange area. The sum of absolute differences with all the data points are plotted in yellow. Takeaway: the medians are the range of values which minimize the sum of absolute differences to all points.
RMSNorm
RMSNorm is like Layer Norm, but it just divides by the RMS = $\sqrt{\frac{1}{n}\sum{x_i}}$. It leaves out the mean-centering and bias addition.
class RMSNorm(nn.Module):
def __init__(self, dims: Sequence[int]):
super().__init__()
self.dims = tuple(dims)
self.n = reduce(lambda x, y: x*y, dims)
self.stats_dims = tuple(range(-len(dims), 0))
self.gamma = nn.Parameter(torch.ones(dims))
def forward(self, x: torch.Tensor):
assert x.shape[-len(self.dims):] == self.dims
ms = (x**2).sum(dim=self.stats_dims, keepdim=True) / self.n
x_scaled = x * (ms + 1e-5).rsqrt()
return x_scaled * self.gamma
RMSNormâs hypothesis is that the mean shifting part of Layer Norm is unnecessary.
When the inputsâ mean $\mu = 0$, RMS norm is exactly equal to Layer Norm.
-
RMSNorm forces the summed inputs into a ân-scaled unit sphere
- Because we divide by RMSNorm, which has a 1/ân term, we end up scaling by ân
- Euclidean norm lacks this ân term, which the authors claim makes it not work well as a layer normalization
-
Most public LLMs have switched to RMSNorm
-
Why is RMSNorm faster than LayerNorm, even with fused kernels?
- Register pressure: they both compute normalization quantities, Layer Normâs is more involved to track the running mean and variance
- this takes ~3x as many operations
- the extra register pressure reduces the max parallelism
- Slightly more FLOPs: Layer Norm also has to subtract the mean and add the bias weights $\mathbf{\beta}$
- Register pressure: they both compute normalization quantities, Layer Normâs is more involved to track the running mean and variance
-
Skipped implementation details typically done:
- cast to FP32 to prevent $\sum x^2$ overflow in, e.g., FP16.
- then cast back at end
- pass in customizable $\epsilon$ (for loading models)
- cast to FP32 to prevent $\sum x^2$ overflow in, e.g., FP16.
See also:
- Layer Norm, esp. for discussion of dimensions normalized over
Variance
Notebooks
Notebooks and scripts for many practical notes are over at mbforbes/ml.
(Itâd be nice to incorporate them here sometime.)
Footnotes
SwiGLU is obviously the funniest to say, with GEGLU coming in close second. â©ïž
Recall the standard form of a Gaussian is written $x \sim \mathcal{N}(\mu,\sigma^2)$ or $\mathcal{N}(x|\mu,\sigma^2)$ â©ïž
A second derivative check or other analysis would be needed to be thorough here and verify itâs a maximum and not a minimum. â©ïž
The end result is so unsurprising (if I saw 739 out of 1000 successes, Iâll guess $\theta = \frac{739}{1000}$) that I think seeing these examples first kind of waters down the magic of MLE. But wow is it a useful concept. â©ïž