Maximum Likelihood Estimation
Yesterday and today I got obsessed with maximum likelihood estimates (MLE) as a unifying formulation from which the modern ML loss functions organically emerge. You simply look at (1) the type of your labels $y$, and (2) the kind of noise you want to assume, and then loss functions like mean squared error and cross entropy just fall out from writing down the (negative log) likelihood of your data in that setting.
The most surprising thing was realizing that, while we colloquially talk about models as “predicting values,” we’re actually always predicting the parameters of probability distributions. Even for regression! This is because we treat our gold outputs (y) as random variables, and try to estimate the parameters of distributions that hold them. This randomness is vital to the formulation of loss functions.
The following is my current writeup of the topic, excerpted from my machine learning notes.
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a beautiful, unifying framing of problem setups and assumptions from which the loss functions we use in everyday ML fall out.
Here’s our initial setup:
Inputs $x$ may be an image, text, or some set of features. Outputs $y$ may be a number (regression) or class label (classification). Our final term of interest $p(y|\hat{y})$ is the likelihood of the true output $y$ under our model $f_\theta$.
Given data pairs $(x,y)$, our model $f_\theta$ will produce $\hat{y} = f_\theta(x)$. Informally, $\hat{y}$ is a prediction of $y$. But crucially, our model output $\hat{y}$ is actually the parameter of a probability distribution $p(y|\hat{y})$ that tries to model the likelihood of $y$ (true label) occurring. In other words: under our model $f_\theta$, how likely is the true data $y$?
The beginnings of maximum likelihood estimation are already emerging: we probably want an estimator (a model $f_\theta$) which maximizes the likelihood that we’d observe the data $p(y|f_\theta(x))$.
To measure this over a corpus of data, we’ll expand our definitions
$$ \begin{aligned} \textbf{x} = (x_1 \ldots x_n) \quad &\text{dataset of } n \text{ inputs} \\ \textbf{y} = (y_1 \ldots y_n) \quad &\text{dataset of } n \text{ outputs} \\ \end{aligned} $$
Now, we can write down the likelihood $L$ of our full dataset as an enormous unwieldly joint distribution:
$$ \begin{aligned} L(\theta) = p(\mathbf{y}|\mathbf{x},\theta) = p(y_1, \ldots, y_n | x_1, \ldots x_n, \theta)\\ \end{aligned} $$
This is where we’ll introduce the i.i.d. assumption: that our data are independent, and all drawn from the same distribution (identically distributed). This allows us to break up the huge joint distribution above into a bunch of individual terms that only consider $(x_i,y_i)$ pairs.
$$ \begin{aligned} L(\theta) = \prod_{i}^{n}{p(y_i| x_i,\theta)} = \prod_{i}^{n}{p(y_i| f_\theta(x_i))}\\ \end{aligned} $$
Now we can go wild and define lots of terms that start with the letter “L.” We’ve already got the likelihood $L$. We’ll also introduce the log-likelihood $\ell$, and the negative log-likelihood $\mathcal{L}$, by first hitting $L$ with a $\log$ and then a $-1$. The negative log-likelihood $\mathcal{L}$ is going to become our loss function(s)!
We start taking the logarithm of things simply because it’s easier to work with sums than products and will allow us to pull exponents down soon. Since log is monotonically increasing, maximizing the log of a function also maximizes that function. We also multiply by -1 so that we can call the term a loss and instead talk about minimizing it—i.e., it’s best when it’s zero.
MLE maximizes $L$ or $\ell$, or minimizes $\mathcal{L}$.
Now we start specifying details of $p(y_i| f_\theta(x_i))$. First, we ask: what does our $y$ look like? Then, we have a choice: what kind of noise would we like to assume?
Why do we have to assume noise? By assuming noise, we allow for the fact that additional data we haven’t observed (say, more train or test data) could exist, and we want some way of measuring the error of our model in predicting it. It’s exactly wrapped up in our formulation of $y$ as a random variable, whose parameter we try to model with $\hat{y}$. In other words, assuming randomness allows us to generalize. If we assumed there was no noise, we’d be saying our training data was all that should ever exist, and our likelihood $p(y|\hat{y})$ should be 1 when exactly $y=\hat{y}$ and $0$ elsewhere. This degenerate distribution gives us no way to learn or measure error—i.e., no way to generalize.
The standard formulations are below. Note that we can assume different kinds of noise (e.g., other than Gaussian for regression) and end up with different loss functions!
| Regression | Classification | Multi-class Classification | |
|---|---|---|---|
| Question: What does our data $y$ look like? | $y \in \mathbb{R}$ | $y \in \{0, 1\}$ | $y \in \{0, \dots, k\}$ |
| Choice: What kind of randomness do we assume? | Gaussian Noise $y \sim \mathcal{N}(\hat{y}, \sigma^2)$ |
Bernoulli $y \sim \hat{y}^y (1 - \hat{y})^{1-y}$ |
Categorical $y \sim \prod_{j}^{k}{\hat{y}_j^{[y=j]}} $ |
| Result: What kind of loss function emerges? | Mean squared error $(y - \hat{y})^2$ |
Binary Cross-Entropy $-y\log\hat{y} -(1-y)\log(1-\hat{y})$ |
Cross Entropy $-\log \hat{y}_y$ |
Below I’ll show the derivations.
Mean Squared Error = Regression + Gaussian Noise
Say $y$ is continuous ($y \in \mathbb{R}$), so we’re doing regression. We’re treating $y$ as a random variable. Let’s assume $y$ is distributed around $\hat{y}$ with Gaussian noise $\mathcal{N}(0,\sigma^2)$.01 . Put another way, our output $\hat{y} = f_\theta(x)$ predicts the mean of a Gaussian that aims to capture $y$.
$$ y = \hat{y} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) $$
$$ y \sim \mathcal{N}(\hat{y}, \sigma^2) $$
Here’s our likelihood:
$$ p(y|\hat{y}) = \mathcal{N}(y|\hat{y}, \sigma^2) $$
Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term:
Because we’re interested in maximizing our likelihood (i.e., minimizing the negative (log) likelihood), we can ignore all of the terms that do not depend on $\theta$ (our model parameters), because they’ll be the same for any choice of model we have. Here, we won’t estimate $\sigma$. Thus, our loss is proportional to the final sum, which is the total squared error. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the mean squared error.
Binary Cross-Entropy = Binary Classification + Bernoulli
Say $y$ is binary ($y \in {0, 1}$), so we’re doing binary classification. We’re treating $y$ as a random variable. A Bernoulli distribution represents a binary random variable with odds $p$ of being true (1), and $1-p$ of being false (0). Our output $\hat{y} = f_\theta(x)$ is the parameter of this distribution ($p = \hat{y}$).
$$ y \sim \text{Bernoulli}(\hat{y}) $$
Here’s our likelihood:
$$ p(y | \hat{y}) = \text{Bernoulli}(y|\hat{y}) = \begin{cases} \hat{y} & \text{if } y=1 \\ 1-\hat{y} & \text{if } y=0 \\ \end{cases} $$
We can use an exponent trick to write these cases as one equation, where the inactive term just becomes 1:
$$ \text{Bernoulli}(y|\hat{y}) = \hat{y}^{y}(1 - \hat{y})^{1 - y} $$
Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term:
Some notes for intuitively reading this. (1) Only one term will be nonzero. (2) Our outputs $\hat{y}_i$ should be $\in (0, 1]$ and the log of a number in $(0, 1]$ is negative, so the preceding minus sign turns these positive, into a loss.
This is the total loss for our dataset. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the binary cross-entropy:
Remark: While this formulation is clever because it lets us use the fact that treating our labels as the numbers 0 and 1 let us always cancel out one of the terms, I don’t like it because it obscures how simple the loss is. I actually prefer the formulation for multi-class classification (below), because it makes it clear that our loss per datum is simply the (negative log) probability we assigned to the correct class.
Cross-Entropy = Multi-Class Classification + Categorical
Say $y$ is one of $k$ categories ($y \in \{1, \ldots, k\}$), so we’re doing multi-class classification. We’re treating $y$ as a random variable. A categorical distribution represents a discrete random variable that can take on $k$ classes, each with corresponding odds $(p_1, \ldots, p_k)$ of occurring, and $0 \leq p_k \leq 1$. Our output $\mathbf{\hat{y}} = f_\theta(x)$ is now a vector of length $k$, the parameters of this distribution:
$$ y \sim \text{Categorical}(\mathbf{\hat{y}}) = \text{Categorical}(\hat{y}_1, \ldots, \hat{y}_k) $$
Here’s our likelihood:
$$ \begin{aligned} p(y=c|\mathbf{\hat{y}}) &= \text{Categorical}(y=c|\mathbf{\hat{y}}) \\ &= \hat{y}_c \end{aligned} $$
It’s a bit awkward to write out. The probability of $y$ being some class $c \in \{1, \ldots, k\}$ is just our model’s output for that class $\hat{y}_c \in [0, 1]$. We can write this another way using Iverson brackets, a superscript $^{[\text{condition}]}$ which evaluates to 1 if the condition is true and 0 if not.
$$ \begin{aligned} p(y|\mathbf{\hat{y}}) &= \text{Categorical}(y|\mathbf{\hat{y}}) \\ &= \prod_{j}^{k}{\hat{y}_j^{[y=j]}} \end{aligned} $$
Let’s write out our full negative log likelihood $\mathcal{L}$ from above and use this term, with apologies that $\hat{y}_{i,j}$ now means our prediction of the $j$th class of the $i$th datum.
Intuition for removing the inner sum: all classes $j \neq y_i$ will be taken to the 0th power by the Iverson bracket. This will turn it into 1. $log(1) = 0$. So it disappears from the sum. So all that’s left of the inner sum is where $j = y_i$.
This is the total loss for our dataset. We can multiply by $\frac{1}{n}$ to get the average loss per data point, which is invariant to the length of our dataset, and is also exactly the cross-entropy loss:
Remark: While the term $\hat{y}_{i,y_j}$ is gross, I like this formulation because it makes it clear what our loss is per datum: it’s just the (negative log) probability we predicted for the correct class $y_i$. Our loss is not affected by the probabilities we assigned for all the incorrect classes.
Addendum: MLE of Known Distributions
If we assume our data $x$ comes from a known distribution, we can often use the maximum likelihood estimate to directly derive the parameter(s) $\theta$ of the distribution. (I think this is probably taught first, but it’s less exciting for general machine learning, so I put it here at the end.)
For example, say our data $x_1, \ldots x_n$ are all binary ($x_i \in \{0, 1\}$), and we want to model them as if drawn (i.i.d.) from a Bernoulli distribution with parameter $\theta$. What’s the best value for $\theta$?
We can take the maximum likelihood estimate of $\theta$. First, let’s write down the the likelihood $L$ of the data under a choice of $\theta$:
This is nice, but we can do better. We’ve observed $\mathbf{x}$. Let’s define $s$ to be the number of successes, i.e., the count of $x = 1$. Then, $n-s$ will naturally be the count of $x = 0$. This lets us rewrite $L$ without a product:
$$ L(\theta) = \theta^s (1 - \theta)^{n-s} $$
Now for the log-likelihood $\ell(\theta)$:
In the above sections, after we wrote down $\ell$, we defined the negative log-likelihood $\mathcal{L}$, got rid of constants, called it a loss function, and called it a day. But here, we can analytically derive the maximum likelihood estimator $\theta_{\text{\small MLE}}$! We’ll take the derivative of the above $\ell(\theta)$ and set it to 0 to try to solve for a maximum.02
Well, would you look at that. The Bernoulli parameter $\theta$ which maximizes the likelihood of the data $\theta_{\text{\small MLE}} = \frac{s}{n}$, the proportion of samples that were 1.03
Footnotes
Recall the standard form of a Gaussian is written $x \sim \mathcal{N}(\mu,\sigma^2)$ or $\mathcal{N}(x|\mu,\sigma^2)$ ↩︎
A second derivative check or other analysis would be needed to be thorough here and verify it’s a maximum and not a minimum. ↩︎
The end result is so unsurprising (if I saw 739 out of 1000 successes, I’ll guess $\theta = \frac{739}{1000}$) that I think seeing these examples first kind of waters down the magic of MLE. But wow is it a useful concept. ↩︎