Maximum Likelyhood Estimation and Binary Cross-entropy
Consider the following situations:
- Given the BMI, resting heartrate, and age of an individual, what is the probability that they will have a heart attack this year?
- Given the age and income of an individual, what is the probability that they will vote in an upcoming election?
- Given the number of hours a customer has engaged with the free portion of my website, what is the probability that they will subscribe to the paid portion?
All of these situations have one or more continuous features and a single binary outcome. We want to model the probability of obtaining one outcome over the other as a function of the features.
In each situation we don’t have any way to directly measure the probability. We can only measure the outcome: did the patient have a heart attack or not? Did the citizen vote in the election or not? So we will not be able to fit the parameters of a model the same way we did for linear regression. Instead we will utilize maximum likelyhood estimation (MLE).
Warmup to MLE
We will introduce the idea of MLE in a simpler discrete case.
You are given a coin and you are told that it is weighted to come up heads with probability \(p\). You toss the coin \(3\) times and obtain the sequence HHT. Based only on this information, what is your best estimate for \(p\)?
One way to address this rigorously is to rephrase the question this way: of all coins with weightings \(p \in [0,1]\), which value of \(p\) gives the maximum likelyhood of producing the sequence HHT? First make a guess about what is reasonable!
The probability of a coin weighted \(p\) producing HHT is
\[L(p) = p^2(1-p)\]where I am using \(L\) to stand for ``likelyhood’’.
The maximum value of this function over $[0,1]$ can be found using differential calculus:
\[\begin{align*} \frac{\textrm{d} L}{\textrm{d} p} &= 2p(1-p) - p^2\\ &=2p - 2p^2 - p^2\\ &=2p - 3p^2 \end{align*}\]So the derivative is \(0\) when \(p = 0\) or \(p = \frac{2}{3}\). As \(L(0) = L(1) = 0\) and \(L(\frac{2}{3}) > 0\) we have that the global maximum of \(L\) occurs at \(p = \frac{2}{3}\).
Does this agree with the guess you made before doing the Calculus?
Let’s do the same thing, but imagine that you tossed the coin \(n\) times, got \(m\) heads and \(n-m\) tails.
The probability of a coin weighted \(p\) producing \(m\) heads and \(n-m\) tails.
\[L(p) = p^n(1-p)^{n-m}\]We could differentiate this directly, but we will get a cleaner result if we use logarithmic differentiation.
\[\begin{align*} \log(L)(p) &= \log(p^m(1-p)^{n-m})\\ &= m\log(p) + (n-m)\log(1-p)\\ \frac{\textrm{d}}{\textrm{d}p} \log(L(p)) = \frac{m}{p} - \frac{n-m}{1-p} \end{align*}\]Setting this equal to zero we have
\[\begin{align*} \frac{m}{p} - \frac{n-m}{1-p} &= 0\\ m(1-p) - (n-m)p &= 0\\ m - mp - np + mp &= 0\\ p &= \frac{m}{n} \end{align*}\]which should also agree with your intuition!
We will see the logarithm reprise its role in the next section.
MLE and Binary Cross-Entropy
Now we return to the general situation: we have \(N\) samples which each have \(k\) continuous variables \(x_1, x_2, x_3, ..., x_k\) and one binary variable \(y\) (which is either \(0\) or \(1\) for each sample). We want to model the probability that \(y = 1\) as a function of these variables using a model with parameters \(\theta = (\theta_1, \theta_2, \theta_3, ..., \theta_m)\) where \(\theta \in \Theta \subset \mathbb{R}^m\). In other words, we want to model the probability as some function \(p_\theta: \mathbb{R}^k \to [0,1]\) which depends on these parameters.
For example, in the situation with age and income predicting voting in an upcoming election, we might model the probability of voting as
\[p_\theta (\vec{x}) = \frac{1}{1 + e^{ - (\theta_0 + \theta_1 x_1 + \theta_2 x_2)}}\]where \(x_1\) is age and \(x_2\) is income. This is only an example though: depending on the data there might be other parametric families of functions we might want to select as our model. The particular example we used here is an example of logistic regression.
The question we have is how to select the parameters \(\theta\) so as to maximize the likelyhood that the data in our sample was generated by the model. This is analogous to how we selected the parameter \(p\) in our warmup.
Fix the parameters \(\theta\). For a single observation \((\vec{x},y)\) the model predicts this observation will occur with a probability of
\[\begin{cases} p_\theta(\vec{x}) \textrm{ if $y = 1$}\\ 1 - p_\theta(\vec{x}) \textrm{ if $y = 0$} \end{cases}\]We can write this more compactly as the single expression
\[y p_\theta(\vec{x}) + (1-y)(1 - p_\theta(\vec{x}))\]So our model predicts that the probability that our sample of \(N\) observations occurs is
\[L(\theta) = \prod_{i=1}^{N} \left(y_i p_\theta(\vec{x_i}) + (1-y_i)(1 - p_\theta(\vec{x_i}))\right)\]Our goal is to find the value of \(\theta\) which maximizes this quantity.
We are now going to take the logarithm of the likelyhood function:
\[\log(L(\theta)) = \sum_{i = 1}^{N} y_i \log (p_\theta(\vec{x_1})) + (1 - y_i)\log(1 - p_\theta(\vec{x_i}))\]Why did we take the logarithm? Several reasons:
- The logarithm is monotonically increasing, so the same \(\theta\) maximizes both \(L\) and \(\log L\).
- The values of \(L\) are probably quite small. We are multiplying together lots of numbers which are between \(0\) and \(1\). The result will be a tiny positive number. Using a logarithmic transformation is a nice way to put these numbers on a more reasonable scale.
- We will eventually want to differentiate to find a maximum value. The derivative of the likelyhood uses the product rule, which would involve a product over all observations. This is not very parallelizable. The derivative of the log-likelyhood only uses the sum rule, which results in a parallelizable derivative computation.
The log-likelyhood is a negative number which we are trying to maximize. By convention we will instead minimize the negative log-likelyhood:
\[\ell(\theta) = - \log(L(\theta)) = - \sum_{i = 1}^{N} y_i \log (p_\theta(\vec{x_1})) + (1 - y_i)\log(1 - p_\theta(\vec{x_i}))\]where I am using \(\ell\) to stand for “loss”.
This expression is called the binary cross-entropy between the model \(p_\theta\) and the empirical probability from our observations \((\vec{x_i}, y_i)\).
Summary
We are attempting to select the value of the parameters \(\theta = (\theta_1, \theta_2, \theta_3, \dots, \theta_m)\) which maximize the likelyhood of our sample \((\vec{x}_i,y_i)\) being predicted by our model \(p_\theta\). We were able to show that the \(\theta\) which maximimizes this likelyhood also minimizes the binary cross-entropy
\[\log(L(\theta)) = \sum_{i = 1}^{N} y_i \log (p_\theta(\vec{x_1})) + (1 - y_i)\log(1 - p_\theta(\vec{x_i}))\]