4.1. Logistic Regression

Let’s begin with a simple scenario. Users, for which we know a certain amount of anonymized information, visit a website. We want to try and predict if the user will click on an advertisement or not given the anonymized information. We can try a train a classification model to address this problem.

Formally, we have a dataset \(\{x_i, y_i\}_{i=1}^N\) where each \(x_i\) is a vector of features and \(y_i \in \{0,1\}\), a binary label indicating if the user clicked on the ad (1) or did not (0).

4.1.1. The Model

Getting our models, which thus far been linear, to output a discrete zero or one seems difficult. It turns out its actually unnecessary for our models to output an exact zero or one. We instead care about the probability that a given data point belongs to the one class. If our model predicts the probability of the input belonging to class one to be larger than 0.5, we can say that it’s decided that the output is a one. If the model predicts that the probability of the input belonging to class one is less than 0.5, we can guess zero.

But how do we connect our linear models to probability? Our linear models \(\theta^\top x\) output values in \((-\infty, \infty)\). With some careful thinking, its possible to discover a way. Rather than outputting the probability directly, we can assume that our linear model outputs the log of the likelihood ratio. The likelihood ratio is \(\frac{P(Y=1 | X = x)}{P(Y = 0 | X = x)}\), or how many more times likely it is that the data point belongs to class one instead of class zero. A very large likelihood ratio indicates that the data point is very likely to be a one, whereas a small likelihood ratio indicates that the data point is very unlikely to be one and is much more likely to be zero. But our model still outputs values in \((-\infty, \infty)\), while the likelihood ratio is always positive. We can take the log of the likelihood ratio to map all of the values smaller than a ratio of one (where both classes are equal) to a negative value. Our model is then represented by the following equation:

\[ \ln \frac{P(Y=1 | X = x)}{P(Y = 0 | X = x} = \theta^\top x\]

Note that when the linear model \(\theta^\top x\) is zero, the likelihood ratio is one. Ultimately, we want our model to predict the probability of the input belonging to class 1. We can solve for this in the equation above by algebraically reducing and substituting \(P(Y=0 | X = x) = 1 - P(Y=1 | X = x)\). We’ll let \(p = P(Y=1 | X = x)\) for simplicity. This is also the quantity we want to solve for.

\[\begin{split}\begin{align*} \frac{P(Y=1 | X = x)}{P(Y = 0 | X = x)} &= e^{\theta^\top x} \\ \frac{p}{1-p} &= e^{\theta^\top x} \\ p &= (1-p) e^{\theta^\top x} \\ p &= (1-p) e^{\theta^\top x} \\ (1 + e^{\theta^\top x}) p &= e^{\theta^\top x} \\ p &= \frac{e^{\theta^\top x}}{1 + e^{\theta^\top x}} = \frac{1}{1 + e^{-\theta^\top x}} \end{align*}\end{split}\]

The function \(g(z) = \frac{1}{1 + e^{-z}}\) is a special function known as the sigmoid function. It is often used to convert continuous values in \((-\infty, \infty)\) to probabilities, just as we have done. Now, we have successfully solved for our model. The next step is to just solve for the correct \(\theta\) as we did before. We can interpret our model as predicting the probability of a Bernoulli random variable (0 or 1).

4.1.2. The Problem

Let’s incorporate our model into an optimization problem. As before, we can apply MLE in order to determine the correct value of \(\theta\) for our model \(f(x) = \frac{1}{1 + e^{-\theta^\top x}}\). The MLE objective is \(\arg \max_\theta P(\{x_i, y_i\}_{i=1}^N | \theta)\). The probability of a data point \(x\) being in class 1 is given by \(f(x)\), and class 0 is \(1 - f(x)\). The respective label \(y\) tells us the true value. Thus, the probability of observing the pair \((x, y)\) in our dataset given \(\theta\) can be expressed as \(f(x)^y (1-f(x))^(1-y)\). If the label \(y = 0\), then only the second term, the model’s probability of being zero, matters. If \(y = 1\), the only the first term, the model’s probability of one matters.

\[\begin{split}\begin{align*} & \arg \max_\theta P(\{x_i, y_i\}_{i=1}^N | \theta) = \arg \max_\theta \prod_{i=1}^N f(x_i)^{y_i} (1-f(x_i))^{(1 - y_i)} \\ &= \arg \max_\theta \ln \prod_{i=1}^N f(x_i)^{y_i} (1-f(x_i))^{(1 - y_i)} = \arg \max_\theta \sum_{i=1}^N \ln f(x_i)^{y_i} (1-f(x_i))^{(1 - y_i)} \\ &= \arg \max_\theta \sum_{i=1}^N y_i \ln f(x_i) + (1 - y_i) \ln (1-f(x_i)) \\ &= \arg \min_\theta \sum_{i=1}^N - y_i \ln f(x_i) - (1 - y_i) \ln (1-f(x_i)) \end{align*}\end{split}\]

This particular expression is known as binary cross entropy loss. It gets this name because it computes a measure of the entropy or randomness of the distribution of labels relative to the distribution of data. If you pay careful attention to the expression you can see that the loss value is high when the prediction disagrees with the label (high prediction entropy w.r.t data distribution). If the label is a 1, only the first term matters, and thus the log probability of being a one (which is zero if it’s a one) contributes to the loss.

4.1.3. Optimization

Now our objective is to minimize the (binary) cross entropy loss. It turns out that logistic regression doesn’t have a closed for solution, but we can easily apply gradient descent. Let’s take the gradient of the loss to determine the gradient descent update rule. When doing this, we’ll apply a useful identity, that \(\frac{\partial}{\partial z} g(z) = g(z) (1 - g(z))\). The proof of this can be found here. In our derivation, we’ll express our model \(f(x)\) using the sigmoid as \(g(\theta^\top x)\).

\[\begin{split}\begin{align*} \nabla_\theta J(\theta) &= \nabla_\theta \sum_{i=1}^N - y_i \ln g(\theta^\top x_i) - (1 - y_i) \ln (1-g(\theta^\top x_i)) \\ &= - \sum_{i=1}^N \left( y_i \frac{1}{g(\theta^\top x_i)} - (1- y_i) \frac{1}{1-g(\theta^\top x_i)} \right) \nabla_\theta g(\theta^\top x_i) \\ &= - \sum_{i=1}^N \left( y_i \frac{1}{g(\theta^\top x_i)} - (1- y_i) \frac{1}{1-g(\theta^\top x_i)} \right) g(\theta^\top x_i) \left(1 - g(\theta^\top x_i) \right) \nabla_\theta \theta^\top x_i \\ &= - \sum_{i=1}^N (y_i - g(\theta^\top x_i)) \nabla_\theta \theta^\top x_i \\ &= - \sum_{i=1}^N (y_i - g(\theta^\top x_i)) x_i \\ &= - X^\top (y - g(X\theta)) \\ \end{align*}\end{split}\]

Note that we have assumed that \(g(X\theta)\) applies the sigmoid function element-wise to the vector \(X \theta\). We can then simply run gradient descent to optimize the objective. Our final gradient descent update rule is thus:

\[\theta \leftarrow \theta + \alpha X^\top (y - g(X\theta))\]

4.1.4. Inference

Finally, we can run inference on our models as follows:

\[\begin{split} h(x) = \begin{cases} 1 & \text{if } g(\theta^\top x) > 0.5 \\ 0 & \text{otherwise} \end{cases} \end{split}\]