4.3. Logistic Variants

Just like for linear regression, there are many variants of logistic regression that correspond to slightly different variations. This section will go through those variants.

4.3.1. Regularization

In linear regression, ridge regression and lasso regression introduced regularization to the original learning objective. Regularization techniques are used to reduce the search space of a problem. We’ll delve more into how regularization can affect the output of models later, but for now know that the \(l2\) and \(l1\) weight penalties for ridge and lasso respectively are forms of regularization. We can use these same augmentations for logistic regression.

4.3.1.1. \(l2\) Regularized Logistic Regression

As with ridge regeression, if we assume a guassian prior over weights we can derive the \(l2\) penalty term using MAP. If do this, the new logistic objective becomes:

\[\arg \min_\theta \sum_{i=1}^N - y_i \ln f(x_i) - (1 - y_i) \ln (1-f(x_i)) + \lambda ||\theta||_2^2\]

4.3.1.2. \(l1\) Regularized Logistic Regression

As with lasso regeression, if we assume a laplacian prior over weights we can derive the \(l1\) penalty term using MAP. If do this, the new logistic objective becomes:

\[\arg \min_\theta \sum_{i=1}^N - y_i \ln f(x_i) - (1 - y_i) \ln (1-f(x_i)) + \lambda ||\theta||_1\]

4.3.1.3. Weighted Logistic Regression

If we believe some sample points are more expressive than others, we can also apply a weighting to each sample point.

\[\arg \min_\theta \sum_{i=1}^N - w_i y_i \ln f(x_i) - w_i (1 - y_i) \ln (1-f(x_i))\]

4.3.2. Multiple Classes

So far, we have assumed that we only care about binary prediction (0 or 1), but you could image a scenario where you are presented with a discrete set of categories 0 to \(k-1\) and want to select the correct classification. There are a number of ways of doing this and as the ideas presented by each are useful, we’ll go through them all.

4.3.2.1. One vs. Many

One way to approach the multi-class prediction problem is to transform it to many binary classification problems and apply what we already know. If we have \(k\) classes we want to predict, we can partition the data into two groups: the class we are interested in, class \(i\), and a super class containing the remaining \(k-1\) classes. With this partitioned data, we can train a logistic regression classifier to predict class \(i\) or not class \(i\) or more specifically simply the probability of being in class \(i\). After training a total of \(k\) classifiers, we select the class corresponding to the model that had highest output value which can be interpreted as the argmax of the probabilities for each class.

4.3.2.2. One vs One

Another way to transform multi-class prediction into binary prediction would be to train classifiers to predict class \(i\) vs. class \(j\). For \(k\) items, there are total of \(k(k-1)/2\) combinations, meaning for this approach we would have to train that many independent classifiers. After training all of the classifiers, we can select the final classification by “voting”. Each of the classifiers votes for the class it decides that the data point belongs to. The class with the most votes is decided as the classification of the data point.

4.3.2.3. Multiple Outputs

Rather than transforming the multi-class problem into many binary classification problems, we can directly model multiple regression outputs. In logistic regression, we used the likelihood ratio between class 1 and class 0, or how much more likely class 1 was than class 0. Rather than modeling the likelihood ratio between to classes, one could imagine directly modeling a categorical distribution with \(k\) bins rather than a Bernoulli distribution (2-bins).

The basic idea is to make the probability of each class directly proportional to a linear model, or \(\ln p_i \propto \theta_i^\top x\) or \(p_i \propto e^{\theta_i^\top x}\). Note that we again use the log to transform potentially negative values to positive ones. However, our probability distribution can potentially sums to a value larger than 1 if we directly use the \(e^{\theta_i^\top x}\) terms for each class. Thus, we compute a normalization constant to make sure the sum of all of the \(p_i\)’s is one. This proportionality constant is just equal to the sum of all the classes, or \(\sum_{j=1}^k e^{\theta_j^\top x}\). We can just divide by this proportionality constant. Thus:

\[ P(Y = i | X = x) = \frac{e^{\theta_i^\top x}}{\sum_{j=1}^k e^{\theta_j^\top x}} \]

This is known as the softmax function. It transforms \(k\) inputs into a categorical distribution. It can be shown that you recover the sigmoid from the softmax function in the two class case. Instead of having a single weight vector \(\theta\), we have one for each class \(\theta_i\) which we can represent as a \(k \times d\) matrix \(\Theta\). We want our model to output a prediction probability for each of the \(k\) classes. We can compute the linear component of the model using the matrix multiply \(\Theta x\). We then apply a softmax element-wise to \(\Theta x\) and get our final model \(f_\Theta(x)\) which outputs the probabilities for each class.

\[ P(\{x_i, y_i\}_{i=1}^N | \Theta) = \prod_{i=1}^N P(Y = y_i | X = x_i) = \prod_{i=1}^N \mathbb{1}\{y_i = j\}^\top f_\Theta(x_i) \]

Here \(\mathbb{1}\{y_i = j\}\) is a length \(k\) vector where element \(j\) is one if \(y_i = j\) otherwise zero. Then as before we take the log and optimize.

\[\begin{split}\begin{align*} &\arg \max_{\Theta} \ln \prod_{i=1}^N \mathbb{1}\{y_i = j\}^\top f(x_i) = \arg \max_{\Theta} \sum_{i=1}^N \mathbb{1}\{y_i = j\}^\top \ln f_\Theta(x_i) \\ &\arg \min_{\Theta} \sum_{i=1}^N \mathbb{1}\{y_i = j\}^\dot - \ln f_\Theta(x_i) = \sum_{i=1}^N \sum_{j=1}^k - \mathbb{1}\{y_i = j\} \ln f_\Theta(x_i)_j \end{align*}\end{split}\]

This is known as cross-entropy loss as it’s no longer binary.

4.3.3. Information Theory & Logistic Regression

So far, we have approached logistic regression and its variants using MLE and MAP. We can however, use other probabilistic tools to derive these classification models. In fact, this information theoretic approach is how cross entropy loss gets its name.

Let’s consider the data distribution. Drawing a data point \((x, y)\) from the distribution involves some randomness. We measure this randomness using a quantity referred to as entropy. Coincidentally, entropy is also the minimum number of bits required to encode a data point given you know the distribution it came from. But why are we talking about randomness when we want our classifiers to be precise? We can consider minimizing the randomness of our classifier’s predictions with respect to the underlying data distribution.

To be completed.