2.2. Estimation

We’ll now use conditional probability to try make correct guesses about the real world given some data. This is often referred to as estimation. In general, let’s assume that we observe \(X\), which can take the form of a single observation or a series of observations, and want to guess the value of \(Y\), some unknown property. In the last section, we used a coin tossing example. Here we’ll consider a similar example with some modified parameters to formalize the decision making process. For convenience, the example has been copied into this section.

Example: Potentially Unfair Coins

Let’s say that you have a mystery coin that is fair or biased but you don’t know which. You are allowed to flip the coin three times to collect data, and observe heads three times.

2.2.1. Maximum a Posteriori Estimation

Let’s take the exact setup from last section, where we know ahead of time that there is a 0.75 probability that the coin is fair and a 0.25 probability that the coin is rigged to always land on heads. Such information known in advance is referred to as a prior, or information about the distribution of the variable we are trying to predict, in this case heads or tails. Maximum a Posteriori (MAP) estimation uses priors in order to weigh different choices.

Our goal is to choose the value \(y\) of the random variable \(Y\) we are trying to infer, given some observations. In essence we maximize the probability of being correct by choosing the explanation that is most likely given the data and our prior knowledge. We can write this mathematical objective formally, and apply bayes rule.

\[\begin{split}\begin{align} y^* &= \arg\max_y P(Y = y | X = x) = \arg\max_y \frac{P(Y=y \cap X=x)}{P(X = x)} = \arg\max_y \frac{P(X =x | Y = y) P(Y = y)}{P(X = x)} \\ &= \arg\max_y P(X =x | Y = y) P(Y = y) \end{align}\end{split}\]

We know that we could compute \(P(X = x)\) because we have assumed that our data \(x\) comes from the data distribution \(p_{\text{data}}\), but we can ignore this value in practice as it is the same for every choice of \(y\) and thus has no effect on the optimization. The result is rather simple: we choose the explanation \(y^*\) that maximizes the likelihood of our data, weighted by the probability of \(y^*\) actually being the cause. The marginal probability \(P(Y)\) is referred to as the prior, while the conditional probability of the explanation given the data, \(P(Y|X)\) is referred to as the posterior. Maximum a Posteriori estimation simply means maximize the posterior.

Let’s apply MAP to the coin problem. We’ll let \(Y = 1\) if the coin is fair and \(Y = 0\) if the coin is unfair. Then our prior is \(P(Y=1) = 0.75, P(Y=0)=0.25\). We now need to calculate the using Bayes rule. We then need to compute \(P(X = 3 | Y = 1)\) and \(P(X = 3 | Y = 0)\). If the coin is fair, we know \(P(X = 3 | Y = 1) = (0.5)^3\) and if the coin only lands on heads if its not fair, \(P(X = 3 | Y = 0) = 1\). We then compare the value if the coin is fair, \(0.75 \times 0.125\) to the value if the coin is rigged, \(0.25 \times 1\), and notice that the relative value of the posterior for the coin being unfair is higher, and thus that is the more likely explanation.

2.2.1.1. Continuous Priors

The previous example assumed that we were making a discrete decisions: fair or unfair. We could have extended this to a set of different choices, such as fair, biased with probability \(p=0.75\), or completely rigged to always land on heads and still just take the max of the posterior across all possible explanations. However, many problems require making decisions across a continuous set of values, or a continuous prior. The math turns out to be nearly identical, though we need to use continuous pdfs instead of discrete probability values (and perhaps some calculus). Let’s take a variant of the coin estimation example. Now instead of predicting fair or unfair, we want to predict the bias of the coin, specifically the parameter \(y\) between \(0\) and \(1\) that gives the probability that the coin lands on heads. We additionally know that coins are usually fair, and thus have the following prior. There’s an 80% chance that a random coin has a probability of flipping heads uniformly distributed between \(0.4\) and \(0.6\), and a 20% chance it’s uniformly distributed over the remaining values.

First, we need to find the pdf for this prior distribution over \(y\). 80% of the area (0.1) of the pdf has to be within a uniform rectangle from 0.4 to 0.6, which has a base area of 0.2 and thus a height of 4. The remaining 20% of the area has to uniformly cover a length of 0.8 and thus has a height of 0.25. You should verify that this a) makes sense, and b) integrates to one. So our prior is:

\[\begin{split} f_Y(y) = \begin{cases} 4 & \text{if } 0.4 \leq y \leq 0.6\\ 0.25 & \text{otherwise} \end{cases} \end{split}\]

We have the same objective, but now with a continuous pdf: \(\arg\max_y P(X = x|Y = y) f_Y(y)\). Conditioned on the value of \(y\), the probability of seeing three heads \(P(X = 3 | Y = y) = y^3\). As the probability of getting three heads is monotonically increasing over \([0,1]\), we need only consider the largest values in each of the two prior cases when taking the argmax. If we are in the region where \(f_Y(y) = 4\), the largest value of \(y\) is 0.6, otherwise the largest value is \(1\). Thus, we compare \(P(Y = 0.6 | X = 3) \propto (0.6)^3 \times 4 = 0.864\) to \(P(Y = 0.6 | X = 3) \propto (1)^3 \times 0.25 = 0.25\) and thus conclude that most likely bias for the coin is 0.6.

2.2.2. Maximum Likelihood Estimation

The critical assumption of MAP is that we have access to a prior, and more so, that our prior is an accurate description of the world. If our prior isn’t accurate, it could bias us into making an inaccurate decision. For many problems, it’s unrealistic to assume that we know any prior. In these scenarios, the best we can do is assume that everything is equally likely. In other words, we assume the uniform prior. This turns out to yield a special case of MAP referred to Maximum Likelihood Estimation (MLE). Let’s start with MAP, assume a uniform prior (or a constant), and see what happens.

\begin{equation} y^* = \arg\max_y P(X =x | Y = y) P(Y = y) = \arg\max_y P(X =x | Y = y) c = \arg\max_y P(X =x | Y = y) \end{equation}

As the prior is the same for every value of \(y\), we can omit the prior from the optimization equation and thus our MLE estimate is just given by the explanation that yields the highest likelihood of seeing the observed data.

Let’s apply MLE to estimate the bias of a coin given that it landed on heads three times. Note that by applying MLE, we assume that all biases of the coin are equally likely, something we know not to be true based on experience. \begin{equation} y^* = \arg\max_y P(X = 3 | Y = y) = \arg\max_y y^3 = 1 \end{equation} The MLE estimate tells us that the most likely explanation for seeing three heads is having a coin that only ever lands on heads. While perhaps reflective of the data, this does not seem like an accurate depiction of the world.