1.3. Data in ML

Let’s go back to our original assumption about data: that it comes from some probability distribution \(p_{\text{data}}\). Though this may seem daunting, it simply means that we can assign some probability or density to any data point we see, for example each natural image or each sentence in the english language. Our goal of machine learning is to thus make a model that performs well on this data distribution. However, we quickly run into problems when trying to calculate the data distribution of objects of interest. It’s impossible to assign a probability to an image without seeing all the natural images in the world. It’s impossible to construct the distribution of sentences without measuring every single sentence that has every been said in the history of the english language. Even more simply, it’s impossible to know the exact probability that a coin lands on heads when flipped. The data distribution is often intractable, and thus we need to come up with a good estimate.

The best estimate, perhaps, is simply given by the sample of data we collect. Data points \(x_i\) collected form the real world must have come from the actual data distribution, or \(x_i \sim p_{\text{data}}\). Thus, in cases where we don’t know the true data distribution, we can use our sample of data to estimate the true distribution. We use what’s called the empirical distribution, defined below.

Definition: The empirical distribution of \(N\) data points \(x_1, ..., x_N\) is a discrete probability distribution where each data point \(x_i\) has an equal chance of appearing. Namely, \(P(X = x_i) = \frac{1}{N}\).

Thus, the empirical distribution is a discrete estimate of the true data distribution in the limit as \(N \rightarrow \infty\). The formal proof of this is quite involved and invokes something called the Strong Law of Large Numbers. While a very interesting topic, its slightly out of scope for these notes. Instead, we’ll prove a simpler fact that indicates convergence, specifically that the expectation of the empirical distribution is the same as that of the actual data distribution. We’ll denote the empirical distribution formed by our sample of \(N\) data points as \(p_{\text{emp}}\). First, some facts about the empirical distribution.

\[\bar{x} = E_{X \sim p_{\text{emp}}}[X] = \sum_{i=1}^N \frac{1}{N} x_i = \frac{1}{N} \sum_{i=1}^N x_i\]
\[\hat{\sigma}^2 = \text{Var}_{X \sim p_{\text{emp}}}(X) =\frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2\]

We want to show that the expected value of the empirical mean \(\bar{x}\) converges to the true mean \(E[X]\). The proof is quite simple. We apply the fact that each data point \(x_i \sim p_{\text{data}}\) and thus \(E[x_i] = E[X]\). \begin{equation} E[\bar{x}] = E\left[\frac{1}{N} \sum_{i=1}^N x_i\right] = \frac{1}{N} \sum_{i=1}^N E[x_i] = \frac{1}{N} \sum_{i=1}^N E[X] = \frac{N}{N} E[X] = E[X] \end{equation}

Armed with a framework for representing data, we can now begin to make decisions.