1.1. Probability

1.1.1. Basic Theory

Probability is mathematical framework used to understand the randomness. Our world is random, and thus so is our data. Here are a few examples of randomness present in data:

  • Ordering of the deck in card games

  • The actions of other players or humans, including mistakes in labeling data.

  • Quantum level interactions of light that change images.

  • noise in sensors

By applying a probabilistic framework to our data, we hope to make sense of all these sources of randomness. Let’s begin with a simple example, we’ll use throughout this note:

Example: Tossing Coins

You have a coin, but aren’t sure if it is fair or not. We define the coin to be fair if it truly lands on heads 50% of the time. How can we figure out if this is true? By collecting an analyzing data. Every time we flip the coin, there are two things that can happen. The coin can land on heads or it can land on tails. This means our random experiment of flipping the coin has two probablistic outcomes \(\omega\), either \(\omega = H\) for heads or \(\omega = T\) for tails.

In probability, we denote each possible result of a random trial by an outcome, \(\omega\) and the set of all possible outcomes as \(\Omega\). In our coin flipping example, \(\Omega = \{H,T\}\). If the coin is fair, we believe the probability of getting heads, or \(P(H) = 0.5\) or 50%. Using this intuition, we can define the basics of a probability space.

Definition: A probability space \((\Omega, P)\) is a set of collectively exhaustive outcomes \(\Omega\) and a function \(P: \Omega \rightarrow \mathbb{R}\) such that the following are true:

  1. \(P(\omega) \geq 0, \forall \omega \in \Omega\) or every outcome has some assigned non-zero likelihood.

  2. \(\sum_{\omega \in \Omega} P(\omega) = 1\) or the total probability has to sum to one, meaning something has to happen.

In our coin flipping example, we had two outcomes \(H, T\). However, this notion of probability generalizes to more scenarios. Let’s say we can collect three data points to figure out if the coin is fair or not. Our new outcomes are defined by the eight possible results, including \(HHH, TTT, HTH, THH,\) etc. The probability of each of these outcomes is \(0.5 \times 0.5 \times 0.5 = 0.125\) as assuming the coin is fair, each flip has a \(0.5\) chance of landing on heads. The probability of any outcome happening is one – after we flip the coin three times, one of the outcomes in our set must have occurred.

We can additionally group multiple outcomes into a set of outcomes, referred to as an event. The probability of an event is simply the probability of any of the outcomes in its set occurring.

1.1.2. Random Variables

When we have thousands of data points, as is often the case in machine learning, dealing with probability spaces becomes rather cumbersome. Imagine that you want to flip the coin 10,000 times to determine whether or not it’s fair. The outcome space contains \(2^10000\) possible outcomes, as there are two possibilities for each of the 10,000 flips of the coin. In order to remedy this predicament, we can define abstractions on top of probability spaces that make dealing with large dimensionality easier. To do so, we define a random variable that converts each outcome to a real value.

Definition: A random variable is a function \(X: \Omega \rightarrow \mathbb{R}^d\).

When flipping a coin, we can assign \(X = 1\) when the result is heads, or \(X = 0\) when the result is tails. Rather than say “heads” or “tails” our data is just zeros and ones. What if we flip the coin 10 times? Well, we collect 10 data points \(X_1, X_2, ..., X_{10}\). The total number of heads \(Z\) would be \(Z = \sum_{i=1}^{10} X_i\), which is just an integer between 0 and 10.

That’s great, but what about probabilities? They simply carry over from the sample space:

  • \(P(H) = P(X=1) = 0.5\).

  • \(P(HHHHHHHHHH) = P(Z=10) = (0.5)^{10}\), as the coin must have landed on heads ten times.

Random variables just map from the outcome space, \(\Omega\) to real values that can be easier to deal with. This mapping still has to obey the laws of probability we previously set up. Namely, the sum of all probabilities of a random variable must still equal one: \(\sum_{x \in \mathcal{X}} P(X=x) = 1\).

1.1.3. Continuous Probability

So far, everything we’ve done has worked great for our integer value coin-flipping problem! But what if we have a continuous set of outcomes, where the data we collect can take on any value in the interval \([0,1]\). Now that we have defined random variables, we can extend probability theory to these continuous spaces. Let’s denote the value we collect from the interval \([0,1]\) as \(X\). What’s the probability that \(X\) takes on a given value, say \(0.25\).

It’s a hard question, for which you may not be able to immediately arrive at the answer. For argument’s sake, we can just assign a probability, say \(P(X = 0.25) = 0.000000000000001\). But, we defined random variable \(X\) such that it has equal probability of being any value between zero and one, and there are an infinitely many values between zero and one. If each outcome has non-zero probability, the total probability will sum to over 1! This leads to a rather counter-intuitive, but logically sound result: each individual outcome must have zero chance of occurring. How could this be? One way to think about it is that continuous real numbers have an infinite amount of precision. If take a one foot ruler and measure an object (taking measurements in terms of feet), I can get tan approximate length but never an exact length. The exact measurement would continue for an infinite number of decimal places, ultimately reaching atomic scale and differentiating between the width of different atoms. As such, any specific value must have zero probability. We get around this by arguing about intervals of probability. While we can’t define the probability that \(P(X=0.25)\), we can logically argue that \(X\) must be in \([0,1]\) with probability one (as that’s where it’s defined), and that \(P[X \leq 0.5] = 0.5\), as we know each value is equally likely, so there’s a one half chance of being in the first half of the interval. We formalize this idea below:

Definition A continuous random variable \(X\) is defined by a probability density function \(f_X : X \rightarrow \mathbb{R}\) such that:

  1. \(P(a \leq X \leq b) = \int_a^b f_X(x) dx\)

  2. \(\int_{-\infty}^{\infty} f_X(x) dx = 1\)

When drawing values uniformly from the interval \([0,1]\), the probability density function \(f_X(x) = 1\) in the region \([0,1]\) and zero else where. Note that the values of the density function \(f_X(x)\) can be larger than one, and even infinite! A simple example of this is drawing values uniformly from \([0, 0.5]\). Here, the density function \(f_X(x) = 2\) in the region \([0,0.5]\). See if you understand why!