CommeUnJeu · L1 PCSI

Probabilities on a finite universe

⌚ ~115 min ▢ 14 blocks ✓ 44 exercises ➣ Prerequisites : Counting, Sets

Probability theory is the mathematics of uncertainty: how to describe an experiment whose outcome is not predictable in advance, how to manipulate the « events » associated with it, and how to compute the chance that a given event occurs. The intuitive examples are everywhere --- a coin flip, a die roll, a card drawn from a shuffled deck, a ticket pulled from a hat. The chapter takes that intuition and turns it into a small, tight algebraic theory built on top of set theory and counting: an event is a subset of the universe of outcomes, and a probability is a way to measure such subsets that respects unions and complements.
The plan has three parts. The first sets up the modelling language: what is the universe $\Omega$ of outcomes, what are events, what is a complete system of events, and what is a probability. The keystone theorem of this part is the characterization of a probability by its distribution on the elementary events --- it says that to specify a probability on a finite universe, it is enough to specify the probability of each individual outcome. The uniform probability $P(A) = |A|/|\Omega|$ falls out as a special case, and it is here that the chapter Counting plugs in: every uniform-probability calculation is, ultimately, a counting calculation. The second part introduces the conditional probability $P_B(A) = P(A \cap B)/P(B)$, the workhorse for « step by step » reasoning, with three named theorems --- composed probabilities, total probability, and Bayes --- that organise the whole calculus. The third part introduces independence, the modelling assumption that lets us multiply probabilities, with the essential warning that pairwise independence is not the same as mutual independence.
Three reflexes the reader should leave with: (i) before any calculation, write down $\Omega$ and $P$ explicitly --- the same physical experiment can be modelled by different universes, and the right one is usually the one that makes $P$ uniform; (ii) translate every event description into set operations --- "or" becomes $\cup$, "and" becomes $\cap$, "not" becomes complement, "implies" becomes $\subseteq$; (iii) recognize a complete system of events when it shows up, and apply the total-probability formula whenever the experiment splits into cases depending on whether \dots. The chapter restricts attention strictly to finite universes: infinite universes (an indefinite sequence of coin flips, a point chosen in a disk) are second-year material and lie outside the program. The construction of random variables (their distributions, the binomial distribution, expectation, variance) is the content of the next chapter, Random Variables; here we only introduce the notation $X : \Omega \to E$ and the events of the form $\{X \in A\}$.

I Setting up the probabilistic framework

We start by translating the vocabulary of random experiments into the language of sets: an outcome is a point of a universe, an event is a subset, the complementary event is the set complement, the conjunction « $A$ and $B$ » is an intersection, and so on. Once that bridge is in place, defining a probability becomes a short additivity axiom on subsets, and the rest of the chapter is the systematic exploitation of additivity together with set-theoretic identities. The keystone of the section is the characterization theorem: a probability on a finite universe is entirely determined by its values on the elementary events. Uniform probability is the special case where all those values are equal, and it is the bridge to Counting.

I.1 Random experiment$\virgule$ universe$\virgule$ events

A random experiment is, intuitively, an experiment whose outcome is not predictable in advance: a die roll, a coin flip, a card drawn from a deck. To make this precise we list the possible outcomes and call the resulting set the universe of the experiment. An event is then any subset of the universe --- a property of the outcome that one can check after the experiment has been performed. Throughout the chapter, $\Omega$ denotes a non-empty finite universe; this assumption is implicit unless we say otherwise.

Definition — Random experiment$\virgule$ universe$\virgule$ event

A random experiment is an experiment whose outcome cannot be predicted in advance. The set of all possible outcomes is called the universe of the experiment, denoted $\Omega$; in this chapter $\Omega$ is always a finite non-empty set. Each element $\omega \in \Omega$ is an outcome. An event is any subset $A \subseteq \Omega$; we say that the event $A$ is realized when the outcome $\omega$ of the experiment belongs to $A$.
A singleton $\{\omega\}$ is called an elementary event. The whole universe $\Omega$ is the certain event (always realized), and the empty set $\emptyset$ is the impossible event (never realized). Two events $A$ and $B$ are incompatible when they cannot be realized simultaneously, that is when $A \cap B = \emptyset$.

Method — Translation table: probabilistic vocabulary $\leftrightarrow$ set operations

Every event description in words can be turned mechanically into a set expression on $\Omega$. Memorize the following table: $$ \renewcommand{\arraystretch}{1.3} \begin{array}{|l|c|l|} \hline \text{In words} & \text{Set operation} & \text{Example} \\ \hline \text{« $A$ or $B$ »} & A \cup B & \text{at least one of $A$, $B$ is realized} \\ \text{« $A$ and $B$ »} & A \cap B & \text{both $A$ and $B$ are realized} \\ \text{« not $A$ », « complement of $A$ »} & \overline{A} = \Omega \setminus A & A \text{ is not realized} \\ \text{« $A$ implies $B$ »} & A \subseteq B & A \text{ realized $\Rightarrow$ $B$ realized} \\ \text{« $A$ and $B$ incompatible »} & A \cap B = \emptyset & A \text{ and $B$ cannot both occur} \\ \hline \end{array} $$ The translation table is bidirectional: any set identity (De Morgan, distributivity, \dots) becomes an identity between events, and any reasoning on events can be carried out at the set level. The chapter on Sets provides the algebraic backbone.

Example

Consider the roll of a balanced six-sided die. The natural universe is $\Omega = \{1, 2, 3, 4, 5, 6\}$. Let $A$ = « obtain an even result » and $B$ = « obtain a result $\le 3$ ». Translating into sets: $$ A = \{2, 4, 6\}, \qquad B = \{1, 2, 3\}. $$ The compound events read off the table:

« $A$ or $B$ » $= A \cup B = \{1, 2, 3, 4, 6\}$.
« $A$ and $B$ » $= A \cap B = \{2\}$.
« complement of $A$ » $= \overline{A} = \{1, 3, 5\}$ (« obtain an odd result »).

Example

Consider two successive draws with replacement from an urn containing one black ball $K$ and one white ball $W$. The universe is $\Omega = \{KK, KW, WK, WW\}$. The event « the first ball is white » is $\{WK, WW\}$; the event « the two balls are of the same colour » is $\{KK, WW\}$; their intersection (« first white and two of the same colour ») is $\{WW\}$.

Skills to practice

Building events from set operations

I.2 Complete system of events

A complete system of events is a way to partition $\Omega$ into a list of pairwise disjoint events that together exhaust the universe. It is the case-by-case reflex: whenever an experiment can be analysed by saying « either the first ball is white, or the first ball is black », one is implicitly using the complete system $\{W_1, \overline{W_1}\}$. The total-probability formula will turn this reflex into a quantitative tool.

Definition — Complete system of events

Let $n \ge 1$. A finite family $(A_1, \ldots, A_n)$ of events of $\Omega$ is called a complete system of events (or partition of $\Omega$) when:

each event is non-empty: $A_i \ne \emptyset$ for all $i$;
the events are pairwise incompatible: $A_i \cap A_j = \emptyset$ for all $i \ne j$;
their union is $\Omega$: $\displaystyle\bigcup_{i=1}^n A_i = \Omega$.

Equivalently, $\Omega = \bigsqcup_{i=1}^n A_i$ is a disjoint union into non-empty pieces.
In the conditional formulas to come (total probability, Bayes), we will additionally require that each $A_i$ has strictly positive probability --- the indexing of the conditional sum is then restricted to indices with $P(A_i) > 0$ (see the total-probability Remark), but the complete system itself is unchanged.

Example

The simplest complete system: for any event $A$ with $A \ne \emptyset$ and $A \ne \Omega$, the pair $\{A, \overline{A}\}$ is a complete system. The two events are disjoint by construction and their union is $\Omega$; the conditions $A \ne \emptyset, \Omega$ guarantee that neither piece is empty.
A slightly richer one: given two events $A$ and $B$, consider the three events $$ A \cap B, \qquad A \cap \overline{B}, \qquad \overline{A}. $$ They are pairwise disjoint, and every $\omega \in \Omega$ lies in exactly one of them (in $A \cap B$ or $A \cap \overline{B}$ if $\omega \in A$, depending on whether $\omega \in B$; in $\overline{A}$ otherwise). After removing those that happen to be empty, the non-empty events among them form a complete system of $\Omega$. This three-piece partition is the standard tool to reason on two events at once.

Example

The « finest » complete system of $\Omega$ is the family of elementary events: $\big(\{\omega\}\big)_{\omega \in \Omega}$. Indeed, $\{\omega\} \cap \{\omega'\} = \emptyset$ for $\omega \ne \omega'$, and $\bigcup_{\omega \in \Omega} \{\omega\} = \Omega$. Every other complete system can be seen as a coarsening of this one: a partition of $\Omega$ into bigger blocks, each block being a union of elementary events.

Skills to practice

Identifying complete systems of events

I.3 Random variable (notation only)

Often an event is naturally described by a quantity associated to the outcome --- the value of a die, the number of « heads » in $n$ tosses, the suit of a drawn card. We capture this by a function $X : \Omega \to E$, called a random variable, and we write $\{X \in A\}$ for the event « $X$ takes its value in $A$ ». In this chapter we use only the notation: the full theory of $X$ (its distribution, the binomial distribution, expectation, variance) is the content of the next chapter, Random Variables.

Definition — Random variable$\virgule$ events $\{X \in A\}$

Let $\Omega$ be a finite universe and $E$ an arbitrary set. A random variable on $\Omega$ with values in $E$ is any function $X : \Omega \to E$. When $E = \mathbb{R}$, we say that $X$ is a real-valued random variable.
For any subset $A \subseteq E$, the event $\{X \in A\}$ is defined by $$ \{X \in A\} \ = \ X^{-1}(A) \ = \ \{\omega \in \Omega \ \mid \ X(\omega) \in A\}. $$ In particular, for $x \in E$, we write $\{X = x\}$ for the event $\{\omega \in \Omega \mid X(\omega) = x\}$, and when $X$ is real-valued, $\{X \le x\}$ for $\{\omega \in \Omega \mid X(\omega) \le x\}$.

Example

A balanced die is rolled twice; the universe is $\Omega = \llbracket 1, 6 \rrbracket^2$ and an outcome $\omega = (a, b)$ records the values of the first and second roll. Define $$ X_1(\omega) = a, \qquad X_2(\omega) = b, \qquad S(\omega) = X_1(\omega) + X_2(\omega). $$ Then $X_1, X_2$ and $S$ are real-valued random variables on $\Omega$. The event $\{S = 7\}$ is $$ \{S = 7\} \ = \ \{(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)\}, $$ a subset of cardinal $6$ inside $\Omega$ of cardinal $36$. The event $\{X_1 \le 2\}$ is the strip $\{1, 2\} \times \llbracket 1, 6 \rrbracket$, of cardinal $12$.

Proposition — The events $\{X \equal x\}$ partition $\Omega$

Let $X : \Omega \to E$ be a random variable. The family $\big(\{X = x\}\big)_{x \in X(\Omega)}$ is a complete system of events: $$ \Omega \ = \ \bigsqcup_{x \in X(\Omega)} \{X = x\}. $$

Proof

Each $\omega \in \Omega$ has a unique image $X(\omega) \in X(\Omega)$, so $\omega$ belongs to the event $\{X = X(\omega)\}$ and to no other $\{X = x\}$ with $x \ne X(\omega)$. Hence the events $\{X = x\}$ are pairwise disjoint, and their union covers $\Omega$.

Definition — Complete system associated with a random variable

The partition $\Omega = \bigsqcup_{x \in X(\Omega)} \{X = x\}$ is called the complete system associated with $X$. It is the natural complete system to use when one wants to compute the probability of an event by conditioning on the value of $X$ (see the total-probability formula).

Skills to practice

Translating events of a random variable

I.4 Probability on a finite universe

Up to now we have only spoken of events as set-theoretic objects: subsets of $\Omega$. To do probability, we need to attach to each event a real number in $[0, 1]$ that measures « how plausible » the event is. The minimal axiom: the certain event has probability $1$, and for two incompatible events the probability of their disjunction is the sum of the probabilities --- additivity. Every other property in the « Properties of probabilities » section will be a consequence of this single rule.

Definition — Probability on a finite universe

A probability on a finite universe $\Omega$ is a function $P : \mathcal{P}(\Omega) \to [0, 1]$ such that:

$P(\Omega) = 1$;
$P$ is additive: for all events $A, B \subseteq \Omega$ with $A \cap B = \emptyset$, $$ P(A \cup B) \ = \ P(A) + P(B). $$

The pair $(\Omega, P)$ is called a finite probability space. Two events $A$ and $B$ are said to be equiprobable when $P(A) = P(B)$.

Example

Balanced die. $\Omega = \llbracket 1, 6 \rrbracket$ with $P(\{k\}) = 1/6$ for each $k$. Additivity then forces $P(A) = |A|/6$ for any event $A$ --- this is the uniform probability we will study in the next subsection.
Biased coin. $\Omega = \{\mathrm{H}, \mathrm{T}\}$ with $P(\{\mathrm{H}\}) = p$ and $P(\{\mathrm{T}\}) = 1 - p$, where $p \in [0, 1]$ is the « probability of getting heads ». The case $p = 1/2$ recovers the balanced coin; the case $p = 0$ models a coin that always shows tails.

Example

Two flips of a biased coin --- a non-uniform finite probability space. A coin with $P(\{\mathrm{H}\}) = p$ is flipped twice. The natural universe is $\Omega = \{\mathrm{H}, \mathrm{T}\}^2$, of cardinal $4$. Assuming independence between the two flips (a modelling choice; we will return to it in the Independence section), the distribution on the four elementary events is $$ P(\{\mathrm{HH}\}) = p^2, \quad P(\{\mathrm{HT}\}) = p(1-p), \quad P(\{\mathrm{TH}\}) = p(1-p), \quad P(\{\mathrm{TT}\}) = (1-p)^2. $$ The probabilities sum to $p^2 + 2p(1-p) + (1-p)^2 = (p + (1-p))^2 = 1$, as required. The probability of « exactly one heads » is then $$ P(\{\mathrm{HT}, \mathrm{TH}\}) \ = \ P(\{\mathrm{HT}\}) + P(\{\mathrm{TH}\}) \ = \ 2p(1-p). $$ For $p = 1/3$, this gives $2 \cdot (1/3) \cdot (2/3) = 4/9$. Notice that for $p \ne 1/2$, the universe $\Omega$ is not uniform: the four elementary events have different probabilities. This will be the running theme: not every finite universe carries the uniform probability.

Skills to practice

Verifying the probability axioms

I.5 Distribution of probabilities and characterization

On a finite universe, the values of a probability on the elementary events are the « building bricks » of the whole probability: knowing them, one recovers the probability of every event by additivity. The characterization theorem of this subsection makes the statement precise: any family of non-negative reals summing to $1$, indexed by $\Omega$, is the distribution of a unique probability. In practice, this is how a probability is defined in concrete problems --- give the distribution on the elementary events, and the rest follows.

Definition — Distribution of probabilities

Let $E$ be a finite non-empty set. A probability distribution on $E$ is a family $(p_x)_{x \in E}$ of real numbers with $$ \forall x \in E, \ p_x \in [0, 1] \qquad \text{and} \qquad \sum_{x \in E} p_x = 1. $$

Theorem — Characterization of a probability by its distribution

Let $\Omega$ be a finite non-empty universe. For every distribution $(p_\omega)_{\omega \in \Omega}$ on $\Omega$, there exists a unique probability $P$ on $\Omega$ such that $$ \forall \omega \in \Omega, \quad P(\{\omega\}) \ = \ p_\omega. $$ This probability is given on every event $A \subseteq \Omega$ by $$ \textcolor{colorprop}{P(A) \ = \ \sum_{\omega \in A} p_\omega.} $$ Conversely, for every probability $P$ on $\Omega$, the family $\big(P(\{\omega\})\big)_{\omega \in \Omega}$ is a distribution on $\Omega$.

Proof

We prove existence and uniqueness by an analysis-synthesis argument.

Finite additivity (mini-lemma, used in Analysis below). The 2-event axiom extends to $n \ge 0$ pairwise incompatible events $B_1, \ldots, B_n$: $$ P\Bigl(\bigsqcup_{i=1}^n B_i\Bigr) \ = \ \sum_{i=1}^n P(B_i), $$ with the convention that the empty union is $\emptyset$ and the empty sum is $0$. Proof by induction on $n$. For $n = 0$ both sides are zero by convention (and $P(\emptyset) = 0$ follows from the axiom, by writing $\Omega = \Omega \cup \emptyset$ disjoint, additivity gives $P(\Omega) = P(\Omega) + P(\emptyset)$, hence $P(\emptyset) = 0$). For $n = 1$ the equality is trivial. For $n = 2$ it is the axiom. Assuming the formula at rank $n$, given $B_1, \ldots, B_{n+1}$ pairwise incompatible, the event $\bigsqcup_{i=1}^n B_i$ is incompatible with $B_{n+1}$ (their intersection is empty), so by the 2-event axiom $P(\bigsqcup_{i=1}^{n+1} B_i) = P(\bigsqcup_{i=1}^n B_i) + P(B_{n+1})$, which by the recurrence hypothesis equals $\sum_{i=1}^n P(B_i) + P(B_{n+1}) = \sum_{i=1}^{n+1} P(B_i)$.
Analysis (uniqueness). Suppose $P$ is a probability on $\Omega$ such that $P(\{\omega\}) = p_\omega$ for every $\omega$. If $A = \emptyset$, then $P(A) = 0 = \sum_{\omega \in \emptyset} p_\omega$ (empty sum convention). If $A \ne \emptyset$, the elementary events $\{\omega\}$ for $\omega \in A$ are pairwise disjoint, and $A = \bigsqcup_{\omega \in A} \{\omega\}$, so by finite additivity (the mini-lemma above, with $n = |A|$), $$ P(A) \ = \ \sum_{\omega \in A} P(\{\omega\}) \ = \ \sum_{\omega \in A} p_\omega. $$ In both cases $P$ is forced on every event, which proves uniqueness.
Synthesis (existence). Define $P : \mathcal{P}(\Omega) \to \mathbb{R}$ by $P(A) = \sum_{\omega \in A} p_\omega$. We check that $P$ is a probability:
- $P$ takes values in $[0, 1]$: each $p_\omega \in [0, 1]$, so $P(A) = \sum_{\omega \in A} p_\omega \ge 0$, and $P(A) \le \sum_{\omega \in \Omega} p_\omega = 1$.
- $P(\Omega) = \sum_{\omega \in \Omega} p_\omega = 1$ by definition of a distribution.
- Additivity: for $A \cap B = \emptyset$, the index sets are disjoint, so $$ P(A \cup B) \ = \ \sum_{\omega \in A \cup B} p_\omega \ = \ \sum_{\omega \in A} p_\omega + \sum_{\omega \in B} p_\omega \ = \ P(A) + P(B). $$
- Restriction on elementary events: $P(\{\omega\}) = p_\omega$ by definition.

The converse statement (every probability defines a distribution) is immediate: $P(\{\omega\}) \in [0, 1]$ since $P$ takes values in $[0, 1]$, and $\sum_\omega P(\{\omega\}) = P(\Omega) = 1$ by finite additivity (mini-lemma above) over the partition of $\Omega$ into singletons.

Method — Defining a probability by its distribution

To define a probability $P$ on a finite universe $\Omega$ in a concrete problem, give the distribution $\big(P(\{\omega\})\big)_{\omega \in \Omega}$ on the elementary events. Once the values $p_\omega$ are fixed (and verified to be non-negative and to sum to $1$), the probability of any event $A$ is $$ P(A) \ = \ \sum_{\omega \in A} P(\{\omega\}). $$ This is the working recipe whenever the universe is small or the distribution has an explicit formula.

Example

Uniform distribution on a die. For $\Omega = \llbracket 1, 6 \rrbracket$ and $p_k = 1/6$ for each $k$, the probability of « obtaining an even result » is $$ P(\{2, 4, 6\}) \ = \ \frac{1}{6} + \frac{1}{6} + \frac{1}{6} \ = \ \frac{1}{2}. $$

Example

Non-uniform distribution: the unbalanced die. Suppose a die is weighted so that the face $1$ comes up with probability $1/4$, and each of the other five faces with probability $3/20$. Verifying that this is a valid distribution: $$ \frac{1}{4} + 5 \cdot \frac{3}{20} \ = \ \frac{5}{20} + \frac{15}{20} \ = \ 1. $$ The probability of « obtaining an even result » is then $$ P(\{2, 4, 6\}) \ = \ \frac{3}{20} + \frac{3}{20} + \frac{3}{20} \ = \ \frac{9}{20}, $$ slightly less than $1/2$ --- the bias toward $1$ shifts probability mass away from the even faces.

Skills to practice

Defining a probability by its distribution

I.6 Uniform probability

The most common case --- and the one where the bridge to Counting is most direct --- is when all elementary events are equiprobable. The characterization theorem then says that the only valid distribution is the constant one, $p_\omega = 1/|\Omega|$, and the resulting probability is the "favourable outcomes over possible outcomes" formula $P(A) = |A|/|\Omega|$. Every uniform-probability calculation is a counting calculation.

Proposition — Existence and uniqueness of the uniform probability

Let $\Omega$ be a finite non-empty universe. The constant distribution $$ p_\omega \ = \ \frac{1}{|\Omega|} \quad \text{for every } \omega \in \Omega $$ defines a unique probability $P$ on $\Omega$, given on every event $A \subseteq \Omega$ by $$ \textcolor{colorprop}{P(A) \ = \ \frac{|A|}{|\Omega|}.} $$

Proof

The constant family $p_\omega = 1/|\Omega|$ is a distribution: each value lies in $[0, 1]$, and the sum is $\sum_{\omega \in \Omega} 1/|\Omega| = |\Omega| \cdot 1/|\Omega| = 1$. By the characterization theorem, there is a unique probability $P$ on $\Omega$ with this distribution, and its value on an event $A$ is $$ P(A) \ = \ \sum_{\omega \in A} \frac{1}{|\Omega|} \ = \ \frac{|A|}{|\Omega|}. $$

Definition — Uniform probability

The probability of the previous Proposition is called the uniform probability on $\Omega$. The elementary events of $\Omega$ are then equiprobable: $P(\{\omega\}) = 1/|\Omega|$ for every $\omega \in \Omega$. More generally, any two events of the same cardinal have the same probability, since $P(A) = |A|/|\Omega|$ depends only on $|A|$.

Method — Uniform $\Rightarrow$ count

On a uniform probability space, every probability calculation reduces to a counting calculation: $$ P(A) \ = \ \frac{|\,\text{favourable outcomes}\,|}{|\,\text{possible outcomes}\,|} \ = \ \frac{|A|}{|\Omega|}. $$ Invoke the toolbox of Counting by name: $p$-permutations ($\frac{n!}{(n-p)!}$), combinations ($\binom{n}{p}$), $n^p$ for ordered selections with repetition. The art is in choosing the universe $\Omega$ so that the counting on the numerator and the denominator is straightforward.

Example

Balanced die. On $\Omega = \llbracket 1, 6 \rrbracket$ uniform, the probability of « obtaining an even result » is $$ P(\{2, 4, 6\}) \ = \ \frac{|\{2, 4, 6\}|}{|\Omega|} \ = \ \frac{3}{6} \ = \ \frac{1}{2}. $$ A direct count, no further work needed.

Example

Five cards drawn simultaneously from a deck of $52$. Take $\Omega = \mathcal{P}_5(\text{deck})$, the set of $5$-element subsets of the deck, with the uniform probability. By Counting, $|\Omega| = \binom{52}{5}$. The event « all five cards are clubs » corresponds to choosing $5$ cards among the $13$ clubs, of cardinal $\binom{13}{5}$. Hence $$ P(\text{all clubs}) \ = \ \frac{\binom{13}{5}}{\binom{52}{5}} \ = \ \frac{1287}{2598960} \ \approx \ 4{.}95 \times 10^{-4}. $$ The event « exactly one club » has cardinal $\binom{13}{1} \binom{39}{4}$ (one club among $13$, four non-clubs among $39$), giving $$ P(\text{exactly one club}) \ = \ \frac{\binom{13}{1}\binom{39}{4}}{\binom{52}{5}}. $$ Both calculations are pure counting --- the probability layer adds no new ideas.

Skills to practice

Computing on a uniform space (counting and probability)

I.7 Properties of probabilities

Once a probability is in place, the standard algebraic identities --- value on the empty event, complement, monotonicity, formula for $P(A \cup B)$, sub-additivity --- follow from additivity by short set-theoretic arguments. None of them is conceptually new; what they provide is a fluent toolkit so that the reader does not have to re-prove them every time they show up. The general inclusion-exclusion formula for $n \ge 3$ events is out of program; we mention the $n = 3$ case as a Remark and leave the general statement to a guided exercise.

Theorem — Properties of probabilities

Let $(\Omega, P)$ be a finite probability space. For all events $A, B, A_1, \ldots, A_n$ of $\Omega$:

[(i)] $\textcolor{colorprop}{P(\emptyset) \ = \ 0.}$
[(ii)] Complement and difference. $\textcolor{colorprop}{P(\overline{A}) \ = \ 1 - P(A)}$ \ and \ $\textcolor{colorprop}{P(A \setminus B) \ = \ P(A) - P(A \cap B).}$
[(iii)] Monotonicity. If $A \subseteq B$, then $\textcolor{colorprop}{P(A) \ \le \ P(B).}$
[(iv)] Union of two events. $\textcolor{colorprop}{P(A \cup B) \ = \ P(A) + P(B) - P(A \cap B).}$
[(v)] Finite additivity. If $A_1, \ldots, A_n$ are pairwise incompatible, $\textcolor{colorprop}{P\bigl(\bigsqcup_{i=1}^n A_i\bigr) \ = \ \sum_{i=1}^n P(A_i).}$
[(vi)] Sub-additivity. For any $A_1, \ldots, A_n$, $\textcolor{colorprop}{P\bigl(\bigcup_{i=1}^n A_i\bigr) \ \le \ \sum_{i=1}^n P(A_i).}$

Proof

[(i)] $\Omega$ and $\emptyset$ are incompatible, and $\Omega \cup \emptyset = \Omega$. By additivity, $P(\Omega) = P(\Omega) + P(\emptyset)$, so $P(\emptyset) = 0$.
[(ii)] $A$ and $\overline{A}$ are incompatible and partition $\Omega$, so by additivity $P(A) + P(\overline{A}) = P(\Omega) = 1$, giving $P(\overline{A}) = 1 - P(A)$. For the difference, $A = (A \cap B) \sqcup (A \setminus B)$ is a disjoint union, so $P(A) = P(A \cap B) + P(A \setminus B)$.
[(iii)] If $A \subseteq B$, then $B = A \sqcup (B \setminus A)$ is a disjoint union, so $P(B) = P(A) + P(B \setminus A) \ge P(A)$ since $P(B \setminus A) \ge 0$.
[(iv)] Write $A \cup B = A \sqcup (B \setminus A)$ as a disjoint union. By additivity and (ii), $$ \begin{aligned} P(A \cup B) \ &= \ P(A) + P(B \setminus A) && \text{(additivity)}\\ &= \ P(A) + P(B) - P(A \cap B) && \text{(by (ii))}. \end{aligned} $$
[(v)] By induction on $n$. The case $n = 2$ is the additivity axiom. Suppose the formula holds for $n$, and let $A_1, \ldots, A_{n+1}$ be pairwise incompatible. Then $\bigsqcup_{i=1}^n A_i$ and $A_{n+1}$ are incompatible (their intersection is empty by hypothesis), so by additivity $$ P\Bigl(\bigsqcup_{i=1}^{n+1} A_i\Bigr) \ = \ P\Bigl(\bigsqcup_{i=1}^n A_i\Bigr) + P(A_{n+1}) \ = \ \sum_{i=1}^n P(A_i) + P(A_{n+1}) \ = \ \sum_{i=1}^{n+1} P(A_i). $$
[(vi)] By induction on $n$. For $n = 1$ the inequality is an equality. For $n = 2$, by (iv), $P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2) \le P(A_1) + P(A_2)$ since $P(A_1 \cap A_2) \ge 0$. Suppose the inequality holds for $n$. Then by (iv) and the recurrence hypothesis, $$ \begin{aligned} P\Bigl(\bigcup_{i=1}^{n+1} A_i\Bigr) \ &\le \ P\Bigl(\bigcup_{i=1}^n A_i\Bigr) + P(A_{n+1}) && \text{(case $n = 2$)}\\ &\le \ \sum_{i=1}^n P(A_i) + P(A_{n+1}) && \text{(recurrence)}\\ &= \ \sum_{i=1}^{n+1} P(A_i). \end{aligned} $$

Inclusion-exclusion for $n = 3$. Applying the union formula (iv) twice gives the three-event analogue: $$ \begin{aligned} P(A \cup B \cup C) \ = \ &P(A) + P(B) + P(C) \\ &- P(A \cap B) - P(B \cap C) - P(A \cap C) \\ &+ P(A \cap B \cap C). \end{aligned} $$ There exists a general formula for $n$ events --- the inclusion-exclusion formula --- but it is outside the syllabus at this level. The general statement is therefore not stated here; a guided indicator-based proof is offered as an optional exercise for those interested.

Method — Computing $P(A \cup B)$ and $P(\overline{A})$

When asked to compute $P(A \cup B)$, $P(\overline A)$, or $P(A \setminus B)$, never re-derive the formula from additivity --- pick the right item of the Properties theorem according to the shape of the question:

« At least one of $A$, $B$ » $= A \cup B$. Use item (iv): $P(A \cup B) = P(A) + P(B) - P(A \cap B)$. The intersection $P(A \cap B)$ must be known or computable; in a uniform space, all terms are counts.
« $A$ does not occur » $= \overline{A}$. Use item (ii): $P(\overline A) = 1 - P(A)$. Particularly useful when $A$ is itself an "at least one" event: counting « none » ($\overline A$) is then a single complement, while counting « at least one » directly would require inclusion-exclusion.
« $A$ but not $B$ » $= A \setminus B$. Use item (ii): $P(A \setminus B) = P(A) - P(A \cap B)$.
Bound on a union of $n$ events. Use item (vi): $P(\bigcup_{i=1}^n A_i) \le \sum_{i=1}^n P(A_i)$. Useful as a quick upper bound when the $A_i$ are not pairwise disjoint and the joint probabilities are hard to compute.

The « complement trick » in item~2 is the single most reused move in the chapter; spot it whenever the original event is naturally phrased as « at least one », « at least once », « non-empty ».

Example

Sport, music, both. In a class of $30$ students, $20$ play sport and $15$ play music; among them, $10$ play both. A student is chosen uniformly at random. The probability that the chosen student plays sport or music is, by item (iv) of the Properties: $$ P(\text{sport} \cup \text{music}) \ = \ \frac{20}{30} + \frac{15}{30} - \frac{10}{30} \ = \ \frac{25}{30} \ = \ \frac{5}{6}. $$ The complement « plays neither » has probability $1 - 5/6 = 1/6$.

Skills to practice

Computing probabilities of unions$\virgule$ complements$\virgule$ differences

II Conditional probability

The probabilities computed so far are unconditional: $P(A)$ measures the chance of $A$ in the absence of any further information about the experiment. In real situations one often acquires partial information --- « we know that $B$ has occurred » --- and the relevant question becomes the chance of $A$ given $B$. The conditional probability $P_B(A) = P(A \mid B) := P(A \cap B)/P(B)$ formalises this. The crucial fact is that $P_B$ is itself a probability on $\Omega$, so the entire algebra of the first section transports verbatim. Three named theorems organise the calculus: composed probabilities (multiplicative chain), total probability (case split), and Bayes (inversion).

II.1 Conditional probability

Knowing that $B$ has occurred restricts the universe of plausible outcomes from $\Omega$ to $B$, and renormalises the probabilities so that the new « certain event » is $B$ itself: $P_B(B) = 1$. The definition $P_B(A) = P(A \cap B)/P(B)$ is the unique additive way to do this; the technical fact is that $P_B$ is again a probability, with all the algebraic properties of the « Properties of probabilities » section.

Definition — Conditional probability

Let $(\Omega, P)$ be a finite probability space and $B \subseteq \Omega$ an event with $P(B) > 0$. For every event $A \subseteq \Omega$, the probability of $A$ conditional on $B$ (also called probability of $A$ given $B$) is the real $$ P_B(A) \ = \ P(A \mid B) \ := \ \frac{P(A \cap B)}{P(B)}. $$ The notations $P_B(A)$ and $P(A \mid B)$ are interchangeable.

Proposition — $P_B$ is a probability

With the assumptions of the previous definition, the function $P_B : \mathcal{P}(\Omega) \to [0, 1]$, $A \mapsto P_B(A)$, is a probability on $\Omega$: it satisfies $P_B(\Omega) = 1$ and is additive on incompatible events. Although $P_B$ is formally a probability on $\Omega$, it is concentrated on $B$: events disjoint from $B$ have conditional probability $0$, and $P_B(B) = 1$.

Proof

We check the two axioms of a probability.

$P_B(\Omega) = 1$. By definition, $P_B(\Omega) = P(\Omega \cap B)/P(B) = P(B)/P(B) = 1$.
Additivity. Let $A_1, A_2 \subseteq \Omega$ with $A_1 \cap A_2 = \emptyset$. Then $(A_1 \cap B) \cap (A_2 \cap B) = (A_1 \cap A_2) \cap B = \emptyset$, so $A_1 \cap B$ and $A_2 \cap B$ are incompatible. By additivity of $P$, $$ \begin{aligned} P_B(A_1 \cup A_2) \ &= \ \frac{P((A_1 \cup A_2) \cap B)}{P(B)} && \text{(definition)} \\ &= \ \frac{P((A_1 \cap B) \cup (A_2 \cap B))}{P(B)} && \text{(distributivity of $\cap$ over $\cup$)} \\ &= \ \frac{P(A_1 \cap B) + P(A_2 \cap B)}{P(B)} && \text{(additivity of $P$)} \\ &= \ P_B(A_1) + P_B(A_2). \end{aligned} $$

Finally $P_B$ takes values in $[0, 1]$: the numerator $P(A \cap B) \le P(B)$ by monotonicity (since $A \cap B \subseteq B$), so $P_B(A) \le 1$; moreover $P(A \cap B) \ge 0$ and $P(B) > 0$, so $P_B(A) \ge 0$.

Method — Working with $P_B$

Once $P(B) > 0$, the conditional probability $P_B$ is a fully-fledged probability on $\Omega$ --- the entire toolbox of the « Properties of probabilities » section transports: $P_B(\overline{A}) = 1 - P_B(A)$, $P_B(A \cup A') = P_B(A) + P_B(A') - P_B(A \cap A')$, and so on. Two endpoint consequences are worth stating: $P_B(B) = 1$ and $P_B(\overline{B}) = 0$. Knowing that $B$ has occurred, we now « live in $B$ »: the events disjoint from $B$ have conditional probability zero, and $B$ itself has conditional probability one.

Example

One card from a balanced deck. A card is drawn uniformly at random from a deck of $52$. Let $A$ = « the card is the ace of spades » and $B$ = « the card is a spade ». Without conditioning, $P(A) = 1/52$ and $P(B) = 13/52 = 1/4$. Conditional on $B$: $$ P_B(A) \ = \ \frac{P(A \cap B)}{P(B)} \ = \ \frac{1/52}{13/52} \ = \ \frac{1}{13}. $$ The interpretation is exactly what intuition would suggest: knowing that the card is a spade, the $13$ spades are equiprobable, and the ace of spades is one of them.

Skills to practice

Computing conditional probabilities

II.2 Composed probabilities formula

When an experiment unfolds in successive steps --- first draw, second draw, third draw --- the joint probability of a sequence of events factors into a product of conditional probabilities, each one conditioned on the events realized so far. The formula is a direct consequence of the definition of $P_B$, applied iteratively; the proof is a telescoping product.

Theorem — Composed probabilities formula

Let $(\Omega, P)$ be a finite probability space, $n \ge 2$, and $A_1, \ldots, A_n$ events such that every conditional probability appearing in the formula below is well-defined. This is equivalent to requiring all prefix intersections $P(A_1 \cap \cdots \cap A_k) > 0$ for $1 \le k \le n-1$; it is enough to assume the last one, $P(A_1 \cap \cdots \cap A_{n-1}) > 0$, since by monotonicity it is contained in all previous prefixes. Then $$ \textcolor{colorprop}{P(A_1 \cap A_2 \cap \cdots \cap A_n) \ = \ P(A_1) \cdot P_{A_1}(A_2) \cdot P_{A_1 \cap A_2}(A_3) \cdots P_{A_1 \cap \cdots \cap A_{n-1}}(A_n).} $$

Proof

By definition of conditional probability, for $k \in \llbracket 2, n \rrbracket$, $$ P_{A_1 \cap \cdots \cap A_{k-1}}(A_k) \ = \ \frac{P(A_1 \cap \cdots \cap A_k)}{P(A_1 \cap \cdots \cap A_{k-1})}. $$ The right-hand side of the theorem is then a telescoping product: $$ \begin{aligned} P(A_1) \cdot \prod_{k=2}^n P_{A_1 \cap \cdots \cap A_{k-1}}(A_k) \ &= \ P(A_1) \cdot \prod_{k=2}^n \frac{P(A_1 \cap \cdots \cap A_k)}{P(A_1 \cap \cdots \cap A_{k-1})} \\ &= \ P(A_1) \cdot \frac{P(A_1 \cap A_2)}{P(A_1)} \cdot \frac{P(A_1 \cap A_2 \cap A_3)}{P(A_1 \cap A_2)} \cdots \frac{P(A_1 \cap \cdots \cap A_n)}{P(A_1 \cap \cdots \cap A_{n-1})} \\ &= \ P(A_1 \cap A_2 \cap \cdots \cap A_n). \end{aligned} $$

Method — Step-by-step experiments

When an experiment splits into « step $1$, then step $2$, then \dots, step $n$ », write the joint probability as a chain of conditional probabilities, each one conditioned on the events realized at the previous steps. The formula reads naturally from left to right: « first the probability that step $1$ succeeds, then the conditional probability that step $2$ succeeds knowing that step $1$ did, \dots ». Particularly useful for sequential draws without replacement, sequential coin flips, or any context where the probability at step $k$ depends on what happened earlier.

Example

Three successive draws without replacement from an urn. An urn contains $n$ black balls and $b$ white balls, with $n \ge 2$ and $b \ge 1$. Three balls are drawn one after the other, without replacement. Compute the probability of the sequence black-white-black.

Answer

Let $K_i$ = « the $i$-th ball is black » and $W_i$ = « the $i$-th ball is white ». By the composed-probabilities formula, $$ P(K_1 \cap W_2 \cap K_3) \ = \ P(K_1) \cdot P_{K_1}(W_2) \cdot P_{K_1 \cap W_2}(K_3). $$ At step $1$, the urn has $n + b$ balls of which $n$ are black, so $P(K_1) = n/(n+b)$. Knowing that step $1$ produced a black ball, the urn now has $n - 1$ black and $b$ white, of total $n + b - 1$, so $P_{K_1}(W_2) = b/(n+b-1)$. Knowing the first two steps gave black-white, the urn has $n - 1$ black and $b - 1$ white, of total $n + b - 2$, so $P_{K_1 \cap W_2}(K_3) = (n-1)/(n+b-2)$. Multiplying: $$ P(K_1 \cap W_2 \cap K_3) \ = \ \frac{n}{n+b} \cdot \frac{b}{n+b-1} \cdot \frac{n-1}{n+b-2}. $$ For $n = 3$, $b = 2$: $\frac{3}{5} \cdot \frac{2}{4} \cdot \frac{2}{3} = \frac{12}{60} = \frac{1}{5}$.

Skills to practice

Applying the composed-probabilities formula

II.3 Total probability formula

The total-probability formula is the case-by-case reasoning made quantitative. If the universe partitions into known cases $A_1, \ldots, A_n$, the probability of any event $B$ is the weighted sum of the conditional probabilities $P_{A_i}(B)$, with weights $P(A_i)$. It is the workhorse of probability calculations: any time the experiment naturally splits into "depending on which case occurs", one applies it.

Theorem — Total probability formula

Let $(\Omega, P)$ be a finite probability space and $\{A_1, \ldots, A_n\}$ a complete system of events with $P(A_i) > 0$ for every $i$. Then for any event $B \subseteq \Omega$, $$ \textcolor{colorprop}{P(B) \ = \ \sum_{i=1}^n P(A_i) \cdot P_{A_i}(B) \ = \ \sum_{i=1}^n P(A_i \cap B).} $$

Proof

Since $\{A_1, \ldots, A_n\}$ is a complete system, $\Omega = \bigsqcup_{i=1}^n A_i$, and intersecting with $B$ gives $B = B \cap \Omega = \bigsqcup_{i=1}^n (B \cap A_i)$, where the union is disjoint because $A_i \cap A_j = \emptyset$ for $i \ne j$. By finite additivity (item (v) of the Properties theorem), $$ P(B) \ = \ \sum_{i=1}^n P(B \cap A_i). $$ By definition of conditional probability, $P(B \cap A_i) = P(A_i) \cdot P_{A_i}(B)$ (using $P(A_i) > 0$). Substituting gives the announced formula.

Zero-probability events. If some $A_i$ in the system satisfies $P(A_i) = 0$, then $P(A_i \cap B) = 0$ for every event $B$ (by monotonicity, $A_i \cap B \subseteq A_i$). The conditional probability $P_{A_i}(B)$ is then undefined, so the weighted form $$ P(B) \ = \ \sum_{i=1}^n P(A_i) \cdot P_{A_i}(B) $$ should be written only over the indices for which $P(A_i) > 0$: $$ P(B) \ = \ \sum_{\substack{1 \le i \le n \\ P(A_i) > 0}} P(A_i) \cdot P_{A_i}(B). $$ The zero-probability parts contribute nothing to the sum, although they still belong to the partition of $\Omega$ --- the system is unchanged, only the indexing of the sum is restricted.

Method — Case split

When the experiment can be analysed by saying "depending on whether the first ball is white or black, depending on whether the test is positive or negative, \dots", identify the complete system $(A_i)$ that captures the cases, compute the priors $P(A_i)$ and the conditional probabilities $P_{A_i}(B)$, and apply the formula. The total-probability formula is the workhorse for "probability tree" computations: each branch of the tree contributes one term $P(A_i) \cdot P_{A_i}(B)$ to the sum, as illustrated below.

The probability of $B$ is obtained by following the three « $A_i$-then-$B$ » branches and summing their products: $P(B) = \sum_{i=1}^3 P(A_i) \cdot P_{A_i}(B)$.

Example

Two successive draws without replacement: probability the second ball is white. Same setup as before --- urn with $n$ black and $b$ white balls --- but now we draw two balls and ask: what is the probability that the second ball is white?

Answer

The natural complete system is $\{W_1, \overline{W_1}\}$ where $W_1$ = « first ball is white ». By the total-probability formula, $$ \begin{aligned} P(W_2) \ &= \ P(W_1) \cdot P_{W_1}(W_2) + P(\overline{W_1}) \cdot P_{\overline{W_1}}(W_2) && \text{(total probability)} \\ &= \ \frac{b}{n+b} \cdot \frac{b-1}{n+b-1} + \frac{n}{n+b} \cdot \frac{b}{n+b-1} && \text{(prior + conditionals)} \\ &= \ \frac{b(b-1) + nb}{(n+b)(n+b-1)} && \text{(common denominator)} \\ &= \ \frac{b(n + b - 1)}{(n+b)(n+b-1)} && \text{(factor $b$)} \\ &= \ \frac{b}{n+b}. \end{aligned} $$ A surprising answer: $P(W_2) = b/(n+b) = P(W_1)$. The probability that the second ball is white equals the probability that the first ball is white --- a manifestation of the symmetry of the without-replacement model. The lesson: total probability is a calculation tool, but it can also reveal hidden symmetries of the model.

Skills to practice

Applying the total-probability formula

II.4 Bayes formula

Bayes' formula is the inversion trick: given the conditional probability $P_A(B)$ in one direction, recover the conditional probability $P_B(A)$ in the other. Combined with the total-probability formula in the denominator, it transforms a prior probability $P(A_j)$ and a likelihood $P_{A_j}(B)$ into a posterior probability $P_B(A_j)$. This « update of belief in light of evidence » is the basis of Bayesian inference, machine learning, medical diagnosis, and signal detection.

Theorem — Bayes' formula

Let $(\Omega, P)$ be a finite probability space.

Single-event form. For events $A, B \subseteq \Omega$ with $P(A), P(B) > 0$, $$ \textcolor{colorprop}{P_B(A) \ = \ \frac{P(A) \cdot P_A(B)}{P(B)}.} $$
Complete-system form. Let $\{A_1, \ldots, A_n\}$ be a complete system of events with $P(A_i) > 0$ for every $i$, and let $B$ be an event with $P(B) > 0$. Then for every $j \in \llbracket 1, n \rrbracket$, $$ \textcolor{colorprop}{P_B(A_j) \ = \ \frac{P(A_j) \cdot P_{A_j}(B)}{\displaystyle\sum_{i=1}^n P(A_i) \cdot P_{A_i}(B)}.} $$

Proof

Single-event form. By definition of conditional probability, $$ P_B(A) \ = \ \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P_A(B) \ = \ \frac{P(A \cap B)}{P(A)}. $$ The second gives $P(A \cap B) = P(A) \cdot P_A(B)$; substituting in the first yields the formula.
Complete-system form. Apply the single-event form with $A = A_j$: $$ P_B(A_j) \ = \ \frac{P(A_j) \cdot P_{A_j}(B)}{P(B)}. $$ The denominator $P(B)$ is computed by the total-probability formula on the complete system: $P(B) = \sum_{i=1}^n P(A_i) P_{A_i}(B)$. Substitute the total-probability denominator into the single-event Bayes form to conclude.

Method — Prior$\virgule$ likelihood$\virgule$ posterior

Bayes' formula is most useful when one knows the priors $P(A_j)$ (the « initial beliefs ») and the likelihoods $P_{A_j}(B)$ (« how likely is the observation $B$ in each scenario $A_j$? »), and one wants the posterior $P_B(A_j)$ (« given that $B$ has been observed, what is the probability of scenario $A_j$? »).
The recipe:

Identify the complete system $(A_j)$ of competing scenarios.
Write down the priors $P(A_j)$.
Write down the likelihoods $P_{A_j}(B)$.
Apply the formula --- the denominator is just the total-probability sum $P(B)$.

The most counter-intuitive feature of Bayes is that even very reliable « evidence » can yield a small posterior probability when the prior is small. The medical-test Example below makes this concrete.

Example

The blue taxis (single-event Bayes). A taxi is involved in a hit-and-run at night in a city where $85\%$ of taxis are green and $15\%$ are blue. A witness identifies the taxi as blue. The witness is reliable in the sense that, in night-time test conditions, they correctly identify the colour of a taxi $80\%$ of the time. What is the probability that the taxi was actually blue?

Answer

Let $G$ = « the taxi is green » and $B$ = « the taxi is blue » ($\{G, B\}$ is a complete system). Let $T$ = « the witness says blue ». The data: $$ P(G) = 0{.}85, \quad P(B) = 0{.}15, \quad P_G(T) = 0{.}20, \quad P_B(T) = 0{.}80. $$ By Bayes (complete-system form), $$ \begin{aligned} P_T(B) \ &= \ \frac{P(B) \cdot P_B(T)}{P(G) \cdot P_G(T) + P(B) \cdot P_B(T)} && \text{(Bayes)} \\ &= \ \frac{0{.}15 \cdot 0{.}80}{0{.}85 \cdot 0{.}20 + 0{.}15 \cdot 0{.}80} && \text{(substitute)} \\ &= \ \frac{0{.}12}{0{.}17 + 0{.}12} \ = \ \frac{0{.}12}{0{.}29} \ \approx \ 0{.}414. \end{aligned} $$ Despite the witness having $80\%$ reliability, the posterior probability that the taxi was blue is only about $41\%$ --- the prior (only $15\%$ of taxis are blue) drags the answer well below the naive $80\%$.

Example

Medical test (counter-intuitive role of the prior). A disease affects $1\%$ of a population. A diagnostic test has $99\%$ sensitivity ($P(\text{test} = + \mid \text{sick}) = 0{.}99$) and $95\%$ specificity ($P(\text{test} = - \mid \text{healthy}) = 0{.}95$). A randomly selected individual tests positive. What is the probability that they are actually sick?

Answer

Let $S$ = « sick », $\overline{S}$ = « healthy » (complete system), and $T$ = « test is positive ». The data: $$ P(S) = 0{.}01, \quad P(\overline{S}) = 0{.}99, \quad P_S(T) = 0{.}99, \quad P_{\overline{S}}(T) = 1 - 0{.}95 = 0{.}05. $$ By Bayes, $$ P_T(S) \ = \ \frac{P(S) \cdot P_S(T)}{P(S) \cdot P_S(T) + P(\overline{S}) \cdot P_{\overline{S}}(T)} \ = \ \frac{0{.}01 \cdot 0{.}99}{0{.}01 \cdot 0{.}99 + 0{.}99 \cdot 0{.}05} \ = \ \frac{0{.}0099}{0{.}0099 + 0{.}0495} \ \approx \ 0{.}167. $$ Despite a $99\%$ sensitivity, the posterior probability of being sick given a positive test is only about $17\%$. The reason: the prior $P(S) = 1\%$ is so small that even a $5\%$ false-positive rate floods the positive-test pool with healthy individuals. This is the classic warning against over-trusting a single test result without accounting for the base rate of the disease.

Example

Three urns, ball revealed by colour. Three urns $U_1, U_2, U_3$ contain respectively $\{2 \text{ white}, 1 \text{ black}\}$, $\{1 \text{ white}, 2 \text{ black}\}$, $\{1 \text{ white}, 1 \text{ black}\}$. An urn is chosen uniformly at random, then one ball is drawn from it. Given that the ball drawn is white, what is the probability that the chosen urn was $U_2$?

Answer

Let $A_i$ = « urn $U_i$ was chosen », a complete system with $P(A_i) = 1/3$ for $i \in \{1, 2, 3\}$. Let $B$ = « ball drawn is white ». The likelihoods are read directly from the urn compositions: $$ P_{A_1}(B) = 2/3, \quad P_{A_2}(B) = 1/3, \quad P_{A_3}(B) = 1/2. $$ Total probability: $P(B) = (1/3)(2/3) + (1/3)(1/3) + (1/3)(1/2) = 2/9 + 1/9 + 1/6 = 4/18 + 2/18 + 3/18 = 9/18 = 1/2$. By Bayes, $$ P_B(A_2) \ = \ \frac{P(A_2) \cdot P_{A_2}(B)}{P(B)} \ = \ \frac{(1/3) \cdot (1/3)}{1/2} \ = \ \frac{1/9}{1/2} \ = \ \frac{2}{9}. $$ The prior was $P(A_2) = 1/3 \approx 0{.}33$; the posterior is $P_B(A_2) = 2/9 \approx 0{.}22$, lower because $U_2$ is the urn least likely to produce a white ball.

Skills to practice

Applying Bayes formula

III Independence of events

Independence is the modelling assumption that lets us multiply probabilities. Two events are independent when the realization of one does not change the probability of the other --- a condition that translates to the factorisation $P(A \cap B) = P(A) P(B)$. Independence is rarely something we prove about a given experiment; it is something we postulate when modelling (« we model the two coin flips as independent », « the draws with replacement are independent »). The mutual independence of $n$ events strengthens this: every sub-family of two or more must factorise. The chapter ends with the warning that pairwise independence is strictly weaker than mutual independence --- a fact that catches every student off guard the first time.

III.1 Independence of two events

Independence of two events is the simplest case: $A$ and $B$ are independent when the joint probability factorises as the product of the marginals, $P(A \cap B) = P(A) P(B)$. When $P(B) > 0$, this is equivalent to saying that $B$ provides no information about $A$: $P_B(A) = P(A)$. The notion is symmetric in $A$ and $B$.

Definition — Independence of two events

Let $(\Omega, P)$ be a finite probability space. Two events $A, B \subseteq \Omega$ are said to be independent when $$ \textcolor{colordef}{P(A \cap B) \ = \ P(A) \cdot P(B).} $$ We write $A \perp B$ for « $A$ and $B$ are independent ».

Proposition — Independence and conditional probability

Let $A, B$ be events with $P(B) > 0$. Then $$ A \perp B \ \iff \ P_B(A) \ = \ P(A). $$ Symmetrically, when $P(A) > 0$, $A \perp B \iff P_A(B) = P(B)$.

Proof

By definition, $P_B(A) = P(A \cap B)/P(B)$. Multiplying both sides by $P(B) > 0$, the equation $P_B(A) = P(A)$ is equivalent to $P(A \cap B) = P(A) P(B)$, which is the definition of $A \perp B$.

Method — Incompatible $\ne$ independent

Distinguish carefully between two often-confused notions:

$A$ and $B$ are incompatible when $A \cap B = \emptyset$, that is when they cannot occur simultaneously.
$A$ and $B$ are independent when $P(A \cap B) = P(A) P(B)$, that is when knowing one does not change the probability of the other.

The two notions are different and, in fact, in tension: two events of strictly positive probability that are incompatible are never independent. Indeed, if $A \cap B = \emptyset$, then $P(A \cap B) = 0$, while $P(A) P(B) > 0$ if both $P(A) > 0$ and $P(B) > 0$. So « $A$ and $B$ never occur together » is a much stronger constraint than « $B$ tells us nothing about $A$ ».

Example

A balanced die: independence depends on the events. On $\Omega = \llbracket 1, 6 \rrbracket$ uniform, let $A$ = « the result is even » $= \{2, 4, 6\}$ and $B$ = « the result is $\ge 4$ » $= \{4, 5, 6\}$. Then $P(A) = P(B) = 1/2$, and $$ P(A \cap B) \ = \ P(\{4, 6\}) \ = \ \frac{2}{6} \ = \ \frac{1}{3}, \quad P(A) \cdot P(B) \ = \ \frac{1}{4}. $$ Since $1/3 \ne 1/4$, the events $A$ and $B$ are not independent. Now let $C$ = « the result is $1$ or $4$ » $= \{1, 4\}$. Then $P(C) = 1/3$ and $$ P(A \cap C) \ = \ P(\{4\}) \ = \ \frac{1}{6} \ = \ \frac{1}{2} \cdot \frac{1}{3} \ = \ P(A) \cdot P(C), $$ so $A$ and $C$ are independent. Independence is a property of the pair $(A, B)$ together with $P$, not of the events in isolation.

Example

Incompatible $\Rightarrow$ not independent. Same die. Let $A$ = « obtain $1$ or $2$ » $= \{1, 2\}$ and $B$ = « obtain $5$ or $6$ » $= \{5, 6\}$. Then $P(A) = P(B) = 1/3$, both strictly positive. The two events are incompatible: $A \cap B = \emptyset$, so $P(A \cap B) = 0$. But the product is $P(A) \cdot P(B) = 1/9 \ne 0$. So $A$ and $B$ are not independent --- learning that $B$ has occurred completely rules out $A$, the strongest possible influence one event can have on another.

Skills to practice

Verifying or refuting independence

III.2 Mutual independence of $n$ events

For three or more events, asking that every pair factorises is not enough --- one must also ask that every triple, every quadruple, \ldots, every sub-family of size $\ge 2$, factorises. This stronger condition is mutual independence, and the difference with pairwise independence is real: there exist families of events that are pairwise independent but not mutually independent. The classical counter-example on $\Omega = \llbracket 1, 4 \rrbracket$ makes this concrete.

Definition — Mutual independence

Let $(\Omega, P)$ be a finite probability space and $A_1, \ldots, A_n$ be $n$ events of $\Omega$. The family $(A_1, \ldots, A_n)$ is said to be mutually independent (or simply independent) when, for every subset $I \subseteq \llbracket 1, n \rrbracket$ with $|I| \ge 2$, $$ \textcolor{colordef}{P\Bigl(\bigcap_{i \in I} A_i\Bigr) \ = \ \prod_{i \in I} P(A_i).} $$ The cases $|I| = 0$ and $|I| = 1$ are excluded for clarity (they are tautological: the empty product is equal to $1$, and a single $P(A_i)$ on each side).

Proposition — Mutual independence implies pairwise independence; the converse is false

If $(A_1, \ldots, A_n)$ is mutually independent, then $A_i$ and $A_j$ are independent for all $i \ne j$ (« pairwise independent »).
The converse is false: there exist three events $A, B, C$ that are pairwise independent but not mutually independent.

Proof

The first assertion is immediate: take $I = \{i, j\}$ in the definition of mutual independence, with $|I| = 2$, to get $P(A_i \cap A_j) = P(A_i) P(A_j)$.
For the converse, we exhibit the classical counter-example. Let $\Omega = \llbracket 1, 4 \rrbracket$ with the uniform probability ($P(\{k\}) = 1/4$ for each $k$). Define $$ A = \{1, 2\}, \qquad B = \{1, 3\}, \qquad C = \{2, 3\}. $$ Each event has cardinal $2$, so $P(A) = P(B) = P(C) = 1/2$. The pairwise intersections each have cardinal $1$: $$ A \cap B = \{1\}, \qquad A \cap C = \{2\}, \qquad B \cap C = \{3\}, $$ giving $P(A \cap B) = P(A \cap C) = P(B \cap C) = 1/4 = (1/2)(1/2)$. So the three events are pairwise independent.
But $A \cap B \cap C = \emptyset$, so $P(A \cap B \cap C) = 0$, while $P(A) P(B) P(C) = (1/2)^3 = 1/8 \ne 0$. The triple does not factorise; the family is not mutually independent.

Method — « Mutual » $\ne$ « pairwise »

When checking the independence of three or more events, verifying every pair is not enough: one must verify the factorisation on every subset $I \subseteq \llbracket 1, n \rrbracket$ with $|I| \ge 2$. When in doubt, compute every sub-intersection and compare with the corresponding product.
For three events, mutual independence is equivalent to the three pairwise factorisations together with the triple factorisation. For $n \ge 4$, one must check every sub-family of size at least $2$: pairs, triples, \dots, up to the full $n$-fold intersection. The total number of conditions is $\sum_{k=2}^n \binom{n}{k} = 2^n - n - 1$.
In modelling practice, mutual independence is usually postulated (the classical models of $n$ independent coin flips, $n$ independent draws with replacement, etc.) rather than verified by hand.

Example

Three independent coin flips: a positive case of mutual independence. A balanced coin is flipped three times. We model this by $\Omega = \{\mathrm{H}, \mathrm{T}\}^3$ uniform, of cardinal $8$. For $i \in \{1, 2, 3\}$, let $A_i$ = « the $i$-th flip is heads »; each $A_i$ has cardinal $4$ (the $i$-th coordinate is fixed at $\mathrm{H}$, the other two are free), so $P(A_i) = 4/8 = 1/2$. Check that $(A_1, A_2, A_3)$ is mutually independent.

Answer

We must check the factorisation on every sub-family of size $\ge 2$. There are $\binom{3}{2} = 3$ pairs and $\binom{3}{3} = 1$ triple --- four conditions to verify.

Pairwise factorisations. For any $i \ne j$, the event $A_i \cap A_j$ corresponds to fixing two coordinates at $\mathrm{H}$ and leaving one free, of cardinal $2$, so $P(A_i \cap A_j) = 2/8 = 1/4 = (1/2)(1/2) = P(A_i) P(A_j)$. The three pairs $(A_1, A_2), (A_1, A_3), (A_2, A_3)$ all factorise.
Triple factorisation. The event $A_1 \cap A_2 \cap A_3$ fixes all three coordinates at $\mathrm{H}$, hence corresponds to the single outcome $\mathrm{HHH}$ and has cardinal $1$. So $P(A_1 \cap A_2 \cap A_3) = 1/8 = (1/2)^3 = P(A_1) P(A_2) P(A_3)$. The triple factorises.

All four conditions hold, so $(A_1, A_2, A_3)$ is mutually independent. The same argument generalises immediately to $n$ independent flips: every sub-intersection $\bigcap_{i \in I} A_i$ has cardinal $2^{n - |I|}$, and the factorisation $1/2^{|I|} = (1/2)^{|I|}$ is automatic. Contrast with the previous counter-example on $\Omega = \llbracket 1, 4 \rrbracket$: the pairwise factorisations there hold, but the triple one fails --- pairwise independence does not imply mutual.

Skills to practice

Verifying mutual independence

III.3 Independence and complements

The last technical fact of the chapter: replacing some of the events in a mutually independent family by their complements preserves independence. In particular, if $A_1, \ldots, A_n$ are mutually independent, then $\overline{A_1}, \overline{A_2}, A_3, \ldots, A_n$ are also mutually independent, and so are $\overline{A_1}, \ldots, \overline{A_n}$. The proof is a direct calculation in the case $n = 2$ (the only one explicitly required by the program); the general statement is admitted as a combinatorial bookkeeping result.

Proposition — Independence and complements

Let $(\Omega, P)$ be a finite probability space.

Case $n = 2$. If $A$ and $B$ are independent, then so are $\overline{A}$ and $B$, $A$ and $\overline{B}$, and $\overline{A}$ and $\overline{B}$.
General case (admitted). If $A_1, \ldots, A_n$ are mutually independent, then for any choice $A_i' \in \{A_i, \overline{A_i}\}$ for $i \in \llbracket 1, n \rrbracket$, the family $(A_1', \ldots, A_n')$ is also mutually independent.

Proof

We prove the case $n = 2$. Let $A, B$ be independent, that is $P(A \cap B) = P(A) P(B)$. Then $$ \begin{aligned} P(\overline{A} \cap B) \ &= \ P(B \setminus (A \cap B)) && \text{(since $\overline{A} \cap B = B \setminus (A \cap B)$)} \\ &= \ P(B) - P(A \cap B) && \text{(item (ii) of Properties)} \\ &= \ P(B) - P(A) P(B) && \text{(independence of $A, B$)} \\ &= \ (1 - P(A)) P(B) \\ &= \ P(\overline{A}) P(B) && \text{(item (ii) of Properties)}. \end{aligned} $$ So $\overline{A}$ and $B$ are independent. Symmetrically (exchange the roles of $A$ and $B$), $A$ and $\overline{B}$ are independent. Applying the result twice, $\overline{A}$ and $\overline{B}$ are also independent.
Sketch of the general case (admitted). Let $J \subseteq \llbracket 1, n \rrbracket$ be the set of indices to complement, with $|J| = k$. We argue by induction on $k$. The case $k = 0$ is the hypothesis. Assuming the family complemented at $k$ chosen indices is mutually independent, pick a $(k+1)$-th index $j \notin J$; for any sub-family $I \subseteq \llbracket 1, n \rrbracket$ with $|I| \ge 2$, the factorisation of $\bigcap_{i \in I} A_i'$ either does not involve $j$ (and is unaffected) or involves $j$ once, in which case --- by mutual independence of the current family, $A_j$ is independent of the intersection of the other events indexed by $I$ --- the calculation of the case $n = 2$ done above on $A_j$ alone (with that intersection playing the role of $B$) replaces $P(A_j)$ by $1 - P(A_j) = P(\overline{A_j})$, preserving the product structure. Hence at rank $k + 1$ the family is still mutually independent. The full inductive proof is admitted; only the rank-$0$-to-$1$ step (the case $n = 2$ above) is required by the program.

Example

$n$ independent coin flips, events only. We model $n$ independent flips of a balanced coin by the uniform universe $\Omega = \{\mathrm{H}, \mathrm{T}\}^n$ (cardinal $2^n$). For $i \in \llbracket 1, n \rrbracket$, let $A_i = \{$ the $i$-th flip shows heads $\}$, an event of cardinal $2^{n-1}$ (the $i$-th coordinate is fixed at $\mathrm{H}$, the others are free), so $P(A_i) = 2^{n-1}/2^n = 1/2$. The events $(A_1, \ldots, A_n)$ are mutually independent: for any $I \subseteq \llbracket 1, n \rrbracket$ with $|I| \ge 2$, the intersection $\bigcap_{i \in I} A_i$ corresponds to fixing the $|I|$ coordinates indexed by $I$ at $\mathrm{H}$, the other $n - |I|$ free, so its cardinal is $2^{n - |I|}$ and its probability is $2^{n - |I|}/2^n = 1/2^{|I|} = \prod_{i \in I} P(A_i)$.
By the Proposition, any family obtained by replacing some of the $A_i$ by $\overline{A_i}$ is also mutually independent. The probability « at least one heads » is computed by complement: $$ P\Bigl(\bigcup_{i=1}^n A_i\Bigr) \ = \ 1 - P\Bigl(\bigcap_{i=1}^n \overline{A_i}\Bigr) \ = \ 1 - \prod_{i=1}^n P(\overline{A_i}) \ = \ 1 - (1/2)^n. $$ For $n = 5$: $P(\text{at least one heads}) = 1 - 1/32 = 31/32$. Note that we have stayed strictly in the world of events associated with the flips; the independence of random variables is the content of the next chapter, Random Variables.

Skills to practice

Applying independence to complements