\( \definecolor{colordef}{RGB}{249,49,84} \definecolor{colorprop}{RGB}{18,102,241} \)
CommeUnJeu · L1 MPSI

Random variables on a finite universe

⌚ ~99 min ▢ 12 blocks ✓ 41 exercises Prerequisites : Probabilities on a finite universe
A random variable is the rigorous version of a phrase like « the number of heads in \(10\) coin flips », « the suit of the card drawn from the deck », « the sum of two dice ». It is a measurement attached to the outcome of a random experiment, and it is encoded as a function \(X : \Omega \to E\) from the universe of outcomes to some target set. Once \(X\) is in place, the events one cares about take the standard form \(\{X = x\}\) or \(\{X \in A\}\), and all their probabilities are repackaged into a single object: the law of \(X\). The chapter is the theory of this object --- how to compute it, how to compare two random variables (equality in law), how to transform it through a function \(f\), how to handle two random variables at once (joint and marginal laws), and how to recognize the standard laws that appear over and over.
The plan has four parts. The first sets up the language: what a random variable is, what its law is, what equality in law means, how the law transports through a function \(f\), and how to condition on an event \(A\). The second part introduces the three named laws --- uniform, Bernoulli, and binomial --- with the binomial getting two named theorems: the constructive derivation \(\mathcal B(n, p) =\) « sum of \(n\) independent Bernoullis \(\mathcal B(p)\) », and the stability of the binomial under independent sums. The third part handles couples \((X, Y)\): joint law, marginal laws, conditional law of \(X\) knowing \(Y = y\), then independence of two variables (three equivalent characterizations), mutual independence of \(n\) variables, and the lemme des coalitions. The fourth part is an enrichment block on two finite counting models (hypergeometric and truncated geometric) that are not named standard laws of the program but are first-class applications of the Counting chapter.
Three reflexes the reader should leave with: (i) before computing anything about \(X\), write down its law \(P_X\) on \(X(\Omega)\), ideally as a clean table or formula --- that is the « identity card » of the variable; (ii) recognize a programme-named law (uniform, Bernoulli, binomial) when it appears in disguise, and apply its formula without re-deriving it from scratch; (iii) on a couple \((X, Y)\), the joint law is the complete data, and independence is exactly the case where the joint law factorises into the product of the two marginals --- a single cell in disagreement is enough to refute it. The chapter inherits from Probabilities on a finite universe the global convention that \(\Omega\) is finite non-empty; expectation, variance, and the moments of standard laws are postponed to the next chapter, Expectation, variance, covariance.
I Random variables and laws
We turn the random-variable vocabulary glimpsed in Probabilities on a finite universe (the notation \(\{X \in A\} = X^{-1}(A)\) and the partition of \(\Omega\) associated with \(X\)) into the full theory: the law of \(X\), equality in law, the law of \(f(X)\), and conditioning on an event. The section opens the chapter and pays its main pedagogical debt: from now on, every probability question about \(X\) alone is a finite sum on \(X(\Omega)\) weighted by the law \(P_X\).
I.1 Definition of a random variable
A random variable is, intuitively, a measurement attached to the outcome of a random experiment. To make the idea formal, we encode the measurement as a function \(X : \Omega \to E\) from the universe to some target set, and we organise the events of interest under the names \(\{X = x\}\) and \(\{X \in A\}\). Throughout the chapter, \(\Omega\) is finite and non-empty (convention inherited from the previous chapter); consequently, the image \(X(\Omega)\) is also a finite set --- there are only finitely many possible values for the measurement.
Definition — Random variable
Let \((\Omega, P)\) be a finite probability space and \(E\) a non-empty set. A random variable on \(\Omega\) with values in \(E\) is any application \(X : \Omega \to E\). For \(A \subseteq E\), the event \(\{X \in A\}\) is the set $$ \{X \in A\} \ = \ X^{-1}(A) \ = \ \{\omega \in \Omega \mid X(\omega) \in A\} \ \subseteq \ \Omega. $$ For \(x \in E\), we write \(\{X = x\}\) for the event \(\{X \in \{x\}\} = X^{-1}(\{x\})\). When \(E \subseteq \mathbb R\), we say that \(X\) is a real-valued random variable, and we additionally use the notation \(\{X \le x\} = X^{-1}(\,]-\infty, x])\).
Definition — Image of a random variable
The image of a random variable \(X : \Omega \to E\) is the set $$ X(\Omega) \ = \ \{X(\omega) \mid \omega \in \Omega\} \ \subseteq \ E. $$ Because \(\Omega\) is finite, \(X(\Omega)\) is also a finite set: it is the set of possible values of \(X\).
The probabilistic support of \(X\) is the smaller set \(\operatorname{Supp}(X) = \{x \in X(\Omega) : P(X = x) > 0\}\). It is included in the image \(X(\Omega)\), with possibly strict inclusion when some outcomes \(\omega \in \Omega\) have probability zero. In most elementary examples the two coincide; in the sequel, formulas are written on \(X(\Omega)\), with zero-probability values causing no difficulty.
Proposition — Complete system associated with \(X\)
Let \(X : \Omega \to E\) be a random variable. The family \(\bigl(\{X = x\}\bigr)_{x \in X(\Omega)}\) is a partition of \(\Omega\) into non-empty events: $$ \Omega \ = \ \bigsqcup_{x \in X(\Omega)} \{X = x\}. $$ This partition is called the complete system of events associated with \(X\).

This is a restatement of the partition-by-fibers fact proved in the previous chapter Probabilities on a finite universe: every \(\omega \in \Omega\) has a unique image \(X(\omega) \in X(\Omega)\), hence belongs to exactly one \(\{X = x\}\). Each \(\{X = x\}\) for \(x \in X(\Omega)\) is non-empty by definition of the image.

Example — Sum of two dice
Two balanced dice are rolled. Take \(\Omega = \llbracket 1, 6 \rrbracket^2\), and define \(X_1, X_2 : \Omega \to \llbracket 1, 6 \rrbracket\) by \(X_1(\omega_1, \omega_2) = \omega_1\) (value of the first die) and \(X_2(\omega_1, \omega_2) = \omega_2\) (value of the second). Their sum \(S = X_1 + X_2\) is a real-valued random variable with image \(S(\Omega) = \llbracket 2, 12 \rrbracket\). The event \(\{S = 7\}\) is the set \(\{(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\}\), of cardinal \(6\).
Example — Indicator of an event
Let \(A \subseteq \Omega\) be an event. The indicator \(\indicatrice_A : \Omega \to \{0, 1\}\) of \(A\) is the random variable defined by \(\indicatrice_A(\omega) = 1\) if \(\omega \in A\) and \(\indicatrice_A(\omega) = 0\) otherwise. Its image is \(\indicatrice_A(\Omega) = \{0, 1\}\) (assuming \(A \ne \emptyset\) and \(A \ne \Omega\)), and the events \(\{\indicatrice_A = 1\} = A\) and \(\{\indicatrice_A = 0\} = \overline A\) recover the event and its complement. The indicator is the bridge between the event language of the previous chapter and the random-variable language of the present one --- a recurring building block, in particular for the Bernoulli law (introduced below).
Skills to practice
  • Finding the law of a random variable from first principles
I.2 Law of a random variable\(\virgule\) equality in law
The law of \(X\) is the « identity card » of the random variable: it is the probability \(P_X\) on \(E\), concentrated on \(X(\Omega)\), that records for each value \(x \in X(\Omega)\) the probability that \(X\) equals \(x\). The fundamental theorem of the section states that \(P_X\) is indeed a probability and that it is entirely determined by the family \(\bigl(P(X = x)\bigr)_{x \in X(\Omega)}\) --- specifying the law amounts to specifying that family. Two random variables, possibly defined on different universes, are said to have the same law (notation \(X \sim Y\)) when their laws agree as probability measures. This is a strictly weaker notion than pointwise equality, but it is the right notion for almost every result that depends on \(X\) only through its distribution.
Definition — Law of a random variable
Let \(X : \Omega \to E\) be a random variable on the finite probability space \((\Omega, P)\). The law (or distribution) of \(X\) is the probability \(P_X\) on \(E\) defined for every \(B \subseteq E\) by $$ P_X(B) \ = \ P(X \in B) \ = \ P\bigl(X^{-1}(B)\bigr). $$ Because \(\Omega\) is finite, \(P_X\) is concentrated on the finite set \(X(\Omega)\): for every \(B \subseteq E\), \(P_X(B) = \sum_{x \in B \cap X(\Omega)} P(X = x)\), and in particular \(P_X(\{x\}) = P(X = x)\) for every \(x \in X(\Omega)\).
Theorem — Law of a random variable
Let \(X : \Omega \to E\) be a random variable on the finite probability space \((\Omega, P)\).
  • [(i)] \(P_X\) is a probability on \(E\), concentrated on the finite set \(X(\Omega)\).
  • [(ii)] The law \(P_X\) is entirely determined by the family \(\bigl(P(X = x)\bigr)_{x \in X(\Omega)}\), which is a distribution on \(X(\Omega)\) --- a family of elements of \([0, 1]\) summing to \(1\). Explicitly, for every \(B \subseteq E\), $$ P_X(B) \ = \ \sum_{x \in B \cap X(\Omega)} P(X = x). $$

(i) We have \(P_X(E) = P(X \in E) = P(\Omega) = 1\). For two disjoint subsets \(B_1, B_2 \subseteq E\), the events \(X^{-1}(B_1)\) and \(X^{-1}(B_2)\) are also disjoint, hence by additivity of \(P\), $$ P_X(B_1 \sqcup B_2) \ = \ P(X^{-1}(B_1 \sqcup B_2)) \ = \ P(X^{-1}(B_1) \sqcup X^{-1}(B_2)) \ = \ P_X(B_1) + P_X(B_2). $$ So \(P_X\) is a probability on \(E\). Concentration on \(X(\Omega)\) comes from \(P_X(E \setminus X(\Omega)) = P(X \notin X(\Omega)) = P(\emptyset) = 0\).
(ii) The family \(\bigl(P(X = x)\bigr)_{x \in X(\Omega)}\) takes values in \([0, 1]\) and, by (i) and the partition of \(\Omega\) associated with \(X\) (Proposition above), satisfies $$ \sum_{x \in X(\Omega)} P(X = x) \ = \ \sum_{x \in X(\Omega)} P(\{X = x\}) \ = \ P(\Omega) \ = \ 1. $$ It is therefore a distribution on the finite set \(X(\Omega)\). By the characterization theorem for finite probabilities (chap. Probabilities on a finite universe, theorem stating that a probability is determined by its values on singletons), this distribution determines a unique probability on \(X(\Omega)\), given for \(B \subseteq X(\Omega)\) by \(\sum_{x \in B} P(X = x)\), which extends to \(E\) by zero outside \(X(\Omega)\). This probability coincides with \(P_X\) on every singleton \(\{x\}\), hence on every \(B \subseteq E\).

Definition — Equality in law
Let \(X : \Omega \to E\) and \(Y : \Omega' \to E\) be two random variables with values in the same target set \(E\), defined on possibly different probability spaces \((\Omega, P)\) and \((\Omega', P')\). We say that \(X\) and \(Y\) have the same law, and we write \(X \sim Y\), if their laws coincide as probabilities on \(E\): $$ \forall x \in E, \quad P(X = x) \ = \ P'(Y = x). $$
Method — Finding the law of a random variable in three steps
To find the law of a random variable \(X : \Omega \to E\):
  1. List the values. Enumerate \(X(\Omega) = \{x_1, x_2, \ldots, x_n\}\) --- the actual image of \(X\).
  2. Compute each \(P(X = x_i)\). For each \(x_i \in X(\Omega)\), identify the event \(\{X = x_i\}\) as a subset of \(\Omega\) and compute its probability (by counting if \(\Omega\) is uniform, by direct application of the model otherwise).
  3. Verify the total. Check that \(\sum_{i=1}^n P(X = x_i) = 1\). This is a built-in sanity check on the calculations of step 2.
The output is the table of the law \(\bigl(x_i, P(X = x_i)\bigr)_{i=1, \ldots, n}\) --- everything else about \(X\) alone is read off this table.
Example — Law of a balanced die
Take \(\Omega = \llbracket 1, 6 \rrbracket\) uniform (balanced die), and let \(X : \Omega \to \llbracket 1, 6 \rrbracket\) be the value of the die: \(X(\omega) = \omega\). Following the three-step method:
  1. \(X(\Omega) = \llbracket 1, 6 \rrbracket\);
  2. for each \(k \in \llbracket 1, 6 \rrbracket\), \(\{X = k\} = \{k\}\), hence \(P(X = k) = 1/6\);
  3. \(\sum_{k=1}^6 1/6 = 1\). \(\checkmark\)
The law of \(X\) is the constant family \((1/6, 1/6, \ldots, 1/6)\) on \(\llbracket 1, 6 \rrbracket\) --- this is the uniform law on \(\llbracket 1, 6 \rrbracket\), formally introduced in the uniform-law section below.
Example — Equality in law on different universes
Consider two models for « a balanced two-outcome experiment »:
  • Model 1 (coin): \(\Omega = \{H, T\}\) uniform, \(X = \indicatrice_{\{H\}}\) (so \(X = 1\) if heads, \(X = 0\) if tails).
  • Model 2 (die): \(\Omega' = \llbracket 1, 6 \rrbracket\) uniform, \(Y = \indicatrice_{\{1, 2, 3\}}\) (so \(Y = 1\) if the die shows \(1\), \(2\), or \(3\); \(Y = 0\) otherwise).
Both \(X\) and \(Y\) take values in \(\{0, 1\}\). Their laws are $$ P(X = 0) = P(X = 1) = 1/2, \qquad P(Y = 0) = P(Y = 1) = 1/2. $$ Hence \(X \sim Y\). The two variables are defined on different universes and are pointwise unrelated, but they have the same law: when only the law matters (e.g.\ when computing the probability of an event involving \(X\) alone), they are interchangeable.
Skills to practice
  • Recognising equality in law
I.3 Law of \(f(X)\)
When we apply a function \(f\) to a random variable \(X\), the composite \(f(X)\) is itself a random variable, and its law is obtained from the law of \(X\) by a preimage operation: each value \(y\) in the image of \(f(X)\) collects all the antecedents \(x\) with \(f(x) = y\), and we sum the probabilities. As a corollary, if two random variables have the same law, applying the same \(f\) to both produces two new random variables that also have the same law.
Proposition — Law of \(f(X)\)
Let \(X : \Omega \to E\) be a random variable and \(f : E \to F\) an application. Then \(f(X) := f \circ X : \Omega \to F\) is a random variable, with image \(f(X)(\Omega) = f(X(\Omega))\), and for every \(y \in F\), $$ P(f(X) = y) \ = \ \sum_{\substack{x \in X(\Omega) \\ f(x) = y}} P(X = x). $$

The events \(\{X = x\}\) for \(x \in X(\Omega)\) with \(f(x) = y\) are pairwise disjoint (sub-family of the partition associated with \(X\)), and their union is exactly \(\{f(X) = y\}\): $$ \{f(X) = y\} \ = \ \{\omega \in \Omega \mid f(X(\omega)) = y\} \ = \ \bigsqcup_{\substack{x \in X(\Omega) \\ f(x) = y}} \{X = x\}. $$ Apply additivity of \(P\) to conclude.

Proposition — Equality in law transports through a function
Let \(X\) and \(Y\) be two random variables with values in \(E\) such that \(X \sim Y\), and let \(f : E \to F\) be an application. Then \(f(X) \sim f(Y)\).

For every \(y \in F\), the event \(\{f(X) = y\}\) rewrites as \(\{X \in f^{-1}(\{y\})\}\), so $$ P(f(X) = y) \ = \ P_X\bigl(f^{-1}(\{y\})\bigr) \ = \ P_Y\bigl(f^{-1}(\{y\})\bigr) \ = \ P(f(Y) = y), $$ using \(P_X = P_Y\) in the middle equality. Hence \(f(X)\) and \(f(Y)\) have the same law.

Method — Computing the law of \(f(X)\)
To find the law of \(f(X)\) from the law of \(X\):
  1. Image. Compute \(f(X)(\Omega) = f(X(\Omega)) = \{f(x) \mid x \in X(\Omega)\}\).
  2. Group by output. For each \(y \in f(X)(\Omega)\), list the antecedents \(\{x \in X(\Omega) \mid f(x) = y\}\).
  3. Sum. Compute \(P(f(X) = y) = \sum_{f(x) = y} P(X = x)\).
  4. Check. Verify \(\sum_y P(f(X) = y) = 1\).
Example — Squared centred die
\(X\) is the value of a balanced die (so \(X \sim \mathcal U(\llbracket 1, 6 \rrbracket)\)), and we set \(Y = (X - 3)^2\). Compute the law of \(Y\).

With \(f(k) = (k - 3)^2\), the values of \(f\) on \(\llbracket 1, 6 \rrbracket\) are \(f(1) = 4, f(2) = 1, f(3) = 0, f(4) = 1, f(5) = 4, f(6) = 9\). Hence \(Y(\Omega) = \{0, 1, 4, 9\}\), and grouping antecedents: $$ \begin{aligned} P(Y = 0) \ &= \ P(X = 3) \ = \ 1/6, \\ P(Y = 1) \ &= \ P(X = 2) + P(X = 4) \ = \ 2/6 \ = \ 1/3, \\ P(Y = 4) \ &= \ P(X = 1) + P(X = 5) \ = \ 2/6 \ = \ 1/3, \\ P(Y = 9) \ &= \ P(X = 6) \ = \ 1/6. \end{aligned} $$ Check: \(1/6 + 1/3 + 1/3 + 1/6 = 1\). \(\checkmark\)

Skills to practice
  • Computing the law of \(f(X)\)
I.4 Conditional law of \(X\) knowing an event
Conditioning on an event \(A\) (with \(P(A) > 0\)) was introduced in the previous chapter (Probabilities on a finite universe, conditional-probability definition) as a new probability \(P_A\) on \(\Omega\). Applied to a random variable \(X\), this gives the conditional law of \(X\) knowing \(A\): it is simply the law of \(X\) under the new probability \(P_A\). The whole theory of laws (definition, characterization, transport by \(f\)) transports verbatim to \(P_A\). The case \(A = \{Y = y\}\) is the conditional law of \(X\) knowing \(Y = y\), which we will revisit in the joint-and-marginal-laws section in the joint-law context.
Definition — Conditional law of \(X\) knowing an event \(A\)
Let \(X : \Omega \to E\) be a random variable and \(A \subseteq \Omega\) an event with \(P(A) > 0\). The conditional law of \(X\) knowing \(A\) is the family \(\bigl(P_A(X = x)\bigr)_{x \in X(\Omega)}\) defined by $$ P_A(X = x) \ = \ P(X = x \mid A) \ = \ \frac{P(\{X = x\} \cap A)}{P(A)}. $$
Proposition — The conditional law is a probability
With the notation above, the family \(\bigl(P_A(X = x)\bigr)_{x \in X(\Omega)}\) is a distribution on \(X(\Omega)\), and the associated probability on \(X(\Omega)\) is exactly the law of \(X\) under the conditional probability \(P_A\). In particular, $$ \sum_{x \in X(\Omega)} P_A(X = x) \ = \ 1. $$

\(P_A\) is a probability on \(\Omega\) (chapter Probabilities on a finite universe, conditional-probability definition), so \(X\) admits a law under \(P_A\), given by \((P_A(X = x))_{x \in X(\Omega)}\). By the Theorem on the law of a random variable (proved earlier in this chapter), this is a distribution on \(X(\Omega)\) and sums to \(1\).

Method — Conditioning a random variable on an event
To compute \(P_A(X = x)\):
  1. Identify the event \(\{X = x\} \cap A\) as a subset of \(\Omega\).
  2. Compute its probability \(P(\{X = x\} \cap A)\) (often by counting if \(\Omega\) is uniform).
  3. Divide by \(P(A)\): \(P_A(X = x) = P(\{X = x\} \cap A) / P(A)\).
  4. Sanity check: \(\sum_x P_A(X = x) = 1\).
Once the conditional law is in hand, the entire theory of laws transports under \(P_A\) --- equality in law, law of \(f(X)\), etc.
Example — Sum of two dice given that the first is even
Two balanced dice are rolled, \(\Omega = \llbracket 1, 6 \rrbracket^2\) uniform; let \(X_1, X_2\) denote the two values and \(S = X_1 + X_2\). Let \(A = \{X_1 \text{ even}\}\). Compute the conditional law of \(S\) knowing \(A\).

\(|\Omega| = 36\). The event \(A\) corresponds to \(X_1 \in \{2, 4, 6\}\), so \(|A| = 3 \cdot 6 = 18\) and \(P(A) = 18/36 = 1/2\). Knowing \(A\), the first die is in \(\{2, 4, 6\}\) and the second is in \(\llbracket 1, 6 \rrbracket\), giving \(S \in \{3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\). Counting the number of favourable couples for each value of \(S\): $$ \begin{array}{c|cccccccccc} s & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\ \hline |\{X_1 \text{ even}, X_1 + X_2 = s\}| & 1 & 1 & 2 & 2 & 3 & 3 & 2 & 2 & 1 & 1 \\ \end{array} $$ (For \(s = 3\), only \((2, 1)\); for \(s = 7\), \((2, 5), (4, 3), (6, 1)\); etc.) Each cell has probability \(1/36\) under \(P\), so \(P(\{S = s\} \cap A) = (\text{count})/36\), and dividing by \(P(A) = 1/2\): $$ \begin{array}{c|cccccccccc} s & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\ \hline P_A(S = s) & 1/18 & 1/18 & 1/9 & 1/9 & 1/6 & 1/6 & 1/9 & 1/9 & 1/18 & 1/18 \\ \end{array} $$ Check: \(4 \cdot 1/18 + 4 \cdot 1/9 + 2 \cdot 1/6 = 2/9 + 4/9 + 3/9 = 9/9 = 1\). \(\checkmark\) The conditional law is supported on \(\llbracket 3, 12 \rrbracket\) (values \(2\) unreachable when \(X_1\) is even) and is symmetric around \(7{,}5\) but with different weights from the unconditional law of \(S\).

Skills to practice
  • Conditioning by an event
II Standard laws in the programme
Three named laws of the program: uniform, Bernoulli, and binomial. Each gets a Definition and a Method (« when to use it »); the binomial gets two named theorems --- a constructive derivation as a sum of \(n\) independent Bernoullis (which explains why \(\binom{n}{k}\) appears in the formula) and a stability theorem \(\mathcal B(n, p) + \mathcal B(m, p) = \mathcal B(n + m, p)\) for independent sums. Two further finite counting models (hypergeometric, truncated geometric) are not named in the program; we treat them in the enrichment section at the end of the chapter as applications of Counting.
II.1 Uniform law
The uniform law is the case where \(X\) takes every value of its image with the same probability. It is the natural model whenever the experiment has « no preferred outcome »: a balanced die, a card drawn at random from a shuffled deck, a uniformly chosen element of a finite list.
Definition — Uniform law on a finite set
Let \(E\) be a non-empty finite set. A random variable \(X\) follows the uniform law on \(E\), written \(X \sim \mathcal U(E)\), if \(X(\Omega) = E\) and $$ \forall x \in E, \quad P(X = x) \ = \ \frac{1}{|E|}. $$ The most common case in practice is \(E = \llbracket a, b \rrbracket\), for which \(P(X = k) = 1/(b - a + 1)\) for every \(k \in \llbracket a, b \rrbracket\).
Method — When to model with a uniform law
Use \(\mathcal U(E)\) to model:
  • the value of a balanced die (\(E = \llbracket 1, 6 \rrbracket\));
  • the choice of one element from a finite set with no preferred element (random card from a deck, random ticket in a hat, uniform integer in \(\llbracket 1, n \rrbracket\));
  • any « symmetric » experiment where all outcomes have the same physical role.
On a uniform space, every probability calculation reduces to counting --- this is the bridge to the chapter Counting.
Example — Balanced die
A balanced six-sided die: \(X = \) value of the die. Then \(X \sim \mathcal U(\llbracket 1, 6 \rrbracket)\) and \(P(X = k) = 1/6\) for every \(k \in \llbracket 1, 6 \rrbracket\).
Example — Rademacher law
A balanced coin gives heads or tails. Encode heads as \(+1\) and tails as \(-1\): \(X \in \{-1, +1\}\) with \(P(X = +1) = P(X = -1) = 1/2\). Then \(X \sim \mathcal U(\{-1, +1\})\). This special uniform law on \(\{-1, +1\}\) is often called the Rademacher law (named after Hans Rademacher, mathematician); it is the symmetric \(\pm 1\) random variable used to model symmetric random walks. It is not a fourth named law of the programme --- it is a particular case of the uniform law.
Skills to practice
  • Recognising a standard law
II.2 Bernoulli law
The Bernoulli law is the elementary brick: a single trial with two possible outcomes (« success » coded \(1\), « failure » coded \(0\)). It is the canonical model for any binary experiment, and it is the building block from which the binomial law is constructed in the next subsection.
Definition — Bernoulli law
Let \(p \in [0, 1]\). A random variable \(X\) valued in \(\{0, 1\}\) follows the Bernoulli law of parameter \(p\), written \(X \sim \mathcal B(p)\), if $$ P(X = 1) \ = \ p, \qquad P(X = 0) \ = \ 1 - p. $$ The value \(1\) is interpreted as a success, the value \(0\) as a failure, and \(p\) as the probability of success.
For every event \(A\) of \((\Omega, P)\), the indicator \(\indicatrice_A\) is a Bernoulli random variable with parameter \(P(A)\): $$ \indicatrice_A \ \sim \ \mathcal B(P(A)). $$ This is the bridge between the event language of Probabilities on a finite universe and the variable language of the present chapter --- and the simplest source of Bernoulli variables.
Example — Card is a heart
A card is drawn uniformly at random from a \(52\)-card deck. Let \(X = 1\) if the card is a heart, \(X = 0\) otherwise. Then \(X = \indicatrice_{\{\text{heart}\}}\), hence \(X \sim \mathcal B(13/52) = \mathcal B(1/4)\).
Skills to practice
  • Identifying and computing with a Bernoulli variable
II.3 Binomial law
The binomial law \(\mathcal B(n, p)\) counts the number of successes in a sequence of \(n\) independent Bernoulli trials of the same parameter \(p\). The constructive theorem below explains why the formula \(P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\) has the binomial coefficient: there are \(\binom{n}{k}\) ways to position the \(k\) successes among the \(n\) trials, and each such positioning has probability \(p^k (1-p)^{n-k}\) by mutual independence. The stability theorem (\(X + Y \sim \mathcal B(n + m, p)\) for independent \(X, Y\) of the same \(p\)) follows by convolution + Vandermonde's identity, and reaffirms the cross-link to Counting.
Definition — Binomial law
Let \(n \in \mathbb N^*\) and \(p \in [0, 1]\). A random variable \(X\) valued in \(\llbracket 0, n \rrbracket\) follows the binomial law of parameters \((n, p)\), written \(X \sim \mathcal B(n, p)\), if $$ \forall k \in \llbracket 0, n \rrbracket, \quad P(X = k) \ = \ \binom{n}{k} \, p^k \, (1 - p)^{n - k}. $$ The integer \(n\) is the number of trials and \(p\) the probability of success at each trial. We adopt the standard convention \(0^0 = 1\) for the boundary cases \(p = 0\) and \(p = 1\): \(\mathcal B(n, 0)\) is concentrated at \(0\), and \(\mathcal B(n, 1)\) is concentrated at \(n\).
Theorem — Construction of the binomial law
Let \(n \in \mathbb N^*\), \(p \in [0, 1]\), and let \(X_1, \ldots, X_n\) be \(n\) mutually independent Bernoulli random variables of the same parameter \(p\) (the precise definition of mutual independence of random variables is given in the mutual-independence subsection below). Then $$ S_n \ := \ X_1 + X_2 + \cdots + X_n \ \sim \ \mathcal B(n, p). $$ That is, \(S_n\) is valued in \(\llbracket 0, n \rrbracket\) and \(P(S_n = k) = \binom{n}{k} p^k (1-p)^{n-k}\) for every \(k \in \llbracket 0, n \rrbracket\).

Reading note. The proof uses the singleton characterization of mutual independence, stated and proved in the mutual-independence subsection below. At first reading, one may simply accept the formula \(P(B_I) = p^k (1-p)^{n-k}\) as the meaning of independence for a fixed configuration of successes and failures.
\(S_n\) takes values in \(\llbracket 0, n \rrbracket\) since each \(X_i \in \{0, 1\}\) and there are \(n\) summands. Fix \(k \in \llbracket 0, n \rrbracket\). The event \(\{S_n = k\}\) is the disjoint union, over all subsets \(I \subseteq \llbracket 1, n \rrbracket\) of cardinal \(k\), of the events $$ B_I \ = \ \Bigl(\bigcap_{i \in I} \{X_i = 1\}\Bigr) \cap \Bigl(\bigcap_{i \notin I} \{X_i = 0\}\Bigr). $$ By the singleton characterization of mutual independence (Proposition of the mutual-independence subsection, anticipated here) applied to the values \(x_i \in \{0, 1\}\) that select between \(\{X_i = 1\}\) and \(\{X_i = 0\}\), and using the values of the Bernoulli laws, $$ P(B_I) \ = \ \prod_{i \in I} P(X_i = 1) \cdot \prod_{i \notin I} P(X_i = 0) \ = \ p^k \, (1 - p)^{n - k}. $$ There are \(\binom{n}{k}\) subsets \(I\) of cardinal \(k\), hence by additivity, $$ P(S_n = k) \ = \ \sum_{|I| = k} P(B_I) \ = \ \binom{n}{k} \, p^k \, (1 - p)^{n - k}. $$ This is the formula defining \(\mathcal B(n, p)\).

Theorem — Stability of the binomial law
Let \(n, m \in \mathbb N^*\) and \(p \in [0, 1]\). If \(X \sim \mathcal B(n, p)\) and \(Y \sim \mathcal B(m, p)\) are independent (definition in the two-variable-independence subsection below), then $$ X + Y \ \sim \ \mathcal B(n + m, p). $$

\(X + Y\) takes values in \(\llbracket 0, n + m \rrbracket\). Fix \(k \in \llbracket 0, n + m \rrbracket\). By the partition \(\{X + Y = k\} = \bigsqcup_{j=0}^{k} \{X = j, Y = k - j\}\) and the independence of \(X\) and \(Y\), $$ P(X + Y = k) \ = \ \sum_{j=0}^{k} P(X = j) P(Y = k - j), $$ where we adopt the convention \(\binom{a}{b} = 0\) for \(b < 0\) or \(b > a\), so that \(P(X = j) = 0\) when \(j > n\) and \(P(Y = k - j) = 0\) when \(k - j > m\). Inserting the binomial formulas, $$ \begin{aligned} P(X + Y = k) \ &= \ \sum_{j=0}^{k} \binom{n}{j} p^j (1 - p)^{n - j} \cdot \binom{m}{k - j} p^{k - j} (1 - p)^{m - k + j} && \text{(binomial formulas)} \\ &= \ p^k (1 - p)^{n + m - k} \sum_{j=0}^{k} \binom{n}{j} \binom{m}{k - j} && \text{(factorise)} \\ &= \ p^k (1 - p)^{n + m - k} \binom{n + m}{k} && \text{(Vandermonde)}. \end{aligned} $$ This is the formula defining \(\mathcal B(n + m, p)\).

Method — Recognising a binomial law
A random variable \(X\) follows \(\mathcal B(n, p)\) when the experiment satisfies the three conditions:
  1. \(n\) trials, identical and independent. The experiment consists of \(n\) trials, mutually independent, each with the same model.
  2. Two outcomes per trial. Each trial has two possible outcomes, success (probability \(p\)) or failure (probability \(1 - p\)), with the same \(p\) for every trial.
  3. Counting successes. \(X\) is the total number of successes among the \(n\) trials.
When the three conditions are satisfied, write \(X \sim \mathcal B(n, p)\) and apply the formula \(P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\) directly --- no need to re-derive it.
Example — Multiple-choice exam
A multiple-choice exam has \(10\) questions, each with \(4\) options of which exactly one is correct. A student answers all questions uniformly at random and independently. Let \(X = \) number of correct answers. Identify the law of \(X\) and compute \(P(X = 3)\).

The three conditions are satisfied: \(n = 10\) trials (the questions), independent and identical, each with two outcomes (correct with probability \(p = 1/4\), incorrect with probability \(3/4\)), and \(X = \) number of successes. Hence \(X \sim \mathcal B(10, 1/4)\). Apply the formula: $$ P(X = 3) \ = \ \binom{10}{3} (1/4)^3 (3/4)^7 \ = \ 120 \cdot \frac{1}{64} \cdot \frac{2187}{16384} \ \approx \ 0{,}2503. $$

Skills to practice
  • Computing with the binomial law
III Couples and independence of random variables
We pass from one variable to two (and to \(n\)). The full information about a couple \((X, Y)\) is encoded in the joint law \(P(X = x, Y = y)\), a two-dimensional table; the laws of \(X\) and \(Y\) alone (the marginals) are read off by summing rows or columns. The conditional law of \(X\) knowing \(Y = y\) is the special case \(A = \{Y = y\}\) of the conditional-law-of-a-variable definition. Independence of \(X\) and \(Y\) is the case where the joint law factorises into the product of marginals --- three equivalent characterizations are gathered in one named theorem. The section closes with mutual independence of \(n\) variables and the lemme des coalitions.
III.1 Joint and marginal laws
The joint law of \((X, Y)\) is a table indexed by \(X(\Omega) \times Y(\Omega)\) that lists every joint probability \(P(X = x, Y = y)\). The marginal laws --- the laws of \(X\) and \(Y\) separately --- are obtained by summing along rows or columns; this is the source of the name « marginal », since the marginal sums are written in the margins of the table. The conditional law of \(X\) knowing \(Y = y\) is, formally, the conditional law of \(X\) under the event \(A = \{Y = y\}\) defined earlier.
Definition — Couple\(\virgule\) joint law
Let \(X : \Omega \to E\) and \(Y : \Omega \to F\) be two random variables on the same probability space \((\Omega, P)\). The pair (or couple) \((X, Y) : \Omega \to E \times F\) is the random variable defined by \((X, Y)(\omega) = (X(\omega), Y(\omega))\). Its joint law is the family $$ \bigl(P(X = x, Y = y)\bigr)_{(x, y) \in X(\Omega) \times Y(\Omega)}, \qquad P(X = x, Y = y) := P(\{X = x\} \cap \{Y = y\}). $$
Definition — Marginal laws
The marginal laws of the couple \((X, Y)\) are the laws of \(X\) and \(Y\) taken separately --- \(P_X\) on \(X(\Omega)\) and \(P_Y\) on \(Y(\Omega)\).
Theorem — Marginal laws from the joint law
Let \((X, Y)\) be a couple with joint law \(\bigl(P(X = x, Y = y)\bigr)\). Then for every \(x \in X(\Omega)\) and every \(y \in Y(\Omega)\), $$ P(X = x) \ = \ \sum_{y \in Y(\Omega)} P(X = x, Y = y), \qquad P(Y = y) \ = \ \sum_{x \in X(\Omega)} P(X = x, Y = y). $$

The complete system associated with \(Y\) partitions \(\Omega = \bigsqcup_{y \in Y(\Omega)} \{Y = y\}\) (complete-system-associated-with-a-variable Proposition above). Intersecting with the event \(\{X = x\}\), $$ \{X = x\} \ = \ \bigsqcup_{y \in Y(\Omega)} \bigl(\{X = x\} \cap \{Y = y\}\bigr). $$ Apply additivity of \(P\) to obtain \(P(X = x) = \sum_y P(X = x, Y = y)\). The formula for \(P(Y = y)\) is symmetric.

Definition — Conditional law of \(X\) knowing \(Y \equal y\)
Let \((X, Y)\) be a couple, and let \(y \in Y(\Omega)\) with \(P(Y = y) > 0\). The conditional law of \(X\) knowing \(Y = y\) is the family \(\bigl(P(X = x \mid Y = y)\bigr)_{x \in X(\Omega)}\) defined by $$ P(X = x \mid Y = y) \ = \ \frac{P(X = x, Y = y)}{P(Y = y)}. $$ This is the special case of the conditional-law-of-a-variable definition with the event \(A = \{Y = y\}\). By the conditional-law-is-a-distribution Proposition, it is a distribution on \(X(\Omega)\).
Method — Reading marginals and conditionals from a joint table
Tabulate the joint law of \((X, Y)\) as a \(|X(\Omega)| \times |Y(\Omega)|\) table whose cell \((x, y)\) contains \(P(X = x, Y = y)\). Then:
  • the marginal law of \(X\) is read in the right margin (sums of rows);
  • the marginal law of \(Y\) is read in the bottom margin (sums of columns);
  • the conditional law of \(X\) knowing \(Y = y\) is read by taking the column \(y\) and renormalising it by its sum \(P(Y = y)\).
This is the source of the names: the marginals are literally written in the margins.
Example — Two draws without replacement
An urn contains \(3\) red, \(2\) white, and \(1\) green ball. Two balls are drawn successively without replacement. Let \(X\) = colour of the first ball (coded \(R = 1, W = 2, G = 3\)) and \(Y\) = colour of the second. The joint law of \((X, Y)\) is the table (each cell is \(P(X = x, Y = y) = (\text{count of (first, second)})/{6 \cdot 5}\), but re-expressed as a fraction over \(30\)): $$ \begin{array}{c|ccc|c} X \backslash Y & 1 (R) & 2 (W) & 3 (G) & P(X = x) \\ \hline 1 (R) & 6/30 & 6/30 & 3/30 & 15/30 = 1/2 \\ 2 (W) & 6/30 & 2/30 & 2/30 & 10/30 = 1/3 \\ 3 (G) & 3/30 & 2/30 & 0/30 & 5/30 = 1/6 \\ \hline P(Y = y) & 15/30 & 10/30 & 5/30 & 1 \end{array} $$ Detail of one cell: \(P(X = 1, Y = 1) = (3/6)(2/5) = 6/30\) (first red, then a red among the remaining \(2\)). The marginal of \(X\) in the right column matches the proportion of each colour in the urn (\(1/2, 1/3, 1/6\)). By symmetry of the model, the marginal of \(Y\) at the bottom is the same --- a non-obvious fact for « without replacement »: the colour of the second draw has the same law as the first. The cells are not products (e.g. \(P(X = 3, Y = 3) = 0 \ne (1/6)(1/6)\)), so \(X\) and \(Y\) are not independent --- which the two-variable-independence subsection will formalise.
Skills to practice
  • Computing joint and marginal laws
III.2 Independence of two random variables
Two random variables are independent when « observing one tells you nothing about the other »: this is the variable-level analogue of the event-level independence of Probabilities on a finite universe. The Theorem below gathers three equivalent characterizations: by sub-events of \(X(\Omega) \times Y(\Omega)\) (the programme definition), by factorisation of the joint law on singletons (the practical test), and by invariance of the conditional law (the « observing \(Y\) doesn't change \(X\) » reading). Stability of independence under functions follows immediately.
Definition — Independence of two random variables
Let \(X : \Omega \to E\) and \(Y : \Omega \to F\) be two random variables. We say that \(X\) and \(Y\) are independent (notation \(X \perp Y\)) if for every \(A \subseteq X(\Omega)\) and every \(B \subseteq Y(\Omega)\), $$ P(X \in A, \ Y \in B) \ = \ P(X \in A) \cdot P(Y \in B). $$
Theorem — Characterization of independence
Let \(X\) and \(Y\) be two random variables on \((\Omega, P)\). The following are equivalent:
  • [(i)] \(X \perp Y\) (independence by sub-events of \(X(\Omega), Y(\Omega)\));
  • [(ii)] for every \((x, y) \in X(\Omega) \times Y(\Omega)\), \(P(X = x, Y = y) = P(X = x) \cdot P(Y = y)\) (factorisation of the joint law on the singletons);
  • [(iii)] for every \(y \in Y(\Omega)\) with \(P(Y = y) > 0\) and every \(x \in X(\Omega)\), \(P(X = x \mid Y = y) = P(X = x)\) (the conditional law of \(X\) knowing \(Y = y\) does not depend on \(y\)).

  • (i) \(\Rightarrow\) (ii): take \(A = \{x\}\) and \(B = \{y\}\) in the definition.
  • (ii) \(\Rightarrow\) (i): for any \(A \subseteq X(\Omega), B \subseteq Y(\Omega)\), write \(\{X \in A, Y \in B\} = \bigsqcup_{(x, y) \in A \times B} \{X = x, Y = y\}\) and apply additivity, factoring out the \(P(X = x) P(Y = y)\) from (ii): $$ \begin{aligned} P(X \in A, Y \in B) \ &= \ \sum_{(x, y) \in A \times B} P(X = x, Y = y) \\ &= \ \sum_{(x, y) \in A \times B} P(X = x) P(Y = y) \\ &= \ \Bigl(\sum_{x \in A} P(X = x)\Bigr) \Bigl(\sum_{y \in B} P(Y = y)\Bigr) \\ &= \ P(X \in A) P(Y \in B). \end{aligned} $$
  • (ii) \(\Rightarrow\) (iii): for \(y\) with \(P(Y = y) > 0\), divide (ii) by \(P(Y = y)\) to obtain \(P(X = x \mid Y = y) = P(X = x)\).
  • (iii) \(\Rightarrow\) (ii): if \(P(Y = y) > 0\), multiply (iii) by \(P(Y = y)\) to recover \(P(X = x, Y = y) = P(X = x) P(Y = y)\). If \(P(Y = y) = 0\), then \(\{X = x, Y = y\} \subseteq \{Y = y\}\) has probability \(0\), and the right-hand side \(P(X = x) P(Y = y) = 0\) as well; (ii) holds trivially. Thus (ii) holds for every \((x, y)\).

Proposition — Stability of independence under functions
Let \(X \perp Y\) and let \(f : X(\Omega) \to F'\) and \(g : Y(\Omega) \to G'\) be two applications. Then \(f(X) \perp g(Y)\).

For any \(C \subseteq f(X)(\Omega)\) and any \(D \subseteq g(Y)(\Omega)\), \(\{f(X) \in C\} = \{X \in f^{-1}(C)\}\) and \(\{g(Y) \in D\} = \{Y \in g^{-1}(D)\}\). By independence of \(X\) and \(Y\) applied to the sub-events \(A = f^{-1}(C)\) and \(B = g^{-1}(D)\), $$ \begin{aligned} P(f(X) \in C, \ g(Y) \in D) \ &= \ P(X \in f^{-1}(C), \ Y \in g^{-1}(D)) \\ &= \ P(X \in f^{-1}(C)) \cdot P(Y \in g^{-1}(D)) \\ &= \ P(f(X) \in C) \cdot P(g(Y) \in D). \end{aligned} $$

Method — Testing independence cell-by-cell
To test \(X \perp Y\):
  1. Compute the joint law \(P(X = x, Y = y)\) for every \((x, y) \in X(\Omega) \times Y(\Omega)\).
  2. Compute the marginals \(P(X = x)\) and \(P(Y = y)\) (sums of rows / columns).
  3. Compare each cell to the product of the corresponding marginals: \(P(X = x, Y = y) \stackrel{?}{=} P(X = x) P(Y = y)\).
A single cell in disagreement is enough to refute independence (characterization (ii) is an « for all \((x, y)\) » statement). Agreement in all cells establishes independence.
Example — Indicators are independent iff events are
Let \(A\) and \(B\) be two events of \((\Omega, P)\). Then the indicator variables \(\indicatrice_A\) and \(\indicatrice_B\) are independent (as random variables) if and only if \(A\) and \(B\) are independent (as events, in the sense of the previous chapter).

By characterization (ii), \(\indicatrice_A \perp \indicatrice_B\) if and only if for all \((a, b) \in \{0, 1\}^2\), \(P(\indicatrice_A = a, \indicatrice_B = b) = P(\indicatrice_A = a) P(\indicatrice_B = b)\). The four cells: $$ \begin{aligned} P(\indicatrice_A = 1, \indicatrice_B = 1) \ &= \ P(A \cap B), && P(\indicatrice_A = 1) P(\indicatrice_B = 1) \ = \ P(A) P(B); \\ P(\indicatrice_A = 1, \indicatrice_B = 0) \ &= \ P(A \cap \overline B), && P(\indicatrice_A = 1) P(\indicatrice_B = 0) \ = \ P(A) (1 - P(B)); \\ P(\indicatrice_A = 0, \indicatrice_B = 1) \ &= \ P(\overline A \cap B), && P(\indicatrice_A = 0) P(\indicatrice_B = 1) \ = \ (1 - P(A)) P(B); \\ P(\indicatrice_A = 0, \indicatrice_B = 0) \ &= \ P(\overline A \cap \overline B), && P(\indicatrice_A = 0) P(\indicatrice_B = 0) \ = \ (1 - P(A))(1 - P(B)). \end{aligned} $$ The first equality \(P(A \cap B) = P(A) P(B)\) is the definition of « \(A\) and \(B\) independent as events ». The other three equalities follow from this one (by the « Independence and complements » Proposition of the previous chapter Probabilities on a finite universe). Conversely, if all four cells agree, the first one is exactly \(A \perp B\). Hence the equivalence.

Skills to practice
  • Verifying or refuting independence of two random variables
III.3 Mutual independence and lemme des coalitions
Mutual independence of \(n\) random variables is the natural generalisation of the two-variable case --- the joint law of the \(n\)-uplet factorises into the product of the \(n\) marginals --- but it is strictly stronger than pairwise independence (a phenomenon already encountered for events in the previous chapter, mutual-versus-pairwise-independence Proposition; the variable-level XOR construction below makes it concrete). The section closes with the lemme des coalitions, an explicit programme result that says: any function of one block of variables is independent of any function of another block, as long as the two blocks come from a mutually independent family.
Definition — Pairwise independence
The random variables \(X_1, \ldots, X_n\) are pairwise independent if for every pair \(i \ne j\) in \(\llbracket 1, n \rrbracket\), \(X_i \perp X_j\).
Definition — Mutual independence
The random variables \(X_1, \ldots, X_n\) (where \(X_i : \Omega \to E_i\)) are mutually independent if for every family of subsets \(A_1 \subseteq X_1(\Omega), \ldots, A_n \subseteq X_n(\Omega)\), $$ P(X_1 \in A_1, \ldots, X_n \in A_n) \ = \ \prod_{i=1}^n P(X_i \in A_i). $$
Proposition — Singleton characterization in the finite case
On a finite probability space, the random variables \(X_1, \ldots, X_n\) are mutually independent if and only if for every \((x_1, \ldots, x_n) \in X_1(\Omega) \times \cdots \times X_n(\Omega)\), $$ P(X_1 = x_1, \ldots, X_n = x_n) \ = \ \prod_{i=1}^n P(X_i = x_i). $$

  • (\(\Rightarrow\)) Take \(A_i = \{x_i\}\) in the Definition.
  • (\(\Leftarrow\)) For arbitrary \(A_i \subseteq X_i(\Omega)\), write \(\{X_i \in A_i\} = \bigsqcup_{x_i \in A_i} \{X_i = x_i\}\) and use additivity to expand the joint probability into a sum over \((x_1, \ldots, x_n) \in A_1 \times \cdots \times A_n\). Each summand factorises by the singleton hypothesis, then the sums separate as a product: $$ \begin{aligned} P(X_1 \in A_1, \ldots, X_n \in A_n) \ &= \ \sum_{(x_1, \ldots, x_n) \in A_1 \times \cdots \times A_n} P(X_1 = x_1, \ldots, X_n = x_n) \\ &= \ \sum_{(x_1, \ldots, x_n) \in A_1 \times \cdots \times A_n} \prod_{i=1}^n P(X_i = x_i) \\ &= \ \prod_{i=1}^n \sum_{x_i \in A_i} P(X_i = x_i) \\ &= \ \prod_{i=1}^n P(X_i \in A_i). \end{aligned} $$

Proposition — Mutual implies pairwise; converse false
Let \(X_1, \ldots, X_n\) be mutually independent. Then they are pairwise independent. The converse is false: there exist random variables that are pairwise independent but not mutually independent.

Direct sense. Pick any pair \(i \ne j\). To show \(X_i \perp X_j\), take \(A_k = X_k(\Omega)\) for \(k \notin \{i, j\}\) in the mutual-independence definition: this gives \(\{X_k \in X_k(\Omega)\} = \Omega\) and \(P(X_k \in X_k(\Omega)) = 1\), so the joint probability collapses to \(P(X_i \in A_i, X_j \in A_j)\) on the left and \(P(X_i \in A_i) P(X_j \in A_j)\) on the right --- which is exactly \(X_i \perp X_j\).
Counter-example. Take \(\Omega = \{0, 1\}^2\) uniform (two independent fair coins). Define \(X_1, X_2 : \Omega \to \{0, 1\}\) by \(X_1(\omega_1, \omega_2) = \omega_1\), \(X_2(\omega_1, \omega_2) = \omega_2\), and \(X_3 = X_1 \oplus X_2\) (XOR, i.e. \(X_3 = X_1 + X_2 \mod 2\)). Each \(X_i \sim \mathcal B(1/2)\). The pairs \((X_1, X_2), (X_1, X_3), (X_2, X_3)\) are each independent (one checks \(P(X_i = a, X_j = b) = 1/4 = (1/2)(1/2)\) for every \((a, b)\)). But the triple is not: \(P(X_1 = 0, X_2 = 0, X_3 = 0) = P(X_1 = 0, X_2 = 0) = 1/4 \ne (1/2)^3 = 1/8\) (since \(X_3 = 0 \oplus 0 = 0\) is forced when \(X_1 = X_2 = 0\)).

Proposition — Sub-family of a mutually independent family
Any sub-family of a mutually independent family of random variables is mutually independent.

Let \(X_1, \ldots, X_n\) be mutually independent and consider the sub-family indexed by \(J \subseteq \llbracket 1, n \rrbracket\) with \(J \ne \emptyset\). To show \((X_j)_{j \in J}\) is mutually independent, fix \(A_j \subseteq X_j(\Omega)\) for \(j \in J\) and take \(A_k = X_k(\Omega)\) for \(k \notin J\) in the mutual-independence definition. Then \(P(X_k \in A_k) = 1\) for \(k \notin J\), so the joint probability and the product both collapse to the indices in \(J\), leaving \(P(\bigcap_{j \in J} \{X_j \in A_j\}) = \prod_{j \in J} P(X_j \in A_j)\).

Proposition — Stability of mutual independence under functions
Let \(X_1, \ldots, X_n\) be mutually independent random variables (with \(X_i : \Omega \to E_i\)), and let \(f_1 : E_1 \to F_1, \ldots, f_n : E_n \to F_n\) be applications. Then \(f_1(X_1), \ldots, f_n(X_n)\) are mutually independent.

For any \(C_1 \subseteq f_1(X_1)(\Omega), \ldots, C_n \subseteq f_n(X_n)(\Omega)\), we have \(\{f_i(X_i) \in C_i\} = \{X_i \in f_i^{-1}(C_i)\}\). Applying mutual independence of \(X_1, \ldots, X_n\) to the pre-images \(f_i^{-1}(C_i)\), $$ \begin{aligned} P\bigl(f_1(X_1) \in C_1, \ldots, f_n(X_n) \in C_n\bigr) \ &= \ P\bigl(X_1 \in f_1^{-1}(C_1), \ldots, X_n \in f_n^{-1}(C_n)\bigr) \\ &= \ \prod_{i=1}^n P\bigl(X_i \in f_i^{-1}(C_i)\bigr) \\ &= \ \prod_{i=1}^n P\bigl(f_i(X_i) \in C_i\bigr). \end{aligned} $$ Hence \(f_1(X_1), \ldots, f_n(X_n)\) are mutually independent.

Proposition — Lemme des coalitions\(\virgule\) two-coalition case
Let \(X_1, \ldots, X_n\) be mutually independent random variables (with \(X_i : \Omega \to E_i\)), and let \(1 \le k < n\). For any applications \(f : E_1 \times \cdots \times E_k \to F\) and \(g : E_{k+1} \times \cdots \times E_n \to G\), the variables $$ U \ := \ f(X_1, \ldots, X_k) \qquad \text{and} \qquad V \ := \ g(X_{k+1}, \ldots, X_n) $$ are independent.

Fix \(u \in U(\Omega)\) and \(v \in V(\Omega)\). The event \(\{U = u, V = v\}\) is the disjoint union, over the \((x_1, \ldots, x_n)\) with \(f(x_1, \ldots, x_k) = u\) and \(g(x_{k+1}, \ldots, x_n) = v\), of the events \(\{X_1 = x_1, \ldots, X_n = x_n\}\). By the singleton characterization of mutual independence, $$ \begin{aligned} P(U = u, V = v) \ &= \ \sum_{\substack{f(x_1, \ldots, x_k) = u \\ g(x_{k+1}, \ldots, x_n) = v}} P(X_1 = x_1, \ldots, X_n = x_n) \\ &= \ \sum_{\substack{f(x_1, \ldots, x_k) = u \\ g(x_{k+1}, \ldots, x_n) = v}} \prod_{i=1}^n P(X_i = x_i) \\ &= \ \Bigl(\sum_{f(x_1, \ldots, x_k) = u} \prod_{i=1}^k P(X_i = x_i)\Bigr) \cdot \Bigl(\sum_{g(x_{k+1}, \ldots, x_n) = v} \prod_{i=k+1}^n P(X_i = x_i)\Bigr) \\ &= \ P(U = u) \cdot P(V = v), \end{aligned} $$ where the third line splits the sum over the two independent groups of indices and the fourth recognises each factor as the law of \(U\) (resp. \(V\)) by the law-of-\(f(X)\) proposition applied to the \(k\)-tuple \((X_1, \ldots, X_k)\) (resp. the \((n-k)\)-tuple). By the singleton characterization of independence of two random variables (Theorem on equivalent forms of two-variable independence, item (ii)), this proves \(U \perp V\).

This chapter also includes the extension of the lemme des coalitions to a partition of \(\llbracket 1, n \rrbracket\) into \(m \ge 2\) non-empty blocks: any \(m\)-tuple of variables, each obtained as a function of one block, is mutually independent. The proof is an iteration of the two-coalition case; we admit it.
Method — Using the lemme des coalitions to split an experiment
Whenever an experiment splits naturally into two (or more) blocks of mutually independent variables --- e.g. the first \(k\) trials versus the last \(n - k\), or the rolls of die A versus the rolls of die B --- the lemme des coalitions allows one to treat each block as a single random variable independent of the others. Concretely: any function of block 1 is independent of any function of block 2.
Example — Application to the binomial proof
In the proof of the construction of \(\mathcal B(n, p)\) (binomial-construction Theorem above), the partial sum \(S_k = X_1 + \cdots + X_k\) and the next variable \(X_{k+1}\) are independent. Indeed, \(S_k\) is a function of the block \((X_1, \ldots, X_k)\) and \(X_{k+1}\) is the sole element of the block \(\{X_{k+1}\}\); the two blocks come from a mutually independent family \(X_1, \ldots, X_{k+1}\), so the lemme des coalitions yields \(S_k \perp X_{k+1}\). This is exactly what was needed to compute \(P(S_n = k)\) by recursion or by the direct combinatorial argument.
Skills to practice
  • Verifying mutual independence and applying the lemme des coalitions
IV Finite counting models
This section is an enrichment block. The two models below --- hypergeometric (drawing without replacement) and truncated geometric (first success in capped trials) --- are not named standard laws of the program, but they are first-class applications of the chapter Counting and they appear naturally in concrete problems (urn draws, password attempts, lotteries, defect detection). We present them as models rather than « lois usuelles » to keep the boundary clear: the program-named laws are only the three of the named-laws section (uniform, Bernoulli, binomial); these two are derived from a counting argument and the definition of independence.
IV.1 Hypergeometric counting model
The hypergeometric model is the law of « number of successes when drawing \(n\) balls without replacement from an urn of \(N\) balls, \(K\) of which are successes ». It is the « cousin without replacement » of the binomial: the binomial models drawing with replacement (\(n\) independent trials of the same Bernoulli), the hypergeometric models drawing without replacement (the trials are no longer independent --- removing a success-ball changes the proportion for the next draw).
Definition — Hypergeometric model
Let \(N \ge 1\), \(K \in \llbracket 0, N \rrbracket\), \(n \in \llbracket 1, N \rrbracket\). A random variable \(X\) valued in \(\llbracket 0, n \rrbracket\) follows the hypergeometric model of parameters \((N, K, n)\), written \(X \sim \mathcal H(N, K, n)\), if for every \(k \in \llbracket 0, n \rrbracket\), $$ P(X = k) \ = \ \frac{\dbinom{K}{k} \dbinom{N - K}{n - k}}{\dbinom{N}{n}}, $$ with the convention \(\binom{a}{b} = 0\) for \(b < 0\) or \(b > a\) (so the formula automatically gives \(P(X = k) = 0\) outside the support \(\max(0, n - (N - K)) \le k \le \min(n, K)\)).
Method — Recognising the hypergeometric model
Use \(\mathcal H(N, K, n)\) to model:
  1. a finite urn (or population) of \(N\) items, of which \(K\) are « marked successes » and \(N - K\) are « non-successes »;
  2. a draw of \(n\) items without replacement (equivalently, a simultaneous draw of \(n\) items);
  3. \(X = \) the number of marked items in the drawn sample.
The formula re-derives in one line by counting (favourable cases over total cases): \(\binom{K}{k}\) ways to choose the \(k\) marked items from the \(K\) available, \(\binom{N-K}{n-k}\) ways to choose the \(n - k\) non-marked items from the \(N - K\) available, all divided by the \(\binom{N}{n}\) ways to choose any \(n\) items from \(N\).
Example — Three balls drawn from an urn
An urn contains \(N = 10\) balls of which \(K = 4\) are red. Draw \(n = 3\) balls simultaneously (without replacement). Let \(X = \) number of red balls drawn. Then \(X \sim \mathcal H(10, 4, 3)\), and $$ P(X = 2) \ = \ \frac{\binom{4}{2} \binom{6}{1}}{\binom{10}{3}} \ = \ \frac{6 \cdot 6}{120} \ = \ \frac{36}{120} \ = \ \frac{3}{10}. $$
Example — Contrast with the binomial --- with vs without replacement
Same urn (\(10\) balls, \(4\) red), but draw \(n = 3\) balls now with replacement. Then each draw is an independent Bernoulli of parameter \(4/10 = 2/5\), so the number of red balls \(X' \sim \mathcal B(3, 2/5)\), and $$ P(X' = 2) \ = \ \binom{3}{2}(2/5)^2(3/5) \ = \ 3 \cdot \frac{4}{25} \cdot \frac{3}{5} \ = \ \frac{36}{125} \ \approx \ 0{,}288. $$ Compared with \(P(X = 2) = 3/10 = 0{,}300\) for the hypergeometric without-replacement model, the difference is small but real. The two models converge when \(n \ll N\) (drawing few balls from a large urn behaves like drawing with replacement), but they diverge when \(n\) is comparable to \(N\).
Skills to practice
  • Drawing without replacement (hypergeometric model)
IV.2 Truncated geometric counting model
The truncated geometric model is the law of the first-success time in a sequence of Bernoulli trials, capped at \(T\): if no success occurs before trial \(T\), the value is set to \(T\). The last value \(T\) collects two distinct situations --- first success at trial \(T\), and no success at all in the first \(T\) trials. The cap is necessary to keep the universe finite: without the cap, one would need to model an infinite sequence of trials, which falls outside the program.
Definition — Truncated geometric model
Let \(T \ge 1\) and \(p \in [0, 1]\). A random variable \(X\) valued in \(\llbracket 1, T \rrbracket\) follows the truncated geometric model of parameters \((T, p)\) if $$ \forall k \in \llbracket 1, T - 1 \rrbracket, \ P(X = k) \ = \ p \, (1 - p)^{k - 1}, \qquad P(X = T) \ = \ (1 - p)^{T - 1}. $$ The interpretation: \(X = k\) with \(k < T\) means « \(k - 1\) failures then a success at trial \(k\) » (probability \((1-p)^{k-1} p\) by independence). The case \(X = T\) collects all outcomes where the first \(T - 1\) trials were failures (probability \((1-p)^{T-1}\)); the result of the \(T\)-th trial does not matter for \(X\) in this convention. We use the standard convention \(0^0 = 1\), so the formula also covers the boundary case \(T = 1\) with \(p = 1\).
Method — Recognising the truncated geometric model
Use the \((T, p)\) truncated geometric model when:
  1. the experiment is a sequence of \(T\) Bernoulli trials of the same parameter \(p\), mutually independent;
  2. the variable of interest is \(X = \) trial number of the first success, conventionally set to \(T\) if no success occurs in the first \(T - 1\) trials.
The formula re-derives directly: \(\{X = k\}\) for \(k < T\) means \(k - 1\) failures followed by a success, of joint probability \((1-p)^{k-1} p\) by mutual independence. \(\{X = T\}\) means the first \(T - 1\) trials are all failures, of probability \((1-p)^{T-1}\).
The non-truncated geometric law (on \(\mathbb N^*\), no cap \(T\)) is hors programme: it requires an infinite universe \(\Omega\), which is excluded by the program (2021, p. 31). Only the truncated version --- with a finite cap \(T\) --- is in scope here.
Example — Coin flips capped at \(T \equal 4\)
A balanced coin is flipped, with the game stopping at the first heads or after \(T = 4\) flips. Let \(X = \) flip number of the first heads (or \(X = 4\) if no heads in the first \(3\) flips). Then \(X\) follows the truncated geometric model with \(T = 4\) and \(p = 1/2\). The law: $$ \begin{aligned} P(X = 1) \ &= \ 1/2, \\ P(X = 2) \ &= \ (1/2) \cdot (1/2) \ = \ 1/4, \\ P(X = 3) \ &= \ (1/2)^2 \cdot (1/2) \ = \ 1/8, \\ P(X = 4) \ &= \ (1/2)^3 \ = \ 1/8. \end{aligned} $$ Check: \(1/2 + 1/4 + 1/8 + 1/8 = 4/8 + 2/8 + 1/8 + 1/8 = 8/8 = 1\). \(\checkmark\)
Skills to practice
  • Modelling first success in capped trials (truncated geometric)