\( \definecolor{colordef}{RGB}{249,49,84} \definecolor{colorprop}{RGB}{18,102,241} \)
CommeUnJeu · L1 MPSI

Expectation, variance, covariance

⌚ ~107 min ▢ 13 blocks ✓ 40 exercises Prerequisites : Random variables on a finite universe
Having defined random variables and their laws in Random variables on a finite universe, we now extract from the law two scalar summaries that capture its essential shape: the expectation \(E(X)\), an indicator of position (the weighted mean of the values of \(X\)), and the variance \(V(X)\), an indicator of dispersion (the mean squared deviation from the expectation). These two numbers are the workhorses of probabilistic computation: most concrete questions about \(X\) are answered by computing \(E(X)\) and \(V(X)\), then plugging into a named formula or inequality.
The plan has three parts plus an enrichment. The first part defines the expectation, computes it for the standard laws (uniform, Bernoulli, binomial), and proves its three pillars: linearity (the most-used property of the chapter --- it requires no independence), the formula of transfer (the way to compute \(E(f(X))\) without computing the law of \(f(X)\)), and the product formula \(E(X Y) = E(X) E(Y)\) for independent variables. The second part defines the variance, the covariance, and the variance of a sum, culminating in the Bienaymé identity: for pairwise uncorrelated variables, the variance of the sum is the sum of the variances --- the formula that makes \(V(\mathcal B(n, p)) = n p(1-p)\) a one-line consequence. The third part proves the three concentration inequalities of the programme: Markov, Bienaymé-Tchebychev (the deviation-from-mean bound), and the weak law of large numbers, which justifies the frequentist interpretation of probability. An optional fourth part introduces the probabilistic method (Erdős, out of program), a non-constructive existence technique built on top of expectation.
Three reflexes the reader should leave with: (i) linearity of expectation requires no independence --- \(E(X + Y) = E(X) + E(Y)\) holds in full generality; computing the law of \(X + Y\) to recover its expectation is a beginner mistake. (ii) For pairwise uncorrelated \((X_i)_i\) --- in particular, if they are pairwise independent (and a fortiori mutually independent) --- \(V(\sum_i X_i) = \sum_i V(X_i)\); this is the Bienaymé identity and the variance of the binomial is its first showcase. (iii) Uncorrelated does not imply independent: a clean counter-example (\(X\) uniform on \(\{-1, 0, +1\}\), \(Y = X^2\)) is part of the active vocabulary. The chapter closes the probability block of this course; subsequent topics (conditional expectation, characteristic functions, central limit theorem) are second-year material.
I Expectation
The expectation of a random variable summarises its law into a single number, the weighted mean of its values. After the definition and the closed-form expectations of the standard laws (constant, uniform, Bernoulli, binomial), we prove the three pillars: properties of expectation including linearity, the formula of transfer to compute \(E(f(X))\), and the product formula for independent variables. The convention \((\Omega, P)\) is a finite probability space with \(\Omega\) non-empty (inherited from the previous two chapters).
I.1 Definition of expectation
The expectation of \(X\) is the average of its values, each value weighted by its probability of occurrence. The more probable a value, the more it pulls the expectation toward itself. The definition allows real- or complex-valued random variables for full generality (the program permits both); the variance and covariance later in the chapter will require real values.
Definition — Expectation
Let \(X\) be a real- or complex-valued random variable on a finite probability space \((\Omega, P)\). The expectation of \(X\) is the number $$ E(X) \ = \ \sum_{x \in X(\Omega)} x \cdot P(X = x). $$ We say that \(X\) is centered (centrée) if \(E(X) = 0\).
Example — Balanced die
A balanced six-sided die is rolled. Let \(X\) be the value shown. Then \(X(\Omega) = \llbracket 1, 6 \rrbracket\) with \(P(X = k) = 1/6\) for every \(k\). By definition, $$ E(X) \ = \ \sum_{k = 1}^6 k \cdot \frac{1}{6} \ = \ \frac{1 + 2 + 3 + 4 + 5 + 6}{6} \ = \ \frac{21}{6} \ = \ \frac{7}{2}. $$ The expectation \(7/2 = 3{.}5\) is the « average value » a fair die would produce over many rolls --- not a value the die can actually show, but the centre of the distribution.
Example — Indicator of an event
Let \(A \in \mathcal P(\Omega)\) be an event and \(X = \indicatrice_A\) its indicator (the variable that takes the value \(1\) on \(A\) and \(0\) on \(\overline A\)). Then \(X(\Omega) = \{0, 1\}\) with \(P(X = 1) = P(A)\) and \(P(X = 0) = P(\overline A) = 1 - P(A)\). By definition, $$ E(\indicatrice_A) \ = \ 1 \cdot P(A) + 0 \cdot (1 - P(A)) \ = \ P(A). $$ This identity --- \(E(\indicatrice_A) = P(A)\) --- is foundational: every linearity-of-expectation argument in this chapter and beyond ultimately rests on it.
Method — Computing \(E(X)\) from the law
To compute \(E(X)\) by the definition:
  1. List \(X(\Omega)\), the set of values \(X\) can take.
  2. Compute the law \((P(X = x))_{x \in X(\Omega)}\) (cross-link to chapter Random variables on a finite universe).
  3. Sum \(x \cdot P(X = x)\) over \(X(\Omega)\).
This is the « brute-force » route. For sums \(X = X_1 + \cdots + X_n\), the linearity of expectation (proved below) is far faster --- never compute the law of the sum to get its expectation.
Skills to practice
  • Computing \(E(X)\) from the law
I.2 Expectation of standard laws
The four standard cases of the program (constant, uniform, Bernoulli, binomial) admit closed-form expectations. We state the four formulas in a single Theorem and prove each. The binomial is the most subtle: we present the direct proof via the « captain formula » \(k \binom{n}{k} = n \binom{n - 1}{k - 1}\); in the linearity section we will recover the same result in one line via linearity applied to the decomposition \(X = X_1 + \cdots + X_n\) in i.i.d. Bernoullis.
Theorem — Expectation of standard laws
Let \(X\) be a random variable on a finite probability space.
  1. Constant. If \(X = m\) is constant of value \(m\), then \(E(X) = m\).
  2. Uniform. Let \(F = \{x_1, \ldots, x_n\} \subset \mathbb C\). If \(X \sim \mathcal U(F)\), then \(E(X) = \dfrac{1}{n} \sum_{k = 1}^n x_k\) (the arithmetic mean of the values).
  3. Bernoulli. Let \(p \in [0, 1]\). If \(X \sim \mathcal B(p)\), then \(E(X) = p\). In particular, \(E(\indicatrice_A) = P(A)\) for every event \(A\).
  4. Binomial. Let \(n \in \mathbb N^*\) and \(p \in [0, 1]\). If \(X \sim \mathcal B(n, p)\), then \(E(X) = n p\).

  • Constant. \(X(\Omega) = \{m\}\) and \(P(X = m) = 1\), so \(E(X) = m \cdot 1 = m\).
  • Uniform. \(P(X = x_k) = 1/n\) for every \(k\), so by definition, $$ E(X) \ = \ \sum_{k = 1}^n x_k \cdot \frac{1}{n} \ = \ \frac{1}{n} \sum_{k = 1}^n x_k. $$
  • Bernoulli. \(X(\Omega) = \{0, 1\}\) with \(P(X = 1) = p\) and \(P(X = 0) = 1 - p\), so \(E(X) = 1 \cdot p + 0 \cdot (1 - p) = p\). The indicator case \(X = \indicatrice_A\) is a Bernoulli of parameter \(P(A)\) (chapter Random variables on a finite universe), hence \(E(\indicatrice_A) = P(A)\).
  • Binomial. \(P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}\) for \(k \in \llbracket 0, n \rrbracket\). By definition, $$ \begin{aligned} E(X) \ &= \ \sum_{k = 0}^n k \cdot \binom{n}{k} p^k (1 - p)^{n - k} && \text{(definition)} \\ &= \ \sum_{k = 1}^n k \cdot \binom{n}{k} p^k (1 - p)^{n - k} && \text{(the } k = 0 \text{ term vanishes)} \\ &= \ \sum_{k = 1}^n n \binom{n - 1}{k - 1} p^k (1 - p)^{n - k} && \text{(captain formula)} \\ &= \ n p \sum_{k = 1}^n \binom{n - 1}{k - 1} p^{k - 1} (1 - p)^{n - k} && \text{(factor } n p \text{)} \\ &= \ n p \sum_{i = 0}^{n - 1} \binom{n - 1}{i} p^i (1 - p)^{(n - 1) - i} && \text{(re-index } i = k - 1 \text{)} \\ &= \ n p \cdot (p + (1 - p))^{n - 1} && \text{(binomial formula)} \\ &= \ n p. \end{aligned} $$

Method — Standard-law expectations to memorise
Closed-form expectations to commit to memory:
  • \(X \sim \mathcal U(\llbracket 1, n \rrbracket)\) : \(E(X) = (n + 1)/2\) ;
  • \(X \sim \mathcal B(p)\) : \(E(X) = p\) ;
  • \(X \sim \mathcal B(n, p)\) : \(E(X) = n p\).
The first formula is the special case of (ii) for the values \(1, 2, \ldots, n\).
Example — QCM with random answers
A multiple-choice exam has \(n = 10\) questions, each with \(4\) options of which exactly one is correct. A student answers all questions uniformly at random and independently. Let \(X\) be the number of correct answers. Then \(X \sim \mathcal B(10, 1/4)\), hence \(E(X) = 10 \cdot 1/4 = 5/2\). The student should expect about \(2{.}5\) correct answers out of \(10\).
Skills to practice
  • Recognising and applying the standard-law expectation
I.3 Properties of expectation
The algebraic block of expectation. Linearity of expectation is the most-used property of the entire chapter --- it requires no independence, only the existence of \(E(X)\) and \(E(Y)\) on a common probability space. The other three properties (alternative formula, triangle inequality, positivity / monotonicity) follow from the same template: a sum manipulation on the values plus the chapter's foundational identity \(E(\indicatrice_A) = P(A)\).
Theorem — Properties of expectation
Let \(X, Y\) be random variables on a finite probability space \((\Omega, P)\).
  1. Alternative formula (any \(X\), real- or complex-valued): $$ E(X) \ = \ \sum_{\omega \in \Omega} P(\{\omega\}) \cdot X(\omega). $$
  2. Linearity (any \(X, Y\), real- or complex-valued): for all \(\lambda, \mu \in \mathbb C\), $$ E(\lambda X + \mu Y) \ = \ \lambda E(X) + \mu E(Y). $$
  3. Triangle inequality (any \(X\), real- or complex-valued): $$ |E(X)| \ \le \ E(|X|). $$
  4. Positivity (real-valued \(X\) only): \(X \ge 0 \Rightarrow E(X) \ge 0\).
  5. Croissance / monotonicity (real-valued \(X, Y\) only): \(X \le Y \Rightarrow E(X) \le E(Y)\).

  • (i) Alternative formula. The events \(\{X = x\}\) for \(x \in X(\Omega)\) form a complete system of events, hence \(\Omega = \bigsqcup_{x \in X(\Omega)} \{X = x\}\). Therefore, $$ \begin{aligned} \sum_{\omega \in \Omega} P(\{\omega\}) X(\omega) \ &= \ \sum_{x \in X(\Omega)} \sum_{\omega \in \{X = x\}} P(\{\omega\}) X(\omega) && \text{(partition)} \\ &= \ \sum_{x \in X(\Omega)} \sum_{\omega \in \{X = x\}} P(\{\omega\}) \cdot x && \text{(}X(\omega) = x \text{ on } \{X = x\}\text{)} \\ &= \ \sum_{x \in X(\Omega)} x \cdot P(X = x) \ = \ E(X) && \text{(} P(X = x) = \sum_{\omega \in \{X = x\}} P(\{\omega\}) \text{)}. \end{aligned} $$
  • (ii) Linearity. By the alternative formula applied to \(\lambda X + \mu Y\), $$ \begin{aligned} E(\lambda X + \mu Y) \ &= \ \sum_{\omega \in \Omega} P(\{\omega\}) (\lambda X(\omega) + \mu Y(\omega)) && \text{(alt. formula)} \\ &= \ \lambda \sum_{\omega \in \Omega} P(\{\omega\}) X(\omega) + \mu \sum_{\omega \in \Omega} P(\{\omega\}) Y(\omega) && \text{(linearity of sums)} \\ &= \ \lambda E(X) + \mu E(Y). \end{aligned} $$
  • (iii) Triangle inequality. By the alternative formula and the triangle inequality on \(\mathbb C\), $$ |E(X)| \ = \ \Bigl| \sum_{\omega \in \Omega} P(\{\omega\}) X(\omega) \Bigr| \ \le \ \sum_{\omega \in \Omega} P(\{\omega\}) |X(\omega)| \ = \ E(|X|). $$
  • (iv) Positivity. If \(X \ge 0\), then \(X(\omega) \ge 0\) for every \(\omega\) and \(P(\{\omega\}) \ge 0\), so the alternative formula expresses \(E(X)\) as a sum of non-negative terms. Hence \(E(X) \ge 0\).
  • (v) Croissance. If \(X \le Y\), apply (iv) to \(Y - X \ge 0\): \(E(Y - X) \ge 0\). By linearity (ii), \(E(Y - X) = E(Y) - E(X)\), hence \(E(X) \le E(Y)\).

Method — Linearity-first
To compute \(E(X + Y)\), \(E(\lambda X)\), \(E(X_1 + \cdots + X_n)\), never compute the law of the sum. Use linearity directly: \(E(X + Y) = E(X) + E(Y)\), etc. The most common applications:
  • Re-derive \(E(\mathcal B(n, p)) = n p\) in one line: write \(X = X_1 + \cdots + X_n\) where \(X_i \sim \mathcal B(p)\) are i.i.d. (chapter Random variables on a finite universe, « binomial \(=\) sum of \(n\) independent Bernoullis »). Then \(E(X) = \sum_i E(X_i) = n \cdot p\).
  • Compute \(E\) of a count : if \(X = \indicatrice_{A_1} + \cdots + \indicatrice_{A_n}\) counts the number of \(A_i\) that occur, then \(E(X) = P(A_1) + \cdots + P(A_n)\). Independence of the \(A_i\) is not required.
Example — Sum of two dice
Roll two balanced dice ; let \(X_1\) and \(X_2\) be the values. Then \(S = X_1 + X_2\) has expectation, by linearity, $$ E(S) \ = \ E(X_1) + E(X_2) \ = \ \frac{7}{2} + \frac{7}{2} \ = \ 7. $$ Independence is not used --- the formula holds for any joint law on \((X_1, X_2)\). Computing the law of \(S\) on \(\llbracket 2, 12 \rrbracket\) and summing \(s P(S = s)\) would also give \(7\), but with much more work.
Skills to practice
  • Computing expectation by linearity
I.4 Formula of transfer
To compute \(E(f(X))\) for a function \(f\), the « brute-force » route is to compute the law of \(f(X)\) on \(f(X)(\Omega)\) and sum. The formula of transfer bypasses that step entirely: \(E(f(X))\) can be expressed directly as a sum on \(X(\Omega)\), weighted by the law of \(X\) and by \(f(x)\). The formula generalises to couples and \(n\)-tuples --- the program-mandated extension that powers many calculations.
Theorem — Formula of transfer
Let \(X\) be a random variable on \((\Omega, P)\) and \(f : X(\Omega) \to \mathbb C\) a function.
  1. One variable. The expectation of \(f(X)\) depends only on \(f\) and the law of \(X\): $$ E(f(X)) \ = \ \sum_{x \in X(\Omega)} f(x) \cdot P(X = x). $$
  2. Couples. For \(f : (X, Y)(\Omega) \to \mathbb C\), $$ E(f(X, Y)) \ = \ \sum_{(x, y) \in (X, Y)(\Omega)} f(x, y) \cdot P(X = x, Y = y). $$
  3. \(n\)-tuples. Analogous formula on \((X_1, \ldots, X_n)(\Omega)\) weighted by the joint law \(P(X_1 = x_1, \ldots, X_n = x_n)\).

We prove (i); (ii) and (iii) follow by the same argument applied to the joint law of \((X, Y)\) or \((X_1, \ldots, X_n)\).
The events \(\{X = x\}\) for \(x \in X(\Omega)\) partition \(\Omega\). By the alternative formula for the expectation of \(f(X)\), $$ \begin{aligned} E(f(X)) \ &= \ \sum_{\omega \in \Omega} P(\{\omega\}) f(X(\omega)) && \text{(alt. formula)} \\ &= \ \sum_{x \in X(\Omega)} \sum_{\omega \in \{X = x\}} P(\{\omega\}) f(x) && \text{(partition; } X(\omega) = x \text{)} \\ &= \ \sum_{x \in X(\Omega)} f(x) \cdot \sum_{\omega \in \{X = x\}} P(\{\omega\}) && \text{(factor } f(x) \text{)} \\ &= \ \sum_{x \in X(\Omega)} f(x) \cdot P(X = x). \end{aligned} $$

Method — Transfer-not-law
To compute \(E(X^2)\), \(E(X(X - 1))\), \(E(\cos(\pi X))\), \(E(X Y)\), etc., use the transfer formula directly --- never compute the law of the function. The cost is one sum on \(X(\Omega)\) (or \((X, Y)(\Omega)\)) ; the « brute-force » via the law of \(f(X)\) is essentially never simpler.
Example — Second moment of a uniform law
Let \(X \sim \mathcal U(\llbracket 1, n \rrbracket)\). By transfer with \(f(k) = k^2\) and the classical sum \(\sum_{k=1}^n k^2 = n(n+1)(2n+1)/6\) from chapter Sums, products and binomial coefficients, $$ E(X^2) \ = \ \sum_{k = 1}^n k^2 \cdot \frac{1}{n} \ = \ \frac{1}{n} \cdot \frac{n(n + 1)(2n + 1)}{6} \ = \ \frac{(n + 1)(2n + 1)}{6}. $$ For \(n = 6\) (balanced die), \(E(X^2) = 7 \cdot 13 / 6 = 91/6\). We use this in the variance section to compute the variance of a die roll.
Skills to practice
  • Applying the formula of transfer
I.5 Expectation of a product of independent variables
A multiplicative analogue of linearity, but conditional on independence: \(E(X Y) = E(X) E(Y)\) holds when \(X\) and \(Y\) are independent, and is generally false otherwise. The formula is the bridge to the variance-of-a-sum formula (the cross terms \(E((X - E(X))(Y - E(Y)))\) vanish under independence). Conversely, \(E(X Y) = E(X) E(Y)\) does not imply independence --- a clean counter-example anchors the « décorrélées \(\ne\) indépendantes » distinction.
Theorem — Expectation of a product of independent variables
Let \(X\) and \(Y\) be random variables on \((\Omega, P)\). If \(X\) and \(Y\) are independent, then $$ E(X Y) \ = \ E(X) \cdot E(Y). $$ The result extends naturally to a finite number of mutually independent variables: if \(X_1, \ldots, X_n\) are mutually independent, then \(E(X_1 X_2 \cdots X_n) = E(X_1) E(X_2) \cdots E(X_n)\).

By the formula of transfer for a couple, applied to \(f(x, y) = x y\), on the full product \(X(\Omega) \times Y(\Omega)\) (allowing zero-probability cells, which contribute nothing), $$ \begin{aligned} E(X Y) \ &= \ \sum_{x \in X(\Omega)} \sum_{y \in Y(\Omega)} x y \cdot P(X = x, Y = y) && \text{(transfer for couples)} \\ &= \ \sum_{x \in X(\Omega)} \sum_{y \in Y(\Omega)} x y \cdot P(X = x) P(Y = y) && \text{(independence: joint factorises)} \\ &= \ \Bigl( \sum_{x \in X(\Omega)} x P(X = x) \Bigr) \cdot \Bigl( \sum_{y \in Y(\Omega)} y P(Y = y) \Bigr) && \text{(separate the double sum)} \\ &= \ E(X) \cdot E(Y) && \text{(definitions)}. \end{aligned} $$ The \(n\)-variable case follows by induction on \(n\), using that \(X_1 \cdots X_{n - 1}\) and \(X_n\) are independent (by the chapter 42 lemme des coalitions: any function of \(X_1, \ldots, X_{n - 1}\) is independent of \(X_n\)).

Product equality of expectations does not imply independence
The converse of the theorem is false. Counter-example. Take \(X \sim \mathcal U(\{-1, 0, +1\})\) and \(Y = X^2\). Then \(E(X) = (-1 + 0 + 1)/3 = 0\) and \(E(X Y) = E(X^3) = ((-1)^3 + 0^3 + 1^3)/3 = 0\), so \(E(X Y) = 0 = 0 \cdot E(X^2) = E(X) E(Y)\). But \(X\) and \(Y\) are not independent: \(P(X = 0, Y = 1) = 0\) since \(\{X = 0\} \cap \{Y = 1\} = \{X = 0\} \cap \{X^2 = 1\} = \emptyset\), whereas \(P(X = 0) \cdot P(Y = 1) = (1/3) \cdot (2/3) = 2/9 \ne 0\). So « \(E(X Y) = E(X) E(Y)\) » is strictly weaker than « \(X \perp Y\) ». We re-encounter this counter-example in the covariance section under the name « décorrélées \(\ne\) indépendantes ».
Example — Two independent dice
Roll two independent balanced dice ; let \(X_1, X_2 \sim \mathcal U(\llbracket 1, 6 \rrbracket)\). Then by the product formula, $$ E(X_1 X_2) \ = \ E(X_1) \cdot E(X_2) \ = \ \frac{7}{2} \cdot \frac{7}{2} \ = \ \frac{49}{4} \ = \ 12{.}25. $$ Computing the law of \(X_1 X_2\) on its \(18\)-element support and summing \(k P(X_1 X_2 = k)\) would also give \(49/4\), but with much more work.
Skills to practice
  • Computing \(E(X Y)\) under independence
II Variance\(\virgule\) standard deviation\(\virgule\) covariance
Variance is the natural quadratic indicator of dispersion : the average squared deviation from the expectation. Standard deviation \(\sigma(X) = \sqrt{V(X)}\) is its homogeneous companion, expressed in the units of \(X\). Covariance generalises variance to two variables: \(\mathrm{Cov}(X, X) = V(X)\). The two named theorems of the section --- Properties of variance (with the practical formula \(V(X) = E(X^2) - E(X)^2\)) and Variance of a sum (with the Bienaymé identity for uncorrelated families) --- are the workhorses. The Bernoulli variance is direct ; the binomial variance is stated here but proved in the variance-of-a-sum section as the headline application of Bienaymé.
II.1 Variance and standard deviation
Variance is the mean of the squared deviation from expectation: \(V(X) = E((X - E(X))^2)\). The squared deviation is non-negative, so \(V(X) \ge 0\), and the practical formula \(V(X) = E(X^2) - E(X)^2\) allows computation without the centering step.
Definition — Variance and standard deviation
Let \(X\) be a real-valued random variable on a finite probability space.
  • The variance of \(X\) is the non-negative real number \(V(X) = E((X - E(X))^2)\).
  • The standard deviation of \(X\) is \(\sigma(X) = \sqrt{V(X)}\).
  • \(X\) is centered if \(E(X) = 0\) ; \(X\) is réduite (reduced) if \(V(X) = 1\).
Theorem — Properties of variance
Let \(X\) be a real-valued random variable.
  1. Practical formula: \(V(X) = E(X^2) - E(X)^2\).
  2. Affine transformation: for \(a, b \in \mathbb R\), \(V(aX + b) = a^2 V(X)\). In particular, if \(\sigma(X) > 0\), the variable \((X - E(X)) / \sigma(X)\) is centered and reduced.
  3. Nullity: \(V(X) = 0\) if and only if \(X\) is presque sûrement constante (i.e. \(P(X = E(X)) = 1\)).

  • (i) Practical formula. Let \(m = E(X)\). By definition, $$ \begin{aligned} V(X) \ &= \ E((X - m)^2) && \text{(definition)} \\ &= \ E(X^2 - 2 m X + m^2) && \text{(expand the square)} \\ &= \ E(X^2) - 2 m E(X) + m^2 && \text{(linearity ; } m \text{ constant)} \\ &= \ E(X^2) - 2 m \cdot m + m^2 \ = \ E(X^2) - m^2 \ = \ E(X^2) - E(X)^2. \end{aligned} $$
  • (ii) Affine transformation. \(E(aX + b) = a E(X) + b\) by linearity. Then $$ \begin{aligned} V(aX + b) \ &= \ E((aX + b - a E(X) - b)^2) && \text{(definition)} \\ &= \ E((a(X - E(X)))^2) && \text{(simplify)} \\ &= \ E(a^2 (X - E(X))^2) && \text{(expand)} \\ &= \ a^2 \cdot E((X - E(X))^2) \ = \ a^2 V(X) && \text{(linearity)}. \end{aligned} $$ For the centred-reduced statement, with \(a = 1/\sigma(X)\) and \(b = -E(X)/\sigma(X)\), \(E((X - E(X))/\sigma(X)) = 0\) and \(V((X - E(X))/\sigma(X)) = V(X)/\sigma(X)^2 = 1\).
  • (iii) Nullity. By transfer, \(V(X) = \sum_{x \in X(\Omega)} P(X = x) (x - E(X))^2\). All summed terms are non-negative, so the sum vanishes if and only if each term vanishes, i.e. for every \(x \in X(\Omega)\), either \(P(X = x) = 0\) or \(x = E(X)\). Equivalently, \(P(X = x) = 0\) for every \(x \ne E(X)\) in \(X(\Omega)\), hence \(P(X = E(X)) = \sum_x P(X = x) - \sum_{x \ne E(X)} P(X = x) = 1\).

Almost-surely constant on a finite universe --- operational reading
On a finite probability space, \(V(X) = 0\) means \(X(\omega) = E(X)\) for every \(\omega \in \Omega\) with \(P(\{\omega\}) > 0\). If all elementary outcomes have positive probability --- the typical setup here --- then \(X\) is genuinely constant on the entire \(\Omega\). The almost-sure / sure distinction becomes substantive only on infinite universes (second-year MP material).
Method — Variance via the practical formula
To compute \(V(X)\), never go through \(E((X - E(X))^2)\) by definition. Use the practical formula in two steps:
  1. Compute \(E(X)\) (definition or linearity).
  2. Compute \(E(X^2)\) via the formula of transfer.
  3. Then \(V(X) = E(X^2) - E(X)^2\).
Example — Variance of a balanced die
Let \(X \sim \mathcal U(\llbracket 1, 6 \rrbracket)\). From the definition of expectation and the formula of transfer, \(E(X) = 7/2\) and \(E(X^2) = 91/6\). Hence $$ V(X) \ = \ E(X^2) - E(X)^2 \ = \ \frac{91}{6} - \frac{49}{4} \ = \ \frac{182}{12} - \frac{147}{12} \ = \ \frac{35}{12}. $$ The standard deviation is \(\sigma(X) = \sqrt{35/12} \approx 1{.}71\), comparable to the « range » \(|6 - 1| = 5\) of the values divided by a small constant.
Skills to practice
  • Computing variance via the practical formula
II.2 Variance of a Bernoulli variable
The Bernoulli variance follows directly from the practical formula : a one-line computation. The variance of the binomial \(\mathcal B(n, p)\) will appear in the variance-of-a-sum section as the headline application of the Bienaymé identity ; it is announced here so the reader can use the formula in the meantime.
Theorem — Variance of a Bernoulli variable
Let \(p \in [0, 1]\). If \(X \sim \mathcal B(p)\), then \(V(X) = p(1 - p)\).

For \(X \sim \mathcal B(p)\), \(X(\Omega) = \{0, 1\}\) and \(X^2 = X\) on \(\{0, 1\}\) (since \(0^2 = 0\) and \(1^2 = 1\)). Therefore \(E(X^2) = E(X) = p\). By the practical formula, $$ V(X) \ = \ E(X^2) - E(X)^2 \ = \ p - p^2 \ = \ p(1 - p). $$

Announcement of the binomial variance
For the binomial law, we will prove below (as the headline application of the Bienaymé identity) that for \(X \sim \mathcal B(n, p)\), $$ V(X) \ = \ n p (1 - p). $$ The reader may use this formula in the meantime ; its proof is genuinely shorter after the variance-of-sum machinery is in place.
Example — Variance of a binomial
For \(X \sim \mathcal B(10, 1/4)\) (the QCM example of the standard-law subsection), the variance is \(V(X) = 10 \cdot (1/4) \cdot (3/4) = 30/16 = 15/8 = 1{.}875\) and the standard deviation is \(\sigma(X) = \sqrt{15/8} \approx 1{.}37\). The student's score is concentrated around its expectation \(5/2 = 2{.}5\) within roughly one or two standard deviations.
Skills to practice
  • Computing the variance of Bernoulli and binomial laws
II.3 Covariance
The covariance generalises the variance to two variables. Two variables of zero covariance are called décorrélées (uncorrelated). Independence implies decorrelation, but the converse is false --- the same counter-example \(X \sim \mathcal U(\{-1, 0, +1\})\) and \(Y = X^2\) from the product-formula subsection illustrates the gap. The covariance feeds the variance-of-a-sum formula proved below.
Definition — Covariance\(\virgule\) décorrélées
Let \(X, Y\) be real-valued random variables. The covariance of \(X\) and \(Y\) is the real number $$ \mathrm{Cov}(X, Y) \ = \ E\bigl((X - E(X))(Y - E(Y))\bigr). $$ \(X\) and \(Y\) are décorrélées (uncorrelated) if \(\mathrm{Cov}(X, Y) = 0\). By construction, \(\mathrm{Cov}(X, X) = V(X)\) and \(\mathrm{Cov}(Y, X) = \mathrm{Cov}(X, Y)\).
Proposition — König-Huygens formula for covariance
For \(X, Y\) real-valued random variables, $$ \mathrm{Cov}(X, Y) \ = \ E(X Y) - E(X) E(Y). $$

Expand the product in the definition: $$ \begin{aligned} \mathrm{Cov}(X, Y) \ &= \ E\bigl((X - E(X))(Y - E(Y))\bigr) && \text{(definition)} \\ &= \ E\bigl(X Y - X E(Y) - E(X) Y + E(X) E(Y)\bigr) && \text{(expand)} \\ &= \ E(X Y) - E(Y) E(X) - E(X) E(Y) + E(X) E(Y) && \text{(linearity ; constants pull out)} \\ &= \ E(X Y) - E(X) E(Y). \end{aligned} $$

Proposition — Independent implies uncorrelated
If \(X\) and \(Y\) are independent real-valued random variables, then \(\mathrm{Cov}(X, Y) = 0\).

By the König-Huygens formula and the product formula for independent variables, \(\mathrm{Cov}(X, Y) = E(X Y) - E(X) E(Y) = E(X) E(Y) - E(X) E(Y) = 0\).

Décorrélation does not imply indépendance
The converse of the previous Proposition is false. Counter-example. Take \(X \sim \mathcal U(\{-1, 0, +1\})\) and \(Y = X^2\). As computed in the product-formula counter-example above, \(E(X Y) = 0 = E(X) E(Y)\), so by König-Huygens \(\mathrm{Cov}(X, Y) = 0\) : \(X\) and \(Y\) are uncorrelated. But \(P(X = 0, Y = 1) = 0 \ne (1/3)(2/3) = P(X = 0) P(Y = 1)\), so \(X\) and \(Y\) are not independent. Décorrélation is strictly weaker than indépendance.
Cauchy-Schwarz inequality for covariance --- out of program
The following inequality is out of program (the Cauchy-Schwarz of the program is in chapter Espaces préhilbertiens, in the inner-product setting). We cite it without proof: $$ |\mathrm{Cov}(X, Y)| \ \le \ \sigma(X) \cdot \sigma(Y). $$ The argument applies the Cauchy-Schwarz inequality to the centered variables \(U = X - E(X)\) and \(V = Y - E(Y)\), with the bilinear form \(\langle U, V \rangle = E(U V)\) ; a fully rigorous proof belongs to the chapter on prehilbertian spaces. Useful in some exercises.
Example — Two independent dice ; one die and its complement
(a) Independent dice. If \(X_1, X_2 \sim \mathcal U(\llbracket 1, 6 \rrbracket)\) are independent, then \(\mathrm{Cov}(X_1, X_2) = 0\) by the previous Proposition.
(b) One die and its complement. Roll one balanced die ; let \(X\) be the value and \(Y = 7 - X\). Then \(X + Y = 7\) identically, so by linearity \(E(X) = E(Y) = 7/2\). By König-Huygens and transfer, $$ \begin{aligned} \mathrm{Cov}(X, Y) \ &= \ E(X(7 - X)) - E(X) E(7 - X) && \text{(König-Huygens with } Y = 7 - X) \\ &= \ 7 E(X) - E(X^2) - E(X)(7 - E(X)) && \text{(linearity)} \\ &= \ 7 E(X) - E(X^2) - 7 E(X) + E(X)^2 && \text{(distribute)} \\ &= \ E(X)^2 - E(X^2) \ = \ -V(X) \ = \ -\frac{35}{12}. \end{aligned} $$ The covariance is strictly negative : when \(X\) is large, \(Y = 7 - X\) is small. We will use this example again in the variance-of-a-sum section to show that \(V(X + Y) = 0\) via the full \(V + 2 \mathrm{Cov} + V\) formula.
Skills to practice
  • Computing covariance and refuting independence
II.4 Variance of a sum\(\virgule\) Bienaymé identity\(\virgule\) application to the binomial
The headline formula of the section. The two-variable version \(V(X + Y) = V(X) + 2 \mathrm{Cov}(X, Y) + V(Y)\) generalises to \(n\) variables, and the special case where the \(X_i\) are pairwise uncorrelated reduces to the Bienaymé identity \(V(\sum_i X_i) = \sum_i V(X_i)\). The first showcase of Bienaymé is the variance of the binomial \(V(\mathcal B(n, p)) = n p (1 - p)\), whose proof was postponed from the variance-properties subsection.
Theorem — Variance of a sum and Bienaymé identity
Let \(X, Y, X_1, \ldots, X_n\) be real-valued random variables on a finite probability space.
  1. Two variables. \(V(X + Y) = V(X) + 2 \mathrm{Cov}(X, Y) + V(Y)\).
  2. \(n\) variables. \(\displaystyle V\bigl(\sum_{i = 1}^n X_i\bigr) = \sum_{i = 1}^n V(X_i) + 2 \sum_{1 \le i < j \le n} \mathrm{Cov}(X_i, X_j)\).
  3. Identité de Bienaymé. If \(X_1, \ldots, X_n\) are pairwise uncorrelated --- in particular, if they are pairwise independent (and a fortiori if they are mutually independent) --- then \(\displaystyle V\bigl(\sum_{i = 1}^n X_i\bigr) = \sum_{i = 1}^n V(X_i)\).

We prove (i); (ii) is the analogous expansion with \(\binom{n}{2}\) cross terms; (iii) is the special case of (ii) where every \(\mathrm{Cov}(X_i, X_j) = 0\).
For (i), let \(m = E(X)\) and \(\mu = E(Y)\), so \(E(X + Y) = m + \mu\). By definition, $$ \begin{aligned} V(X + Y) \ &= \ E((X + Y - m - \mu)^2) && \text{(definition)} \\ &= \ E\bigl(((X - m) + (Y - \mu))^2\bigr) && \text{(regroup)} \\ &= \ E\bigl((X - m)^2 + 2 (X - m)(Y - \mu) + (Y - \mu)^2\bigr) && \text{(expand square)} \\ &= \ E((X - m)^2) + 2 E((X - m)(Y - \mu)) + E((Y - \mu)^2) && \text{(linearity)} \\ &= \ V(X) + 2 \mathrm{Cov}(X, Y) + V(Y) && \text{(definitions)}. \end{aligned} $$

Method — Bienaymé identity in action
When the \(X_i\) are pairwise uncorrelated --- in particular, when they are pairwise independent (and a fortiori mutually independent) --- the variance of the sum is the sum of the variances. The most common applications:
  • \(X = X_1 + \cdots + X_n\) with \(X_i\) i.i.d. \(\mathcal B(p)\) mutually independent gives \(V(X) = n p (1 - p)\) in one line --- this is the binomial variance below.
  • Sum of \(n\) i.i.d. variables of common variance \(\sigma^2\) : \(V(\sum_i X_i) = n \sigma^2\), hence \(V(\bar X_n) = V(\frac{1}{n} \sum_i X_i) = \sigma^2 / n\). Foundation of the weak law of large numbers proved below.
Cross-link to chapter 42's lemme des coalitions when the independence comes from a coalition argument.
Theorem — Variance of the binomial
Let \(n \in \mathbb N^*\) and \(p \in [0, 1]\). If \(X \sim \mathcal B(n, p)\), then \(V(X) = n p (1 - p)\).

By the chapter 42 « binomial \(=\) sum of \(n\) independent Bernoullis » Theorem, \(X\) has the same law as \(X_1 + \cdots + X_n\) where the \(X_i\) are i.i.d. \(\mathcal B(p)\) mutually independent. Mutual independence implies pairwise independence implies pairwise uncorrelation. By the Bienaymé identity, $$ V(X) \ = \ V(X_1 + \cdots + X_n) \ = \ V(X_1) + \cdots + V(X_n) \ = \ n \cdot p(1 - p), $$ where we used \(V(X_i) = p(1 - p)\) from the Bernoulli variance computation above. Closes the proof loop opened in the variance-properties subsection.

Example — Sum of two non-independent dice --- covariance term matters
Continuing the covariance example with the die: roll one die, set \(X\) = the value and \(Y = 7 - X\). Then \(X + Y = 7\) identically, so \(V(X + Y) = 0\). By the full formula, $$ \begin{aligned} V(X + Y) \ &= \ V(X) + 2 \mathrm{Cov}(X, Y) + V(Y) && \text{(theorem (i))} \\ &= \ \frac{35}{12} + 2 \cdot \Bigl(-\frac{35}{12}\Bigr) + \frac{35}{12} && \text{(values from above)} \\ &= \ 0 \ \checkmark. \end{aligned} $$ When the \(X_i\) are not pairwise uncorrelated, you must keep the \(\mathrm{Cov}\) terms --- the Bienaymé identity does not apply.
Skills to practice
  • Applying the Bienaymé identity
III Probabilistic inequalities
The three concentration inequalities of the program. Markov bounds the right tail of a non-negative variable by its expectation. Applied to \((X - E(X))^2\), Markov yields the Bienaymé-Tchebychev inequality, which bounds the deviation from the mean by the variance. Applied to the empirical mean \(S_n / n\) of \(n\) i.i.d. variables, Bienaymé-Tchebychev yields the weak law of large numbers: the empirical mean concentrates around the true mean as \(n \to \infty\), justifying the frequentist interpretation of probability.
III.1 Markov inequality
The simplest concentration inequality : for a non-negative random variable, the probability that \(X\) exceeds a threshold \(a\) is bounded by the ratio \(E(X)/a\). The general \(|X|\) form is an immediate corollary, and applying Markov to \(f(X)\) for a well-chosen \(f\) produces refinements.
Theorem — Markov inequality
  1. Standard form (non-negative variable). Let \(X\) be a non-negative real-valued random variable and \(a > 0\). Then $$ P(X \ge a) \ \le \ \frac{E(X)}{a}. $$
  2. Absolute-value form (any real- or complex-valued \(X\)). For \(a > 0\), $$ P(|X| \ge a) \ \le \ \frac{E(|X|)}{a}. $$

(i) Standard form. Since \(X \ge 0\) and \(a > 0\), $$ a \indicatrice_{\{X \ge a\}} \ \le \ X \indicatrice_{\{X \ge a\}} \ \le \ X \qquad (\text{pointwise on } \Omega). $$ The first inequality holds because \(\indicatrice_{\{X \ge a\}}\) is \(0\) where \(X < a\) (both sides are \(0\)) and \(a \le X\) where \(X \ge a\) ; the second because \(\indicatrice \in \{0, 1\}\) and \(X \ge 0\). Take expectations using positivity / linearity: $$ a P(X \ge a) \ = \ E(a \indicatrice_{\{X \ge a\}}) \ \le \ E(X), \qquad \text{hence} \quad P(X \ge a) \le E(X)/a. $$ (ii) Absolute-value form. Apply (i) to the non-negative variable \(|X|\).

Method — Markov via a monotone function
To bound \(P(X \ge a)\) for \(a \in \mathbb R\), choose a function \(f : \mathbb R \to \mathbb R_+\) with three properties:
  1. increasing;
  2. takes non-negative values (so \(f(X) \ge 0\));
  3. satisfies \(f(a) > 0\).
By increasing-ness, \(\{X \ge a\} \subseteq \{f(X) \ge f(a)\}\) ; apply Markov to the non-negative variable \(f(X)\): $$ P(X \ge a) \ \le \ P(f(X) \ge f(a)) \ \le \ \frac{E(f(X))}{f(a)}. $$ Common choices: \(f(x) = x^2\) on \(\mathbb R_+\) (gives a baby Bienaymé-Tchebychev) ; \(f(x) = e^{tx}\) for \(t > 0\) (the Chernoff bound --- out of program but cited for cultural reference).
Example — Markov on a binomial
Let \(X \sim \mathcal B(100, 1/2)\), so \(E(X) = 50\) and \(X \ge 0\). By Markov, $$ P(X \ge 80) \ \le \ \frac{E(X)}{80} \ = \ \frac{50}{80} \ = \ \frac{5}{8}. $$ This bound is very loose --- the exact binomial tail \(P(X \ge 80)\) is much smaller (of order \(10^{-9}\)). The looseness motivates the refinement via Bienaymé-Tchebychev in the next subsection, which exploits the variance instead of just the expectation.
Skills to practice
  • Applying Markov directly and via \(f(X)\)
III.2 Bienaymé-Tchebychev inequality
Markov applied to the squared deviation \((X - E(X))^2\) gives the canonical deviation-from-mean bound : \(P(|X - E(X)| \ge a) \le V(X)/a^2\). Sharper than Markov when \(V\) is small. The squared structure replaces the linear ratio of Markov by a quadratic one.
Theorem — Bienaymé-Tchebychev inequality
Let \(X\) be a real-valued random variable on a finite probability space and \(a > 0\). Then $$ P(|X - E(X)| \ge a) \ \le \ \frac{V(X)}{a^2}. $$

Squaring is increasing on \(\mathbb R_+\), so the events \(\{|X - E(X)| \ge a\}\) and \(\{(X - E(X))^2 \ge a^2\}\) coincide. Apply Markov (standard form) to the non-negative variable \((X - E(X))^2\): $$ P(|X - E(X)| \ge a) \ = \ P((X - E(X))^2 \ge a^2) \ \le \ \frac{E((X - E(X))^2)}{a^2} \ = \ \frac{V(X)}{a^2}. $$

Method — Bounding deviation from the mean
When bounding \(P(|X - \mu| \ge a)\) with \(\mu = E(X)\), reach for Bienaymé-Tchebychev first. Compute \(V(X)\) (via the practical formula or Bienaymé) and divide by \(a^2\).
Example — Bienaymé-Tchebychev on a binomial
Continuing the previous example, \(X \sim \mathcal B(100, 1/2)\) has \(E(X) = 50\) and \(V(X) = 100 \cdot 1/2 \cdot 1/2 = 25\). The event \(\{X \ge 80\}\) is contained in \(\{|X - 50| \ge 30\}\), so by Bienaymé-Tchebychev, $$ P(X \ge 80) \ \le \ P(|X - 50| \ge 30) \ \le \ \frac{V(X)}{30^2} \ = \ \frac{25}{900} \ \approx \ 0{.}028. $$ A factor of \(\approx 22\) tighter than Markov's \(5/8 = 0{.}625\), achieved by exploiting \(V(X)\) rather than just \(E(X)\).
Skills to practice
  • Bounding deviations from the mean
III.3 Weak law of large numbers and frequentist interpretation
Bienaymé-Tchebychev applied to the empirical mean \(S_n / n\) of \(n\) i.i.d. variables yields the weak law of large numbers: for every \(\varepsilon > 0\), \(P(|S_n/n - m| \ge \varepsilon) \to 0\) as \(n \to \infty\). This is the rigorous form of the « frequencies converge to probabilities » intuition --- the frequentist interpretation of probability that the program explicitly mandates.
Theorem — Weak law of large numbers
For every \(n \ge 1\), let \(X_1, \ldots, X_n\) be real-valued random variables on a finite probability space, mutually independent and identically distributed (i.i.d.), with common expectation \(m = E(X_1)\) and common standard deviation \(\sigma = \sigma(X_1)\). Set \(S_n = X_1 + \cdots + X_n\). Then, for every \(\varepsilon > 0\), $$ P\Bigl(\bigl|\tfrac{S_n}{n} - m\bigr| \ge \varepsilon\Bigr) \ \le \ \frac{\sigma^2}{n \varepsilon^2}. $$ In particular, when this construction is repeated for arbitrarily large \(n\) (each \(n\) on its own finite probability space), the right-hand side tends to \(0\) as \(n \to +\infty\), so the empirical mean \(S_n/n\) concentrates around the true mean \(m\).

Compute the expectation and variance of \(S_n/n\). By linearity, $$ E\Bigl(\frac{S_n}{n}\Bigr) \ = \ \frac{E(X_1) + \cdots + E(X_n)}{n} \ = \ \frac{n m}{n} \ = \ m. $$ For the variance, the \(X_i\) are mutually independent hence pairwise uncorrelated, so Bienaymé applies : $$ \begin{aligned} V\Bigl(\frac{S_n}{n}\Bigr) \ &= \ \frac{V(S_n)}{n^2} && \text{(scaling by \(1/n\))} \\ &= \ \frac{V(X_1) + \cdots + V(X_n)}{n^2} && \text{(Bienaymé identity)} \\ &= \ \frac{n \sigma^2}{n^2} \ = \ \frac{\sigma^2}{n} && \text{(equal variances ; arithmetic)}. \end{aligned} $$ Apply Bienaymé-Tchebychev to \(S_n/n\) at threshold \(\varepsilon\): $$ P\Bigl(\bigl|\tfrac{S_n}{n} - m\bigr| \ge \varepsilon\Bigr) \ \le \ \frac{V(S_n/n)}{\varepsilon^2} \ = \ \frac{\sigma^2}{n \varepsilon^2}. $$

Method — Frequentist interpretation of probability
The empirical frequency of an event \(A\) over \(n\) independent trials of an experiment converges (in probability) to \(P(A)\). This is what makes probability « about the world » --- without this convergence, the formal number \(P(A)\) would have no operational meaning. Concretely, with \(X_i = \indicatrice_{A_i}\) for \(A_i\) = « \(A\) occurred at trial \(i\) » : \(S_n/n\) = empirical frequency, \(m = E(X_1) = P(A)\), and the weak law says \(S_n/n \to P(A)\) in probability.
Example — Empirical mean of 1000 die rolls
Roll a balanced die \(n = 1000\) times independently ; let \(X_i\) be the value of roll \(i\) and \(\bar X_n = S_n/n\) the empirical mean. From the expectation and variance of a uniform die roll, \(m = 7/2\) and \(\sigma^2 = 35/12\). By the weak law at \(\varepsilon = 0{.}1\), $$ P\Bigl(\bigl|\bar X_n - \tfrac{7}{2}\bigr| \ge 0{.}1\Bigr) \ \le \ \frac{\sigma^2}{n \varepsilon^2} \ = \ \frac{35/12}{1000 \cdot 0{.}01} \ = \ \frac{35}{120} \ \approx \ 0{.}29. $$ A bound, not a tight estimate (the true probability is far smaller), but enough to confirm that the empirical mean is within \(0{.}1\) of \(3{.}5\) at least \(71\%\) of the time --- already a useful guarantee.
Skills to practice
  • Applying the weak law of large numbers
IV Going further (out of program)
Out of program. An optional enrichment block on the probabilistic method (Erdős): a non-constructive existence technique that uses expectation to show that some object with a desired property exists, without ever constructing it explicitly. Read for the curious student and for the cross-disciplinary connection to combinatorics and graph theory; not assessed.
IV.1 The probabilistic method
The principle, popularised by Paul Erdős (1913--1996) : to show that some element of a finite set \(E\) has a given property, define a random variable \(X\) on \(E\) and show that the probability of « \(X\) has the property » is positive. If yes, there exists at least one \(\omega\) such that \(X(\omega)\) has the property --- so the desired element exists. The technique is « l'espoir fait vivre » (hope makes alive) : computing an expectation can prove existence.
Theorem — Existence by expectation (l'espoir fait vivre)
Let \(X\) be a real-valued random variable on a finite probability space. Then \(P(X \le E(X)) > 0\) and \(P(X \ge E(X)) > 0\). In particular, the sets \(\{x \in X(\Omega) \mid x \le E(X)\}\) and \(\{x \in X(\Omega) \mid x \ge E(X)\}\) are both non-empty.

Let \(m = E(X)\) ; we show \(P(X \le m) > 0\) (the other inequality follows by replacing \(X\) by \(-X\)). Suppose, by contradiction, that \(P(X \le m) = 0\). Since the total probability mass is \(\sum_{x \in X(\Omega)} P(X = x) = 1\) and the mass on \(\{x \le m\}\) is zero by hypothesis, all of the mass is carried by \(\{x > m\}\) : \(\sum_{x > m} P(X = x) = 1\). Then, $$ \begin{aligned} 0 \ &= \ E(X) - m && \text{(hypothesis \(E(X) = m\))} \\ &= \ E(X - m) && \text{(linearity)} \\ &= \ \sum_{\substack{x \in X(\Omega) \\ x > m}} P(X = x) (x - m) && \text{(definition of \(E(X - m)\) ; mass on \(\{x \le m\}\) is \(0\))}, \end{aligned} $$ which is a sum of strictly positive terms (\(x - m > 0\)) weighted by non-negative probabilities summing to \(1\) : the sum is strictly positive, contradicting the left-hand \(0\).

Example — Graph cut --- Erdős' classical existence proof
Let \(G = (V, E)\) be a graph with \(|V| = 2n\) vertices (\(n \ge 1\)) and at least one edge (\(|E| \ge 1\)). Claim: there exists a partition \(V = V_1 \sqcup V_2\) with \(|V_1| = |V_2| = n\) such that strictly more than \(|E|/2\) edges cross the cut (have one endpoint in \(V_1\) and the other in \(V_2\)).
Proof by the probabilistic method. Pick a balanced partition uniformly at random : \(V_1\) is a uniformly chosen \(n\)-element subset of \(V\), \(V_2 = V \setminus V_1\). For each edge \(a = \{x, y\} \in E\), let \(A_a\) = « \(a\) crosses the cut ». By symmetry, $$ P(A_a) \ = \ \frac{2 \binom{2n - 2}{n - 1}}{\binom{2n}{n}} \ = \ \frac{n}{2n - 1} \ > \ \frac{1}{2}, $$ since \(n / (2n - 1) > 1/2 \iff 2n > 2n - 1\). Let \(N\) = number of crossing edges. By linearity of expectation, $$ E(N) \ = \ \sum_{a \in E} P(A_a) \ > \ \frac{|E|}{2}. $$ By the existence-by-expectation Theorem (applied to \(N\)), there exists at least one outcome with \(N > |E|/2\), i.e. a balanced partition with strictly more than \(|E|/2\) crossing edges. \(\blacksquare\)
This is the famous existence-via-expectation argument : we have proved existence of a good partition without constructing it explicitly.
Skills to practice
  • Proving existence by the probabilistic method