Skip to Content
TechnicalLaw of Total Probability (Discrete, From First Principles)

Law of Total Probability (Discrete, From First Principles)

Short walkthrough of the law of total probability in the discrete setting, using notation common in TCS: a finite domain X\mathcal{X}, distributions p(x)p(x) on X\mathcal{X}, and events as subsets of X\mathcal{X}.


1. Discrete distributions on a finite domain

Let X\mathcal{X} be a finite domain of size kk, X=[k]={1,2,,k}\mathcal{X} = [k] = \{1,2,\dots,k\}.

A discrete distribution on X\mathcal{X} is a function

p:X[0,1]p : \mathcal{X} \to [0,1]

such that

xXp(x)=1.\sum_{x \in \mathcal{X}} p(x) = 1.

For an event AXA \subseteq \mathcal{X}, define

p(A)=xAp(x).p(A) = \sum_{x \in A} p(x).

Two basic rules:

  1. Additivity for disjoint events
    If A1,,AmXA_1,\dots,A_m \subseteq \mathcal{X} are pairwise disjoint, then

    p(i=1mAi)=i=1mp(Ai).p\Big(\bigcup_{i=1}^m A_i\Big) = \sum_{i=1}^m p(A_i).
  2. Conditional probability (with p(B)>0p(B) > 0):

    p(AB)=p(AB)p(B).p(A \mid B) = \frac{p(A \cap B)}{p(B)}.

Rewriting:

p(AB)=p(AB)p(B).p(A \cap B) = p(A \mid B)\,p(B).

2. Partitions and the law of total probability

A family of events B1,,BmXB_1, \dots, B_m \subseteq \mathcal{X} is a partition of X\mathcal{X} if

  1. BiBj=B_i \cap B_j = \varnothing for all iji \neq j (disjoint),
  2. i=1mBi=X\bigcup_{i=1}^m B_i = \mathcal{X} (they cover the domain).

Exactly one BiB_i holds for each xXx \in \mathcal{X}.

Take any event AXA \subseteq \mathcal{X}. Because the BiB_i cover X\mathcal{X},

A=i=1m(ABi),A = \bigcup_{i=1}^m (A \cap B_i),

and the sets (ABi)(A \cap B_i) are disjoint.

By additivity,

p(A)=i=1mp(ABi).p(A) = \sum_{i=1}^m p(A \cap B_i).

Using p(ABi)=p(ABi)p(Bi)p(A \cap B_i) = p(A \mid B_i)\,p(B_i), we get

p(A)=i=1mp(ABi)p(Bi).p(A) = \sum_{i=1}^m p(A \mid B_i)\,p(B_i).

This is the law of total probability in this notation:

p(A)=i=1mp(ABi)p(Bi).\boxed{p(A) = \sum_{i=1}^m p(A \mid B_i)\,p(B_i).}

3. Coin example in this notation

We define the randomness of the process as a distribution pp over a finite domain X\mathcal{X}.

Setup

There are two coins:

  • Coin FF (fair): p(HF)=0.5p(\text{H} \mid F) = 0.5
  • Coin BB (biased): p(HB)=0.9p(\text{H} \mid B) = 0.9

Step 1: pick a coin.

  • p(F)=0.7p(F) = 0.7
  • p(B)=0.3p(B) = 0.3

Step 2: flip the chosen coin once.

Define the domain

X={(F,H),(F,T),(B,H),(B,T)}.\mathcal{X} = \{(F,H), (F,T), (B,H), (B,T)\}.

The distribution pp on X\mathcal{X} is:

  • p(F,H)=p(F)p(HF)=0.70.5=0.35p(F,H) = p(F)\,p(H \mid F) = 0.7 \cdot 0.5 = 0.35
  • p(F,T)=0.70.5=0.35p(F,T) = 0.7 \cdot 0.5 = 0.35
  • p(B,H)=0.30.9=0.27p(B,H) = 0.3 \cdot 0.9 = 0.27
  • p(B,T)=0.30.1=0.03p(B,T) = 0.3 \cdot 0.1 = 0.03

Check:

xXp(x)=0.35+0.35+0.27+0.03=1.\sum_{x \in \mathcal{X}} p(x) = 0.35 + 0.35 + 0.27 + 0.03 = 1.

We care about the event

A={(F,H),(B,H)}(flip is Heads).A = \{(F,H), (B,H)\} \quad \text{(flip is Heads)}.

Then

p(A)=p(F,H)+p(B,H)=0.35+0.27=0.62.p(A) = p(F,H) + p(B,H) = 0.35 + 0.27 = 0.62.

Now let’s express the same calculation using the law of total probability.

Using a partition

Define events in X\mathcal{X}:

  • B1B_1: “coin is FF
    B1={(F,H),(F,T)}B_1 = \{(F,H), (F,T)\}
  • B2B_2: “coin is BB
    B2={(B,H),(B,T)}B_2 = \{(B,H), (B,T)\}

{B1,B2}\{B_1,B_2\} is a partition of X\mathcal{X}.

We have:

  • p(B1)=p(F)=0.7p(B_1) = p(F) = 0.7

  • p(B2)=p(B)=0.3p(B_2) = p(B) = 0.3

  • p(AB1)=p(HF)=0.5p(A \mid B_1) = p(\text{H} \mid F) = 0.5

  • p(AB2)=p(HB)=0.9p(A \mid B_2) = p(\text{H} \mid B) = 0.9

Apply the law:

p(A)=p(AB1)p(B1)+p(AB2)p(B2)=(0.5)(0.7)+(0.9)(0.3)=0.35+0.27=0.62.p(A) = p(A \mid B_1)p(B_1) + p(A \mid B_2)p(B_2) = (0.5)(0.7) + (0.9)(0.3) = 0.35 + 0.27 = 0.62.

Same result, but now written as a sum of conditional probabilities over a partition.


4. Randomized algorithms: how to use this

View the randomness of an algorithm as a distribution pp over a finite domain X\mathcal{X} (all possible random choices, random bits, etc.).

Let:

  • B1,,BmXB_1, \dots, B_m \subseteq \mathcal{X} be a partition encoding cases:
    e.g. “pivot is good”, “pivot is bad”, or “hash function is ii”.
  • AXA \subseteq \mathcal{X} be the event that the algorithm succeeds (or runs in time T\le T, etc.).

Then

p(A)=i=1mp(ABi)p(Bi).p(A) = \sum_{i=1}^m p(A \mid B_i)\,p(B_i).

Decompose larger probability into smaller conditional pieces:

  1. Partition into cases B1,,BmB_1,\dots,B_m:
    e.g. “Case 1: pivot is good”, “Case 2: pivot is bad”.
  2. For each case ii, compute or bound p(ABi)p(A \mid B_i).
  3. Compute or bound p(Bi)p(B_i).
  4. Plug into p(A)=i=1mp(ABi)p(Bi).p(A) = \sum_{i=1}^m p(A \mid B_i)\,p(B_i).

This is exactly the law of total probability applied to the distribution pp induced by the algorithm’s randomness.


5. Summary

  1. Fix a finite domain X\mathcal{X} and a distribution pp on it.
  2. Choose a partition B1,,BmB_1,\dots,B_m of X\mathcal{X}.
  3. For any event AXA \subseteq \mathcal{X}, write A=i=1m(ABi)A = \bigcup_{i=1}^m (A \cap B_i) with (ABi)(A \cap B_i) disjoint.
  4. Use additivity: p(A)=i=1mp(ABi).p(A) = \sum_{i=1}^m p(A \cap B_i).
  5. Use conditional probability: p(ABi)=p(ABi)p(Bi).p(A \cap B_i) = p(A \mid B_i)\,p(B_i).
  6. Combine: p(A)=i=1mp(ABi)p(Bi).p(A) = \sum_{i=1}^m p(A \mid B_i)\,p(B_i).
Last updated on