Monte Carlo integration, variance reduction, importance sampling, and random number generation.
Prereqs:Joint Distributions & Random Vectors
Overview
This topic bridges theory and computation. We first establish the four modes of convergence and the key limit theorems (LLN, CLT, Slutsky) that justify Monte Carlo methods. We then cover the main simulation techniques for generating random variables, and finally the Monte Carlo estimator with its variance reduction techniques — all essential tools for quant roles involving numerical methods and pricing.
1. Convergence of Random Variables
1.1 Almost Sure Convergence (p.s.)
Definition
A sequence (Xn) converges almost surely (p.s.) to X if:
P(n→∞limXn=X)=1
i.e. the set of outcomes where Xn(ω)→X(ω) has probability zero.
Theorem— Strong Law of Large Numbers
Let (Xn) be i.i.d. with E[∣X1∣]<∞ and μ=E[X1]. Then:
Xˉn=n1
💡Tip
Almost sure convergence is the strongest mode — it means the sequence converges pathwise, for almost every realization ω. The Strong LLN justifies using the sample mean as an estimator: with probability 1, Xˉn will eventually be arbitrarily close to μ.
1.2 Convergence in Lp
Definition
A sequence (Xn) converges to Xin Lp (p≥1) if:
∥Xn−X∥p=
Properties
Lp convergence ⇒Lq convergence for q≤p
L2 convergence is called convergence in mean square (m.s.)
XnL2 and
Lp convergence ⇒ convergence in probability
Remarque
L2 convergence is the natural mode in statistics and signal processing — it corresponds to minimizing mean squared error. In Monte Carlo, the variance of the estimator going to 0 is exactly L2 convergence.
1.3 Convergence in Probability
Definition
A sequence (Xn) converges in probability to X if for all ε>0:
P(∣Xn−X∣>ε)→0as n→∞
Theorem— Weak Law of Large Numbers
Let (Xn) be i.i.d. with E[∣X1∣]<∞ and μ=E[X1]. Then:
Xˉn=n1
Properties
Almost sure convergence ⇒ convergence in probability
Lp convergence ⇒ convergence in probability
Convergence in probability ⇒ convergence in distribution
If XnPX and g continuous, then g(Xn)Pg(X)
1.4 Convergence in Distribution (en loi)
Definition
A sequence (Xn) converges in distribution to X if for all continuity points x of FX:
FXn(x)→F
Written XndX or .
Theorem— Central Limit Theorem (CLT)
Let (Xn) be i.i.d. with mean μ and variance σ2<∞. Then:
σn
Equivalently: n(Xˉn.
Theorem— Slutsky's Theorem
If XndX and YnPc (a constant), then:
Xn+Yn
Remarque
Convergence in distribution is the weakest mode — it only describes the limiting law, not the pathwise behavior. Note that Xndc (a constant) is equivalent to XnPc.
💡Tip
Slutsky is used constantly in statistics: if you replace an unknown parameter by a consistent estimator (convergence in probability), Slutsky lets you preserve the limiting distribution. Example: replacing σ by σ^ in the CLT still gives N(0,1) in the limit.
1.5 Hierarchy & Key Relationships
Properties— Implication Chain
p.s.⇒P⇒LLp⇒Lq(p≥q)⇒P⇒L
None of the reverse implications hold in general
Exception: Xndc (constant)
Remarque
Summary table of implications:
From \ To
p.s.
Lp
In prob.
In law
p.s.
—
❌
✅
✅
Lp
❌
—
✅
✅
In prob.
❌
❌
—
✅
In law
❌
❌
❌ (unless limit is constant)
—
2. Simulation Methods
2.1 Inverse Transform Method
Theorem— Inverse Transform
Let F be a CDF and U∼U(0,1). Define the generalized inverseF−1(u)=inf{x:F(x)≥u}. Then:
X=F−1(U)∼F
Properties
Works for any distribution with invertible CDF
For discrete distributions: find k such that F(k−1)<U≤F(k)
Computationally efficient when F−1 has a closed form
Example
To simulate X∼Exp(λ): since F(x)=1−e−λx, set U∼U(0,1) and compute X=−λ1log(1−U). Since 1−U∼U(0,1) as well, use X=−λ1logU.
💡Tip
The inverse transform is the foundational simulation method. If asked "how would you simulate distribution X?", first check if F−1 has a closed form — if yes, use this method. If not, use rejection sampling.
2.2 Rejection Sampling
Definition
To simulate X∼f, choose a proposal densityg and constant M≥1 such that f(x)≤Mg(x) for all x. Then:
Sample Y∼g and U∼U(0,1) independently
AcceptY if U≤, else and repeat
P(accept)=M1
The accepted Y has distribution f.
Properties
Number of trials until acceptance ∼Geom(1/M) with mean M
Efficiency decreases as M increases — choose g close to f
Works for any target density f (even unnormalized)
💡Tip
In interviews: rejection sampling is the go-to when F−1 has no closed form. The key design choice is the proposal g — use a distribution that covers f well (same support, similar shape) to keep M small and acceptance rate high.
2.3 Box-Muller Transform
Theorem— Box-Muller
Let U1,U2∼iidU(0,1). Define:
Z1=−2logU
Then Z1 and Z2 are independentN(0,1) random variables.
Remarque
Box-Muller generates pairs of independent standard normals from pairs of uniforms using polar coordinates. It is the standard method when N(0,1) simulation is needed and the inverse CDF Φ−1 is unavailable or slow to compute.
2.4 Simulation of a Gaussian Vector
Theorem— Cholesky Method
To simulate X∼N(μ,Σ):
Compute the Cholesky decompositionΣ=LLT where L is lower triangular
Simulate Z=(Z1,…,Zn)T with
Set:
X=μ+LZ∼N(μ,Σ)
Remarque
The Cholesky decomposition exists and is unique when Σ is positive definite. The construction works because Cov(LZ)=LCov(Z)LT=LILT=Σ.
2.5 Simulation of Discrete Random Variables
Definition
To simulate X with P(X=xk)=pk, k=1,…,n:
Sample U∼U(0,1)
Return X=xk where k is the smallest index such that:
j=1∑kpj≥U
💡Tip
This is the discrete version of the inverse transform. For large n, precompute cumulative probabilities and use binary search to find k in O(logn) rather than O(n).
3. Monte Carlo Method
3.1 Principle & Estimator
Definition
The Monte Carlo estimator of θ=E[h(X)] is:
θ^n=n1i=1∑nh(Xi),Xi∼iidfX
where X1,…,Xn are i.i.d. simulated copies of X.
Remarque
The Monte Carlo principle is a direct application of the Strong LLN: θ^np.s.E[h(X)]=θ. Any expectation — an integral, a price, a probability — can be estimated this way as long as you can simulate X.
3.2 Convergence, Variance & Confidence Interval
Properties
Unbiasedness: E[θ^n]=θ
Variance: Var(θ^n)=nσ2 where σ2=Var(h(X))
Convergence rate: error =O(1/n) — independent of dimension
MSE: E[(θ^n−θ)2]=n
Formula— Confidence Interval
By the CLT, an approximate 95% confidence interval for θ is:
θ^n±1.96nσ^
where σ^2=n−11.
💡Tip
The O(1/n) convergence rate is dimension-free — Monte Carlo does not suffer from the curse of dimensionality. This is why it dominates numerical integration in high dimensions (e.g. pricing basket options with 100 underlyings).
4. Variance Reduction Techniques
4.1 Antithetic Variables
Definition
Instead of n independent samples, use n/2 pairs (Xi,X~i) where X~i is negatively correlated with Xi. The estimator becomes:
θ^nAV=
For U∼U(0,1), use U~=1−U to generate .
Properties
Var(θ^nAV)=nVar(h(X))+nCov(h(X),h(X~))
Effective when Cov(h(X),h(X~))<0, i.e. h is monotone
Same computational cost as standard MC with n samples
💡Tip
Interview framing: "Use U and 1−U to generate two paths — if h is monotone, one path overshoots while the other undershoots, and they cancel out. This reduces variance at no extra cost."
4.2 Control Variates
Definition
Let Z be a random variable with known meanE[Z]=μZ, correlated with h(X). The linear control variate estimator with coefficient c is:
θ^nCV=
The optimal coefficient minimizing variance is:
c∗=Var(Z)Cov(h(X),Z)
giving variance reduction:
Var(θ^nCV)=n
Properties
Unbiased: E[θ^nCV]=θ for any c
Maximum reduction when ∣ρ(h(X),Z)∣ is close to 1
c∗ is estimated in practice from a pilot run
💡Tip
Interview framing: "Find a simpler quantity Z that you can compute analytically and that moves with h(X). Subtract its fluctuations from the estimator — you keep the mean but remove correlated noise. Works best when Z is a simplified version of h(X), e.g. the payoff of a European call used as control for an exotic option."
4.3 Importance Sampling
Definition
To estimate θ=Ef[h(X)]=∫h(x)f(x)dx, sample from an alternative density g (the importance distribution) and reweight:
θ^nIS
where w(x) is the likelihood ratio (importance weight).
Properties
Unbiased: Eg[h(X)w(X)]=∫h(x)g(x)f(x)g(x)dx=θ
Optimalg∗(x)∝∣h(x)∣f(x) gives zero variance
Variance can increase if g is poorly chosen (tails too light)
💡Tip
Interview framing: "Shift the sampling distribution toward the region that matters most — rare events or high-payoff regions. Instead of simulating many paths that contribute nothing, concentrate samples where h(X) is large, then correct for the bias with the likelihood ratio f/g. Essential for rare-event simulation (e.g. default probabilities, deep out-of-the-money options)."
4.4 Stratified Sampling
Definition
Partition the sample space into KstrataA1,…,AK with P(Ak)=pk. Allocate nk=n⋅pk samples to stratum k and estimate within each stratum:
θ^n
Properties
Always reduces variance compared to standard MC: Var(θ^SS)≤Var(θ^)
Variance reduction is largest when stratum means differ greatly
Optimal allocation: allocate more samples to strata with higher variance (nk∝pkσk)
💡Tip
Interview framing: "Instead of randomly covering the space, ensure each region is represented proportionally. This eliminates the sampling variability due to uneven coverage — the estimator benefits from guaranteed representation of all strata rather than hoping random samples cover everything evenly."
5. Quick Reference
Convergence Modes
Mode
Definition
Notation
Almost sure
P(limXn=X)=1
Xnp.s.X
In Lp
E[∥Xn−X∥p]
In probability
P(∥Xn−X∥>ε)→0
X
In distribution
FXn(x)→FX(x)
Implication Chain
p.s.⇒P⇒LandLp⇒Lq(p≥q)⇒P⇒L
Simulation Methods
Method
When to use
Key formula
Inverse transform
F−1 has closed form
X=F−1(U)
Rejection sampling
F−1 unavailable
Accept with prob f(Y)/(Mg(Y))
Box-Muller
Simulate N(0,1) pairs
Z1=−2logU
Cholesky
Gaussian vectors
X=μ+LZ, Σ=LLT
Variance Reduction Summary
Method
Principle
Best when
Antithetic variables
Use U and 1−U
h is monotone
Control variates
Subtract correlated known quantity
∥ρ(h(X),Z)∥ close to 1
Importance sampling
Resample from better distribution
Rare events, tail probabilities
Stratified sampling
Enforce proportional coverage
Strata means differ significantly
Key Formulas
Concept
Formula
MC estimator
θ^n=n1∑h(Xi)
MC variance
Var(θ^n)=σ2/n
95% CI
θ^n±1.96σ^/n
Optimal control variate coeff.
c∗=Cov(h(X),Z)/Var(Z)
Importance weight
w(x)=f(x)/g(x)
i=1∑n
Xi
p.s.
μ
(E[∣Xn−X∣p])1/p
→
0
as
n
→
∞
X
⇒
E[Xn]→
E[X]
Var(Xn)→Var(X)
i=1∑n
Xi
P
μ
X
(
x
)
as
n
→
∞
Xn
L
X
(
Xˉn
−
μ
)
d
N
(
0
,
1
)
−
μ)d
N(0,σ2)
d
X
+
c
,
Xn
⋅
Yn
d
c
⋅
X
⇒
XnP
c
Dominated convergence: if Xnp.s.X and ∣Xn∣≤Y with E[Y]<∞, then XnL1X