Markov chain Monte Carlo, usually shortened to MCMC, is one of the central ideas in modern statistical computing. It is the reason Bayesian inference scales beyond toy models. It is also one of the most misunderstood methods in statistics, because people often learn the recipes before they learn the problem the recipes are solving.
A compact definition is:
MCMC is a way to approximate expectations under a distribution that you can evaluate up to a constant but cannot sample from directly.
Everything else in the article is an expansion of that definition.
The problem MCMC actually solves
Suppose your posterior density is
In Bayesian work, the quantity you usually want is not the posterior density itself but an expectation with respect to that density. For some function , you want
That one formula hides nearly every practical Bayesian task:
- posterior means and medians
- posterior variances and correlations
- credible intervals
- posterior predictive probabilities
- expected losses and decision rules
- model comparison quantities built from predictive densities
In low dimensions, sometimes you can compute these integrals analytically. In very special conjugate models, the posterior belongs to a closed-form family and the algebra is friendly. But the moment you introduce latent variables, hierarchical structure, nonlinear likelihoods, missing data, non-Gaussian priors, or high-dimensional parameter vectors, the integral usually becomes intractable.
The key point is subtle but important: the posterior may be easy to write down and hard to use. You can often evaluate the unnormalized log posterior
but that does not mean you can integrate it or sample from it directly.
This is why MCMC matters. It turns an impossible integral into a simulation problem.
Monte Carlo before the Markov chain
Before understanding MCMC, it helps to understand ordinary Monte Carlo.
If you could draw i.i.d. samples
then the posterior expectation could be approximated by
By the law of large numbers,
Conceptually, that part is simple. In the settings where Bayesian inference gets interesting, drawing independent samples from the posterior is the hard part.
MCMC starts from a practical question:
If direct i.i.d. sampling is impossible, can we at least build a stochastic mechanism that spends the right proportion of time in the right places?
That is where the Markov chain enters.
What the Markov chain contributes
A Markov chain is a sequence
with the property that the next state depends on the present state, not the full past:
That property is not a limitation; it is the design space. We choose a transition rule so that the chain has the posterior as its long-run distribution.
If the chain is ergodic and has stationary distribution , then averages along the chain satisfy
This is the core MCMC move:
- Build a Markov transition rule.
- Ensure the posterior is stationary for that rule.
- Run the chain long enough that empirical averages approximate posterior expectations.
The stationary condition means that if the chain were already distributed as , one more transition would leave that distribution unchanged:
One common sufficient condition is detailed balance:
Intuitively, detailed balance says that, in equilibrium, probability flow from to is exactly matched by probability flow back from to . It is not the only route to stationarity, but it is the easiest to verify for many classical samplers.
Metropolis-Hastings from first principles
The most canonical MCMC algorithm is Metropolis-Hastings. It begins with a proposal distribution
which suggests a candidate next state when the current state is .
If we always accepted the proposal, the chain would just follow and generally would not target the posterior. So we correct the proposal using an acceptance probability:
The algorithm is:
- Start at some .
- At iteration , propose .
- Compute .
- Draw .
- If , accept and set . Otherwise reject and set .
For the common symmetric random-walk proposal
the proposal terms cancel and the acceptance rule becomes
This is why Metropolis-Hastings is so useful in Bayesian computation: the unknown normalizing constant in the posterior cancels. If
then
You only need the posterior up to proportionality.
Watch the chain move
A good way to understand MCMC is to watch what the chain does on a few difficult targets.
Before the lab, start with the animated walkthrough. It puts occupancy, proposal decisions, diagnostics, and geometry in one sequence.
Animated Walkthrough
How a Markov chain turns motion into inference.
Start here before the lab. The walkthrough moves through posterior mass, proposal decisions, diagnostics, and geometry in sequence.
Interactive MCMC Lab
Watch a Metropolis sampler discover where the posterior actually lives.
The canvas shows the target density. The white trail is the chain. Green arrows are accepted proposals; red arrows are rejected ones. Change the landscape and the proposal scale, then compare the trace and empirical marginal below. The same run should make sense in all three views.
Geometry
Multimodal Posterior
Trace Plot
One coordinate over time
Flat plateaus mean rejections. Slow drift means strong autocorrelation. Sudden level shifts often mean the chain finally escaped one region for another.
Empirical vs Target
Marginal distribution of x
Blue bars are the chain's retained draws after burn-in. The pale line is the true target marginal. A biased chain misses shape, mass, or both.
The interface gives you a few linked views of the same computation.
- The density canvas shows where posterior mass lives.
- The white trajectory shows the path of the chain through parameter space.
- The trace plot shows serial dependence in one coordinate.
- The empirical marginal plot shows whether the chain is reproducing the correct distribution after burn-in.
- The decision panel shows a single Metropolis step numerically: current log density, proposal log density, acceptance probability, and the uniform draw that decides the move.
This is why MCMC takes time to master. The algorithm is easy to state; the behavior is governed by geometry.
Burn-in, warm-up, and why early draws are different
The chain needs a starting point. Unless that starting point already looks like a typical draw from the posterior, early iterations are not representative. The chain is still moving from arbitrary initialization into the region where the posterior actually has mass.
This is the intuition behind burn-in or warm-up: discard an initial segment of the chain because it reflects transient behavior rather than equilibrium sampling.
But burn-in is frequently oversimplified. Three distinctions matter:
1. Burn-in is about transient bias
If you start far from equilibrium, early draws are biased by initialization. Discarding them can reduce that bias.
2. Warm-up is often doing more than discarding
In modern samplers, especially HMC and NUTS implementations, warm-up is usually the phase where the algorithm tunes step sizes, mass matrices, or other adaptation parameters. Those iterations are not just "bad early samples"; they are part of algorithm configuration.
3. Burn-in does not fix bad mixing
If the chain remains trapped in one mode, or moves only microscopically along a narrow ridge, discarding the first 200 or 2,000 draws will not solve the problem. Burn-in addresses starting-point bias. It does not cure a poorly exploring transition kernel.
That is why the burn-in slider in the lab is revealing. You can often improve the histogram by excluding the early transient phase, but only if the chain eventually explores the right distribution. If it never does, burn-in becomes cosmetic.
Dependence is the price you pay
Ordinary Monte Carlo uses independent draws. MCMC draws are dependent by construction. Consecutive states of the chain are usually correlated, sometimes strongly.
That dependence means a sample size of 4,000 draws does not generally behave like 4,000 i.i.d. observations. The meaningful quantity is the effective sample size (ESS), which adjusts for serial correlation.
One way to express this is through the integrated autocorrelation time:
where is the lag- autocorrelation of the sampled quantity.
Then an approximate effective sample size is
So if your chain has strong persistence and is large, you may need many iterations to get a modest amount of actual information.
This explains a common beginner confusion:
"My acceptance rate is 90%, so my chain must be good."
Not necessarily. High acceptance can simply mean your proposals are so small that the chain barely moves. The sampler accepts nearly everything because every proposal is almost identical to the current state. The result is a beautifully smooth trace and a terrible effective sample size.
In the lab, the Sticky regime makes that failure visible. The chain moves often, but it learns slowly.
Geometry is the real story
The dominant factor in MCMC performance is usually not abstract probability theory. It is the geometry of the target distribution.
Multimodality
When the posterior has separated modes, a local proposal mechanism can spend a long time inside one hill and almost never jump across the valley to another. The trace looks stable, but the chain is not learning the full posterior. This is a mixing failure, not a Monte Carlo variance issue.
Strong correlation and curvature
When the posterior mass lies on a thin tilted ridge or curved valley, isotropic random-walk proposals are poorly matched to the geometry. Large steps jump off the ridge and get rejected. Small steps creep along the ridge with heavy autocorrelation.
Funnels and varying scales
In hierarchical models, it is common to have regions where one parameter controls the scale of another. Then the posterior can be extremely narrow in one area and diffuse in another. A single global step size is fundamentally mismatched: too large for the narrow neck, too small for the broad chamber.
All three pathologies appear in the interactive lab. They are not edge cases. They are the computational reality of many serious Bayesian models.
Why Metropolis-Hastings is both foundational and limited
Random-walk Metropolis-Hastings is pedagogically perfect because it isolates the essential ingredients:
- a target density
- a proposal mechanism
- an acceptance correction
- empirical averages along a chain
But it is often not the best algorithm for modern applied work.
Its limitations are structural:
- local proposals ignore gradients
- one scale parameter rarely matches all directions
- serial dependence can be severe
- multimodal targets remain difficult
- tuning is manual and model-specific
That is why advanced MCMC methods should be understood as repairs to these weaknesses.
Gibbs sampling: exact conditionals instead of accept-reject moves
If you can sample directly from conditional distributions, Gibbs sampling updates one block at a time:
and so on.
The elegant feature of Gibbs is that every update is accepted automatically, because each conditional draw is already exactly consistent with the target distribution.
But Gibbs is not a universal solution. In strongly correlated models, componentwise updates can still mix very slowly, because the chain is forced to move one coordinate or block at a time through a geometry that is intrinsically joint.
So Gibbs often improves implementation simplicity without fully fixing the deeper geometry problem.
Hamiltonian Monte Carlo: use gradients to stop wandering blindly
Hamiltonian Monte Carlo (HMC) attacks the central inefficiency of random walks. Instead of proposing tiny local perturbations, HMC augments the parameter vector with momentum variables and simulates a physical system with Hamiltonian
where the potential energy is
The resulting dynamics satisfy
Why does this matter? Because gradients tell the sampler which way the density is changing. Instead of meandering via random local nudges, the algorithm can travel long distances through high-density regions while keeping acceptance rates high.
This fixes, or at least dramatically mitigates, several pathologies visible in random-walk Metropolis:
- along a ridge, HMC can move with the geometry rather than repeatedly stepping off it
- in moderate to high dimensions, one trajectory can cover much more distance than many random-walk proposals
- effective sample size per gradient evaluation is often far larger
That is one reason contemporary Bayesian software so often defaults to HMC-based methods when gradients are available.
NUTS: automate the hardest tuning decision
HMC introduces a new tuning question: how long should each trajectory be? If it is too short, the chain behaves like an expensive random walk. If it is too long, computation is wasted and numerical errors accumulate.
The No-U-Turn Sampler (NUTS) solves that by adaptively extending the trajectory until it begins to turn back on itself. In practice, NUTS makes HMC much easier to use because it removes one of the most frustrating manual tuning decisions while preserving the geometry-aware strengths of HMC.
That is why, if you use Stan, PyMC, NumPyro, or comparable modern Bayesian systems, you typically encounter NUTS as the default workhorse.
Conceptually, though, NUTS is still part of the same MCMC family. It is a Markov transition rule designed so that the posterior is stationary. The sophistication is in how cleverly it moves through the space.
Sampler Comparison
Three samplers on the same target.
This view compares random-walk Metropolis, fixed-length HMC, and a NUTS-style dynamic path length on the same geometry. The point is to see how different transition rules move through the same posterior.
Diagnostics that actually matter
There is no single magic statistic that certifies MCMC correctness. Diagnostics are a collection of partial checks, each ruling out different failure modes.
Trace plots
Trace plots tell you whether the chain is drifting, sticking, oscillating, or switching modes. They are not formal proof, but they are often the fastest way to spot pathologies.
Multiple chains
Running several chains from dispersed initial values is one of the best practical checks. If they settle into materially different regions, that is evidence of poor mixing or multimodality.
and between-chain consistency
The potential scale reduction statistic, often written , compares within-chain and between-chain variation. Values near 1 are desirable. Values materially above 1 suggest non-convergence or insufficient mixing.
Effective sample size
ESS translates dependence into a more honest count of information. A huge nominal sample can still have a weak ESS.
Sampler-specific diagnostics
For HMC and NUTS, divergences, energy diagnostics, tree depth saturations, and Bayesian fraction of missing information style checks matter. They often reveal geometry problems even when marginal trace plots look superficially acceptable.
Model-based diagnostics
Posterior predictive checks, sensitivity to priors, and substantive validation still matter. A chain can converge perfectly to the wrong model.
The right mindset is not "Did my diagnostic pass?" but "Which failure modes have I ruled out, and which remain plausible?"
A practical workflow for serious MCMC use
If you are using MCMC on a real model rather than a classroom example, a disciplined workflow matters more than any single algorithmic trick.
- Write the model in terms of an explicit log posterior or log joint density.
- Understand the geometry you expect before sampling: constraints, funnels, correlations, multimodality, weak identification.
- Reparameterize aggressively when the geometry is poor. Centered versus non-centered parameterizations are a canonical example.
- Run multiple chains from different starting points.
- Inspect trace plots, ESS, , and sampler-specific warnings together rather than in isolation.
- Use posterior predictive checks and prior sensitivity analysis so that computational convergence is not mistaken for scientific adequacy.
This sequence keeps computational checks and model checks tied together.
What MCMC is, and what it is not
MCMC is not optimization. The chain does not search for the single best point; it estimates an entire distribution.
MCMC is not exact in finite time. It is an asymptotic method whose practical success depends on mixing, tuning, and diagnostics.
MCMC is not primarily about random numbers. It is about building a transition rule whose long-run occupancy pattern matches the target distribution.
MCMC is also not uniquely Bayesian. The same machinery appears anywhere a difficult distribution must be explored by simulation. Bayesian inference simply gave it one of its most powerful and consequential homes.
The right mental model
If you remember only one thing, remember this:
MCMC is posterior integration by controlled stochastic movement.
The posterior defines a landscape. The transition kernel defines how you move across that landscape. Diagnostics tell you whether the movement is actually revealing the landscape or merely scribbling on one corner of it.
Once that mental model is clear, the major algorithms line up naturally:
- Metropolis-Hastings says: propose, score, accept or reject.
- Gibbs says: move by exact conditional draws.
- HMC says: use gradients and momentum to travel far without random-walk diffusion.
- NUTS says: choose trajectory length adaptively so the travel is efficient in practice.
The interactive Metropolis lab above is deliberately simple, but it makes the main point visible: sampling quality is about geometry, not just formulas.
Conclusion
MCMC replaces difficult integrals with a stochastic process whose stationary distribution is the target posterior. A chain is not useful just because it is valid on paper. It has to explore the posterior you care about on a timescale you can afford, and you have to verify that it did.
If you can read the density canvas, the trace plot, the empirical marginal, and the acceptance rule as different views of the same computation, the method becomes much easier to reason about. At that point MCMC stops feeling opaque and starts looking like what it is: a practical way to turn probability geometry into numerical answers.
