Monte Carlo Methods

Huang, Xuanqiang Angelo

Monte Carlo Methods

January 15, 2025 · Reading Time: 11 minutes · By Xuanqiang Angelo Huang

DI Law of Large Numbers e Central limit theorem ne parliamo in Central Limit Theorem and Law of Large Numbers. Usually these methods are useful when you need to calculate following something similar to Bayes rule, but don't know how to calculate the denominator, often infeasible integral. We estimate this value without explicitly calculating that.

Interested in $P (x) = \frac{1}{z} P^{*} (x) = \frac{1}{Z} e^{- E (x)}$ Can evaluate E(x) at any x.

Problem 1 Make samples x(r) ~ 2 P
Problem 2 Estimate expectations $Φ = \sum_{x} ϕ (x) P (x)$ ) What we're not trying to do:
We're not trying to find the most probable state.
We're not trying to visit all typical states.

Law of large numbers

Data una sequenza di variabili aleatorie, $X_{1}, X_{2}, \dots, X_{n} \dots$ , tali che siano i.i.d tali per cui $E (X_{1}) = E (X_{2}) = \dots = E (X_{n}) = \dots = μ$ tale che sia finito. Consideriamo

S_{n} = i = 1 \sum n x_{i}, :, \overset{x}{ˉ}_{n} = \frac{S _{n}}{n}

Allora questo teorema afferma che:

\overset{x}{ˉ}_{n} \to μ

Ossia il limite converge sul valore atteso di tutte le variabili aleatorie.

Inoltre se abbiamo che la varianza di tutte le variabili aleatorie, abbiamo che

v a r (\overset{x}{^}_{n}) = \frac{σ ^{2}}{n} \to 0

Strong Law of large Numbers

P (n \to \infty lim \overset{x}{ˉ}_{n} = μ) = 1

Ci permette di prendere la sample average per quei valori di mezzo. E posso fare la stima della nostra funzione di interesse $h$ tenendo conto:

\overset{ˉ}{h}_{n} = i = 1 \sum n \frac{h ( x _{i} )}{n}

E questo converge $O (n)$ per la legge dopo.

Central limit theorem

Data una sequenza di variabili aleatorie, $X_{1}, X_{2}, \dots, X_{n} \dots$ , tali che siano i.i.d tali per cui $E (X_{1}) = E (X_{2}) = \dots = E (X_{n}) = \dots = μ$ tale che sia finito, e con varianza tutti uguali $= σ^{2}$ finito.

Abbiamo come risultato che

n (\overset{x}{^}_{n} - μ) \to N (0, σ^{2})

La prima parte è una variabile aleatoria e converge a quel valore. (questo permette di utilizzare la gaussiana quando $n$ è grande abbastanza).

Monte carlo integration

Abbiamo che:

\int_{X} h (x) \cdot f (x) d x = E_{f} [h (x)] = μ

Questo è tutto il significato dell'integrazione di monte carlo (molte variabili aleatorie per stimare il valore di qualcosa). E questo vale sempre, anche se $f$ non è una funzione di densità, basta che sia positiva (basta riscalare).

Il motivo per cui funziona è per LLN, perché abbiamo che convergerà su $μ$ in lungo termine, basta considerare molte variabili aleatorie consecutive.

Cose interessanti:

Posso stimare il valore atteso grazie a LLN
Posso stimare la varianza grazie ad essa
Posso stimare variabili condizionali.

Importance sampling

Deve soddisfare la cosa del supporto (contiene supporto sia di h che di f)
Deve avere varianza finita per i pesi che sono trovati. (ci sono metodi per stimare poi il $g$ ).
1. la frazione $\frac{f}{g}$ deve essere limitata e la varianza di $h$ rispetto alla densità $f$ deve essere finita.

Bound on samples

TODO: this should be interesting and important and should use Hoeffding's inequality.

Markov Chain Monte Carlo

L'idea principale di Markov Chain Monte carlo è di costruire una catena di Markov tale che si possa utilizzare per generare dalla distribuzione difficilmente trattabile. Uno dei metodi principali per fare ciò è utilizzare tecniche si sampling.

Metropolis Hastings

The Sampling Algorithm

Monte Carlo Methods-20241202181900146 — From Bishop Deep Learning Book

We observe that this sampling algorithm takes idea from Markov Chains balanced equation's property of stationary distributions and the Accept Reject algorithm for sampling some distributions. Also in this case, we don't care about the normalization constant for $p$ , the distribution that we would like to sample from. We choose the that acceptance probability as it allows us to satisfy the balance equation and thus, the stationary property of the constructed Chain.

With Metropolis-Hasting we need to

Be able to calculate the density for the target distribution
Be able to sample easily from the $q$ distribution. Then it has some links with Random Walks inside the space we want to sample.

With Metropolis Hastings we accept with probability

α = min (1, \frac{p ( x ^{'} ) q ( x ∣ x ^{'} )}{p ( x ) q ( x ^{'} ∣ x )})

Where $x$ is the current state and $x^{'}$ is the proposed state.

Stationary Distribution

We can prove that with that acceptance rate, the probability of transitioning in a given state satisfies the balance equation for the stationary distribution, which implies we can have a well behaved Markov Chain, and continue to sample from this.

We see that the probability of transitioning from $x$ to $x^{'}$ is the same as transitioning from $x^{'}$ to $x$ .

p (x) q (x^{'} ∣ x) α = p (x^{'}) q (x ∣ x^{'}) α

To prove this, just consider the two possible cases for the min function.

Con distribuzioni Gaussiane

Il caso continuo con distribuzione Gaussiane potrebbe essere di particolare interesse perché semplifica l'analisi del calcolo del coefficiente $α$ : dato che le Gaussians sono simmetriche, abbiamo che il valore di

\frac{r ( x ^{'} ∣ x )}{r ( x ∣ x ^{'} )} = \frac{N ( x ^{'} ; x , τ I )}{N ( x ; x ^{'} , τ I )} = 1

Quindi il coefficiente $α$ si può scrivere anche solo come le distribuzioni $\frac{q ( x )}{q ( x ^{'} )}$ che se si assume una distribuzione di Gibbs, si può intendere come un $exp (f (x) - f (x^{'}))$ , che è di facile interpretazione. Quindi possiamo fare sampling solamente considerando il rapporto fra le due distribuzioni.

Variations and Improvements

Metropolis-Hastings' sample efficiency could be improved by adding information about the gradient of the distribution. This is the idea behind the Langevin algorithm.

Metropolis Adjusted Langevin Algorithm

The Metropolis-Adjusted Langevin Algorithm (MALA) modifies the Metropolis-Hastings method by incorporating gradient information of the target distribution to guide proposals, fixing the mean with a gradient descent step. Now the mean is $x - τ \nabla f (x)$ instead of just $x$ . The proposal function is still a Gaussian. This allows us to accept always, which motivates why it is more efficient compared to the Metropolis Hasting algorithm. Now the proposal distribution is the following

r (x^{'} ∣ x) = N (x^{'}; x + τ \nabla f (x), 2 τ I) \propto exp (- \frac{1}{2} \frac{x ^{'} - x - τ \nabla lo g p ( x )}{G}^{2}),

§ It is possible to show that for log-concave distributions (e.g., Bayesian log. Regression), MALA efficiently converges to the stationary distribution (mixing time is polynomial in the dimension)

Somehow, one can prove that if $τ \to 0$ then the Metropolis Hastings tends to always accept the proposal, which is why it is more efficient. The version of Metropolis Hastings that always accepts is called Unadjusted Langevin Algorithm.

If we model the energy function as:

p (θ ∣ x_{1 : n}, y_{1 : n}) = \frac{1}{Z} exp (lo g p (θ) + i = 1 \sum n lo g p (y_{i} ∣ x_{i}, θ))

Similarly to a Log Linear Models in language processing, then the update function is the following:

θ_{t + 1} = θ_{t} + τ \nabla_{θ} lo g p (θ) + τ i = 1 \sum n \nabla_{θ} lo g p (y_{i} ∣ x_{i}, θ) + ε

Where $ε \sim N (0, 2 τ I)$

The drawback of this method is that computing the energy for the whole dataset is expensive for large datasets. This is why sometimes we use a stochastic version of this algorithm. Which drives to the stochastic gradient Langevin dynamics.

Stochastic Gradient Langevin Dynamics

The double value for the Noise signal is justifiable if you study diffusion processes. So if you want to know how the $2$ factor comes into play when analyzing the process take a look at those courses. One can also estimate the update of the gradient of the likelihood by having $m$ samples and then using that values for the update: So the algorithm for the update becomes:

Sample $m$ points and then update $θ$ as follows:

θ_{t + 1} = θ_{t} + τ \nabla_{θ} lo g p (θ) + τ \frac{n}{m} i = 1 \sum m \nabla_{θ} lo g p (y_{i} ∣ x_{i}, θ) + ε

Langevin Monte Carlo

TODO: this is just an Outlook, and it is left out from here.

Landscape of sampling methods

we can broadly categorize optimization methods based on the use of

Momentum
Stochasticity
Sampling Then we would have this cube.

Gibbs Sampling

These are probably not very important for the exam. We use a collapsed version of Gibbs Sampling for Dirichlet Processes.

The Idea

Gibbs sampling is a special case of Metropolis Hasting, in this case the proposal distribution $r$ is as follows:

r_{i} (x^{'} ∣ x) = {p (x_{i}^{'} ∣ x_{- i}^{'}) 0 if x^{'} differs from x only in i else

We just focus on a single coordinate difference. Similar to the coordinate descent in some manner.

Proof of correctness

This sections proves that Gibbs Sampling indeed gives the original distribution. We will use a similar idea presented for #Metropolis Hastings.

One can observe the following:

α_{i} (x^{'} ∣ x) = min (1, \frac{p ( x ^{'} )}{p ( x )}) = 1

A simple code example

This code example shows how we can sample from a multivariate Gaussian using Gibbs sampling. Note that we just need to be able to compute marginals in order to sample this.

import numpy as np
import matplotlib.pyplot as plt

# Define parameters of the bivariate Gaussian
mu1, mu2 = 0, 0 # Means
sigma1, sigma2 = 1, 1 # Standard deviations
rho = 0.8 # Correlation coefficient

# Derived quantities
cov = rho * sigma1 * sigma2 # Covariance
Sigma = np.array([[sigma1**2, cov], [cov, sigma2**2]]) # Covariance matrix
Sigma_inv = np.linalg.inv(Sigma) # Precision matrix


# Conditional distributions
def sample_x1_given_x2(x2, mu1, mu2, sigma1, sigma2, rho):
	cond_mean = mu1 + rho * (sigma1 / sigma2) * (x2 - mu2)
	cond_var = (1 - rho**2) * sigma1**2
	return np.random.normal(cond_mean, np.sqrt(cond_var))

  
def sample_x2_given_x1(x1, mu1, mu2, sigma1, sigma2, rho):
	cond_mean = mu2 + rho * (sigma2 / sigma1) * (x1 - mu1)
	cond_var = (1 - rho**2) * sigma2**2
	return np.random.normal(cond_mean, np.sqrt(cond_var))

# Gibbs Sampling
def gibbs_sampling(num_samples, burn_in=100):
	samples = []
	x1, x2 = 0, 0 # Initial values
	for _ in range(num_samples + burn_in):
		x1 = sample_x1_given_x2(x2, mu1, mu2, sigma1, sigma2, rho)
		x2 = sample_x2_given_x1(x1, mu1, mu2, sigma1, sigma2, rho)
		samples.append((x1, x2))
	return np.array(samples[burn_in:]) # Discard burn-in samples

# Run Gibbs Sampling
num_samples = 5000
samples = gibbs_sampling(num_samples)
  

# Plot results
plt.figure(figsize=(8, 6))
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5, s=5, label="Samples")
plt.title("Gibbs Sampling: Bivariate Gaussian", fontsize=14)
plt.xlabel("$X_1$", fontsize=12)
plt.ylabel("$X_2$", fontsize=12)
plt.axhline(0, color='gray', linewidth=0.5)
plt.axvline(0, color='gray', linewidth=0.5)
plt.grid(alpha=0.3)
plt.legend()
plt.show()

Law of large numbers#

Strong Law of large Numbers#

Central limit theorem#

Monte carlo integration#

Importance sampling#

Bound on samples#

Markov Chain Monte Carlo#

Metropolis Hastings#

The Sampling Algorithm#

Stationary Distribution#

Con distribuzioni Gaussiane#

Variations and Improvements#

Metropolis Adjusted Langevin Algorithm#

Stochastic Gradient Langevin Dynamics#

Langevin Monte Carlo#

Landscape of sampling methods#

Gibbs Sampling#

The Idea#

Proof of correctness#

A simple code example#