LMC Bivariate Gaussian Example
Nabarun Deb, Ursula Guo, University of Chicago
05-28-2024
A simple 2d case of Gaussian simulation using Langevin Monte Carlo algorithm
Here’s the set up of multivariate Gaussian for a simple 2d example.
Let X be a bivariate Gassian random variable.
Let the mean of X E[X]=(0,0).
Let the covariance matrix of X be cov(X)=[10.90.91]
Then the log density function of X is proportional to −0.5∗(X−E[X])Tcov(X)−1(X−E[X]). Note that (X−E[X])Tcov(X)−1(X−E[X]) denotes the Mahalanobis distance between X and its mean E[X]. For multivariate normal distribution log density function derivation, see here
library(MASS)
require(ggplot2)
## Loading required package: ggplot2
library(mvtnorm)
# Define the target distribution: multivariate Gaussian
target_mean <- c(0, 0)
target_cov <- matrix(c(1, 0.9, 0.9, 1), nrow = 2) # covar matrix of the 2d Gaussian
target_log_density <- function(x) {
-0.5 * t(x - target_mean) %*% solve(target_cov) %*% (x - target_mean)
}
Gradient of the log-density function is cov(X)−1(E[X]−X)
# Gradient of the log-density function
grad_log_density <- function(x) {
solve(target_cov) %*% (target_mean - x)
}
Now make use of the Langevin Monte Carlo iteration, which states Xk+1=Xk+η∗∇(log(pdf(X)))/2+√η∗Zk where Zk is a random drawing from a bivariate normal distribution.
Question: if we need to make random draw from normal distribution for this LMC algorithm to work for normal distribution, doesn’t that mean it’s circulatory?
Answer:
Most of the time, LMC is used to sample random variables from a much more complicated distribution than Gaussian. So the step where we introduce the Gaussian noise to purturb the sequence of variables is not circulatory in those settings.
Question: what does “thinning” mean in the context of LMC sampling technique?
Answer:
Thinning refers to sampling random variables for a distribution by leaving several iterations in between unsampled. It is worth noting that
Question: How does strong convexity come into place when we are using LMC to sample a Gaussian distribution?
Answer:
For a function to be convex, the necessary condition is to check that the Hessian (or “second derivative” for 1d random variable) is positive semi-definite (or ≥0 for 1d random variable).
Definition: A function f:Ω⊂Rn→R is a Convex function iff f(tx+(1−t)y)≤tf(x)+(1−t)f(y).
(Second Order Condition for Convexity) For a function f∈C2(Ω), where Ω⊂Rn is also a convex set, f is convex iff ∇2f(x)⪰0 for all x∈Ω.
Definition: Let f∈C2(Ω) and f is convex. f is strictly convex iff ∇2f(x)≻0 for all x∈Ω.
Definition: Let f∈C2(Ω) and f is convex. f is strongly convex if there exists some m>0 such that ∇2f(x)≻mI for all x∈Ω, where I is just the identity matrix of the dimensions of Hessian ∇2f(x).
So intuitively, the larger the second derivative/Hessian is, the more strongly convex a function is. Therefore, let’s calculate the second derivative of log of pdf of a standard normal random variable.
Let x∼N(0,1).
We know that log(x) is a concave function, so -log(x) is a convex function. The pdf of a standard normal is a convex function. So the composition of the two is also a convex function. Now let’s calculate how convex it is through second derivative calculation.
−log(pdf(x))=−log(1√2πσ2exp(−(x−μ)22σ2))=−log(1√2πσ2exp(−x2−2xμ+μ22σ2))∝−x2−2xμ+μ22σ2⟹d2(−log(pdf(x)))dx2∝1σ2
This shows that the second derivative or Hessian is proportional to 1σ2, so the smaller the variance σ2, the bigger the second derivative, so the more strongly convex −log(pdf(x)) is.
We can relate this to the variance of Xt, as we want to achieve the convergence of var(limt→∞Xt)=1 since the simulated Xt converges in distribution to N(0,1).
Here’s how we can look at the variance in each iteration of the Langevin Monte Carlo algorithm:
dXt=−∇(−log(pdf(Xt)))dt+√2dBt⟹Xt+ϵ−Xt=−∇(−log(pdf(Xt)))(t+ϵ−t)+√2(Bt+ϵ−Bt)⟹Xt+ϵ=Xt−Xtσ2ϵ+√2(Bt+ϵ−Bt)Xt+ϵ=Xt(1−ϵσ2)+√2ϵzt where zt∼N(0,1), and i.i.d. for each t
If we let the starting X0 to be a random variable of N(0,1), then we can tell the variance of Xϵ by the above expression.
Xϵ=(1−ϵσ2)X0+√2ϵZ)∼N(0,(1−ϵσ2)2+2ϵ)
So if we let X0∼N(0,var(0)) where var(0) denotes the variance. Then the variance at the ϵ step and at t+ϵ step can also be calculated.
Xϵ∼N(0,(1−ϵσ2)2var(0)+2ϵ)⟹var(ϵ)=(1−ϵσ2)2var(0)+2ϵXt+ϵ∼N(0,(1−ϵσ2)2var(t)+2ϵ))⟹var(t+ϵ)=(1−ϵσ2)2var(t)+2ϵ
Now calculate the derivative of variance with respect to time to see how the variance changes overtime. We expect to express dvar(t)dt in terms of var(t) so that we can solve the ODE to express var(t) in terms of t.
limϵ→0var(t+ϵ)−var(t)ϵ=limϵ→0(1−ϵσ2)2var(t)+2ϵϵ=limϵ→0(−2ϵσ2+ϵ2σ4)var(t)+2ϵϵ=limϵ→0(−2σ2+ϵσ4)var(t)+2=−2σ2var(t)+2⟹dvar(t)dt=−2σ2var(t)+2
Solving the ODE, we get var(t)=σ2(1−exp(−2σ2t)).
And taking t to infinity, we have limt→∞var(t)=σ2. Therefore, the smaller the σ2, the smaller the variance is, which is what we would ideally want to converge to in infinite time. This is the same conclusion we drew from requiring the function −log(pdf(x)) to be strongly convex.
# Langevin dynamics sampling without "thinning"
langevin_sampling <- function(n_samples, step_size, initial_value) {
samples <- matrix(0, n_samples, length(initial_value))
x <- initial_value
for (i in 1:n_samples) {
grad <- grad_log_density(x)
x <- x + step_size / 2 * grad + sqrt(step_size) * rnorm(length(x))
samples[i, ] <- x
}
return(samples)
}
Next, generate examples using the langevin_sampling function defined above.
# Set parameters
n_samples <- 505000
bsz = 5000
step_size <- 0.01
initial_value <- c(0, 0)
# Generate samples
samples <- langevin_sampling(n_samples, step_size, initial_value)
samplesburn = samples[-c(1:bsz),]
thinsz=seq(100,n_samples-bsz,100)
colMeans(samples)
## [1] -0.04580787 -0.04383146
cov(samples)
## [,1] [,2]
## [1,] 0.9814081 0.8814491
## [2,] 0.8814491 0.9863900
samplesDf = data.frame(samples)
library(ggplot2)
ggplot(samplesDf, aes(x=X1,y=X2))+
labs(title = "Langevin Sampling for 2 dimensions",
x = "x1",
y = "x2") +
geom_point(color = "deepskyblue2", alpha = 0.5) +
theme_bw()