LMC Bivariate Gaussian Example
Nabarun Deb, Ursula Guo, University of Chicago
05-28-2024
A simple 2d case of Gaussian simulation using Langevin Monte Carlo algorithm
Here’s the set up of multivariate Gaussian for a simple 2d example.
Let X be a bivariate Gassian random variable.
Let the mean of X \(\mathbb{E}[X]=(0,0)\).
Let the covariance matrix of X be \[ cov(X)= \left[ {\begin{array}{cc} 1 & 0.9 \\ 0.9 & 1 \\ \end{array} } \right] \]
Then the log density function of X is proportional to \(-0.5*(X-\mathbb{E}[X])^T cov(X)^{-1} (X-\mathbb{E}[X])\). Note that \((X-\mathbb{E}[X])^T cov(X)^{-1} (X-\mathbb{E}[X])\) denotes the Mahalanobis distance between \(X\) and its mean \(\mathbb{E}[X]\). For multivariate normal distribution log density function derivation, see here
library(MASS)
require(ggplot2)
## Loading required package: ggplot2
library(mvtnorm)
# Define the target distribution: multivariate Gaussian
target_mean <- c(0, 0)
target_cov <- matrix(c(1, 0.9, 0.9, 1), nrow = 2) # covar matrix of the 2d Gaussian
target_log_density <- function(x) {
-0.5 * t(x - target_mean) %*% solve(target_cov) %*% (x - target_mean)
}
Gradient of the log-density function is \(cov(X)^{-1}(\mathbb{E}[X]-X)\)
# Gradient of the log-density function
grad_log_density <- function(x) {
solve(target_cov) %*% (target_mean - x)
}
Now make use of the Langevin Monte Carlo iteration, which states \(X_{k+1} = X_k + \eta * \nabla(log(pdf(X))) / 2 + \sqrt{\eta}*Z_k\) where \(Z_k\) is a random drawing from a bivariate normal distribution.
Question: if we need to make random draw from normal distribution for this LMC algorithm to work for normal distribution, doesn’t that mean it’s circulatory?
Answer:
Most of the time, LMC is used to sample random variables from a much more complicated distribution than Gaussian. So the step where we introduce the Gaussian noise to purturb the sequence of variables is not circulatory in those settings.
Question: what does “thinning” mean in the context of LMC sampling technique?
Answer:
Thinning refers to sampling random variables for a distribution by leaving several iterations in between unsampled. It is worth noting that
Question: How does strong convexity come into place when we are using LMC to sample a Gaussian distribution?
Answer:
For a function to be convex, the necessary condition is to check that the Hessian (or “second derivative” for 1d random variable) is positive semi-definite (or \(\geq 0\) for 1d random variable).
Definition: A function \(f: \Omega \subset \mathbb{R}^n \rightarrow \mathbb{R}\) is a Convex function iff \(f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)\).
(Second Order Condition for Convexity) For a function \(f \in C^2(\Omega)\), where \(\Omega \subset \mathbb{R}^n\) is also a convex set, \(f\) is convex iff \(\nabla^2f(x) \succeq 0\) for all \(x \in \Omega\).
Definition: Let \(f\in C^2(\Omega)\) and \(f\) is convex. \(f\) is strictly convex iff \(\nabla^2 f(x) \succ 0\) for all \(x \in \Omega\).
Definition: Let \(f\in C^2(\Omega)\) and \(f\) is convex. \(f\) is strongly convex if there exists some \(m > 0\) such that \(\nabla^2 f(x) \succ mI\) for all \(x \in \Omega\), where \(I\) is just the identity matrix of the dimensions of Hessian \(\nabla^2f(x)\).
So intuitively, the larger the second derivative/Hessian is, the more strongly convex a function is. Therefore, let’s calculate the second derivative of log of pdf of a standard normal random variable.
Let \(x \sim N(0,1)\).
We know that log(x) is a concave function, so -log(x) is a convex function. The pdf of a standard normal is a convex function. So the composition of the two is also a convex function. Now let’s calculate how convex it is through second derivative calculation.
\[\begin{align*} -log(pdf(x)) &= -log(\frac{1}{\sqrt{2\pi}\sigma^2}exp(-\frac{(x-\mu)^2}{2\sigma^2}))\\ &= -log(\frac{1}{\sqrt{2\pi}\sigma^2}exp(-\frac{x^2-2x\mu+\mu^2}{2\sigma^2}))\\ &\propto -\frac{x^2-2x\mu+\mu^2}{2\sigma^2}\\ \implies \frac{d^2(-log(pdf(x)))}{dx^2} &\propto \frac{1}{\sigma^2} \end{align*}\]
This shows that the second derivative or Hessian is proportional to \(\frac{1}{\sigma^2}\), so the smaller the variance \(\sigma^2\), the bigger the second derivative, so the more strongly convex \(-log(pdf(x))\) is.
We can relate this to the variance of \(X_t\), as we want to achieve the convergence of \(var(\lim_{t \rightarrow \infty} X_t) = 1\) since the simulated \(X_t\) converges in distribution to \(N(0,1)\).
Here’s how we can look at the variance in each iteration of the Langevin Monte Carlo algorithm:
\[\begin{align*} dX_t &= -\nabla(-log(pdf(X_t)))dt + \sqrt{2}dB_t\\ \implies X_{t+\epsilon}-X_t &= -\nabla(-log(pdf(X_t)))(t+\epsilon-t) + \sqrt{2} (B_{t+\epsilon}-B_{t})\\ \implies X_{t+\epsilon} &= X_t - \frac{X_t}{\sigma^2}\epsilon+\sqrt{2}(B_{t+\epsilon}-B_{t})\\ X_{t+\epsilon} &= X_t(1-\frac{\epsilon}{\sigma^2}) + \sqrt{2\epsilon}z_t \text{ where $z_t \sim N(0,1)$, and i.i.d. for each $t$}\\ \end{align*}\]
If we let the starting \(X_0\) to be a random variable of \(N(0,1)\), then we can tell the variance of \(X_\epsilon\) by the above expression.
\[\begin{align*} X_\epsilon &= (1-\frac{\epsilon}{\sigma^2})X_0 + \sqrt{2\epsilon}Z)\\ &\sim N(0,(1-\frac{\epsilon}{\sigma^2})^2 + 2\epsilon) \end{align*}\]
So if we let \(X_0 \sim N(0,var(0))\) where \(var(0)\) denotes the variance. Then the variance at the \(\epsilon\) step and at \(t+\epsilon\) step can also be calculated.
\[\begin{align*} X_\epsilon &\sim N(0,(1-\frac{\epsilon}{\sigma^2})^2var(0) + 2\epsilon)\\ \implies var(\epsilon) &= (1-\frac{\epsilon}{\sigma^2})^2var(0) + 2\epsilon\\ X_{t+\epsilon} &\sim N(0,(1-\frac{\epsilon}{\sigma^2})^2var(t) + 2\epsilon))\\ \implies var(t+\epsilon) &= (1-\frac{\epsilon}{\sigma^2})^2var(t) + 2\epsilon \end{align*}\]
Now calculate the derivative of variance with respect to time to see how the variance changes overtime. We expect to express \(\frac{dvar(t)}{dt}\) in terms of \(var(t)\) so that we can solve the ODE to express \(var(t)\) in terms of \(t\).
\[\begin{align*} \lim_{\epsilon \rightarrow 0} \frac{var(t+\epsilon)-var(t)}{\epsilon} &= \lim_{\epsilon \rightarrow 0} \frac{(1-\frac{\epsilon}{\sigma^2})^2 var(t)+2\epsilon}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0} \frac{(-\frac{2\epsilon}{\sigma^2}+\frac{\epsilon^2}{\sigma^4})var(t)+2\epsilon}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0} (-\frac{2}{\sigma^2}+\frac{\epsilon}{\sigma^4})var(t)+2 \\ &= -\frac{2}{\sigma^2}var(t) + 2\\ \implies \frac{dvar(t)}{dt} &= -\frac{2}{\sigma^2}var(t) + 2 \end{align*}\]
Solving the ODE, we get \(var(t) = \sigma^2(1-exp(-\frac{2}{\sigma^2}t))\).
And taking \(t\) to infinity, we have \(\lim_{t \rightarrow \infty} var(t) = \sigma^2\). Therefore, the smaller the \(\sigma^2\), the smaller the variance is, which is what we would ideally want to converge to in infinite time. This is the same conclusion we drew from requiring the function \(-log(pdf(x))\) to be strongly convex.
# Langevin dynamics sampling without "thinning"
langevin_sampling <- function(n_samples, step_size, initial_value) {
samples <- matrix(0, n_samples, length(initial_value))
x <- initial_value
for (i in 1:n_samples) {
grad <- grad_log_density(x)
x <- x + step_size / 2 * grad + sqrt(step_size) * rnorm(length(x))
samples[i, ] <- x
}
return(samples)
}
Next, generate examples using the langevin_sampling function defined above.
# Set parameters
n_samples <- 505000
bsz = 5000
step_size <- 0.01
initial_value <- c(0, 0)
# Generate samples
samples <- langevin_sampling(n_samples, step_size, initial_value)
samplesburn = samples[-c(1:bsz),]
thinsz=seq(100,n_samples-bsz,100)
colMeans(samples)
## [1] -0.04580787 -0.04383146
cov(samples)
## [,1] [,2]
## [1,] 0.9814081 0.8814491
## [2,] 0.8814491 0.9863900
samplesDf = data.frame(samples)
library(ggplot2)
ggplot(samplesDf, aes(x=X1,y=X2))+
labs(title = "Langevin Sampling for 2 dimensions",
x = "x1",
y = "x2") +
geom_point(color = "deepskyblue2", alpha = 0.5) +
theme_bw()