Gaussian Processes

Introduction

In supervised learning, we often use parametric models $p(\mathbf{y} \lvert \mathbf{X},\boldsymbol\theta)$ to explain data and infer optimal values of parameter $\boldsymbol\theta$ via maximum likelihood or maximum a posteriori estimation. If needed we can also infer a full posterior distribution $p(\boldsymbol\theta \lvert \mathbf{X},\mathbf{y})$ instead of a point estimate $\boldsymbol{\hat\theta}$. With increasing data complexity, models with a higher number of parameters are usually needed to explain data reasonably well. Methods that use models with a fixed number of parameters are called parametric methods.

In non-parametric methods, on the other hand, the number of parameters depend on the dataset size. For example, in Nadaraya-Watson kernel regression, a weight $w_i$ is assigned to each observed target $y_i$ and for predicting the target value at a new point $\mathbf{x}$ a weighted average is computed:

Observations that are closer to $\mathbf{x}$ have a higher weight than observations that are further away. Weights are computed from $\mathbf{x}$ and observed $\mathbf{x}_i$ with a kernel $\kappa$. A special case is k-nearest neighbors (KNN) where the $k$ closest observations have a weight $1/k$, and all others have weight $0$. Non-parametric methods often need to process all training data for prediction and are therefore slower at inference time than parametric methods. On the other hand, training is usually faster as non-parametric models only need to remember training data.

Another example of non-parametric methods are Gaussian processes (GPs). Instead of inferring a distribution over the parameters of a parametric function Gaussian processes can be used to infer a distribution over functions directly. A Gaussian process defines a prior over functions. After having observed some function values it can be converted into a posterior over functions. Inference of continuous function values in this context is known as GP regression but GPs can also be used for classification.

Univariate Gaussian distribution

The probability density function of a univariate Gaussian distribution:

$\boldsymbol{\mu}$ is the mean or expectation of the distribution (and also its median and mode), while the parameter $\sigma$ is its standard deviation.The variance of the distribution is $\sigma^2$.

Multivariate Gaussian distribution

The multivariate Gaussian distribution is a generalization of the univariate Gaussian distribution to higher dimensions. Assuming that the dimensions are independent of each other:

Equation $(2)$ in matrix form:

We have

Thus

$\boldsymbol{\mu} \in \mathbb{R}^N$ is the mean vector, $\mathbf{K} \in \mathbb{R}^{N \times N}$ is the covariance matrix, since we assume that the dimensions are independent of each other, $\mathbf{K}$ is a diagonal matrix. When the variables are correlated, the form of Equation $(3)$ is still the same, the covariance matrix $\mathbf{K}$ is no longer a diagonal matrix and only has the properties of positive semi-definite and symmetric.

Equation $(3)$ is usually abbreviated as:

Gaussian Processes

A Gaussian process is a random process where any point $\mathbf{x} = [\mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{N}]$ is assigned a random variable $f(\mathbf{x}) = [f(\mathbf{x}_{1}), f(\mathbf{x}_{2}), \cdots, f(\mathbf{x}_{N})]$ and where the joint distribution of a finite number of these variables $p(f(\mathbf{x}_1),…,f(\mathbf{x}_N))$ is itself Gaussian:

In Equation $(4)$, , and . $m$ is the mean function and it is common to use $m(\mathbf{x}) = 0$ as GPs are flexible enough to model the mean arbitrarily well. $\kappa$ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by $\mathbf{K}$. If points $\mathbf{x}_i$ and $\mathbf{x}_j$ are considered to be similar by the kernel the function values at these points, $f(\mathbf{x}_i)$ and $f(\mathbf{x}_j)$, can be expected to be similar too.

A GP prior $p(\mathbf{f} \lvert \mathbf{X})$ can be converted into a GP posterior $p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})$ after having observed some data $\mathbf{y}$. The posterior can then be used to make predictions given new input :

Equation $(5)$ is the posterior predictive distribution which is also a Gaussian with mean and . By definition of the GP, the joint distribution of observed data $\mathbf{y}$ and predictions is

With $N$ training data and new input data, is , is and is . $\sigma_y^2$ is the noise term in the diagonal of $\mathbf{K_y}$. It is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to $\boldsymbol{0}$ for notational simplicity. The sufficient statistics of the posterior predictive distribution, and , can be computed with[1][3]

This is the minimum we need to know for implementing Gaussian processes and applying them to regression problems. For further details, please consult the literature in the References section. The next section shows how to implement GPs with plain NumPy from scratch, later sections demonstrate how to use GP implementations from scikit-learn.

Implementation with NumPy

Here, we will use the squared exponential kernel, also known as Gaussian kernel or RBF kernel:

The length parameter $l$ controls the smoothness of the function and $\sigma_f$ the vertical variation. For simplicity, we use the same length parameter $l$ for all input dimensions (isotropic kernel).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np

def kernel(X1, X2, l=1.0, sigma_f=1.0):
'''
Isotropic squared exponential kernel. Computes
a covariance matrix from points in X1 and X2.

Args:
X1: Array of m points (m x d).
X2: Array of n points (n x d).

Returns:
Covariance matrix (m x n).
'''
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)

There are many other kernels that can be used for Gaussian processes. See [3] for a detailed reference or the scikit-learn documentation for some examples.

Prior

Let’s first define a prior over functions with mean zero and a covariance matrix computed with kernel parameters $l=1$ and $\sigma_f=1$. To draw random functions from that GP we draw random samples from the corresponding multivariate normal. The following example draws ten random samples and plots it together with the zero mean and the 95% confidence interval (computed from the diagonal of the covariance matrix).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import matplotlib.pyplot as plt

# Finite number of points
n = 100

# points we're going to make predictions at
X = np.linspace(-5, 5, n).reshape(-1, 1)

# Mean and covariance of the prior
mu_prior = np.zeros(X.shape)
cov_prior = kernel(X, X)

# Draw 10 samples from the prior
number_of_samples = 10
samples_prior = np.random.multivariate_normal(mu_prior.ravel(), cov_prior, number_of_samples)
plot_gp(mu_prior, cov_prior, X, samples=samples_prior)
plt.title(f'{number_of_samples} samples from the GP prior (n = 100)')
plt.axis([-5, 5, -3, 3])
plt.legend()
plt.show()


The plot_gp function is defined here.

1
2
3
4
5
6
7
8
def plot_gp(mu, cov, X, X_train=None, Y_train=None, samples=[]):
plot_margin_of_error(X, mu, cov)
plt.plot(X, mu, label='GP mean')
for i, sample in enumerate(samples):
plt.plot(X, sample, lw=1, ls='--')

if X_train is not None:
plt.plot(X_train, Y_train, 'rx', label='Observed Data')


The plot_margin_of_error function is defined here.

1
2
3
4
5
6
def plot_margin_of_error(X, mu, cov):
X = X.ravel()
mu = mu.ravel()
uncertainty = 1.96 * np.sqrt(np.diag(cov)) # %95 Confidence interval
plt.fill_between(X, mu + uncertainty, mu - uncertainty, alpha=0.1,
label='Margin of error (%95 Confidence interval)')

Prediction from noise-free training data

To compute the sufficient statistics i.e. mean and covariance of the posterior predictive distribution we implement Equations $(7)$ and $(8)$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from numpy.linalg import inv

def posterior_predictive(X_s, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
'''
Computes the sufficient statistics of the GP posterior predictive distribution
from m training data X_train and Y_train and n new inputs X_s.

Args:
X_s: New input locations (n x d).
X_train: Training locations (m x d).
Y_train: Training targets (m x 1).
l: Kernel length parameter.
sigma_f: Kernel vertical variation parameter.
sigma_y: Noise parameter.

Returns:
Posterior mean vector (n x d) and covariance matrix (n x n).
'''
K = kernel(X_train, X_train, l, sigma_f) + sigma_y ** 2 * np.eye(len(X_train))
K_s = kernel(X_train, X_s, l, sigma_f)
K_ss = kernel(X_s, X_s, l, sigma_f) + 1e-8 * np.eye(len(X_s))
K_inv = inv(K)

# Equation (7)
mu_s = K_s.T.dot(K_inv).dot(Y_train)

# Equation (8)
cov_s = K_ss - K_s.T.dot(K_inv).dot(K_s)

return mu_s, cov_s

and apply them to noise-free training data X_train and Y_train. The following example draws ten samples from the posterior predictive and plots them along with the mean, confidence interval and training data. In a noise-free model, variance at the training points is zero and all random functions drawn from the posterior go through the trainig points.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Noise free training data
X_train = np.array([-3, -2, -1, 1, 2, 3]).reshape(-1, 1)
Y_train_noise_free = np.sin(X_train)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train_noise_free)

# Draw 10 samples from the posterior
samples_posterior = np.random.multivariate_normal(mu_s.ravel(), cov_s, number_of_samples)

plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train_noise_free, samples=samples_posterior)
plt.title(f'{number_of_samples} samples from the GP posterior (n = 100)')
plt.axis([-5, 5, -3, 3])
plt.legend()
plt.show()


Comparison of true function and GP posterior.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
f_true = np.sin(X).flatten()

# Plot Gp mean
plt.plot(X, mu_s, ls='-', label='GP mean')

# Plot data observed
plt.plot(X_train, Y_train_noise_free, 'rx', label='Observed Data')

# Plot True function
plt.plot(X, f_true, label='True function')

# Plot Margin of error (%95 Confidence interval)
plot_margin_of_error(X, mu_s, cov_s)
plt.axis([-5, 5, -3, 3])
plt.legend()
plt.show()

Prediction from noisy training data

If some noise is included in the model, training points are only approximated and the variance at the training points is non-zero.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Noisy training data
noise = 0.4
X_train = np.array([-3, -2, -1, 1, 2, 3]).reshape(-1, 1)
Y_train_noisy = np.sin(X_train) + noise * np.random.randn(*X_train.shape)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train_noisy, sigma_y=noise)

# Draw 10 samples from the posterior
samples_posterior_noisy = np.random.multivariate_normal(mu_s.ravel(), cov_s, number_of_samples)
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train_noisy, samples=samples_posterior_noisy)
plt.title(f'{number_of_samples} samples from the GP posterior (n = 100)')
plt.axis([-5, 5, -3, 3])
plt.legend()
plt.show()


Comparison of true function and GP posterior.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
f_true = np.sin(X).flatten()

# Plot Gp mean
plt.plot(X, mu_s, ls='-', label='GP mean')

# Plot data observed
plt.plot(X_train, Y_train_noisy, 'rx', label='Observed Data')

# Plot True function
plt.plot(X, f_true, label='True function')

# Plot Margin of error (%95 Confidence interval)
plot_margin_of_error(X, mu_s, cov_s)
plt.axis([-5, 5, -3, 3])
plt.legend()
plt.show()

Effect of kernel parameters and noise parameter

The following example shows the effect of kernel parameters $l$ and $\sigma_f$ as well as the noise parameter $\sigma_y$. Higher $l$ values lead to smoother functions and therefore to coarser approximations of the training data. Lower $l$ values make functions more wiggly with wide confidence intervals between training data points. $\sigma_f$ controls the vertical variation of functions drawn from the GP. This can be seen by the wide confidence intervals outside the training data region in the right figure of the second row. $\sigma_y$ represents the amount of noise in the training data. Higher $\sigma_y$ values make more coarse approximations which avoids overfitting to noisy data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
params = [
(0.3, 1.0, 0.2),
(3.0, 1.0, 0.2),
(1.0, 0.3, 0.2),
(1.0, 3.0, 0.2),
(1.0, 1.0, 0.05),
(1.0, 1.0, 1.5),
]

for i, (l, sigma_f, sigma_y) in enumerate(params):
mu_s, cov_s = posterior_predictive(X, X_train, Y_train_noisy, l=l,
sigma_f=sigma_f,
sigma_y=sigma_y)
plt.title(f'l = {l}, sigma_f = {sigma_f}, sigma_y = {sigma_y}')
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train_noisy)
plt.axis([-5, 5, -3, 3])
plt.show()


Optimal values for these parameters can be estimated by maximizing the log marginal likelihood which is given by[1][3]

In the following we will minimize the negative log marginal likelihood w.r.t. parameters $l$ and $\sigma_f$, $\sigma_y$ is set to the known noise level of the data. If the noise level is unknown, $\sigma_y$ can be estimated as well along with the other parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from numpy.linalg import cholesky, det, lstsq
from scipy.optimize import minimize

def nll_fn(X_train, Y_train, noise, naive=True):
'''
Returns a function that computes the negative log marginal
likelihood for training data X_train and Y_train and given
noise level.

Args:
X_train: training locations (m x d).
Y_train: training targets (m x 1).
noise: known noise level of Y_train.
naive: if True use a naive implementation of Eq. (10), if
False use a numerically more stable implementation.

Returns:
Minimization objective.
'''

def nll_naive(theta):
# Naive implementation of Eq. (10). Works well for the examples
# in this article but is numerically less stable compared to
# the implementation in nll_stable below.
K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
noise ** 2 * np.eye(len(X_train))
return 0.5 * np.log(det(K)) + \
0.5 * Y_train.T.dot(inv(K).dot(Y_train)) + \
0.5 * len(X_train) * np.log(2 * np.pi)

def nll_stable(theta):
# Numerically more stable implementation of Eq. (10) as described
# in http://www.gaussianprocess.org/gpml/chapters/RW2.pdf, Section
# 2.2, Algorithm 2.1.
K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
noise ** 2 * np.eye(len(X_train))
L = cholesky(K)
return np.sum(np.log(np.diagonal(L))) + \
0.5 * Y_train.T.dot(lstsq(L.T, lstsq(L, Y_train)[0])[0]) + \
0.5 * len(X_train) * np.log(2 * np.pi)

if naive:
return nll_naive
else:
return nll_stable


# Minimize the negative log-likelihood w.r.t. parameters l and sigma_f.
# We should actually run the minimization several times with different
# initializations to avoid local minimal but this is skipped here for
# simplicity.
res = minimize(nll_fn(X_train, Y_train_noisy, noise), [1, 1],
bounds=((1e-5, None), (1e-5, None)),
method='L-BFGS-B')

# Store the optimization results in global variables so that we can
# compare it later with the results from other implementations.
l_opt, sigma_f_opt = res.x
l_opt, sigma_f_opt

# Compute the posterior predictive statistics with optimized kernel parameters and plot the results
mu_s, cov_s = posterior_predictive(X, X_train, Y_train_noisy, l=l_opt, sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train_noisy)
plt.axis([-5, 5, -3, 3])
plt.title(f'After parameter optimization:l = {l_opt:.2f}, sigma_f = {sigma_f_opt:.2f}, sigma_y = {noise}')
plt.legend()
plt.show()


With optimized kernel parameters, training data are reasonably covered by the 95% confidence interval and the mean of the posterior predictive is a good approximation.

Higher dimensions

The above implementation can also be used for higher input data dimensions. Here, a GP is used to fit noisy samples from a sine wave originating at $\boldsymbol{0}$ and expanding in the x-y plane. The following plots show the noisy samples and the posterior predictive mean before and after kernel parameter optimization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

def plot_gp_2D(gx, gy, mu, X_train, Y_train, title, i):
ax = plt.gcf().add_subplot(1, 2, i, projection='3d')
ax.plot_surface(gx, gy, mu.reshape(gx.shape), cmap=cm.coolwarm, linewidth=0, alpha=0.2, antialiased=False)
ax.scatter(X_train[:, 0], X_train[:, 1], Y_train, c=Y_train, cmap=cm.coolwarm)
z = mu.reshape(gx.shape)
ax.contourf(gx, gy, z, zdir='z', offset=0, cmap=cm.coolwarm, alpha=0.6)
ax.set_title(title)

noise_2D = 0.1

rx, ry = np.arange(-5, 5, 0.3), np.arange(-5, 5, 0.3)
gx, gy = np.meshgrid(rx, rx)

X_2D = np.c_[gx.ravel(), gy.ravel()]

X_2D_train = np.random.uniform(-4, 4, (100, 2))
Y_2D_train = np.sin(0.5 * np.linalg.norm(X_2D_train, axis=1)) + \
noise_2D * np.random.randn(len(X_2D_train))

plt.figure(figsize=(14, 7))

mu_s, _ = posterior_predictive(X_2D, X_2D_train, Y_2D_train, sigma_y=noise_2D)
plot_gp_2D(gx, gy, mu_s, X_2D_train, Y_2D_train,
f'Before parameter optimization: l={1.00} sigma_f={1.00}', 1)

res = minimize(nll_fn(X_2D_train, Y_2D_train, noise_2D), [1, 1],
bounds=((1e-5, None), (1e-5, None)),
method='L-BFGS-B')

mu_s, _ = posterior_predictive(X_2D, X_2D_train, Y_2D_train, *res.x, sigma_y=noise_2D)
plot_gp_2D(gx, gy, mu_s, X_2D_train, Y_2D_train,
f'After parameter optimization: l={res.x[0]:.2f} sigma_f={res.x[1]:.2f}', 2)
plt.show()


Note how the true sine wave is approximated much better after parameter optimization.

Library that implement GPs

This section shows a example of library that provide implementation of GPs. Only a minimal setup will be provided here, just enough for reproducing the above results. For further details please consult the documentation of the library.

Scikit-learn

Scikit-learn provides a GaussianProcessRegressor for implementing GP regression models. It can be configured with pre-defined kernels and user-defined kernels. Kernels can also be composed. The squared exponential kernel is the RBF kernel in scikit-learn. The RBF kernel only has a length_scale parameter which corresponds to the $l$ parameter above. To have a $\sigma_f$ parameter as well, we have to compose the RBF kernel with a ConstantKernel.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, RBF

rbf = ConstantKernel(1.0) * RBF(length_scale=1.0)
gpr = GaussianProcessRegressor(kernel=rbf, alpha=noise**2)

# Reuse training data from previous 1D example
gpr.fit(X_train, Y_train)

# Compute posterior predictive mean and covariance
mu_s, cov_s = gpr.predict(X, return_cov=True)

# Obtain optimized kernel parameters
l = gpr.kernel_.k2.get_params()['length_scale']
sigma_f = np.sqrt(gpr.kernel_.k1.get_params()['constant_value'])

# Compare with previous results
assert(np.isclose(l_opt, l))
assert(np.isclose(sigma_f_opt, sigma_f))

# Plot the results
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train)


References

[1] Kevin P. Murphy. Machine Learning, A Probabilistic Perspective, Chapters 4, 14 and 15.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning, Chapter 6.
[3] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.
[4] Guibo Wang. Gaussian processes.
[5] Martin Krasser. Gaussian processes.