In this post, we’ll elaborate on the two major school of thoughts how we can approximate an intractable function in probabilistic models. In my previous post about inference in deep neural networks has one special method been elaborated, but there are many more which are widely used in other forms of probabilistic models, e.g. probabilistic graphical models (PGM), but not yet in neural networks. David MacKay’s book is the main source for this post; it’s excellent, please consider reading the chapters you’re interested in. The adequate page numbers are given to refer you directly to the points you want to study.
Let’s first take a step back and thoroughly understand why we actually need to approximate some unknown underlying true distribution. Keep in mind that we move in the realm of neural networks which usually have thousand or millions of parameters to tune during the learning process, i.e. backpropagation. Each of these parameters can be understood as one hidden variable in a PGM. So, not only the graph itself will become quite overwhelming, also the calculation of the joint distribution, and, consequently, a function capturing all these latent variables is also quite complex.
If you cannot follow the last paragraph, checkout my previous post , please.
What we mean with complex: it is simply not possible to calculate the underlying true distribution in any foreseeable time, or integrals of smooth functions are in the calculations involved which can numerically never be calculated exactly.
We can visualise that through landscapes of some picturesque nature.
We first go to Ireland, enjoying the smooth and regular hills.
Such a three-dimensional function is a fairly simple multivariate distribution, that can probably be calculated in some time; hence, it doesn’t need to be approximated.
But, we are unfortunately not in Ireland with our deep learning models, we are in Switzerland.
Deep learning models often have thousands or millions of parameters that makes a distribution looking a bit more complex now. Therefore, it can probably not be calculated in any foreseeable time exactly. So, what do we do?
The solution is, we approximate the unknown true function.
For this approximation, we can choose from two major different approaches: Monte Carlo methods and variational methods.
In Monte Carlo methods, we take samples from the unknown true distribution, whereas we only examine variational methods as the representative for deterministic methods, because of their appropriate usage in deep learning. There, we define another, much simpler distribution that should be at the end of its optimisation somehow similar to the unknown true one. The next graph hopefully helps you seeing the difference.
Let us now delve deeper into the two different approaches.
1 Monte Carlo methods
This family of approximation methods is particularly well-suited when we are searching the mean, i.e. the expected value, of some function p(z|x), but we can also use it to generate samples from a given probability distribution, thus seeing some bits and pieces from it that helps us guessing how the overall distribution might look like. (book pp. 357–361)
Bear in mind, while we cannot compute the unknown true distribution exactly, we can still draw random samples from it.
But let us for now only solve the problem of finding the mean of some function p(z|x). Let us call p(z|x) the target density which we want to approximate by drawing samples from it, which define the function f(z). Analytically, we solve this by calculating the integral over all possible states of f(z), as follows:
But, we might not even know all possible states of f(z) and we are acquainted with the fact that an integral can never be calculated exactly and might even not have an analytical solution for complex distributions. Thus, we draw S samples from the unknown true distribution and can also achieve fairly good results with it:
It looks all fairly complicated, I must admit that, but we elaborate this a bit more with an example now.
Imagine you go each year on holidays to Indonesia. Most parts of Indonesia lie quite close to the equator, so the climate is fairly stable with year-round temperatures of 30–40°C. You also keep very well track of the temperatures in each of your holidays. Just a week before you are going to board the plane to Indonesia for this year’s holidays, you ask yourself how warm it might be this year. You’re not happy with the interval 30–40°C, you want to know it more precisely. So, there is some unknown true distribution of the temperatures in which it is exactly prescribed how warm it has been the last years and how warm it’ll be this year, but nobody in the world is smart enough to calculate this distribution exactly. Of course, you know that and what you do instead is looking at your logbook entries of the last years. Can you guess what you actually do with “looking at your logbook entries”? Exactly, you take samples from the unknown true distribution.
So, what happens when the number of samples increases? We become more confident and the function f(z) seems to become closer to p(z|x), i.e. the variance shrinks.
There exist many different methods how we can sample from the unknown true distribution, but we will only elaborate the most often used ones here. With all the subsequent methods, we start with the scenario that the unknown true distribution p(z|x) is intractable and we want to approximate it.
1.1 Importance sampling
(book pp. 361–364)
For purposes of simplicity, we see our target density p(z|x) = p(z) as one-dimensional, but still somehow too complex to draw samples directly from it. In importance sampling, we define first a simpler distribution q(z) and second draw S samples from it. If these S samples came from p(z), we could simply calculate
But, as we outlined, the S samples come from q(z) and not from p(z), i.e. sampled values which are greater than p(z) are over-represented and sampled values which are less than p(z) are under-represented. We solve this dilemma by introducing weights which take care of this unbalance.
Incorporating those into the calculations for the expected value, we arrive at
Graphically, we can illustrate the involved functions as follows:
1.2 Rejection sampling
(book pp. 364–365)
Same as before, our target density p(z) is one-dimensional, but too complex to draw samples directly from and we define a simpler proposal density kq(z); k is a constant, such that it is always greater than the unnormalised target density p̃(z):
This could look then, for instance, like
If you observe the previous graph carefully, you detect an u⁰ at a random point z⁰. Let us call this two-dimensional point (u⁰, z⁰). We generate it by first drawing z⁰ from kq(z) and second drawing u⁰ from a uniform distribution with the interval [0, kq(z⁰)]. That our u⁰ in the above graph is right at the outer boundary of this interval, hence u⁰ = kq(z⁰) is purely random. Then, we simply evaluate if (u⁰, z⁰) lies within p̃(z) and would accept z⁰ as a point of the unknown true distribution p(z), if it did so. If (u⁰, z⁰) did not, we would reject it. This last step might sound a bit like magic, but we’ll carefully elaborate it now
The proposed points (u, z) come with uniform probability from the shaded area underneath the curve kq(z) as shown in the above figure. The rejection rule rejects all the points that lie above the curve p̃(z). So, the points (u, z) that are accepted are uniformly distributed in the unshaded area under p̃(z). This implies that the probability density of the z-coordinates of the accepted points must be proportional to p̃(z), so the samples must be independent samples from p(z).
Obviously, rejection sampling only works well when the proposal density kq(z) is fairly similar to the target density p(z), because this procedure would result in a large frequency of rejection for two very differently defined distributions. We must consider this when selecting an appropriate Monte Carlo method for our given problem.
1.3 Metropolis-Hastings method
(book pp. 365–370)
In contrast to the two previous methods which both only work well if kq(z) is fairly similar to p(z), the Metrolpolis-Hastings algorithm performs also well for complex densities p(z) by defining multiple kq(z) and depending each on the current state z. Now, a perspective similar to rejection sampling comes into play. We start at an initial state z⁰ and only accept a tentative new state z¹ that is generated from a new proposal density kq(z¹|z⁰) if the probability a is greater than 1. If a is lower than 1, we accept the new tentative state z¹ with the probability a. Generically, with
we can write this as
Since we make our new tentative states dependent on the previous state, we deal here for the first time with a Markov chain Monte Carlo method.
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
Note the contrast to rejection sampling where each sample is independent.
Everything else is very similar to rejection sampling. The accepted samples, here called states, define an approximation of the unknown true distribution p(z). This procedure of learning p(z) is often called convergence.
I abstain from any figures here, because I would like to kindly refer you to Alex Rogozhnikov’s brilliant illustration in his blog post.
1.4 Gibbs sampling
(book pp. 370–371)
Gibbs sampling can be seen as a special case of the aforementioned Metropolis-Hastings algorithm in which we must have at least two dimensions. We have an unknown true distribution p(z) from which we wish to sample and have selected a state to begin z⁰ of the Markov chain. As with all Markov chain methods, we have multiple steps. Each step of the Gibbs sampling procedure involves replacing the value of one of the variables by a value drawn from the distribution of that variable conditioned on the values of the remaining variables. For the first three variables, we can write this as
If we worked with a PGM, we could illustrate this as Bishop (2006) does:
The variables of interest are left the white node in a Markov network, and right the white node in a Bayesian network. Both nodes are the variables for which we draw samples. The distributions from which we draw these are conditioned on all residual nodes, i.e. variables.
A quite interesting fact is that a tentative new state in Gibbs sampling is always accepted, as we derive here. Let us define z’ as a proposed new state, z as the current state, m and i as some indices for the variable vectors z or z’. The probability of acceptance a is then
It might sound confusing that z’ = z, but these components are unchanged by the sampling step, merely
It might sound quite overwhelming, but we now investigate its application to a Gaussian distribution. We take a correlated, hence two-dimensional, Gaussian distribution into account, as you can see in the subsequent figure. Furthermore, we have conditional distributions of width l and marginal distributions of width L. The step size is governed by the standard deviation of the conditional distribution (green curve), and is O(l), leading to slow progress in the direction of elongation of the joint distribution (red ellipse). The number of steps needed to obtain an independent sample from the distribution is O((L/l)²).
For now, please note the limitation of this method being a random walk (blue line) that sometimes finds probable samples, but sometimes not.
1.5 Hamiltonian Monte Carlo
(book pp. 387–390)
Hamiltonian Monte Carlo is another Metropolis method, but benefits, in contrast to Gibbs sampling, from the evaluation of the gradient of some unknown distribution p(z). This leads to finding probable samples much more efficient. This method is only applicable to distributions over continuous variables for which we can readily evaluate the gradient of the log probability with respect to the state variables. It has its origins in physics, precisely in quantum mechanics, so we speak here of energies. It might sound complicated, but it will help you understanding the entire algorithm.
For many systems whose probability p(z) can be written as
where we interpret E(z) as the potential energy of the system when in state z, the gradients of p(z) can be computed straightforwardly. They indicate which direction one should go in to find states that have higher probability. Z is a normalisation variable.
Again, I abstain from any figure here, simply because Alex Rogozhnikov’s brilliant illustration in his blog post are too good to be missed.
As in any system, the Hamiltonian H is the sum of the potential energy E, dependent on the state z, and the kinetic energy K, dependent on the velocity v.
with directions of the velocity i.
With these two components, we draw samples from the joint probability distribution, which is
With a bit of mathematical background knowledge, you can directly see that we can separate this equation. By simply ignoring the momentum variable K(v), we obtain samples for z when we draw samples from the joint probability distribution p(z, v).
Exactly as in any Metropolis method, we have a probability of acceptance a again. A tentative new state z’ is proposed and the current state z is also taken into consideration.
2. Variational methods
Variational methods always define an approximate distribution q(z) which we want to optimse, so it is as similar as possible to the unknown true distribution p(z). While we have in most cases literally no clue how p(z) might look like, we define for purposes of simplicity q(z) very often as a Gaussian or multivariate Gaussian distribution. Let us here assume q(z) is a Gaussian distribution with (variational) parameters θ that want to be learnt, so q(z) becomes as similar as possible to p(z). In a neural network, the latent variables z are the weights w, hence we can write:
The ‘distance’ between these two distributions is estimated by the Kullback-Leibler (KL) divergence which is calculated as follows:
So, we must do nothing more than minimising this KL divergence which is, as a side note, an optimisation problem rather than an approximation problem.
Graves (2011) gives a well-explaining graphical intuition for this procedure.
Recall the definition of the KL divergence and you’ll directly notice that the integral is numerically not solvable, i.e. intractable. So, what are we supposed to do now? Let us have a closer look at the KL divergence:
Placing this in our optimisation function, we arrive at:
We are now left with a constant term that can be omitted and precedent term, widely known as the evidence lower bound, abbreviated as ELBO. Consequently, and due to the minus sign in front of the ELBO, we can say that maximising the ELBO is equivalent to minimising the KL divergence, for maximising the log-likelihood log p(D|θ), the most common cost function forprobabilistic models. Thus, we can also say that
what is shown in the subsequent figure graphically.
Side-note. We find here an important pre-factor β dependent on the batch i = 1, 2, . . . , M that has analytically no derivation, but must be included in the realm of neural networks. It weighs the prior dependent complexity cost, i.e. the KL divergence, relative to the data dependent likelihood cost, i.e. the ELBO. In other words, we want to emphasise the prior when we are early in the training phase, and have not been exposed to much data, whereas we want to rely our inference proportional to the amount of data we have seen, thus weigh the likelihood more when we have been exposed to more data. We consider multiple proposals for β as
We evaluated the different β’s for a Bayesian feed-forward neural network with two hidden layers, each with 400 units and got these results with MNIST.
As a mid-term conclusion, we can record that the KL divergence determines not only the ‘distance’ between the two considered distributions, but also the gap between p(D|θ) and the ELBO. This gap is often referred to as tightness of the bound. (Blei et al., 2017; Kingma, 2017, pp. 14–18)
So, let’s continue and take the log-likelihood, or model evidence, log p(D|θ) into account.
The last step uses Jensen’s inequality. When f is concave,
Finally, we aim to find optimal parameters θ that maximise the ELBO
which can be performed with the coordinate ascent algorithm. For a detailed explanation of this algorithm, please review Blei et al. (2017).
Bishop, C. M. (2006). Pattern recognition and machine learning. New York (NY): Springer.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.
Graves, A. (2011). Practical variational inference for neural networks. In Advances in neural information processing systems (pp. 2348–2356).
Kingma, D. P. (2017). Variational inference & deep learning: A new synthesis.
Special thanks to Kumar Shridhar and Jesper Wohlert for their valuable comments. Follow both of them, they do phenomenal work!