The intersection of probabilistic graphical models (PGMs) and deep learning is a very hot research topic in machine learning at the moment. I collected different sources for this post, but Daphne Koller’s coursera course is an outstanding one. Everything you need as a background is given by my first two posts (first, second). Please revisit them, if you don’t understand a point, or just comment on the bottom of this page and I’ll answer.
A probabilistic graphical model is a representation of conditional independent relationships between the nodes.
Nodes are random variables. When we shade them, it means we have observed, hence have data for them. If nodes are blank, they are unknown and we call them latent or hidden variables.
Conditional independence means for the above graph that observing a would not be influential at all for c, if we observed b. To express this more sophisticated: c is conditional independent of a given b.
We very often put a plate around some of the variables, i.e., very general, repetition. The N signalises how often we perform that repetition.
The most important characteristic of PGMs is the ability of being translated from a graph to a joint distribution. Since any PGM is a representation of conditional independences, we can write the joint distribution for the above graph as follows:
The bold b and c signalise that we have vectors here instead of scalars. That’s because we repeat b and c N times, and then stack the scalars we get for each repetition to a vector.
Look into my ,Fundamentals 2 again, if you have difficulties understanding this formula. There is a section about independence and it explains everything you need to know here.
Now, you may ask yourself why the joint distribution is of importance? Let us rewrite our beloved Bayes’ theorem and you’ll directly see it.
There is the joint distribution p(b, c) that you can calculate with the PGM.
Another important factor is p(b). We often speak of model evidence, or marginal likelihood, when we want to describe it. It can be calculated as follows:
So, why do we do all that?
Remember that we have some observed variables and some latent variables. In many scenarios, the distribution of the latent variables is of interest, so we want to find a method how we can calculate it. Bayes’ theorem is describing exactly that procedure.
Let us look at the following graph and things might become clearer.
We observe b, but do not know a and c. Take the direction of the edges into account and you’ll notice that a does not matter at all for the calculating c, if we observe b. What is really of interest, is c. So, how do we calculate the probability of c? Exactly with p(c|b). Calculating this probability is called inference. There are multiple possibilities how to calculate it, but we’ll cover these in the next post.
As a side-note, we can even phrase the aforesaid “a does not matter at all”. It is called explaining away.
So far, we’ve only taken three variables into account, but you can draw PGMs for problems with much more variables. See, for example, the following.
x5, x6, x7 could be what you’ll eat for breakfast, lunch and dinner respectively, and all z’s are the variables influencing it. For example, z1 could be weekday or weekend, z2 meeting out of office or not, z3 coming home late or not, and so on.
To prove independence here might be a bit of a struggle, but it is actually also not that difficult. There is an algorithm called D-Separation. Watch Daphne Koller’s lecture about it and you’ll understand it.
What I just described in the above example was a discrete model. Our variables can only be integer numbers. We have also discussed that in Fundamentals 1, so you might already know it.
What we can do with a PGM that assumes only integer numbers is defining a conditional probability table (CPT). For a simple PGM like you see below left, the CPT could look like the one below right. It gives us, for example, the probability that A = 0, given B = 0 and C = 1, what we write as p(A=0|B=0, C=1). It is 0.3.
As a counterpart to discrete models, there are continuous models. We have already spoken about them in our Fundamentals 1, so please go back, if you need a refresher.
What is a CPT for discrete models, is the conditional probability function (CPF) for continuous ones. Our PGMs look then slightly different, but we can still understand them quickly, if we know what all the variables mean.
Let’s assume one of the variables in the above graph of the discrete model, say B, is continuous. What parameters do we need to define a continuous variable, that is nothing else than a distribution (for sake of simplicity, let’s assume a normal distribution)? Exactly, mean μ and standard deviation σ. We simply take these two parameters as additional nodes and we’re set up.
That was it, we now have understood the basics and can progress to the part that really matters: how can we calculate the latent variables, when we play with huge amount of data and parameters, how we typically do in deep learning?