By now, all of you have probably followed deep learning research for quite a while. In 1998, LeCun et al. proposed the very simple MNIST data set of handwritten digits and showed with their LeNet-5 that we can achieve a high validation accuracy on it. The data sets subsequently proposed became more complex (e.g., ImageNet or Atari games), hence the models performing well on them became more sophisticated, i.e. complex, as well. Simultaneously, the tasks these models can perform also became more complex as, e.g., Goodfellow et al.’s Gans (2014) or Kingma & Welling’s VAEs (2014). One of my personal highlights is Eslami et al.’s Neural scene representation and rendering (2018), which clearly shows that neural networks can perform fairly complex tasks to-date. I see a clear upwards slope in the complexity of tasks neural networks can perform and do not believe this trend will stop or slow down soon.
In this post, I will give clear arguments why Bayesian methods are so widely applicable and must be applied when we want to solve more complex tasks.
Notably, I speak of complexity in a terminology adopted by social sciences, because I personally develop interest to see very high-level patterns in social systems as, e.g., students at a university or migrants from low-income countries. When you just open the newspapers, these problems are the most pressing ones in our contemporary era, so I would also like to encourage all of you to attempt to solve these problems with machine learning, ideally Bayesian deep learning.
What is complexity?
Complexity is in the context of deep learning best understood as complex systems. Systems are ensembles of agents which interact in one way or another. These agents form together a whole. One of the fundamental characteristics of complex systems is that these agents potentially interact non-linearly. Complexity researchers commonly agree on two disparate levels of complexity: simple or restricted complexity, and complex or general complexity (Byrne, 2005; Morin, 2006, respectively). While general complexity can, by definition, not be mathematically modelled in any way, restricted complexity can.
Believing Byrne (2005), most social systems fall under the category of general complexity and this is already a fair limitation to our mathematical approach, but we see, for now, any system we take into account as merely possessing restricted complexity. We do so, because our deep learning tasks of the future might be to investigate patterns in societies, given the pace of progress we had so far in deep learning-based solutions for more and more complex tasks. Since the agents of societies, which are all of us, do sometimes interact unpredictably, we can certainly state than the non-linear property is fulfilled.
Modelling complex systems
Bearing in mind that mathematical descriptions of anything more or less complex are merely models and not fundamental truths (Cohen & Stewart, 1995), we can directly deduce that Bayesian inference is in any complex instance more appropriate to use than frequentist inference. In Bayesian inference, we see the model parameters θ as random variables, i.e. we want to learn them, whereas we see in frequentist inference the model parameters θ as fixed, but unknown descriptions of the truth. In frequentist inference, our conducted experiments are seen as samples from an infinitely large set of the exact same experiment (e.g., rolling a fair dice, i.e. the true model parameters θ determining the outcome of each roll are 1/6 for each side of the dice). A priori to conducting our series of experiments, we define how often we want to do it. After this number of experiments have been conducted, we have a confidence interval for each model parameter which tells us the probability that the confidence interval includes the true parameter.
So, in frequentist inference, we see model parameters θ as unchangeable, hence this school of thought argues that an ideal model, i.e. conducting our experiments infinitely often, represent the true state of anything more or less complex.
In essence, this means that proponents of frequentist inference claim that we are able to describe the truth with an infinite number of experiments and that these model parameters θ describing the truth will never change.
This might work for any very simple experiment, but it is fundamentally against Cohen & Stewart’s (1995)ideas which think of natural systems, hence a higher level of complexity than simple experiments comparable to rolling dices. They believe that systems can change over time, regardless of anything that happened in the past, and can develop new phenomena which have not been present to-date. This point of argumentation is again very much aligned with the definition of complexity from a social sciences angle (see emergence).
What does this mean for Bayesian inference?
In Bayesian inference, we do learn the model parameters θ in form of probability distributions. Doing so, we keep them flexible and can update their shape whenever new observations of a system arrive. Doesn’t this sound more plausible to do when we aim to model complex systems?
But, this is not the only reason why Bayesian inference is more appropriate than frequentist inference to use for complex tasks.
Previously, we have defined a complex system to potentially have non-linear interactions among its components, i.e. its agents. So far, there has no deterministic mathematical formalism been developed for non-linear systems and will also never be developed, because complex systems are, by definition, non-deterministic. In other words, if we repeat an experiment in a complex system, the outcome of the second experiment won’t be the same as of the first.
This is still perfectly feasible in Bayesian and frequentist inference, but what is the obvious path to go whenever we cannot have a deterministic solution for anything?
Exactly, we approximate.
This is precisely what we do in Bayesian methods: the intractable posterior probability distribution p(θ|D) is approximated, either by a variational distribution q(θ|D) how we like to do it in neural networks, or with Monte Carlo methods how we often do it in probabilistic graphical models. In frequentist inference instead, we take a somehow ignorant approach and just repeat the experiment more often with the hope to finally have an acceptable confidence interval which includes our true parameter. This “level of acceptance” is called confidence level and denoted as α, which you have probably seen before. Commonly agreed on, α should be at least 95%.
What does this all mean for the future of deep learning research?
Well, also consider that any deep learning model is actually a complex system by itself. We have neurons, which you can see as agents, and non-linear activation functions between them, which you can see as the agents’ non-linear relations. As e.g. in a CNN, filters develop out of this setting, which you can see as an emergent phenomenon.
With the increasing complexity of tasks wanting to be solved by deep learning while maintaining an understanding of complexity as it is present to-date in social sciences, there is probably no way around the application of Bayesian deep learning to solve those tasks. Neural networks are one of the few methods which can find non-linear relations between random variables — one strong desiderata of complex system modelling. Additionally, Bayesian inference is naturally inductive and generally approximates the truth instead of aiming to find it exactly, which frequentist inference does."
Remember that this is just another argument to utilise Bayesian deep learning besides the advantages of having a measure for uncertainty and the natural embodiment of Occam’s razor.
I hope you enjoyed reading this not very technical post. Besides the articles I already linked, below are two social sciences books which I can highly recommend.