Chatper 6. Bayesian Approaches#
Contents#
Bayes’ theorem
Bayesian linear regression (Bishop, Chater 3)
Bayesian model comparison
Bayesian networks (Bishop, Chater 8)
Dynamic Bayesian inference
Bayesian Brain
Sensory psychophysics
Cortical circuit
Bayes’ Theorem#
Let us recall the two fundamental rules of probability regarding random variables \(X\) and \(Y\):
Sum rule:
Product rule:
From the symmetry of joint probability \(p(X,Y)=p(Y,X)\),
we have the relationship \(p(X|Y)p(Y) = p(Y|X)p(X)\),
which brings us to Bayes’ theorem:
Using Bayes’ theorem, we can convert one conditional probability to the other. This simple formula has turned out to be very insightful in the context of sensory processing and learning.
Suppose \(X\) is the variable of your interest, such as the existence of your target, and \(Y\) is a noisey sensory input. What sensory input \(Y\) you would receive if your target exists or not is represented by a sensory model \(p(Y|X)\).
Your knowledge or assumption about existence of your target is represented by \(p(X)\), called prior probability.
For a given sensory input \(y\), \(p(Y=y|X)\) as a function of \(X\) is called the likelihood of the state \(X\).
The probability of the target \(X\) existing after observing \(y\), \(p(X|Y=y)\) is called the posterior probability.
Bayes’ theorem gives a theoretical basis for the intuition that the posterior probability is proportional to the product of the prior prbability and the likelihood.
Example: mouse in a bush#
You are a cat chasing a mouse and heard a rustuling sound from a bush. About half of the case a hiding mouse makes a sound, but about 10% of the time you hear rustling just by the wind. How would you estimate the probability for a mouse hiding in the bush?
\(p(Y \vert X)\) |
no mouse \(X=0\) |
mouse hiding \(X=1\) |
---|---|---|
no sound \(Y=0\) |
0.9 |
0.5 |
sound \(Y=1\) |
0.1 |
0.5 |
For example, if you heard a sound, \(Y=1\), what is the probability of a mouse hiding behind the bush, \(p(X=1|Y=1)\)? From the above table, 0.5?
No, actually. In the ‘heard’ row, 0.1 and 0.5 do not sum up to one. They are likelihoods \(p(Y=1|X)\), but not probability distribution \(p(X|Y=1)\) for the mouse to be in the bush or not.
The mouse ran into other nearby bushes, so you assume that the prior probability of the mouse in this bush is 30%:
prior probability |
no mouse \(X=0\) |
mouse hiding \(X=1\) |
---|---|---|
\(p(X)\) |
0.7 |
0.3 |
By having this prior probability, we can use Bayes’ theorem:
Iterative Bayesian Inference#
A useful property of Bayesian inference is that you can apply it iteratively to incoming data stream.
We denote the sequence of observations up to time \(t\) as
and want to estimate the cause \(x\) of these observations
If the observations are independent, their joint distribution is a product
and thus the posterior can be decomposed as
This means that the posterior \(p(x|y_{1:t-1})\) that you computed by time \(t-1\) serves as the prior to be combined with the likelihood for the new coming data \(p(y_t|x)\) for computing the new posterior \(p(x|y_{1:t})\).
This iterative update of the posterior is practically helpful in online inference utilizing whatever data available so far.
Coin toss#
Here is a simple example of estimating the parameter \(\mu\), probability for a coin to land head up, during multiple tosses.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# take samples
mu = 0.3 # probability of head
N = 6 # number of samples
y = np.random.choice(2, N, p=[1-mu, mu]) # binary observation sequence
y
array([0, 1, 0, 0, 0, 1])
dx = 0.01 # plot step
x = np.arange( 0, 1, dx) # range of the parameter
prior = np.ones( len(x)) # Assume a uniform prior
for t in range(N):
plt.figure() # a new figure
# prior
plt.plot( x, prior, 'b')
# observation
plt.plot( y[0:t+1], np.arange(t+1)/N, 'ko')
# likelihood
likelihood = x*y[t] + (1-x)*(1-y[t]) # theta if head, 1-theta if tail
plt.plot( x, likelihood, 'g')
# product
prilik = prior*likelihood
plt.plot( x, prilik,'c')
# posterior by normalization
marginal = sum(prilik)*dx # integrate over the parameter range
posterior = prilik/marginal # normalize
plt.plot( x, posterior, 'r')
plt.xlabel('$\mu$')
plt.legend(('prior', 'observation', 'likelihood', 'prior*like.', 'posterior'))
plt.title('t = {0}'.format(t+1))
# posterior as a new prior
prior = posterior
<>:20: SyntaxWarning: invalid escape sequence '\m'
<>:20: SyntaxWarning: invalid escape sequence '\m'
/var/folders/0z/3cttnw852b959kp4wjh_z_wc0000gn/T/ipykernel_1364/3935772410.py:20: SyntaxWarning: invalid escape sequence '\m'
plt.xlabel('$\mu$')
As more data are collected, the posterior distribution of \(\mu\) becomes sharper and colser to the true value.
Bayesian approaches in machine learning#
“Bayesian” is quite popular in machine learning, but it is used for different meanings:
To combine prior knowledge and the likelihood from observation
To assume a graphical model of data generation for estimation of the causes
To estimate the distribution of a variable, not a single point
In supervised learning:
avoid over fitting by introducing a prior distribution on the parameters
compare models by their probability of producing observed data
In reinforcement learning:
infer the environmental state from incomplete observation
estimate the distribution of reward, not just the expectation
In unsupervised learning:
infer hidden variables behind data
e.g. responsibility in Mixtures of Gaussians
Bayesian Linear Regression#
The standard linear regression (Chapter 3) assumes a linear regression function with additive noise $\( t_n = \b{w}^T\b{x}_n + \epsilon \)\( where \)p(\epsilon)=\mathcal{N}(0,\beta^{-1})$.
In Bayesian linear regression, we assume that the weights are sampled from a prior distribution \(p(\b{w})=\mathcal{N}(\b{0},\alpha^{-1}I)\).
The likelihood of the parameter \(\b{w}\) for the target output \(\b{t}\) is
When both the prior and likelihood are Gaussian, the posterior will also be Gaussian and have the form:
where
and
If we let \(\alpha=0\), i.e. infinitely large variance for the weight prior, this is equivalent to regular linear regression. Introducing a penalty term is commonly done in linear regression as
The Bayesian regression gives a probabilistic interpretation on the role of the regularization parameter \(\alpha\).
# We will use this frequently
def gauss(x, mu=0, sigma=1):
"""Gaussian distribution"""
return np.exp(-((x-mu)/sigma)**2/2)/(np.sqrt(2*np.pi)*sigma)
# Distributions in the parameter and data spaces
alpha = 1. # inverse variance of weight prior
beta = 10 # inverse variance of observation noise
# prior distribution of weights
W = np.linspace(-3, 3, 30)
W0, W1 = np.meshgrid(W, W)
pw = gauss(W0)*gauss(W1)
# plot the distribution
plt.subplot(1,2,1)
plt.pcolormesh(W, W, pw)
# sample weight from prior distribution
K = 10
wpri = np.random.normal(scale=alpha**-0.5,size=2*K).reshape(K,2)
plt.plot(wpri[:,0], wpri[:,1], "r+")
plt.axis('square');
plt.title("Prior"); plt.xlabel("w0"); plt.ylabel("w1");
# plot model samples
xrange = np.array([-1., 1.]) # range of input x
plt.subplot(1,2,2)
for k in range(K):
#wpri[1,k]*X
plt.plot(xrange, wpri[k,0]+wpri[k,1]*xrange)
plt.title("Model samples"); plt.xlabel("x"); plt.ylabel("y");
# after observation of data
N = 1
wt = np.array([-1, 1]) # 'true' weights
# sample data
X = np.random.uniform(xrange[0], xrange[1], size=N)
X = np.c_[np.ones(N), X] # prepend 1 in the leftmost column
t = wt@X.T + np.random.normal(size=N)/np.sqrt(beta)
plt.subplot(2,2,2) # start from top right
plt.plot(xrange, wt[0]+wt[1]*xrange, "g")
plt.plot(X[:,1], t, "r.") # training data
plt.xlim(xrange)
plt.title("Data"); plt.xlabel("x"); plt.ylabel("y");
# likelihood
like = 1
for n in range(N):
like = like*gauss(t[n] - (W0+W1*X[n,1]), sigma=beta**(-0.5))
plt.subplot(2,2,1) # top left
plt.pcolormesh(W, W, like)
plt.axis('square')
plt.title("Likelihood"); plt.xlabel("w0"); plt.ylabel("w1");
# new posterior
plt.subplot(2,2,3)
post = pw*like
plt.pcolormesh(W, W, post)
plt.axis('square')
plt.title("Posterior"); plt.xlabel("w0"); plt.ylabel("w1");
S = np.linalg.inv(alpha*np.eye(2) + beta*X.T@X)
m = beta*S@X.T@t
print('m =', m)
print('S =', S)
# sample weights
wpost = np.random.multivariate_normal(m, S, K)
plt.plot(wpost[:,0], wpost[:,1], "r+")
# plot model samples
xrange = np.array([-1., 1.]) # range of input x
plt.subplot(2,2,4)
for k in range(K):
#wpri[1,k]*X
plt.plot(xrange, wpost[k,0]+wpost[k,1]*xrange)
plt.plot(X[:,1], t, "r.") # training data
plt.title("Model samples"); plt.xlabel("x"); plt.ylabel("y");
plt.tight_layout() # adjust subplot margins
m = [-1.05040997 0.57394076]
S = [[0.28497351 0.39068827]
[0.39068827 0.78652914]]
Predictive distribution#
In Bayesian regression, the result is not one weight vector, but a distribution in the weight space. Then it is reasonable to consider the distribution of the output considering such uncertainty in the weigts.
The output \(y\) for a new input \(\b{x}\) should have the distribution
where the variance of the output is given by
Let us see the example of approximating a sine function by Gaussian basis functions.
# 1D Gaussian basis functions
def gbf1(x, xrange=[-1.,1.], M=10):
"""Gaussian basis functions: x can be a 1D array"""
xc = np.linspace(xrange[0], xrange[1], num=M) # centers
xd = (xc[1]-xc[0]) # interval
# x can be an array for N data points
return np.exp(-((np.tile(x,[M,1]).T - xc)/xd)**2)
# example
x = np.linspace(-1, 1, 100)
plt.plot(x, gbf1(x, M=5));
plt.xlabel("x"); plt.ylabel("phi");
def blr(X, t, alpha=1., beta=10.):
"""Bayesian linear regression
alpha: inv. variance of weight prior
beta: inv. variance of observation noise
"""
N, D = X.shape
S = np.linalg.inv(alpha*np.eye(D) + beta*X.T@X) # posterior covariance
m = beta*S@X.T@t # posterior mean
return m, S
def target(x):
"""Target function"""
return np.sin(x)
# Training data
N = 20
eps = 0.2 # noise size
xr = 4 # range of x
x = np.random.uniform(-xr, xr, size=N)
f = target(x) # target function
t = f + np.random.normal(scale=eps, size=N) # with noise
# data for testing/plotting
Np = 100
xp = np.linspace(-xr, xr, Np)
fp = target(xp)
plt.plot(xp, fp, "g") # target function
plt.plot(x, t, "ro"); # training data
plt.xlabel("x"); plt.ylabel("y");
M = 10 # number of basis functions
Phi = gbf1(x, [-xr,xr], M) # Gaussian basis functions
m, S = blr(Phi, t, alpha=1, beta=25) # Bayesian linear regression
print(m)
# test data
Phip = gbf1(xp, [-xr,xr], M)
yp = Phip@m.T
plt.plot(xp, fp, "g") # target function
plt.plot(x, t, "ro"); # training data
plt.plot(xp, yp, "b"); # MAP estimate
# predictive distribution
sigma = np.sqrt(1/beta + np.sum(Phip@S*Phip, axis=1))
plt.plot(xp, yp+sigma, "c")
plt.plot(xp, yp-sigma, "c")
plt.xlabel("x"); plt.ylabel("y");
[ 0.13214453 0.86043173 -0.86373069 -0.5999962 -0.43421678 0.4235418
0.57504221 0.17307799 0.41306333 -0.91960475]
See how \(N\), \(M\), \(\alpha\) and \(\beta\) affect the performance.
Bayesian model comparison#
We have so far considered Bayesian inference of the parameter \(\b{w}\) for a given model \(\c{M}\), such as a regression model with some input variables, but we can also think of Bayesian inference of probability over models \(\c{M}_i\), such as regression models with different choices of input variables, given data \(\c{D}\)
Here \(p(\c{D}|\c{M}_i)\), the likelihood of a model given data, is called the evidence of the model.
If we include a model explicitly in our Bayesian parameter estimation, we have
where we have the evidence as the normalizing denominator
This is also called marginal likelihood because it is the likelihood of the model with its parameters marginalized.
Computing model evidence#
In Bayesian linear regression, the model evidence with the hyperparamters \(\alpha\) (weight prior) and \(\beta\) (observation noise) is given by integration over all the range of the weight parameters \(\b{w}\):
By further integrating this over the prior distribution of \(\alpha\) and \(\beta\) we have the evidence for the full model.
A practical approximation is to find \(\alpha\) and \(\beta\) that maximize \(p(\b{t}|\alpha,\beta)\).
The log evidence is given as (Bishop, Chapter 3.5)
where \(\b{m}\) and \(S\) also depend on \(\alpha\) and \(\beta\).
For \(N>>D\), the evidence is maximized by
def logev(X, t, m, S):
"""log evidence for Bayesian regression
m: posterior mean
S: posterior covariance
alpha: inv. variance of weight prior
beta: inv. variance of observation noise
"""
N, D = X.shape
#S = np.linalg.inv(alpha*np.eye(D) + beta*X.T@X) # posterior covariance
#m = beta*S@X.T@t # posterior mean
em = t - X@m.T # error with MAP estimate
alpha = D/np.dot(m,m)
beta = N/np.dot(em,em)
# log evidence
lev = -beta/2*np.dot(em,em) - alpha/2*np.dot(m,m) + np.log(abs(np.linalg.det(S)))/2 + D/2*np.log(alpha) + N/2*(np.log(beta/(2*np.pi)))
return lev, alpha, beta
Try computing log evidence for different \(M\)
# try different values of M
Max = 10 # max number of basis functions
mse = np.zeros(Max) # mean square errors
lev = np.zeros(Max) # log evidences
for M in range(2,Max):
Phi = gbf1(x, [-xr,xr], M) # Gaussian basis functions
m, S = blr(Phi, t, alpha=1, beta=25)
lev[M], alpha, beta = logev(Phi, t, m, S)
# test data
Phip = gbf1(xp, [-xr,xr], M) # Gaussian basis functions
err = fp - Phip@m.T # validation error
mse[M] = np.dot(err,err)/Np
print(M, alpha, beta, lev[M], mse[M])
plt.subplot(2,1,1)
plt.plot(np.arange(2,Max), mse[2:], "r"); plt.ylabel("mse");
plt.subplot(2,1,2)
plt.plot(np.arange(2,Max), lev[2:], "b"); plt.ylabel("log evidence");
plt.xlabel("M");
2 24.37152259080127 2.58148716745645 -21.726982321935157 0.44353253637597106
3 30.875544846983832 2.636403069154352 -22.015398044900767 0.43627681395716983
4 0.29795968278069307 23.24062052247909 -10.041203952576923 0.009016298559501685
5 0.7915007277658813 24.328638208684595 -9.885011771865946 0.00604027630768057
6 1.5530818992624351 25.02042443149396 -9.647877004072965 0.004923684275417875
7 1.765595004234003 34.20662360035838 -7.727386487791897 0.019218900194458852
8 2.3488272310324114 39.47731654828744 -6.807844440537249 0.026019025142708308
9 2.182436399913303 49.9188950010986 -5.686545730394201 0.0419360060900722
Bayesian networks#
As we have seen in the example of Bayesian linear regression, statistical machine learning assumes a generative model of the observed data and infer the posterior probability of variables of your interest, after marginalizing other unobserved variables.
In doing so, representation of the relationships by graphs with random variables as nodes (or vertices) and joint or conditional probabilities as links (or edges or arcs) have turned out to be very useful. They are called graphical models.
Directed graphs representing conditional probabilities by arrows (directed edges) are called Bayesian networks.
For example, Bayesian linear regression can be represented as a Bayesian network as below.
Graphical model for the Bayesian linear regression.
Inference in Bayesian network goes like this: as values of some variables are observed,
clamp the values of the nodes where observation was made.
compute the posterior distributions of the nodes along the graph by repeating Bayesian inference and marginalization
In doing this, the conditional independence of the nodes allows efficient computation.
Inference on a chain#
Here we consider the simplest case of a chain of discrete random variables.
Graphical model of a chain of states.
For each node, the variable takes an integer value \(x_n \in \{1,...,K_n\}\) and we consider the joint distribution over the entire nodes:
When an observation \(x_N=k\) is made at the end node, we would consider the posterior distribution
The posterior distribution of each node \(x_n\) is given by marginalization
This can be computed efficiently by passing two messages:
Forward message \(\alpha_n\) of prior:
Backward message \(\beta_n\) of likelihood:
with \(1\) at \(k\)-th component and
The posterior distribution for each node is then given by their product
This is called forward-backward algorithm.
This can be generalized to tree-like networks and the algorithm using forward and backward message passing is known as belief propagation.
Markov chain#
Here is an example of inference in a Markov chain.
class Markov:
"""Class for a Markov chain"""
def __init__(self, ptr):
"""Create a new environment"""
self.ptr = ptr # transition matrix p(x'|x)
self.Ns = len(ptr) # number of states
def sample(self, x0=0, step=1):
"""generate a sample sequence from x0"""
seq = np.zeros(step+1, dtype=int) # sequence buffer
seq[0] = x0
for t in range(step):
pt1 = self.ptr[:, seq[t]] # prob. of new states
seq[t+1] = np.random.choice(self.Ns, p=pt1) # sample
return seq
def forward(self, p0, step=1):
"""forward message from initial distribution p0"""
alpha = np.zeros((step+1, self.Ns)) # priors
alpha[0] = p0 # initial distribution
for t in range(step):
alpha[t+1] = self.ptr @ alpha[t]
return alpha
def backward(self, obs, step=1):
"""backward message from terminal observaion"""
beta = np.zeros((step+1, self.Ns)) # likelihoods
beta[-1] = obs # observation
for t in range(step, 0, -1): # toward 0
beta[t-1] = beta[t] @ self.ptr
return beta
def posterior(self, p0, obs, step):
"""forward-backward algorithm"""
alpha = self.forward(p0, step)
beta = self.backward(obs, step)
post = alpha*beta
for t in range(step+1):
post[t] = post[t]/sum(post[t]) # normalize
return post
Here is an example of directed random walk on a ring.
# stochastic cycling on a ring
ns = 6 # ring size
ps = 0.3 # shift probability
Ptr = np.zeros((ns, ns)) # transition matrix
for i in range(ns):
Ptr[i,i] = 1 - ps
Ptr[(i+1)%ns, i] = ps
plt.imshow(Ptr)
# create a Markov chain
ring = Markov(Ptr)
# a sample trajectory
T = 15
ring.sample(1, T)
array([1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5])
# forward message passing
alpha = ring.forward([0,0,1,0,0,0], T)
plt.imshow(alpha.T);
# backward message passing
beta = ring.backward([0,0,0,1,0,0], T)
plt.imshow(beta.T);
# posterior by their products
post = ring.posterior([0,0,1,0,0,0], [0,0,0,1,0,0], T)
plt.imshow(post.T);
# a little shifted observation
post = ring.posterior([0,0,1,0,0,0], [0,0,0,0,1,0], T)
plt.imshow(post.T);
# longer sequence
post = ring.posterior([0,0,1,0,0,0], [0,0,0,0,1,0], 3*T)
plt.imshow(post.T);
Dynamic Bayesian Inference#
Iterative Bayesian inference can be generalized to the case when the hidden variable \(x\) changes dynamically.
We denote the sequence of observation as
and the history of underlying state variable as
We assume two conditional probability distributions:
Dynamics model: \(p(x’|x)\)
Observation model: \(p(y|x)\)
Using the posterior \(p(x_t|y_{1:t})\) computed from the data up to time \(t\), we use the dymamics model to compute the predictive prior:
by integrating or summing over the possible range of \(x\).
We can combine this prior with the new coming data \(y_{t+1}\) to update the posterior as:
This is called dynamic Bayesian inference and allows real-time tracking of hidden variables from noisy observations.
When \(x\) is discrete, the process is called hidden Markov model (HMM), which has been used extensively speech processing.
Another example is Kalman filter, in which \(x\) and \(y\) are continuous and the dynamics and observation models are linear mapping with Gaussian noise.
Bayesian sensorimotor processing#
Our life is full of uncertainty. In sensory perception, we need to cope with noise, delay and occulusion and also overcome fundamental ill-posedness, such as to identify the 3D location of your target from 2D retinal images or sounds to two ears.
To find a practical solution to such ill-posed problems, we need to make use of some prior assumptions, such as the light usually comes from the top or objects don’t jump abruptly.
Bayesian inference provides a principled way for combining any prior knowledge with sensory evidence. Indeed there are several lines of psychological evidence suggesting that humans and animals integrate knowledge from prior experience or multi-modal sensory information as predicted by Bayesian inference (Knill & Pouget 2004, Kording & Wolpert 2004, Doya et al. 2007).
Bayesian computation in the brain#
How such Bayesian computation realized in the brain? How does the brain represent and manipulate probability distributions?
One possibility is that the receptive field of a neuron represents a basis function in the sensory space and the activities of a population of neurons represent a probability distribution. This idea is called probabilistic population code (Zemel et al. 2004, Ma et al. 2006).
The cerebral cortex has a hierarchical organization and bi-directional connections between lower and higher areas originating from specific layers. There have been serveral hypotheses about how such hierarchical recurrent network can realize Bayesian inference, such as belif propagation (Lochmann & Deneve 2011) and variational free energy approximation (Friston 2005, 2010; Bogacz 2017).
This tutorial illustrates how variational free-energy approximation of posterior probability works and how such mechanisms might be mapped onto the cortical circuit (from Bobacz 2017).
There has been only scarse attempts at directly testing those hypotheses, but a recent two-photon imaging experiment showed the evidence for dynamic Bayesian inference by populations of neurons in the parietal cortex (Funamizu et al. 2016).
References#
Bishop CM (2006) Pattern Recognition and Machine Learning. Springer. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
Chapter 3: Bayesian linear regression
Chapter 8: Graphical models
Bayesian sensorimotor integration#
Knill DC, Pouget A (2004) The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in neurosciences 27:712-719. https://doi.org/10.1016/j.tins.2004.10.007
Körding KP, Wolpert DM (2004) Bayesian integration in sensorimotor learning. Nature 427:244-247. https://doi.org/10.1038/nature02169
Doya K, Ishii S, Pouget A, Rao R (2007) Bayesian Brain: Probabilistic Approach to Neural Coding and Learning. MIT Press.
Probabilistic population codes#
Zemel RS, Dayan P, Pouget A (1998) Probabilistic interpretation of population codes. Neural computation 10:403-430. https://doi.org/10.1162/089976698300017818
Ma WJ, Beck JM, Latham PE, Pouget A (2006) Bayesian inference with probabilistic population codes. Nature neuroscience 9:1432-1438. https://doi.org/10.1038/nn1790
Baysian inference in the cortical circuit#
Bogacz R (2017) A tutorial on the free-energy framework for modelling perception and learning. Journal of Mathematical Psychology. 76, 198–211. https://doi.org/10.1016/j.jmp.2015.11.003
Friston K (2005). A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci, 360, 815-36. http://doi.org/10.1098/rstb.2005.1622
Friston K (2010). The free-energy principle: a unified brain theory? Nat Rev Neurosci, 11, 127-38. http://doi.org/10.1038/nrn2787
Lochmann T, Deneve S (2011) Neural processing as causal inference. Current opinion in neurobiology 21:774-781. https://doi.org/10.1016/j.conb.2011.05.018
Funamizu A, Kuhn B, Doya K (2016) Neural substrate of dynamic Bayesian inference in the cerebral cortex. Nature Neuroscience 19:1682-1689. https://doi.org/10.1038/nn.4390