Appendix: Bayesian Model Fitting by PyStan#

Stan and PyStan#

Stan is a probabilistic programming language for probabilistic sampling and inference, including MCMC. Stan uses an efficient MCMC algorithm Hamiltonian Monte Carlo (HMC) by default.
https://mc-stan.org

Stan is implemented by C++ and PyStan is a Python interface for Stan.
https://pystan.readthedocs.io

Installing PyStan#

PyStan is supported for Linux and macOS with C++ complier.

To use PyStan with Jupyter notebook, you need to install httpstan.
For Intel CPU machines, just pip install should work.

pip install httpstan

For Apple Silicone, you need to build from the source. Download the souce code from GitHub
stan-dev/httpstan
and install as guided in
https://httpstan.readthedocs.io/en/latest/installation.html

Then you can install pystan and a visualization tool arviz by

pip install pystan arviz

Import libraries#

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import stan
# For multichain MCMC
import multiprocessing
multiprocessing.set_start_method("fork")

import arviz
# For running PyStan via Jupyter Notebook
import nest_asyncio
nest_asyncio.apply()

Data sampling#

Here is a simple example of sampling data from a Gaussian distribution by PyStan.

# make a normal distribution model
model = stan.build('parameters {real x;} model {x ~ normal(0,1);}', random_seed=1)
Building: found in cache, done.
# see the organization of the model
model
Model(model_name='models/loo4gezv', program_code='parameters {real x;} model {x ~ normal(0,1);}', data={}, param_names=('x',), constrained_param_names=('x',), dims=([],), random_seed=1)
# take 1000 samples
sample = model.sample(num_chains=1, num_samples=1000, num_warmup=100)
print(sample)
print(sample['x'].shape)
Sampling:   0%


Sampling: 100%, done.
Messages received during sampling:
  Gradient evaluation took 5e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
  Adjust your expectations accordingly!
  WARNING: There aren't enough warmup iterations to fit the
           three stages of adaptation as currently configured.
           Reducing each adaptation stage to 15%/75%/10% of
           the given number of warmup iterations:
             init_buffer = 15
             adapt_window = 75
             term_buffer = 10
<stan.Fit>
Parameters:
    x: ()
Draws: 1000
(1, 1000)
# see the samples
plt.subplot(1,3,(1,2))
plt.plot(sample['x'][0], '.');
plt.ylabel('x')
# show the histogram
plt.subplot(1,3,3)
plt.hist(sample['x'][0], orientation='horizontal');
_images/8c71ed8ca4ab290d112ad829b6b959f07da440bc61987123d3a25b3cc8f0b2c8.png
# plot the sample distribution
arviz.plot_posterior(sample, var_names=['x']);
_images/739dbb41fe235ed9c1b7ecd4de3182d2469797e068399a8d76adbeac551d2600.png

General steps for Bayesian modeling with stan#

  1. Describing a probabilistic generative model using Stan language.

  2. Compiling a Stan file.

  3. Running MCMC to obtain posterior samples.

  4. Diagnosing the validity of MCMC.

  5. Evaluating an obtained posterior.

Example 1: Coin-toss#

Let us first see a very basic example of coin toss.

Head (\(x=1\)) and tail (\(x=0\)) follows a Bernoulli distribution: $\( x \sim \mbox{Bernoulli}(\mu) \)\( where \)\mu\( is the mean parameter \)0 \le \mu \le 1$.

A common prior distribution for \(\mu\) is the Beta distribution \(\mbox{Beta}(x, a, b) \sim x^{a-1} (1-x)^{b-1}\).

For \(n\) coin tosses, the number of heads \(x\) follows a binomial distribution: $\( x \sim \mbox{Binomial}(n, \mu) \)$

Step 1: Describing a probabilistic generative model using Stan language#

Here we are interested to estimate the mean parameter \(\mu\) from samples.

stan_coin_toss = """
data {
    int<lower=0> n; // Number of tosses
    int<lower=0> x; // Number of heads
    int<lower=0> a; // Parameter "a" of the prior (Beta Distribution)
    int<lower=0> b; // Parameter "b" of the prior (Beta Distribution)
}

parameters {
    real<lower=0, upper=1> mu;
}

model {
    mu ~ beta(a, b); // Write a prior distribution
    x ~ binomial(n, mu);
}
"""
n = 100    # Number of tosses
x = 60     # Number of heads
a = 2     # Parameter "a" of the prior (Beta Distribution)
b = 4     # Parameter "b" of the prior (Beta Distribution)
data_coin_toss = {
    "n": n,
    "x": x,
    "a": a,
    "b": b
}

Step 2: Compiling a Stan file#

posterior = stan.build(stan_coin_toss, data=data_coin_toss, random_seed=1)
Building: found in cache, done.
Messages from stanc:
Warning: The parameter mu has 2 priors.

Step 3: Running MCMC to obtain posterior samples#

# take samples
fit = posterior.sample(num_chains=4, num_samples=2000, num_warmup=100)
fit
Sampling:   0%


Sampling:  25% (2100/8400)


Sampling:  50% (4200/8400)


Sampling:  75% (6300/8400)


Sampling: 100% (8400/8400)


Sampling: 100% (8400/8400), done.
Messages received during sampling:
  Gradient evaluation took 5e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
  Adjust your expectations accordingly!
  WARNING: There aren't enough warmup iterations to fit the
           three stages of adaptation as currently configured.
           Reducing each adaptation stage to 15%/75%/10% of
           the given number of warmup iterations:
             init_buffer = 15
             adapt_window = 75
             term_buffer = 10
  Gradient evaluation took 5e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
  Adjust your expectations accordingly!
  WARNING: There aren't enough warmup iterations to fit the
           three stages of adaptation as currently configured.
           Reducing each adaptation stage to 15%/75%/10% of
           the given number of warmup iterations:
             init_buffer = 15
             adapt_window = 75
             term_buffer = 10
  Gradient evaluation took 4e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
  Adjust your expectations accordingly!
  WARNING: There aren't enough warmup iterations to fit the
           three stages of adaptation as currently configured.
           Reducing each adaptation stage to 15%/75%/10% of
           the given number of warmup iterations:
             init_buffer = 15
             adapt_window = 75
             term_buffer = 10
  Gradient evaluation took 4e-06 seconds
  1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
  Adjust your expectations accordingly!
  WARNING: There aren't enough warmup iterations to fit the
           three stages of adaptation as currently configured.
           Reducing each adaptation stage to 15%/75%/10% of
           the given number of warmup iterations:
             init_buffer = 15
             adapt_window = 75
             term_buffer = 10
<stan.Fit>
Parameters:
    mu: ()
Draws: 8000
# see the sampled parameters
plt.subplot(1,3,(1,2))
plt.plot(fit['mu'][0], '.')
plt.ylabel(r'$\mu$')
# show the histogram
plt.subplot(1,3,3)
plt.hist(fit['mu'][0], orientation='horizontal')
plt.tight_layout()
_images/f8fafdd6f1590b11109decdb13357c0b563d234e3df8c1664e20aa0383183d48.png

Step 4: Diagnosing the validity of MCMC#

The essential point is whether a simulated chain is converged to a stationary process. A simple method to diagnose the convergence is to calculate the Gelman-Rubin statistic (a.k.a. Rhat).

You can easily calculate it (and another statistic called effective sample size (ESS)) by using Arviz.

Rhat of converged samples becomes 1. The Stan development team says that the diagnosis is accepted if Rhat is between 1 and 1.05 but there is open to debate.

ESS is for evaluating whether there is autocorrelation within a chain. If ESS is larger than sample size of each chain, it is proved that autocorrelation is small enough.

arviz.summary(arviz.from_pystan(posterior=fit, posterior_model=posterior))
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu 0.584 0.046 0.496 0.67 0.001 0.0 3115.0 4972.0 1.0

It is also good to check a convergence by your eyes. If you simulate multiple chains, it is important to check whether there is not big shape difference among posterior distributions from the chains.

arviz.plot_trace(fit, var_names=('mu'), combined=False);
_images/6cdaf7a72efc6cf2e344f10ff58f094ddb3ab5819baf1c32d7e022c2dfca3329.png

If MCMC is not converged, the following solutions (and more) are considered.

  • Increasing burn-in samples

  • Increasing the number of chains

  • Reconsidering a model structure

Step 5: Evaluating an obtained posterior#

Highest density interval (HDI): All points within HDI have a higher probability density than points outside the interval. The HDI can be used in the context of uncertainty characterisation of posterior distributions as Credible Interval (CI).

arviz.plot_posterior(fit, var_names=['mu']);
_images/d4b521e49457860275be7befe9f790891ecd07b610d3539aabdf4449e1d725b2.png

This example is simple enough to compute a posterior analytically. Let’s see how close the MCMC posterior is to the analytical posterior.

mus = np.linspace(0, 1, 200)
prior = stats.beta(a, b)
post = stats.beta(a+x, b+n-x)
post_hist = np.histogram(fit['mu'], bins=mus)
post_dist = stats.rv_histogram(post_hist)

plt.figure(figsize=(8, 6))
plt.plot(mus, prior.pdf(mus), label='Prior', c='blue')
plt.plot(mus, n*stats.binom(n, mus).pmf(x), label='Likelihood', c='green')
plt.plot(mus, post_dist.pdf(mus), label='MCMC posterior', c='orange')
plt.axvline((x+a-1)/(n+a+b-2), label='MAP', c='red', ls='dashed')
plt.plot(mus, post.pdf(mus), label='Analytical posterior', c='red', ls='dotted')
# plt.axvline(mu/n, c='green', linestyle='dashed', alpha=0.4, label='MLE')
plt.xlim([0, 1])
plt.xlabel(r'$\mu$', fontsize=14)
plt.ylabel('Density', fontsize=16)
plt.legend()
plt.show()
_images/ed7d80e702ccb8fd66db82190c6691eb71b693d9131d0c743bd577d9e83989f8.png

References#