Markov Chain Monte Carlo

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo

The Metropolis Hastings Algorithm

A key ingredient is the proposal distribution \index{proposal density} \(q(\vartheta|\theta^{i-1})\), which potentially depends on the draw \(\theta^{i-1}\) in iteration \(i-1\) of the algorithm. \vspace{0.05in}

\begin{algo}[Generic MH Algorithm] \label{algo_genericmh} For $i=1$ to N: Draw $\vartheta$ from a density $q(\vartheta|\theta^{i-1})$. Set $\theta^{i} = \vartheta$ with probability \[ \alpha(\vartheta | \theta^{i-1} ) = \min \left\{ 1, ; \frac{ p(Y| \vartheta )p(\vartheta) / q(\vartheta | \theta^{i-1}) }{ p(Y|\theta^{i-1}) p(\theta^{i-1}) / q(\theta^{i-1} | \vartheta) } \right\} \] and $\theta^{i} = \theta^{i-1}$ otherwise. \end{algo}


Because \\(p(\theta|Y) \propto p(Y|\theta)p(\theta)\\) we can replace the posterior densities in the calculation of the acceptance probabilities \\(\alpha(\vartheta | \theta^{i-1})\\) This yields a Markov transition kernel \\(K(\theta|\tilde{\theta})\\), where the conditioning value \\(\tilde{\theta}\\) corresponds to the parameter draw from iteration \\(i-1\\).

Convergence

Probability theory for MH is much harder than for IS.

Key property: invariance of Markov Chain.

\begin{eqnarray} p(\theta|Y) = \int K(\theta | \tilde{\theta}) p(\tilde{\theta} |Y) d \tilde{\theta}. \label{eq_mhinvariance} \end{eqnarray}

Show this property using reversibility of the Markov Chain
Not sufficient for SLLN or CLT, these things depend on \(q\) and \(\pi\).
\textcolor{red}{Look at specific example.}

A Specific Example

Deriving the Transition Kernel

Transition Kernel, Continued

Markov Chain

We can represent the Markov Chain generated by MH as an AR(1). Define: \[ \xi^i = \frac{\theta^i - \tau_1}{\tau_2-\tau_1},\quad \xi^i \in \{0, 1\}. \] \(\xi^i\) follows the first-order autoregressive process

\begin{equation} \xi^i = (1-k_{11}) + \lambda_2(K) \xi^{i-1} + \nu^i. \label{eq_exKxi} \end{equation}


Conditional on \\(\xi^{i-1}=j-1\\), \\(j=1,2\\), the innovation \\(\nu^i\\) has support on \\(k\_{jj}\\) and \\((1-k\_{jj})\\), its conditional mean is equal to zero, and its conditional variance is equal to \\(k\_{jj}(1-k\_{jj})\\).

More on Markov Chain

\[ \bar{h}_N = \frac{1}{N} \sum_{i=1}^N h(\theta^i) \] we deduce from a central limit theorem for dependent random variables that \[ \sqrt{N} (\bar{h}_N - \mathbb{E}_\pi[h]) \Longrightarrow N \big(0, \Omega(h) \big), \] where \(\Omega(h)\) is now the long-run covariance matrix \[ \Omega(h) = \lim_{L \longrightarrow \infty} \mathbb{V}_\pi[h] \left( 1 + 2 \sum_{l=1}^L \frac{L-l}{L} \left( q - (1-q)\frac{\pi_1}{1-\pi_1} \right)^l \right). \] In turn, the asymptotic inefficiency factor is given by \index{inefficiency factor}

\begin{eqnarray} \mbox{InEff}_\infty &=& \frac{\Omega(h)}{\mathbb{V}_\pi[h]} \\\ &=& 1 + 2 \lim_{L \longrightarrow \infty} ; \sum_{l=1}^L \frac{L-l}{L} \left( q - (1-q)\frac{\pi_1}{1-\pi_1} \right)^l. \nonumber \end{eqnarray}

Numerical Example

Autocorrelation Functions

\includegraphics[width=4.3in]{static/mh_acf}

Log Inefficiency Factor as function of \(q\)

\includegraphics[width=4.3in]{static/mh_relative_variance}

Convergence: within vs across chain variance estimates

\begin{center} \includegraphics[width=2.1in]{static/mh_hac} \end{center}

Take Aways

How to pick \(q\) for a DSGE model?

Random Walk Metropolis-Hastings

On \(\hat\Sigma\)

Picking Scaling \(c\)

From Prior to Posterior

\vspace{0.15in} Eliciting prior distributions DelNegro2008a:

Priors, Continued

\vspace{0.2in} Above all: Generate draws from the prior distribution of \(\theta\); compute important transformations of \(\theta\) such as steady-state ratios and possibly impulse-response functions or variance decompositions.

Try not to set priors based \(Y\)

\includegraphics[width=4.3in]{static/swpi}

\(\rho = \frac{x^2}{x^2+y^2}, x\sim U[0, 1], y\sim U[0, 1]\)}

Density of \(\rho\) \includegraphics[width=4.3in]{static/rho}

Bayesian Estimation – Prior

\begin{center} \scalebox{0.85}{ \begin{tabular}{lllcc} \hline\hline Name & Domain & \multicolumn{3}{c}{Prior} \\\ & & Density & Para (1) & Para (2) \ \hline \multicolumn{5}{c}{Steady State Related Parameters $\theta_{(ss)}$} \ \hline $r^{(A)}$ & $\mathbb{R}^+$ & Gamma & 0.50 & 0.50 \\\ $\pi^{(A)}$ & $\mathbb{R}^+$ & Gamma & 7.00 & 2.00 \\\ $\gamma^{(Q)}$ & $\mathbb{R}$ & Normal & 0.40 & 0.20 \ \hline \multicolumn{5}{c}{Endogenous Propagation Parameters $\theta_{(endo)}$} \ \hline $\tau$ & $\mathbb{R}^+$ & Gamma & 2.00 & 0.50 \\\ $\kappa$ & $[0,1]$ & Uniform & 0.00 & 1.00 \\\ $\psi_1$ & $\mathbb{R}^+$ & Gamma & 1.50 & 0.25 \\\ $\psi_2$ & $\mathbb{R}^+$ & Gamma & 0.50 & 0.25 \\\ $\rho_R$ & $[0,1)$ & Uniform & 0.00 & 1.00 \ \hline \multicolumn{5}{c}{Exogenous Shock Parameters $\theta_{(exo)}$} \ \hline $\rho_G$ & $[0,1)$ & Uniform & 0.00 & 1.00 \\\ $\rho_Z$ & $[0,1)$ & Uniform & 0.00 & 1.00 \\\ $100 \sigma_R$ & $\mathbb{R}^+$ & InvGamma & 0.40 & 4.00 \\\ $100 \sigma_G$ & $\mathbb{R}^+$ & InvGamma & 1.00 & 4.00 \\\ $100 \sigma_Z$ & $\mathbb{R}^+$ & InvGamma & 0.50 & 4.00 \ \hline \hline \end{tabular} } \end{center}

Baseline Estimation

\begin{table}[t!] \caption{Posterior Estimates of DSGE Model Parameters} \label{t_dsge1_posterior} \begin{center} \begin{tabular}{p{0.8cm}ccp{0.8cm}cc} \hline\hline & Mean & [0.05, 0.95] & & Mean & [0.05,0.95] \ \hline $\tau$ & 2.83 & [ 1.95, 3.82] & $\rho_r$ & 0.77 & [ 0.71, 0.82] \\\ $\kappa$ & 0.78 & [ 0.51, 0.98] & $\rho_g$ & 0.98 & [ 0.96, 1.00] \\\ $\psi_1$ & 1.80 & [ 1.43, 2.20] & $\rho_z$ & 0.88 & [ 0.84, 0.92] \\\ $\psi_2$ & 0.63 & [ 0.23, 1.21] & $\sigma_r$ & 0.22 & [ 0.18, 0.26] \\\ $r^{(A)}$ & 0.42 & [ 0.04, 0.95] & $\sigma_g$ & 0.71 & [ 0.61, 0.84] \\\ $\pi^{(A)}$ & 3.30 & [ 2.78, 3.80] & $\sigma_z$ & 0.31 & [ 0.26, 0.36] \\\ $\gamma^{(Q)}$ & 0.52 & [ 0.28, 0.74] & & & \\\ \hline \hline \end{tabular} \end{center} {\em Notes:} We generated $N=100,000$ draws from the posterior and discarded the first 50,000 draws. Based on the remaining draws we approximated the posterior mean and the 5th and 95th percentiles. \end{table}

More on \(c\)

Vary \(c\in(0, 2]\). Look at effect on

What is the relationship between acceptance rate and accuracy?

Effects of Scaling

\includegraphics[width=4.3in]{static/dsge1_rwmh_scaling}

Acceptance Rate vs. Accuracy

\includegraphics[width=4.3in]{static/dsge1_rwmh_acceptance_v_accuracy}

Convergence of Monte Carlo Average \(\bar{\tau}_{N|N_0}\)

\begin{center} \includegraphics[width=3.5in]{static/dsge1_recursive_means_tau.pdf} \\\ \end{center}

Notes: The $x-$axis indicates the number of draws \(N\). \(N_0\) is set to \(0\), \(25,000\) and \(50,000\), respectively.

Improvements to MCMC: Blocking

Block MH Algorithm

Draw \(\theta^0 \in \Theta\) and then for \(i=1\) to \(N\):

  1. Create a partition \(B^i\) of the parameter vector into \(N_{blocks}\) blocks \(\theta_1, \ldots, \theta_{N_{blocks}}\) via some rule (perhaps probabilistic), unrelated to the current state of the Markov chain.
  2. For \(b = 1, \ldots, N_{blocks}\):
    1. Draw \(\vartheta_b \sim q(\cdot|\left[\theta^{i}_{< b}, \theta_b^{i-1}, \theta^{i-1}_{\ge b}\right])\).
    2. With probability, \[ \alpha = \max\left\{\frac{p(\left[\theta^{i}_{< b}, \vartheta_{b}, \theta^{i-1}_{>b}\right]|Y) q(\theta^{i-1}_b,|\theta^{i}_{< b},\vartheta_b,\theta^{i-1}_{> b})} {p(\theta^{i}_{< b}, \theta_b^{i-1}, \theta^{i-1}_{> b}|Y)q(\vartheta_b|\theta^{i}_{< b}, \theta_b^{i-1}, \theta^{i-1}_{> b})},1\right\}, \] set \(\theta^i_b = \vartheta_{b}\), otherwise set \(\theta_b^{i} = \theta^{i-1}_b.\)

Random-Block MH Algorithm

Metropolis-Adjusted Langevin Algorithm

Newton MH Algorithm

Run Times and Tuning Constants for MH Algorithms

\begin{center} \begin{tabular}{llll} \hline\hline Algorithm & Run Time & Accpt. & Tuning\\\ & [hh:mm:ss] & Rate & Constants \\\ \hline 1-Block RWMH-I & 00:01:13 & 0.28 & $c = 0.015$ \\\ 1-Block RWMH-V & 00:01:13 & 0.37 & $c = 0.400$ \\\ 3-Block RWMH-I & 00:03:38 & 0.40 & $c = 0.070$ \\\ 3-Block RWMH-V & 00:03:36 & 0.43 & $c = 1.200$ \\\ 3-Block MAL & 00:54:12 & 0.43 & $c_1 = 0.4, c_2 = 0.750$ \\\ 3-Block Newton MH & 03:01:40 & 0.53 & $\bar s = 0.7, c_2 = 0.600$ \\\ \hline \end{tabular} \end{center}

Notes: In each run we generate \(N=100,000\) draws. We report the fastest run time and the average acceptance rate across \(N_{run}=50\) independent Markov chains.

Autocorrelation Function of \(\tau^{i}\)

\begin{center}
\includegraphics[width=3.5in]{static/dsge1\_tau\_acf.pdf}

\end{center}

Notes: The autocorrelation functions are computed based on a single run of each algorithm.

Inefficiency Factor \(\mbox{InEff}_N[\bar{\tau}]\)

\begin{center} \includegraphics[width=3.5in]{static/dsge1_relative_variance_tau.pdf} \end{center}

Notes: The small-sample inefficiency factors are computed based on \(N_{run}=50\) independent runs of each algorithm.

IID Equivalent Draws Per Second

\[ \mbox{$iid$-equivalent draws per second} = \frac{N}{\mbox{Run Time [seconds]}} \cdot \frac{1}{\mbox{InEff}_N}. \]

\begin{center} \begin{tabular}{lr} \hline \hline Algorithm & Draws Per Second \ \hline 1-Block RWMH-V & 7.76 \\\ 3-Block RWMH-V & 5.65 \\\ 3-Block MAL & 1.24 \\\ 3-Block RWMH-I & 0.14 \\\ 3-Block Newton MH & 0.13 \\\ 1-Block RWMH-I & 0.04 \ \hline \end{tabular} \end{center}

Performance of Different MH Algorithms

\begin{center} \begin{tabular}{cc} RWMH-V (1 Block) & RWMH-V (3 Blocks) \\\ \includegraphics[width=1.35in]{static/dsge1_rwmh_one_block_hac.pdf} & \includegraphics[width=1.35in]{static/dsge1_rwmh_three_block_hac.pdf} \\\ MAL & Newton \\\ \includegraphics[width=1.35in]{static/dsge1_langevin_hac.pdf} & \includegraphics[width=1.35in]{static/dsge1_newton_hac.pdf} \end{tabular} \end{center}

\tiny Notes: Each panel contains scatter plots of the small sample variance \(\mathbb{V}[\bar{\theta}]\) computed across multiple chains (\(x\)-axis) versus the \(\mbox{HAC}[\bar{h}]\) estimates of \(\Omega(\theta)/N\) (\(y\)-axis).

Recall: Posterior Odds and Marginal Data Densities

Computation of Marginal Data Densities

Modified Harmonic Mean

Chib and Jeliazkov

Chib and Jeliazkov

Chib and Jeliazkov

MH-Based Marginal Data Density Estimates

\begin{center} \begin{tabular}{lcc} \hline\hline Model & Mean($\ln \hat p(Y)$) & Std. Dev.($\ln \hat p(Y)$) \\\ \hline Geweke $(\tau = 0.5)$ & -346.17 & 0.03 \\\ Geweke $(\tau = 0.9)$ & -346.10 & 0.04 \\\ SWZ $(q = 0.5)$ & -346.29 & 0.03 \\\ SWZ $(q = 0.9)$ & -346.31 & 0.02 \\\ Chib and Jeliazkov & -346.20 & 0.40 \\\ \hline \hline \end{tabular} \end{center}

Notes: Table shows mean and standard deviation of log marginal data density estimators, computed over \(N_{run} =50\) runs of the RWMH-V sampler using \(N=100,000\) draws, discarding a burn-in sample of \(N_0=50,000\) draws. The SWZ estimator uses \(J=100,000\) draws to compute \(\hat \tau\), while the CJ estimators uses \(J=100,000\) to compute the denominator of \(\hat{p}(\tilde{\theta}|Y)\).

References

Bibliography

[Metropolis1953] Metropolis, Rosenbluth, Rosenbluth, Teller & Teller, Equations of State Calculations by Fast Computing Machines, Journal of Chemical Physics, 21, 1087-1091 (1953).

[Hastings1970] Hastings, Monte Carlo Sampling Methods Using Markov Chains and Their Applications, Biometrika, 57, 97-109 (1970).

[Tierney1994] Tierney, Markov Chains for Exploring Posterior Distributions, The Annals of Statistics, 22(4), 1701-1728 (1994).

[Chib1995a] Chib & Greenberg, Understanding the Metropolis-Hastings Algorithm, The American Statistician, 49, 327-335 (1995).

[Robert2004] Robert & Casella, Monte Carlo Statistical Methods, Springer (2004).

[Geweke2005] John Geweke, Contemporary Bayesian Econometrics and Statistics, John Wiley & Sons, Inc. (2005).

[DelNegro2008a] Del Negro & Frank Schorfheide, Forming Priors for DSGE Models (and How it Affects the Assessment of Nominal Rigidities), Journal of Monetary Economics, 55(7), 1191-1208 (2008). link. doi.