A key ingredient is the proposal distribution \index{proposal density} \(q(\vartheta|\theta^{i-1})\), which potentially depends on the draw \(\theta^{i-1}\) in iteration \(i-1\) of the algorithm. \vspace{0.05in}
\begin{algo}[Generic MH Algorithm] \label{algo_genericmh} For $i=1$ to N: Draw $\vartheta$ from a density $q(\vartheta|\theta^{i-1})$. Set $\theta^{i} = \vartheta$ with probability \[ \alpha(\vartheta | \theta^{i-1} ) = \min \left\{ 1, ; \frac{ p(Y| \vartheta )p(\vartheta) / q(\vartheta | \theta^{i-1}) }{ p(Y|\theta^{i-1}) p(\theta^{i-1}) / q(\theta^{i-1} | \vartheta) } \right\} \] and $\theta^{i} = \theta^{i-1}$ otherwise. \end{algo}
Probability theory for MH is much harder than for IS.
Key property: invariance of Markov Chain.
\begin{eqnarray} p(\theta|Y) = \int K(\theta | \tilde{\theta}) p(\tilde{\theta} |Y) d \tilde{\theta}. \label{eq_mhinvariance} \end{eqnarray}
Show this property using reversibility of the Markov Chain
Not sufficient for SLLN or CLT, these things depend on \(q\) and \(\pi\).
\textcolor{red}{Look at specific example.}
Suppose the parameter space is discrete and \(\theta\) can only take two values: \(\tau_1\) and \(\tau_2\).
The posterior distribution then simplifies to two probabilities which we denote as \(\pi_l = \mathbb{P}\{ \theta=\tau_l|Y\}\), \(l=1,2\).
The proposal distribution in Algorithm~\ref{algo_genericmh} can be represented as a two-stage Markov process with transition matrix
\begin{equation} Q = \left[ \begin{array}{cc} q_{11} & q_{12} \ q_{21} & q_{22} \end{array} \right], \end{equation}
where \(q_{lk}\) is the probability of drawing \(\vartheta = \tau_k\) conditional on \(\theta^{i-1} = \tau_l\).
Assume that \[ q_{11} = q_{22} = q, \quad q_{12}=q_{21}=1-q \] and that the posterior distribution has the property \[ \pi_2 > \pi_1. \]
Similar reasoning as before
\begin{eqnarray} K = \left[ \begin{array}{cc} k_{11} & k_{12} \ k_{21} & k_{22} \end{array} \right] \label{eq_exKtransition} = \left[ \begin{array}{cc} q & (1-q) \ (1-q)\frac{\pi_1}{\pi_2} & q + (1-q)\left( 1- \frac{\pi_1}{\pi_2} \right) \end{array} \right]. \nonumber \end{eqnarray}
\(K\) has two eigenvalues \(\lambda_1\) and \(\lambda_2\):
\begin{equation} \lambda_1(K) = 1, \quad \lambda_2(K) = q - (1-q)\frac{\pi_1}{1-\pi_1}. \end{equation}
Eigenvector associated with with \(\lambda_1(K)\) determines the invariant distribution
of the Markov chain (=posterior). If \(\lambda_2(K)\neq 1\), this distribution is unique.
The persistence of the Markov chain is characterized by the eigenvalue \(\lambda_2(K)\).
We can represent the Markov Chain generated by MH as an AR(1). Define: \[ \xi^i = \frac{\theta^i - \tau_1}{\tau_2-\tau_1},\quad \xi^i \in \{0, 1\}. \] \(\xi^i\) follows the first-order autoregressive process
\begin{equation} \xi^i = (1-k_{11}) + \lambda_2(K) \xi^{i-1} + \nu^i. \label{eq_exKxi} \end{equation}
\[ \bar{h}_N = \frac{1}{N} \sum_{i=1}^N h(\theta^i) \] we deduce from a central limit theorem for dependent random variables that \[ \sqrt{N} (\bar{h}_N - \mathbb{E}_\pi[h]) \Longrightarrow N \big(0, \Omega(h) \big), \] where \(\Omega(h)\) is now the long-run covariance matrix \[ \Omega(h) = \lim_{L \longrightarrow \infty} \mathbb{V}_\pi[h] \left( 1 + 2 \sum_{l=1}^L \frac{L-l}{L} \left( q - (1-q)\frac{\pi_1}{1-\pi_1} \right)^l \right). \] In turn, the asymptotic inefficiency factor is given by \index{inefficiency factor}
\begin{eqnarray} \mbox{InEff}_\infty &=& \frac{\Omega(h)}{\mathbb{V}_\pi[h]} \\\ &=& 1 + 2 \lim_{L \longrightarrow \infty} ; \sum_{l=1}^L \frac{L-l}{L} \left( q - (1-q)\frac{\pi_1}{1-\pi_1} \right)^l. \nonumber \end{eqnarray}
\includegraphics[width=4.3in]{static/mh_acf}
\includegraphics[width=4.3in]{static/mh_relative_variance}
\begin{center} \includegraphics[width=2.1in]{static/mh_hac} \end{center}
high autocorrelation reflects the fact that it will take a high number of draws to accurately reflect the \index{target distribution} target distribution
for large values of \(q\), the variance of Monte Carlo estimates of \(h\) drawn from the MH chain are much larger than the variance of estimates derived from \(iid\) draws
HAC estimates bracket small-sample estimates, indicating convergence, but they tend to underestimate variance for all \(q\).
How to pick \(q\) for a DSGE model?
Want \(\hat\Sigma\) to incorporate information about the posterior.
One approach: \cite{Schorfheide2000}, is to set \(\hat\Sigma\) to be the
negative of the inverse \index{Hessian matrix} Hessian at the mode of the
log posterior, \(\hat\theta\), obtained by running a numerical optimization
\index{numerical optimization}.
This has appealing large sample properties, but can be tedious and innacurate.
Another (adaptive) approach: use prior variance for a first sequence of posterior draws, the compute the sample covariance matrix and use that as \(\hat\Sigma\). Must be fixed eventually.
Here we cheat: \[ \mbox{RWMH-V} : \hat\Sigma =\mathbb{V}_\pi[\theta]. \]
\vspace{0.15in} Eliciting prior distributions DelNegro2008a:
Group parameters by categories: \(\theta_{(ss)}\) (related to steady state), \(\theta_{(exo)}\) (related to exogenous processes), \(\theta_{(endo)}\) (affects mechanisms but not steady state).
\begin{eqnarray} \theta_{(ss)} &=& [r^{(A)},\pi^{(A)},\gamma^{(Q)}]’ \nonumber\\\ \theta_{(exo)} &=& [\rho_g,\rho_z,\sigma_g,\sigma_z,\sigma_R]’\nonumber \\\ \theta_{(endo)} &=& [\tau,\kappa,\psi_1,\psi_2,\rho_R]’ \nonumber \end{eqnarray}
\vspace{0.2in} Above all: Generate draws from the prior distribution of \(\theta\); compute important transformations of \(\theta\) such as steady-state ratios and possibly impulse-response functions or variance decompositions.
\includegraphics[width=4.3in]{static/swpi}
Density of \(\rho\) \includegraphics[width=4.3in]{static/rho}
\begin{center} \scalebox{0.85}{ \begin{tabular}{lllcc} \hline\hline Name & Domain & \multicolumn{3}{c}{Prior} \\\ & & Density & Para (1) & Para (2) \ \hline \multicolumn{5}{c}{Steady State Related Parameters $\theta_{(ss)}$} \ \hline $r^{(A)}$ & $\mathbb{R}^+$ & Gamma & 0.50 & 0.50 \\\ $\pi^{(A)}$ & $\mathbb{R}^+$ & Gamma & 7.00 & 2.00 \\\ $\gamma^{(Q)}$ & $\mathbb{R}$ & Normal & 0.40 & 0.20 \ \hline \multicolumn{5}{c}{Endogenous Propagation Parameters $\theta_{(endo)}$} \ \hline $\tau$ & $\mathbb{R}^+$ & Gamma & 2.00 & 0.50 \\\ $\kappa$ & $[0,1]$ & Uniform & 0.00 & 1.00 \\\ $\psi_1$ & $\mathbb{R}^+$ & Gamma & 1.50 & 0.25 \\\ $\psi_2$ & $\mathbb{R}^+$ & Gamma & 0.50 & 0.25 \\\ $\rho_R$ & $[0,1)$ & Uniform & 0.00 & 1.00 \ \hline \multicolumn{5}{c}{Exogenous Shock Parameters $\theta_{(exo)}$} \ \hline $\rho_G$ & $[0,1)$ & Uniform & 0.00 & 1.00 \\\ $\rho_Z$ & $[0,1)$ & Uniform & 0.00 & 1.00 \\\ $100 \sigma_R$ & $\mathbb{R}^+$ & InvGamma & 0.40 & 4.00 \\\ $100 \sigma_G$ & $\mathbb{R}^+$ & InvGamma & 1.00 & 4.00 \\\ $100 \sigma_Z$ & $\mathbb{R}^+$ & InvGamma & 0.50 & 4.00 \ \hline \hline \end{tabular} } \end{center}
\begin{table}[t!] \caption{Posterior Estimates of DSGE Model Parameters} \label{t_dsge1_posterior} \begin{center} \begin{tabular}{p{0.8cm}ccp{0.8cm}cc} \hline\hline & Mean & [0.05, 0.95] & & Mean & [0.05,0.95] \ \hline $\tau$ & 2.83 & [ 1.95, 3.82] & $\rho_r$ & 0.77 & [ 0.71, 0.82] \\\ $\kappa$ & 0.78 & [ 0.51, 0.98] & $\rho_g$ & 0.98 & [ 0.96, 1.00] \\\ $\psi_1$ & 1.80 & [ 1.43, 2.20] & $\rho_z$ & 0.88 & [ 0.84, 0.92] \\\ $\psi_2$ & 0.63 & [ 0.23, 1.21] & $\sigma_r$ & 0.22 & [ 0.18, 0.26] \\\ $r^{(A)}$ & 0.42 & [ 0.04, 0.95] & $\sigma_g$ & 0.71 & [ 0.61, 0.84] \\\ $\pi^{(A)}$ & 3.30 & [ 2.78, 3.80] & $\sigma_z$ & 0.31 & [ 0.26, 0.36] \\\ $\gamma^{(Q)}$ & 0.52 & [ 0.28, 0.74] & & & \\\ \hline \hline \end{tabular} \end{center} {\em Notes:} We generated $N=100,000$ draws from the posterior and discarded the first 50,000 draws. Based on the remaining draws we approximated the posterior mean and the 5th and 95th percentiles. \end{table}
Vary \(c\in(0, 2]\). Look at effect on
What is the relationship between acceptance rate and accuracy?
\includegraphics[width=4.3in]{static/dsge1_rwmh_scaling}
\includegraphics[width=4.3in]{static/dsge1_rwmh_acceptance_v_accuracy}
\begin{center} \includegraphics[width=3.5in]{static/dsge1_recursive_means_tau.pdf} \\\ \end{center}
Notes: The $x-$axis indicates the number of draws \(N\). \(N_0\) is set to \(0\), \(25,000\) and \(50,000\), respectively.
In high-dimensional parameter spaces the RWMH algorithm generates highly persistent Markov chains.
What’s bad about persistence?
\begin{eqnarray*} \lefteqn{ \sqrt{N}( \bar{h}_N - \mathbb{E}[\bar{h}_N]) } \\\ & \Longrightarrow & N \bigg( 0, , \frac{1}{N} \sum_{i=1}^n \mathbb{V}[h(\theta^i)] + {\color{red} \frac{1}{N} \sum_{i=1}^N \sum_{j \not=i} COV\big[ h(\theta^i),h(\theta^j) \big]} \bigg). \end{eqnarray*}
Potential Remedy:
To reduce persistence of the chain, try to find partitions such that parameters are strongly correlated within blocks and weakly correlated across blocks or use random blocking.
Draw \(\theta^0 \in \Theta\) and then for \(i=1\) to \(N\):
Generate a sequence of random partitions \(\{ B^i\}_{i=1}^N\) of the parameter vector \(\theta\) into \(N_{blocks}\) equally sized blocks, denoted by \(\theta_b\), \(b=1,\ldots,N_{blocks}\) as follows:
Execute Algorithm Block MH Algorithm.
The proposal distribution of Metropolis-Adjusted Langevin (MAL) algorithm is given by
\begin{align*} \mu(\theta^{i-1}) &= \theta^{i-1} + \frac{c_1}{2} M_1 \frac{\partial}{\partial \theta} \ln p(\theta^{i-1}|Y) \bigg|_{\theta = \theta^{i-1}}, \\\ \Sigma(\theta^{i-1}) &= c_2^2M_2. \nonumber \end{align*}
that is \(\theta^{i-1}\) is adjusted by a step in the direction of the gradient of the log posterior density function.
One standard practice is to set \(M_1 = M_2 = M\), with \[ M = -\left[\frac{\partial}{\partial \theta \partial \theta’} \ln p(\theta|Y) \bigg|_{\theta = \hat{\theta}}\right]^{-1}, \] where \(\hat\theta\) is the mode of the posterior distribution obtained using a numerical optimization routine.
Newton MH Algorithm replaces the Hessian evaluated at the posterior
mode \(\hat{\theta}\) by the Hessian evaluated at \(\theta^{i-1}\).
The proposal distribution is given by
\begin{align*} \mu(\theta^{i-1}) &= \theta^{i-1} - s \left[\frac{\partial}{\partial \theta \partial \theta’} \ln p(\theta|Y) \bigg|_{\theta = \theta^{i-1}}\right]^{-1} \ & \times\frac{\partial}{\partial \theta} \ln p(\theta^{i-1}|Y) \bigg|_{\theta = \theta^{i-1}} \nonumber \\\ \hat\Sigma(\theta^{i-1}) &= - c_2^2 \left[\frac{\partial}{\partial \theta \partial \theta’} \ln p(\theta|Y) \bigg|_{\theta = \theta^{i-1}}\right]^{-1}. \nonumber \end{align*}
It is useful to let \(s\) be independently of \(\theta^{i-1}\): \[ c_1 = 2s, \quad s\sim iid U[0, \bar s], \] where \(\bar s\) is a tuning parameter.
\begin{center} \begin{tabular}{llll} \hline\hline Algorithm & Run Time & Accpt. & Tuning\\\ & [hh:mm:ss] & Rate & Constants \\\ \hline 1-Block RWMH-I & 00:01:13 & 0.28 & $c = 0.015$ \\\ 1-Block RWMH-V & 00:01:13 & 0.37 & $c = 0.400$ \\\ 3-Block RWMH-I & 00:03:38 & 0.40 & $c = 0.070$ \\\ 3-Block RWMH-V & 00:03:36 & 0.43 & $c = 1.200$ \\\ 3-Block MAL & 00:54:12 & 0.43 & $c_1 = 0.4, c_2 = 0.750$ \\\ 3-Block Newton MH & 03:01:40 & 0.53 & $\bar s = 0.7, c_2 = 0.600$ \\\ \hline \end{tabular} \end{center}
Notes: In each run we generate \(N=100,000\) draws. We report
the fastest run time and the average acceptance rate across \(N_{run}=50\) independent Markov chains.
\begin{center}
\includegraphics[width=3.5in]{static/dsge1\_tau\_acf.pdf}
\end{center}
Notes: The autocorrelation functions are computed based on a single run of each algorithm.
\begin{center} \includegraphics[width=3.5in]{static/dsge1_relative_variance_tau.pdf} \end{center}
Notes: The small-sample inefficiency factors are computed based on \(N_{run}=50\) independent runs of each algorithm.
\[ \mbox{$iid$-equivalent draws per second} = \frac{N}{\mbox{Run Time [seconds]}} \cdot \frac{1}{\mbox{InEff}_N}. \]
\begin{center} \begin{tabular}{lr} \hline \hline Algorithm & Draws Per Second \ \hline 1-Block RWMH-V & 7.76 \\\ 3-Block RWMH-V & 5.65 \\\ 3-Block MAL & 1.24 \\\ 3-Block RWMH-I & 0.14 \\\ 3-Block Newton MH & 0.13 \\\ 1-Block RWMH-I & 0.04 \ \hline \end{tabular} \end{center}
\begin{center} \begin{tabular}{cc} RWMH-V (1 Block) & RWMH-V (3 Blocks) \\\ \includegraphics[width=1.35in]{static/dsge1_rwmh_one_block_hac.pdf} & \includegraphics[width=1.35in]{static/dsge1_rwmh_three_block_hac.pdf} \\\ MAL & Newton \\\ \includegraphics[width=1.35in]{static/dsge1_langevin_hac.pdf} & \includegraphics[width=1.35in]{static/dsge1_newton_hac.pdf} \end{tabular} \end{center}
\tiny Notes: Each panel contains scatter plots of the small sample variance \(\mathbb{V}[\bar{\theta}]\) computed across multiple chains (\(x\)-axis) versus the \(\mbox{HAC}[\bar{h}]\) estimates of \(\Omega(\theta)/N\) (\(y\)-axis).
Posterior model probabilities can be computed as follows:
\begin{equation} \pi_{i,T} = \frac{ \pi_{i,0} p(Y|{\cal M}_i) }{ \sum_{j} \pi_{j,0} p(Y|{\cal M}_j) }, \quad j=1,\ldots,2, \end{equation}
where
\begin{equation} p(Y|{\cal M}) = \int p(Y|\theta,{\cal M}) p(\theta|{\cal M}) d\theta \end{equation}
Note: \[ \ln p(Y_{1:T}|{\cal M}) = \sum_{t=1}^T \ln \int p(y_t |\theta, Y_{1:t-1}, {\cal M}) p(\theta|Y_{1:t-1},{\cal M}) d\theta \]
Posterior odds and Bayes Factor
\begin{equation} \frac{\pi_{1,T}}{\pi_{2,T}} = \underbrace{ \frac{\pi_{1,0}}{\pi_{2,0}} }_{\displaystyle Prior, Odds} \times \underbrace{ \frac{p(Y|{\cal M}_1)}{p(Y|{\cal M}_2)} }_{\displaystyle Bayes, Factor} \end{equation}
Reciprocal importance sampling:
Chib and Jeliazkov’s estimator
For a survey, see Ardia, Hoogerheide, and van Dijk (2009).
Reciprocal importance samplers are based on the following identity:
\begin{eqnarray} \frac{1}{p(Y ) } = \int \frac{ f(\theta) }{ p(Y|\theta) p(\theta ) } p(\theta |Y) d\theta, \end{eqnarray}
where \(\int f(\theta) d\theta = 1\).
Conditional on the choice of \(f(\theta)\) an obvious estimator is
\begin{equation} \hat{p}_{G}(Y) = \left[ \frac{1}{N} \sum_{i=1}^{N} \frac{ f(\theta^{i}) }{ p(Y|\theta^{i}) p(\theta^{i})} \right]^{-1}, \label{eq_harmonicmean} \end{equation}
where \(\theta^{i}\) is drawn from the posterior \(p(\theta|Y)\).
Geweke (1999):
\begin{eqnarray} f(\theta) &=& \tau^{-1} (2\pi)^{-d/2} |V_{\theta}|^{-1/2} \exp \left[ - 0.5(\theta - \bar{\theta})’ V^{-1}_{\theta} (\theta - \bar{\theta}) \right] \nonumber \\\ && \times \left\{ (\theta - \bar{\theta})’ V^{-1}_{\theta} (\theta - \bar{\theta}) \le F_{\chi^2_{d}}^{-1}(\tau) \right\}. \end{eqnarray}
Rewrite Bayes Theorem:
\begin{equation} p(Y) = \frac{ p(Y|\theta) p(\theta)}{p(\theta|Y)}. \label{eq_csequality} \end{equation}
Thus,
\begin{equation} \hat{p}_{CS}(Y) = \frac{ p(Y|\tilde{\theta}) p(\tilde{\theta}) }{ \hat{p}(\tilde{\theta}|Y) }, \end{equation}
where we replaced the generic \(\theta\) in~(\ref{eq_csequality}) by the posterior mode \(\tilde{\theta}\).
Use output of Metropolis-Hastings Algorithm.
Proposal density for transition \(\theta \mapsto \tilde{\theta}\): \(q(\theta,\tilde{\theta}|Y)\).
Probability of accepting proposed draw: \[ \alpha(\theta,\tilde{\theta}|Y) = \min ; \left\{ 1, ; \frac{p(\tilde{\theta}|Y)/q(\theta,\tilde{\theta}|Y)}{ p(\theta|Y)/q(\tilde{\theta},\theta|Y)} \right\}. \]
Note that
\begin{eqnarray*} \lefteqn{ \int \alpha(\theta,\tilde{\theta}|Y) q(\theta,\tilde{\theta}|Y) p(\theta|Y) d\theta } \\\ &=& \int \min ; \left\{ 1, ; \frac{p(\tilde{\theta}|Y)/q(\theta,\tilde{\theta}|Y)}{ p(\theta|Y)/q(\tilde{\theta},\theta|Y)} \right\} q(\theta,\tilde{\theta}|Y) p(\theta|Y) d\theta \\\ &=& p(\tilde{\theta}|Y) \int \min ; \left\{ \frac{p(\theta|Y)/q(\tilde{\theta},\theta|Y)}{p(\tilde{\theta}|Y)/q(\theta,\tilde{\theta}|Y)}, ; 1 \right\} q(\tilde{\theta},\theta|Y) d\theta \\\ &=& p(\tilde{\theta}|Y) \int \alpha(\tilde{\theta},\theta|Y) q(\tilde{\theta},\theta|Y) d\theta \end{eqnarray*}
Posterior density at the mode can be approximated as follows
\begin{equation} \hat{p}(\tilde \theta|Y) = \frac{ \frac{1}{N} \sum_{i=1}^{N} \alpha (\theta^{i},\tilde{\theta}|Y) q(\theta^{i},\tilde{\theta}|Y) }{ \frac{1}{J} \sum_{j=1}^J \alpha (\tilde{\theta},\theta^{j}|Y) }, \end{equation}
\(\{ \theta^{i} \}\) are posterior draws obtained with the the M-H Algorithm;
\(\{ \theta^{j}\}\) are additional draws from \(q(\tilde{\theta},\theta|Y)\) given the fixed value \(\tilde{\theta}\).
\begin{center} \begin{tabular}{lcc} \hline\hline Model & Mean($\ln \hat p(Y)$) & Std. Dev.($\ln \hat p(Y)$) \\\ \hline Geweke $(\tau = 0.5)$ & -346.17 & 0.03 \\\ Geweke $(\tau = 0.9)$ & -346.10 & 0.04 \\\ SWZ $(q = 0.5)$ & -346.29 & 0.03 \\\ SWZ $(q = 0.9)$ & -346.31 & 0.02 \\\ Chib and Jeliazkov & -346.20 & 0.40 \\\ \hline \hline \end{tabular} \end{center}
Notes: Table shows mean and standard deviation of log marginal data density estimators, computed over \(N_{run} =50\) runs of the RWMH-V sampler using \(N=100,000\) draws, discarding a burn-in sample of \(N_0=50,000\) draws. The SWZ estimator uses \(J=100,000\) draws to compute \(\hat \tau\), while the CJ estimators uses \(J=100,000\) to compute the denominator of \(\hat{p}(\tilde{\theta}|Y)\).
[Metropolis1953] Metropolis, Rosenbluth, Rosenbluth, Teller & Teller, Equations of State Calculations by Fast Computing Machines, Journal of Chemical Physics, 21, 1087-1091 (1953). ↩
[Hastings1970] Hastings, Monte Carlo Sampling Methods Using Markov Chains and Their Applications, Biometrika, 57, 97-109 (1970). ↩
[Tierney1994] Tierney, Markov Chains for Exploring Posterior Distributions, The Annals of Statistics, 22(4), 1701-1728 (1994). ↩
[Chib1995a] Chib & Greenberg, Understanding the Metropolis-Hastings Algorithm, The American Statistician, 49, 327-335 (1995). ↩
[Robert2004] Robert & Casella, Monte Carlo Statistical Methods, Springer (2004). ↩
[Geweke2005] John Geweke, Contemporary Bayesian Econometrics and Statistics, John Wiley & Sons, Inc. (2005). ↩
[DelNegro2008a] Del Negro & Frank Schorfheide, Forming Priors for DSGE Models (and How it Affects the Assessment of Nominal Rigidities), Journal of Monetary Economics, 55(7), 1191-1208 (2008). link. doi. ↩