Introduction Bayes 6: (Linear) State Space Models

Intro

Background

A DSGE Model

Small-Scale DSGE Model

Final Goods Producers

Intermediate Goods Producers

Households

Monetary and Fiscal Policy

Exogenous Processes

Equilibrium Conditions

Equilibrium Conditions – Continued

Steady State

Solving DSGE Models

Solving DSGE Models

What is a Local Approximation?

What is a Log-Linear Approximation?

Loglinearization of New Keynesian Model

Canonical Linear Rational Expectations System

How Can One Solve Linear Rational Expectations Systems? A Simple Example

A Simple Example

Solving a More General System

Solving a More General System

Solving a More General System

Proposition

If there exists a solution to Eq. (\ref{eq_stabcond}) that expresses the forecast errors as function of the fundamental shocks \(\epsilon_t\) and sunspot shocks \(\zeta_t\), it is of the form

\begin{eqnarray} \eta_t &=& \eta_1 \epsilon_t + \eta_2 \zeta_t \label{eq_etasol} \\\ &=& ( - V_{.1}D_{11}^{-1} U_{.1}^{\prime}\Psi_x^J + V_{.2} \widetilde{M}) \epsilon_t + V_{.2} M_\zeta \zeta_t, \notag \end{eqnarray}

where \(\widetilde{M}\) is an \((k-r) \times l\) matrix, \(M_\zeta\) is a \((k-r) \times p\) matrix, and the dimension of \(V_{.2}\) is \(k\times (k-r)\). The solution is unique if \(k = r\) and \(V_{.2}\) is zero.

At the End of the Day…

Measurement Equation

State Space Models and The Kalman Filter

State space models form a very general class of models that encompass many of the specifications that we encountered earlier. VARMA models, linearized DSGE models, and more can be written in state space form. State space models are particularly popular at the FRB. For example, the models in the \(r^*\) suite can all be written in state space form.

A state space model can be described by two different equations: a measurement equation that relates an unobservable state vector \(s_t\) to the observables \(y_t\), and a transition equation that describes the evolution of the state vector \(s_t\). For now, we’ll restrict attention to the case in which both of these equations are linear.

Measurement. The measurement equation is of the form

\begin{eqnarray} \label{eq:obs} y_t = D_{t|t-1} + Z_{t|t-1} s_t + \eta_t , \quad t=1,\ldots,T \end{eqnarray}

where \(y_t\) is an \(n_y \times 1\) vector of observables, \(s_t\) is an \(n_s \times 1\) vector of state variables, \(Z_{t|t-1}\) is an \(n_y \times n_s\) vector, \(D_{t|t-1}\) is a \(n_y\times 1\) vector, and \(\eta_t\) are innovations (or often ``measurement errors’’) with mean zero and \(\mathbb{E}_{t-1}[ \eta_t \eta_t’] = H_{t|t-1}\).

Transition. The transition equation is of the form

\begin{eqnarray} \label{eq:transition} s_t = C_{t|t-1} + T_{t|t-1} s_{t-1} + R_{t|t-1} \epsilon_t \end{eqnarray}

where \(R_t\) is \(n_s \times n_\epsilon\), and \(\epsilon\) is a \(n_\epsilon \times 1\) vector of innovations with mean zero and variance \(\mathbb{E}_{t|t-1}[ \epsilon_t \epsilon_t’] = Q_{t|t-1}\).

The collection of matrices in (eq:obs) and (eq:transition) define the state space system. For that reason, they are often referred to as the “system matrices.”

Example. Consider the ARMA(1,1) model of the form

\begin{eqnarray} y_t = \phi y_{t-1} + \epsilon_t + \theta \epsilon_{t-1} \quad \epsilon_t \sim iid{\cal N}(0,\sigma^2) \end{eqnarray}

The model can be rewritten in state space form

\begin{eqnarray} y_t & = & [ 1 ; \theta] \left[ \begin{array}{c} \epsilon_t \ \epsilon_{t-1} \end{array} \right] + \phi y_{t-1}\\\ \left[ \begin{array}{c} \epsilon_t \ \epsilon_{t-1} \end{array} \right] & = & \left[ \begin{array}{cc} 0 & 0 \ 1 & 0 \end{array} \right] \left[ \begin{array}{c} \epsilon_{t-1} \ \epsilon_{t-2} \end{array} \right] + \left[ \begin{array}{c} \eta_t \ 0 \end{array} \right] \end{eqnarray}

where \(\epsilon_t \sim iid{\cal N}(0,\sigma^2)\). Thus, the state vector is composed of \(s_t = [\epsilon_t, \epsilon_{t-1}]’\) and \(D_{t} = \rho y_{t-1}\). This construction is not unique. We could also write the model as:

\begin{eqnarray} y_t & = & [ 1 ; 0] \left[ \begin{array}{c} y_t \ \epsilon_{t} \end{array} \right] \\\ \left[ \begin{array}{c} y_t \ \epsilon_{t} \end{array} \right] & = & \left[ \begin{array}{cc} \phi & \theta \ 0 & 0 \end{array} \right] \left[ \begin{array}{c} y_{t-1} \ \epsilon_{t-1} \end{array} \right] + \left[ \begin{array}{c} 1\ 1 \end{array} \right]\epsilon_t. \end{eqnarray}

Notice in this formulation the state vector \(s_t = [y_t,\epsilon_t]\) is partially observed. So it’s not true, strictly speaking, that the entire \(s_t\) vector must be unobserved. \(\Box\)

If the system matrices \(D_t, Z_t, H_t, C_t, T_t, R_t, Q_t\) are non-stochastic and predetermined, then the system is linear and \(y_t\) can be expressed as a function of present and past \(\eta_t\)’s and \(\epsilon_t\)’s. We’ve done some work on linear systems previously (VARs), so the natural next step is to expand our toolkit to do the kinds of things we liked to do with VARs:

Now, if the state vector was observed, it would be easy to combine equation (eq:obs) and (eq:transition) to obtain a VAR jointly in \([y_t, s_t]\). Thus, it would be straightforward to obtain the (perhaps conditional) likelihood: \[ p(Y^T,S^T | \{Z_t, d_t, H_t, T_t, c_t, R_t, Q_t \}). \] But life is hard, and we don’t get to observe \(S^T\). We need to compute the likelihood for the data we have, i.e., the likelihood in (eq:likelihood). We have to marginalize out \(S^T\). It turns out that their is an algorithm that does this, and fulfills the three desiderata above. The algorithm is called the Kalman Filter and was originally adopted from the engineering literature.

The Kalman Filter

For this presentation of the Kalman filter, we’re going to assume that the system matrices are time invariant, that is, they do not depend on \(t\). So we drop these subscripts from our notation. Furthermore, we’re going to collect them in the vector \(\theta = [C,T,R,Q,D,Z,H]\), where the \(vec\) operator is being implicitly applied to each matrix.

We’re also going to assume that the innovations \(\eta_t\) and \(\epsilon_t\) are normally distributed. We need to this to obtain an exact likelihood, although the Kalman filter can be used to obtain an optimal—in terms of MSE—predictor \(y_{t+h}\) given \(Y^T\) for \(h \ge 1\) using linear projections, regardless of the parametric distributions for \(\eta_t\) and \(\epsilon_t\). The chapter on state space models in Hamilton derives this. In this case the likelihood calculation delivers a quasi-likelihood.

With our normality assumption, the derivation of the Kalman filter has a natural Bayesian interpretation. Before we proceed, we’re going to state some results about multivariate normal distributions, which will help later on.

Lemma. Let \((x’,y’)’\) be jointly normal with \[ \mu = \left[ \begin{array}{c} \mu_x \ \mu_y \end{array} \right] \quad \mbox{and} \quad \Sigma = \left[ \begin{array}{cc} \Sigma_{xx} & \Sigma_{xy} \\\ \Sigma_{yx} & \Sigma_{yy} \end{array} \right] \] Then the \(pdf(x|y)\) is multivariate normal with

\begin{eqnarray} \mu_{x|y} &=& \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1}(y - \mu_y) \\\ \Sigma_{xx|y} &=& \Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx} \end{eqnarray}

Note that the converse is not necessarily true. \(\Box\)

In both theory and practice, the Kalman filter proceeds recursively, using the natural prior-posterior sequencing, after an initialization.

Initialization. We’re going to start at period \(t=0\), that is, the period before we first observe \(y\). We assume that \(s_0\) is normally distributed:

\begin{align} s_0 | \theta \sim \mathcal N\left(A_0, P_0\right). \end{align}

Importantly, we conceptualize this distribution as prior distribution. We’ll discuss possible ways to select \(A_0\) and \(P_0\) in a bit.

Prediction. We can combine our prior distribution for \(s_0\) with the state transition equation (eq:transition). Since \(s_0\) is normally distributed and \(\epsilon_1\) is also normally distributed (and independent of \(s_0\)), \(s_1\) is also normally distributed, \[ s_1 | \theta \sim \mathcal N\left(A_{1|0}, P_{1|0}\right) \] where \[ A_{1|0} = C+T A_{0} \mbox{ and } P_{1|0} = T P_0 T’ + RQR’. \] Note that this is the unconditional distribution of \(s_1\), a prior distribution for \(s_1\) before seeing \(y_1\). We write the mean \(A_{1|0}\) and \(P_{1|0}\) as conditional on time \(t=0\).

Next consider the prediction of \(y_1\). The conditional distribution of \(y_1\) is of the form

\begin{eqnarray} y_1|s_1,\theta \sim {\cal N}(D+Z s_1, H) \end{eqnarray}

Since \(s_1 \sim {\cal N}( A_{1|0}, P_{1|0})\), we can deduce that the marginal distribution of \(y_1\) is of the form

\begin{eqnarray} y_1|\theta \sim {\cal N} (\hat{y}_{1|0}, F_{1|0}) \end{eqnarray}

where

\begin{eqnarray*} \hat{y}_{1|0} = D + Z A_{1|0} \mbox{ and } F_{1|0} = Z P_{1|0} Z’ + H. \end{eqnarray*}

Here we’ve been explicit in going \(s_0 \rightarrow s_1 \rightarrow y_1\).

Updating. Another way to see this is to rewrite the observation equation (eq:obs) in terms of \(s_{t-1}\) and \(\epsilon_t\). If \(s_{0}\) is normally distributed as above it’s easy to see that \(s_1\) and \(y_1\) are jointly normally distributed with the marginal and conditional distributions mentioned above. We have:

\begin{eqnarray} s_1 &=& C + T s_0 + R\epsilon_t \\\ y_1 &=& D + Z T s_0 + Z\epsilon_t + \eta_t. \end{eqnarray}

Direct calculation yields:

\begin{eqnarray} \begin{bmatrix}s_1 \ y_1 \end{bmatrix} \bigg| \theta \sim \mathcal N \left( \begin{bmatrix} A_{1|0} \ \hat y_{1|0} \end{bmatrix}, \begin{bmatrix} P_{1|0} & P_{1|0} Z’ \ Z P_{1|0} & F_{1|0} \end{bmatrix}\right). \end{eqnarray}

Consider the third goal of toolbox: delivering \(p(s_1|y_1,\theta)\). Well, we can get that easily using the formula for the conditional normal distribution:

\begin{align} \label{eq:update} s_1 | y_1 \sim N\left( A_{1|0} + P_{1|0} Z’ F_{1|0}^{-1} \left(y_1 - \hat y_{1|0}\right), P_{1|0} - P_{1|0} Z’ F_{1|0}^{-1} Z P_{1|0}\right). \end{align}

Note that we could have instead obtained this using:

\begin{align} \label{eq:bayes} p(s_1 | y_1,\theta) \propto p(y_1|s_1,\theta) p(s_1|\theta), \end{align}

i.e., our good friend Bayes rule! Note the conjugacy (normal-normal) likelihood-prior relationship yields a normally distributed posterior. Finally, let’s call give our updated state mean and variance:

\begin{align} \label{eq:ms} A_1 = A_{1|0} + P_{1|0} Z’ F_{1|0}^{-1} \left(y_1 - \hat y_{1|0}\right) \mbox{ and } P_1 = P_{1|0} - P_{1|0} Z’ F_{1|0}^{-1} Z P_{1|0}. \end{align}

Generalization. Now, with the distribution form \(s_1|y_1,\theta\), we’re back where we started! So all we have to do is construct \(s_2|y_1,\theta\) and \(y_2 | s_2, y_1, \theta\) in an identical fashion as above, and so on for \(t = 2,\ldots,T\). We can summarize the recursions:

  1. Initialization. Set \(s_0 \sim N(A_0,P_0).\)

  2. Recursions. For \(t=1,\ldots,T\):

    \begin{align} \label{eq:state-prediction} \mbox{state prediction}&: A_{t|t-1} = C+T A_{t-1} \mbox{ and } P_{t|t-1} = T P_{t-1} T’ + RQR’.\\\ \label{eq:obs-prediction} \mbox{observation prediction}&: \hat y_{t|t-1} = D + Z A_{t|t-1} \mbox{ and } F_{t|t-1} = Z P_{t|t-1} Z’ + H. \\\ \label{eq:state-update} \mbox{state update}&: A_t = A_{t|t-1} + P_{t|t-1} Z’ F_{t|t-1}^{-1} \left(y_t - \hat y_{t|t-1}\right) \mbox { and } \nonumber \\\ &~~ P_t = P_{t|t-1} - P_{t|t-1} Z’ F_{t|t-1}^{-1} Z P_{t|t-1}. \end{align}

Likelihood function. We can define the one-step ahead forecast error

\begin{eqnarray} \nu_t = y_t - \hat{y}_{t|t-1} = Z (s_t - A_{t|t-1}) + \eta_t. \end{eqnarray}

The likelihood function is given by

\begin{eqnarray} p(Y^T | \theta ) & = & \prod_{t=1}^T p(y_t|Y^{t-1}, \theta) \nonumber \\\ & = & ( 2 \pi)^{-n_yT/2} \left( \prod_{t=1}^T |F_{t|t-1}| \right)^{-1/2} \times \exp \left\{ - \frac{1}{2} \sum_{t=1}^T \nu_t F_{t|t-1}^{-1} \nu_t’ \right\} \end{eqnarray}

This representation of the likelihood function is often called prediction error form, because it is based on the recursive prediction one-step ahead prediction errors \(\nu_t\). \(\Box\)

Discussion

Initialization. First, on the initialization step, if the system-matrices are time-invariant and the process for \(s_t\) is stationary (i.e., all the eigenvalues of \(T\) are less than one in magnitude), it might make sense to initialize the Kalman filter from the invariant distribution, i.e., we have \(A_0\) and \(P_0\) such that \[ A_0 = (I_{n_s} - T)^{-1} C \mbox{ and } P_0 = T P_0 T’ + RQR’. \] If the system is not too big, you can solve for \(P_0\) directly using the \(vec\) operator: \[ vec(P_0) = \left(I_{n_s^2} - (T’ \otimes T)\right)^{-1} RQR’. \] Otherwise, there are algorithms available for computing \(P_0\) reliably and quickly.

If the system is not stationary, it’s common practice to set the variance of \(P_0\) be extremely large, like \(1000\times I_{n_s}\).

Kalman Gain. In (eq:state-update), the matrix that maps the prediction errors, \(\nu_t\), into the state revision is important enough to warrant it’s own name: the Kalman Gain. The Kalman Gain, \[ K_t = P_{t|t-1} Z F_{t|t-1}^{-1}, \] is an \(n_s\times n_y\) matrix that maps the “surprises” (forecast errors) in the observed data to changes in our beliefs about the underlying unobserved states. Essentially, the gain tells us how we learn about the states from the data.

Time-varying system matrices and missing data. The Kalman filter recursions in (eq:state-prediction), (eq:obs-prediction), and (eq:state-update) are valid if the system matrices are time-varying (but pre-determined.) In practice, it is simply a matter of adding the relevant subscripts onto the system matrices. An important case of time-varying system matrices is when they are constant except for the fact that some of the observations are missing; i.e, for some \(t\), at least one element of \(y_t\) is missing. In this case, we simply modify the observation equation (eq:obs)—and hence, (eq:obs-prediction) and (eq:state-update)—in order to account for the fact that we observe fewer series at some periods. Suppose in period \(t\) we observe \(n_{y_t}\), which is less than or equal to \(n_y\). Define the \(n_{y_t} \times n_y\) select matrix \(M_t\), to be the matrix whose columns are comprised of \(\{e_i : i\mbox{th series is observed}\}\), where \(e_i\) is the \(n_y\times 1\) vector with a one in the \(i\)th position and zeros elsewhere. Then,

\begin{align} \label{eq:missing} D_t = M_t D,~~ Z_t = M_t Z, \mbox{ and } H_t = M_t H M_t’. \end{align}

The ability to handle missing data is an extremely powerful feature of the Kalman filter, as it allows us to both handle estimating models with missing data, and make inference about the missing data itself. More on this later. Most programmed Kalman filter routines can handle missing data without an modification of the system matrices on the part of the user. Simply code your missing data as nan. Finally, note that the likelihood calculation in (eq:likelihood) needs to be modified (i.e., \(n_y\) needs to be replaced by \(n_{y_t}\).) Again, preprogrammed routines should handle this without user intervention.

“Steady-state” Kalman filter. Suppose the system matrices are constant. If we combine (eq:state-prediction), (eq:obs-prediction) (eq:state-update) for the state variance, we obtain

\begin{eqnarray} P_{t+1|t} = T P_{t|t-1}T’ + RQR - T P_{t|t-1} Z’ (Z P_{t|t-1} Z’ + H)^{-1} Z P_{t|t-1}T' \end{eqnarray}

with \(P_{0|-1} = P_0.\) This equation is known as the matrix Riccati recursion, a discrete time analogue to the popular set of ODEs. Under some regularity conditions, as \(t\) gets sufficiently large, \(P_{t+1|t} \rightarrow \bar P\), i.e., there is an invariant solution to the Riccati equation. Some people refer to this as the “steady-state” prediction variance (and correspondingly, the “steady-state” Kalman gain.) It can be useful in computation as well: after a sufficiently amount of time, one does not need to continue to update \(P_{t|t-1}\), which is the typically the costliest part of evaluating the Kalman filter. Note this also makes clear that the variances in the Kalman filter to not depend on the observed data.

Innovations representation of the Kalman Filter. TBD

Caution. Some authors adopt a slightly different timing convention with the Kalman Filter; specifically, DurbinKoopman2001. The initialization of the filter changes slightly. It’s all very tedious.

Kalman Smoothing

Note that the Kalman filter is a filter: it delivers the sequence of smoothed distribution \(\{s_t | Y^{t}\}_{t=1}^T\), which since they are normal, are simply described by the sequence \(\{A_t,P_t\}_{t=1}^T\). Sometimes, we interested in the smoothed distributions, \(\{s_t | Y^T\}_{t=1}^T\), that is distributions of the unobserved states conditional on all of the data. These distributions are also normally distributed, and can be found another recursive algorithm known as the Kalman smoother.

Details TBD.

Drawing from the Smoothed Distribution

TBD.

An Example: GDP+

Here I’m going to through a simple state space model described in Aruoba_2016. Since GDP data is inherently noisy, the authors use both income-side \((GDP_{It})\) and expenditure-side \((GDP_{Et})\) data on GDP growth to infer the true (unobserved) growth rate, \(GDP_t\). The authors posit that the true growth rate follows an AR(1):

\begin{align} \label{eq:gdp} GDP_t = \mu (1-\rho) + \rho GDP_{t-1} + \epsilon_t, \quad \epsilon_t \sim IID N(0, \sigma^2). \end{align}

An that both income- and expenditure-side estimates are mismeasured versions of this:

\begin{align} \label{eq:gdp-measure} \begin{array}{c} GDP_{Et} \ GDP_{It} \end{array} \bigg| GDP_t \sim IID N \left( \left[ \begin{array}{c} GDP_t \ GDP_t \end{array} \right], \left[ \begin{array}{cc} \sigma_{E}^2 & 0 \ 0 & \sigma_{I}^2 \end{array} \right]\right) \end{align}

We can cast this into state space form with \(n_y = 2\) and \(n_s = n_\epsilon = 1\). We have

\begin{align} C = \mu(1-\rho),~~ T = \rho,~~ R = 1, \mbox{ and } Q = \sigma^2, \nonumber \\\ D = \begin{bmatrix}0 \ 0 \end{bmatrix}, ~~ Z = \begin{bmatrix}1 \ 1 \end{bmatrix}, \mbox{ and } H = \left[ \begin{array}{cc} \sigma_{E}^2 & 0 \ 0 & \sigma_{I}^2 \end{array} \right]. \end{align}

Here’s a look at the data:

We can use the Kalman filter to maximize the likelihood function, since we haven’t quite worked out how to elicit the posterior of this model just yet.

Initial likelihood:  -4293.835439078496
      fun: 358.8810028467478
 hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
      jac: array([-1.08002496e-04, -1.70530257e-05, -3.12638804e-04, -2.95585778e-04,
       -2.16004992e-04])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 108
      nit: 14
   status: 0
  success: True
        x: array([0.50974449, 0.39613152, 0.28275435, 0.40326539, 0.64044755])
Maximized Likelihood:  -358.8810028467478
{'rho': 0.5097444850915837, 'mu': 0.39613152196112617, 'sige': 0.2827543510275178, 'sigi': 0.4032653925782693, 'sig': 0.6404475458159359}

We these point estimates, we can use the kalman filter to extract \(\{ A_t\}_{t=1}^T\), the filtered means of the “true” GDP series. We’ll plot them along with the observables, and the simple average of expenditure-side and income-side GDP estimates.

To make it a bit easier to see, let’s look at the last five or so years.

You can see the average is different form the filtered estimate. Why is that? The Kalman Gain matrix places different weights on the income- and expenditure-side GDP data.

0.4217678091997104 0.42176780919971

Bibliography

References

Bibliography

[woodford_2003] Woodford, Interest and Prices, Princeton University Press (2003).

[Gali2008] Jordi Gal'i, Monetary Policy, Inflation, and the Business Cycle: An Introduction to the New Keynesian Framework, Princeton University Press (2008).

[ireland2004] Ireland, A Method for Taking Models to the Data, Journal of Economic Dynamics and Control, 28(6), 1205-1226 (2004).

[christiano2005] Christiano, Eichenbaum, Evans & , Nominal Rigidities and the Dynamic Effects of a Shock to Monetary Policy, Journal of Political Economy, 113(1), 1-45 (2005).

[Smets2007] Smets & Wouters, Shocks and Frictions in US Business Cycles: A Bayesian DSGE Approach, American Economic Review, 97, 586-608 (2007).

[An2007b] An & Frank Schorfheide, Bayesian Analysis of DSGE Models, Econometric Reviews, 26(2-4), 113-172 (2007). doi.

[Harvey1991] Andrew Harvey, Forecasting, Structural Time Series Models and the Kalman Filter, University of Cambridge Press (1991).

[Hamilton] James Hamilton, Time Series Analysis, Princeton University Press (1994).

[HerbstSchorfheide2015] Edward Herbst & Frank Schorfheide, Bayesian Estimation of DSGE Models, Princeton University Press (2015).

[Fern_ndez_Villaverde_2016] Fernandez-Villaverde, Rubio-Ramirez, & Schorfheide, Solution and Estimation Methods for DSGE Models, Handbook of Macroeconomics, 527 724 (2016). link. doi.

[Sims2002] Sims, Solving Linear Rational Expectations Models, Computational Economics, 20, 1-20 (2002).

[DurbinKoopman2001] Durbin & Koopman, Time Series Analysis by State Space Methods, Oxford University Press (2001).

[Aruoba_2016] Aruoba, Diebold, , Nalewaik, Schorfheide, Song & Dongho, Improving GDP measurement: A measurement-error perspective, Journal of Econometrics, 191(2), 384 397 (2016). link. doi.