ECON 616: Machine Learning

Intro + Defintions

Background

“Machine Learning” definition

A Dictionary

Econometrics ML
\(\underbrace{y}_{T\times 1} = \{y_{1:T}\}\) Endogenous outcome
\(\underbrace{X}_{T\times n} = \{x_{1:T}\}\) Exogenous Feature
\(1:T\) “in sample” “training”
\(T:T+V \) ??? not enough data! “validation”
\(T+V:T+V+O\) “out of sample” “testing”

Today I’ll concentrate on prediction (regression) problems. \[ \hat y = f(X;\theta) \] Economists would call this modeling the conditional expectation, MLers the hypothesis function.

It all starts with a loss funciton

Generally speaking, we can write (any) estimation problem as essentially a loss minizimation problem.
Let \(L\) \[ L(\hat y, y) = L(f(X;\theta), y) \] Be a loss function (sometimes called a “cost” function).
Estimation in ML: pick \(\theta\) to minimize loss.
ML more concerned with minimizing my loss than inference on \(\theta\) per se.
Forget standard errors…

Gradient Descent

Example: Forecasting Inflation

Let’s consider forecasting (GDP deflator) inflation.

Linear Regression

from sklearn.linear_model import LinearRegression
linear_model_univariate = LinearRegression()

train_start, train_end = '1985', '2000'
inf['inf_L1'] = inf.GDPDEF.shift(1)
inf = inf.dropna(how='any')
inftrain = inf[train_start:train_end]
Xtrain,ytrain = (inftrain.inf_L1.values.reshape(-1,1),
                 inftrain.inf)
fitted_ols = linear_model_univariate.fit(Xtrain,ytrain)

Many regressors

Let’s add the individual spf forecasts to our regression.

/home/eherbst/miniconda3/lib/python3.8/site-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")

Estimating this in scikit learn is easy


spf_flatted_zero = spf_flat.fillna(0.)

spfX = spf_flatted_zero[train_forecasters][train_start:train_end]
spfXtrain = np.c_[Xtrain, spfX]

linear_model_spf = LinearRegression()
fitted_ols_spf = linear_model_spf.fit(spfXtrain,ytrain)
Table 1: Mean Squared Errors
Train Test
LS-univariate 0.59 2.28
LS-SPF 0 2.1

Regularization

We’ve got way too many variables – our model does horrible out of sample!
Their are many regularization techniques available for variable selection
Conventional: AIC, BIC
Alternative approach: Penalized regression.
Consider the loss function: \[ L(\hat y,y) = \frac{1}{2T} \sum_{t=1}^T (f(x_t;\theta) - y)^2 + \lambda \sum_{i=1}^N \left[(1-\alpha)|\theta_i| + \alpha|\theta_i|^2\right]. \] This is called elastic net regression. When \(\lambda = 0\), we’re back to OLS.
Many special cases.

Ridge Regression

The ridge regression Hoerl and Kennard (2004) is special case where \(\alpha = 1\).
Long (1940s) used in statistics and econometrics.
This is sometimes called (or is a special case of) “Tikhonov regularization”
It’s an L2 penalty, so it’s won’t force parameters to be exactly zero.
Can be formulatd as Bayesian linear regression.

from sklearn.linear_model import Ridge
fitted_ridge = Ridge().fit(spfXtrain,ytrain)

LS vs Ridge (\(\lambda = 1\)) Coefficients

Lasso Regression

Set \(\alpha = 0\)

from sklearn.linear_model import Lasso
fitted_lasso = Lasso().fit(spfXtrain,ytrain)

LS vs Lasso (\(\lambda = 1\)) Coefficients

Picking \(\lambda\)

Table 2: Mean Squared Error, \(\lambda = 1\)
Method Train Train
Least Squares (Univariate) 0.35 0.71
Least Squares (SPF) 0.0 0.68
Least Squares (SPF-Ridge) 0.0003 0.67
Least Squares (SPF-Lasso) 0.59 0.96

Support Vector Machines

Support Vector Machines

Support Vector Machines

Estimating Support Vector Machine

from sklearn.svm import SVR
fitted_svm = SVR().fit(Xtrain,ytrain)
Table 3: Mean Squared Error, \(\Lambda = 1\)
Method Train Train
Least Squares (Univariate) 0.35 0.71
Least Squares (SVM) 0.06 0.73

Other ML Techniques

Neural Networks

Neural Networks

Let’s construct a hypothesis function using a neural network.
Suppose that we have \(N\) features in \(x_t\).
(Let \(x_{0,t}\) be the intercept.)
Neural Networks are modeled after the way neurons work in a brain as basical computational units.

Neural Networks, continued

Drop the \(t\) subscript. Consider: \[ \left[\begin{array}{c} x_{0} \\ \vdots \\ x_{N} \end{array}\right] \rightarrow [~] \rightarrow f(x;\theta) \] \(a_i^j\) activation of unit \(i\) in layer \(j\).

\(\beta^j\) matrix of weights controlling function mapping layer \(j\) to layer \(j+1\). \[ \left[\begin{array}{c} x_{0} \\ \vdots \\ x_{N} \end{array}\right] \rightarrow \left[\begin{array}{c} a_{0}^2 \\ \vdots \\ a_{N}^2 \end{array}\right] \rightarrow f(x;\theta) \].

Neural Networks in a figure

Neural Networks Continued

If \(N = 2\) and our neural network has \(1\) hidden layer.

\begin{eqnarray} a_1^2 &=& g(\theta_{10}^1 x_0 + \theta_{11}^1 x_1 + \theta_{12}^1 x_2) \nonumber \\ a_2^2 &=& g(\theta_{20}^1 x_0 + \theta_{21}^1 x_1 + \theta_{22}^1 x_2) \nonumber \\ f(x;\theta) &=& g(\theta_{10}^2 a_0^2 + \theta_{11}^2 a_1^2 + \theta_{12}^2 a_2^2 \\ \end{eqnarray}

(\(a_0^j\) is always a constant (“bias”) by convention.)
Matrix of coefficients \(\theta^j\) sometimes called weights
Depending on \(g\), \(f\) is highly nonlinear in \(x\)! Good and bad …

Which activation function?

name
linear \(\theta x\)
sigmoid \(1/(1+e^{-\theta x}\)
tanh \(tanh(\theta x)\)
rectified linear unit \(max(0,\theta x)\)

How to pick \(g\)…?

How to estimate this model.

Just like any other ML model: minimize the loss!
Gradient descent needs a derivative
back propagation algorithm

Application: Nakamura (2005)

scikit-learn code

from sklearn.neural_network import MLPRegressor

NN = MLPRegressor(hidden_layer_sizes=(2,),
                  activation='tanh',
                  alpha=1e-6,
                  max_iter=10000,
                  solver='lbfgs')

fitted_NN = NN.fit(Xtrain,ytrain)

Neural Network vs. AR(1): Predicted Values

What is the “right” method to use

You might have guessed…
Wolpert and Macready (1997): A universal learning algorithm does cannot exist.
Need prior knowledge about problem…
This is has been present in econometrics for a very long time…
There’s no free lunch!.

Bibliography

References

References

Athey, S. (2018): “The Impact of Machine Learning on Economics,” in The Economics of Artificial Intelligence: An Agenda, National Bureau of Economic Research, Inc.
Bates, J. M., and C. W. J. Granger. (1969): “The Combination of Forecasts,Or, 20, 451.
Chernozhukov, V., W. K. Härdle, C. Huang, and W. Wang. (2018): “Lasso-Driven Inference in Time and Space,Ssrn Electronic Journal, .
Hastie, T., R. Tibshirani, and J. Friedman. (2009): “The Elements of Statistical Learning,Springer Series in Statistics, .
Hoerl, A. E., and R. W. Kennard. (2004): “Ridge Regression,Encyclopedia of Statistical Sciences, .
Mullainathan, S., and J. Spiess. (2017): “Machine Learning: An Applied Econometric Approach,Journal of Economic Perspectives, 31, 87 106.
Nakamura, E. (2005): “Inflation Forecasting Using a Neural Network,Economics Letters, 86, 373 378.
Varian, H. R. (2014): “Big Data: New Tricks for Econometrics,Journal of Economic Perspectives, 28, 3 28.
Wolpert, D., and W. Macready. (1997): “No Free Lunch Theorems for Optimization,Ieee Transactions on Evolutionary Computation, 1, 67 82.