Bayesian Estimation of the GARCH(1,1) Model with Student-t Innovations

David Ardia; Lennart F. Hoogerheide

1 Introduction

Research on changing volatility using time series models has been active since the pioneer paper by (Engle 1982). From there, ARCH (AutoRegressive Conditional Heteroscedasticity) and GARCH (Generalized ARCH) type models grew rapidly into a rich family of empirical models for volatility forecasting during the 80’s. These models are widespread and essential tools in financial econometrics.

In the \(\text{GARCH}(p,q)\) model introduced by Bollerslev (1986), the conditional variance at time \(t\) of the log-return \(y_t\) (of a financial asset or a financial index), denoted by \(h_t\), is postulated to be a linear function of the squares of past \(q\) log-returns and past \(p\) conditional variances. More precisely: \[h_t \doteq \alpha_0 + \sum_{i=1}^q \alpha_i \, y_{t-i}^2 + \sum_{j=1}^p \, \beta_j h_{t-j} \,,\] where the parameters satisfy the constraints \(\alpha_i \geq 0\) \((i=0,\ldots,q)\) and \(\beta_j \geq 0\) \((j=1,\ldots,p)\) in order to ensure a positive conditional variance. In most empirical applications it turns out that the simple specification \(p=q=1\) is able to reproduce the volatility dynamics of financial data. This has led the GARCH(1,1) model to become the workhorse model by both academics and practitioners. Given a model specification for \(h_t\), the log-returns are then modelled as \(y_t = \varepsilon_t h_t^{1/2}\), where \(\varepsilon_t\) are i.i.d. disturbances. Common choices for \(\varepsilon_t\) are Normal and Student-\(t\) disturbances. The Student-\(t\) specification is particularly useful, since it can provide the excess kurtosis in the conditional distribution that is often found in financial time series processes (unlike models with Normal innovations).

Until recently, GARCH models have mainly been estimated using the classical Maximum Likelihood technique. Several R packages provide functions for their estimation; see, e.g. fGarch (Wuertz and Y. Chalabi 2009), rgarch (Ghalanos 2010) and tseries (Trapletti and K. Hornik 2009). The Bayesian approach offers an attractive alternative which enables small sample results, robust estimation, model discrimination, model combination, and probabilistic statements on (possibly nonlinear) functions of the model parameters.

The package bayesGARCH (Ardia 2007) implements the Bayesian estimation procedure described in Ardia (2008 5) for the GARCH(1,1) model with Student-\(t\) innovations. The approach, based on the work of Nakatsuma (1998), consists of a Metropolis-Hastings (MH) algorithm where the proposal distributions are constructed from auxiliary ARMA processes on the squared observations. This methodology avoids the time-consuming and difficult task, especially for non-experts, of choosing and tuning a sampling algorithm. The program is written in R with some subroutines implemented in C in order to speed up the simulation procedure. The validity of the algorithm as well as the correctness of the computer code have been verified by the method of Geweke (2004).

2 Model, priors and MCMC scheme

A GARCH(1,1) model with Student-\(t\) innovations for the log-returns \(\{ y_t \}\) may be written via data augmentation (see Geweke 1993) as

\[\begin{aligned} \label{eq:model} \begin{split} y_t &= \varepsilon_t \left(\tfrac{\nu - 2}{\nu} \, \varpi_t \, h_t\right)^{1/2} \quad t = 1,\ldots,T\\ \varepsilon_t &\overset{\textit{iid}}{\sim} \mathcal N(0,1)\\ \varpi_t &\overset{\textit{iid}}{\sim} \mathcal{IG} \left( \frac \nu 2, \frac \nu 2 \right)\\ h_t &\doteq \alpha_0 + \alpha_1 y_{t-1}^2 + \beta h_{t-1}\,, \end{split} \end{aligned} \tag{1} \]

where \(\alpha_0 > 0\), \(\alpha_1, \beta \geq 0\) and \(\nu > 2\); \(\mathcal N(0,1)\) denotes the standard normal distribution; \(\mathcal IG\) denotes the inverted gamma distribution. The restriction on the degrees of freedom parameter \(\nu\) ensures the conditional variance to be finite and the restrictions on the GARCH parameters \(\alpha_0, \alpha_1\) and \(\beta\) guarantee its positivity. We emphasize the fact that only positivity constraints are implemented in the MH algorithm; no stationarity conditions are imposed in the simulation procedure.

In order to write the likelihood function, we define the vectors \(y \doteq (y_1,\ldots,y_T)'\), \(\varpi \doteq (\varpi_1,\ldots,\varpi_T)'\) and \(\alpha \doteq (\alpha_0, \alpha_1)'\). We regroup the model parameters into the vector \(\psi \doteq (\alpha, \beta, \nu)\). Then, upon defining the \(T \times T\) diagonal matrix \[\Sigma \doteq \Sigma ( \psi, \varpi ) = \text{diag}\left( \{ \varpi_t \tfrac{\nu-2}{\nu} h_t( \alpha, \beta ) \}_{t=1}^{T} \right)\,,\] where \(h_t(\alpha, \beta) \doteq \alpha_0 + \alpha_1 y_{t-1}^2 + \beta h_{t-1}(\alpha, \beta)\), we can express the likelihood of \((\psi, \varpi)\) as

\[\begin{aligned} \label{eq:lik} \mathcal L(\psi, \varpi \,|\, y) \propto ( \det \Sigma )^{-1/2} \exp \left[-\tfrac{1}{2} y' \Sigma^{-1} y \right]\,. \end{aligned} \tag{2} \]

The Bayesian approach considers \((\psi,\varpi)\) as a random variable which is characterized by a prior density denoted by \(p(\psi, \varpi)\). The prior is specified with the help of parameters called hyperparameters which are initially assumed to be known and constant. Moreover, depending on the researcher’s prior information, this density can be more or less informative. Then, by coupling the likelihood function of the model parameters with the prior density, we can transform the probability density using Bayes’ rule to get the posterior density \(p\left(\psi, \varpi \mid y \right)\) as follows:

\[\label{eq:bayes} p\left(\psi, \varpi \mid y \right) = \frac{\mathcal L\left(\psi, \varpi \mid y \right) p\left(\psi, \varpi \right)} {\int \mathcal L\left(\psi, \varpi \mid y \right) p\left(\psi, \varpi \right) d\psi d\varpi}\,. \tag{3} \]

This posterior is a quantitative, probabilistic description of the knowledge about the model parameters after observing the data. For an excellent introduction on Bayesian econometrics we refer the reader to Koop (2003).

We use truncated normal priors on the GARCH parameters \(\alpha\) and \(\beta\) \[\begin{aligned} p\left( \alpha \right) &\propto \phi_{\mathcal N_2}\left( \alpha \mid \mu_\alpha, \Sigma_\alpha \right) \, \text{1} \left\{\alpha \in \mathbb R_+^2 \right\}\\ p\left( \beta \right) &\propto \phi_{\mathcal N_1}\left( \beta \mid \mu_\beta, \Sigma_\beta \right) \, \text{1} \left\{ \beta \in \mathbb R_+ \right\} \,, \end{aligned}\] where \(\mu_\bullet\) and \(\Sigma_\bullet\) are the hyperparameters, \(\text{1} \{\cdot\}\) is the indicator function and \(\phi_{\mathcal N_d}\) is the \(d\)-dimensional normal density.

The prior distribution of vector \(\varpi\) conditional on \(\nu\) is found by noting that the components \(\varpi_t\) are independent and identically distributed from the inverted gamma density, which yields \[\begin{aligned} p\left( \varpi \mid \nu \right) &= \left( \frac{\nu}{2} \right)^{\frac{T \nu} 2} \left[ \Gamma \left( \frac \nu 2 \right) \right]^{-T} \left( \prod_{t=1}^T \varpi_t \right)^{-\frac \nu 2 - 1} \\ &\times \exp \left[ -\frac{1}{2} \sum_{t=1}^T \frac \nu {\varpi_t} \right] \,. \end{aligned}\] We follow (Deschamps 2006) in the choice of the prior distribution on the degrees of freedom parameter. The distribution is a translated exponential with parameters \(\lambda > 0\) and \(\delta \geq 2\) \[p\left( \nu \right) = \lambda \exp \left[ -\lambda \left( \nu - \delta \right) \right] \, \text{1} \left\{\nu > \delta\right\}\,.\] For large values of \(\lambda\), the mass of the prior is concentrated in the neighborhood of \(\delta\) and a constraint on the degrees of freedom can be imposed in this manner. Normality of the errors is assumed when \(\delta\) is chosen large. As pointed out by Deschamps (2006), this prior density is useful for two reasons. First, it is potentially important, for numerical reasons, to bound the degrees of freedom parameter away from two to avoid explosion of the conditional variance. Second, we can approximate the normality of the errors while maintaining a reasonably tight prior which can improve the convergence of the sampler.

The joint prior distribution is then formed by assuming prior independence between the parameters, i.e. \(p(\psi, \varpi) = p(\alpha)p(\beta)p(\varpi\mid\nu)p(\nu)\).

The recursive nature of the GARCH(1,1) variance equation implies that the joint posterior and the full conditional densities cannot be expressed in closed form. There exists no (conjugate) prior that can remedy this property. Therefore, we cannot use the simple Gibbs sampler and need to rely on a more elaborated Markov Chain Monte Carlo (MCMC) simulation strategy to approximate the posterior density. The idea of MCMC sampling was first introduced by Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953) and was subsequently generalized by Hastings (1970). The sampling strategy relies on the construction of a Markov chain with realizations \((\psi^{[0]}, \varpi^{[0]}), \ldots, (\psi^{[j]}, \varpi^{[j]}), \ldots\) in the parameter space. Under appropriate regularity conditions, asymptotic results guarantee that as \(j\) tends to infinity, \((\psi^{[j]}, \varpi^{[j]})\) tends in distribution to a random variable whose density is (3). Hence, after discarding a of the first draws, the realized values of the chain can be used to make inference about the joint posterior.

The MCMC sampler implemented in the package bayesGARCH is based on the approach of Ardia (2008 5), inspired from the previous work by Nakatsuma (1998). The algorithm consists of a MH algorithm where the GARCH parameters are updated by blocks (one block for \(\alpha\) and one block for \(\beta\)) while the degrees of freedom parameter is sampled using an optimized rejection technique from a translated exponential source density. This methodology has the advantage of being fully automatic. Moreover, in our experience, the algorithm explores the domain of the joint posterior efficiently compared to naive MH approaches or the Griddy-Gibbs sampler of Ritter and M. A. Tanner (1992).

3 Illustration

We apply our Bayesian estimation methods to daily observations of the Deutschmark vs British Pound (DEM/GBP) foreign exchange log-returns. The sample period is from January 3, 1985, to December 31, 1991, for a total of \(1\,974\) observations. This data set has been promoted as an informal benchmark for GARCH time series software validation. From this time series, the first 750 observations are used to illustrate the Bayesian approach. The observation window excerpt from our data set is plotted in Figure 1.

graphic without alt text — Figure 1: DEM/GBP foreign exchange daily log-returns.

We fit the GARCH(1,1) model with Student-\(t\) innovations to the data for this observation window using the bayesGARCH function

> args(bayesGARCH)

function (y, mu.alpha = c(0, 0),
          Sigma.alpha = 1000 * diag(1,2),
          mu.beta = 0, Sigma.beta = 1000,
          lambda = 0.01, delta = 2,
          control = list())

The input arguments of the function are the vector of data, the hyperparameters and the list control which can supply any of the following elements:

n.chain: number of MCMC chain(s) to be generated; default 1.
l.chain: length of each MCMC chain; default 10000.
start.val: vector of starting values of the chain(s); default c(0.01,0.1,0.7,20). Alternatively, the starting values could be set to the maximum likelihood estimates using the function fGarch available in the package fGarch, for instance.
addPriorConditions: function which allows the user to add any constraint on the model parameters; default NULL, i.e. not additional constraints are imposed.
refresh: frequency of reports; default 10.
digits: number of printed digits in the reports; default 4.

As a prior distribution for the Bayesian estimation we take the default values in bayesGARCH, which are diffuse priors. We generate two chains for \(5\,000\) passes each by setting the control parameter values n.chain = 2 and l.chain = 5000.

> data(dem2gbp)
> y <- dem2gbp[1:750]
> set.seed(1234)
> MCMC <- bayesGARCH(y, control = list(
          l.chain = 5000, n.chain = 2))

chain:  1  iteration:  10
parameters:  0.0441 0.212 0.656 115
chain:  1  iteration:  20
parameters:  0.0346 0.136 0.747 136
...
chain:  2  iteration:  5000
parameters:  0.0288 0.190 0.754 4.67

The function outputs the MCMC chains as an object of the class "mcmc" from the package coda (Plummer, N. Best, K. Cowles, and K. Vines 2010). This package contains functions for post-processing the MCMC output; see Plummer, N. Best, K. Cowles, and K. Vines (2006) for an introduction. Note that coda is loaded automatically with bayesGARCH.

A trace plot of the MCMC chains (i.e. a plot of iterations vs. sampled values) can be generated using the function traceplot; the output is displayed in Figure 2.

Convergence of the sampler (using the diagnostic test of Gelman and D. B. Rubin (1992)), acceptance rates and autocorrelations in the chains can be computed as follows:

> gelman.diag(MCMC)

       Point est. 97.5% quantile
alpha0       1.02           1.07
alpha1       1.01           1.05
beta         1.02           1.07
nu           1.02           1.06

Multivariate psrf

1.02

> 1 - rejectionRate(MCMC)

alpha0 alpha1   beta     nu
 0.890  0.890  0.953  1.000

> autocorr.diag(MCMC)

       alpha0 alpha1  beta    nu
Lag 0   1.000  1.000 1.000 1.000
Lag 1   0.914  0.872 0.975 0.984
Lag 5   0.786  0.719 0.901 0.925
Lag 10  0.708  0.644 0.816 0.863
Lag 50  0.304  0.299 0.333 0.558

The convergence diagnostic shows no evidence against convergence for the last \(2\,500\) iterations (only the second half of the chain is used by default in gelman.diag) since the scale reduction factor is smaller than 1.2; see Gelman and D. B. Rubin (1992) for details. The MCMC sampling algorithm reaches very high acceptance rates ranging from 89% for vector \(\alpha\) to 95% for \(\beta\) suggesting that the proposal distributions are close to the full conditionals. The rejection technique used to generate \(\nu\) allows a new value to be drawn at each pass in the MH algorithm.

The one-lag autocorrelations in the chains range from 0.87 for parameter \(\alpha_1\) to 0.98 for parameter \(\nu\). Using the function formSmpl, we discard the first \(2\,500\) draws from the overall MCMC output as a burn in period, keep only every second draw to diminish the autocorrelation, and merge the two chains to get a final sample length of \(2\,500\).

> smpl <- formSmpl(MCMC, l.bi = 2500,
                   batch.size = 2)

n.chain   : 2
l.chain   : 5000
l.bi      : 2500
batch.size: 2
smpl size : 2500

Basic posterior statistics can be easily obtained with the summary method available for mcmc objects.

> summary(smpl)

Iterations = 1:2500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 2500

1. Empirical mean and standard deviation
   for each variable, plus standard error
   of the mean:

         Mean     SD Naive SE Time-series SE
alpha0 0.0345 0.0138 0.000277        0.00173
alpha1 0.2360 0.0647 0.001293        0.00760
beta   0.6832 0.0835 0.001671        0.01156
nu     6.4019 1.5166 0.030333        0.19833

2. Quantiles for each variable:

         2.5%   25%    50%    75%   97.5%
alpha0 0.0126 0.024 0.0328 0.0435  0.0646
alpha1 0.1257 0.189 0.2306 0.2764  0.3826
beta   0.5203 0.624 0.6866 0.7459  0.8343
nu     4.2403 5.297 6.1014 7.2282 10.1204

The marginal distributions of the model parameters can be obtained by first transforming the output into a matrix and then using the function hist. Marginal posterior densities are displayed in Figure 3. We clearly notice the asymmetric shape of the histograms; this is especially true for parameter \(\nu\). This is also reflected by the differences between the posterior means and medians. These results should warn us against the abusive use of asymptotic justifications. In the present case, even 750 observations do not suffice to justify the asymptotic symmetric normal approximation for the parameter estimator’s distribution.

Probabilistic statements on nonlinear functions of the model parameters can be straightforwardly obtained by simulation from the joint posterior sample. In particular, we can test the covariance stationarity condition and estimate the density of the unconditional variance when this condition is satisfied. Under the GARCH(1,1) specification, the process is covariance stationary if \(\alpha_1 + \beta < 1\), as shown by Bollerslev (1986 310). The term \((\alpha_1 + \beta)\) is the degree of persistence in the autocorrelation of the squares which controls the intensity of the clustering in the variance process. With a value close to one, past shocks and past variances will have a longer impact on the future conditional variance.

To make inference on the persistence of the squared process, we simply use the posterior sample and generate \((\alpha_1^{[j]} + \beta^{[j]})\) for each draw \(\psi^{[j]}\) in the posterior sample. The posterior density of the persistence is plotted in Figure 4. The histogram is left-skewed with a median value of 0.923 and a maximum value of 1.050. In this case, the covariance stationarity of the process is supported by the data. The unconditional variance of the GARCH(1,1) model is \(\alpha_0 / (1 - \alpha_1 - \beta)\) given that \(\alpha_1 + \beta < 1\). Conditionally upon existence, the posterior mean is 0.387 and the 90% credible interval is [0.274,1.378]. The empirical variance is 0.323.

Other probabilistic statements on interesting functions of the model parameters can be obtained using the joint posterior sample. Under specification (1), the conditional kurtosis is \(3 (\nu - 2) / (\nu - 4)\) provided that \(\nu > 4\). Using the posterior sample, we estimate the posterior probability of existence for the conditional kurtosis to be 0.994. Therefore, the existence is clearly supported by the data. Conditionally upon existence, the posterior mean of the kurtosis is 8.21, the median is 5.84 and the 95% confidence interval is [4.12,15.81], indicating heavier tails than for the normal distribution. The positive skewness of the posterior for the conditional kurtosis is caused by a couple of very large values (the maximum simulated value is 404.90). These correspond to draws with \(\nu\) slightly larger than 4. Note that if one desires to rule out large values for the conditional kurtosis beforehand, then one can set \(\delta > 4\) in the prior for \(\nu\). For example, the choice \(\delta = 4.5\) would guarantee the kurtosis to be smaller than 15.

Prior restrictions and normal innovations

The control parameter addPriorConditions can be used to impose any type of constraints on the model parameters \(\psi\) during the estimation. For instance, to ensure the estimation of a covariance stationary GARCH(1,1) model, the function should be defined as

> addPriorConditions <- function(psi)
+   psi[2] + psi[3] < 1

Finally, we can impose normality of the innovations in a straightforward manner by setting the hyperparameters \(\lambda = 100\) and \(\delta=500\) in the bayesGARCH function.

Practical advice

The estimation strategy implemented in bayesGARCH is fully automatic and does not require any tuning of the MCMC sampler. This is certainly an appealing feature for practitioners. The generation of the Markov chains is however time consuming and estimating the model over several datasets on a daily basis can therefore take a significant amount of time. In this case, the algorithm can be easily parallelized, by running a single chain on several processors. This can be easily achieved with the package foreach (REvolution Computing 2009), for instance. Also, when the estimation is repeated over updated time series (i.e. time series with more recent observations), it is wise to start the algorithm using the posterior mean or median of the parameters obtained at the previous estimation step. The impact of the starting values (burn-in phase) is likely to be smaller and thus the convergence faster.

Finally, note that as any MH algorithm, the sampler can get stuck at a given value, so that the chain does not move anymore. However, the sampler uses Taylor-made candidate densities that are especially constructed at each step, so it is almost impossible for this MCMC sampler to get stuck at a given value for many subsequent draws. For example, for our data set we still obtain posterior results that are almost equal to the results that we obtained for the reasonable default initial values c(0.01,0.1,0.7,20), even if we take the very poor initial values c(0.1,0.01,0.4,50). In the unlikely case that such ill behaviour does occur, one could scale the data (to have standard deviation 1), or run the algorithm with different initial values or a different random seed.

4 Summary

This note presented the Bayesian estimation of the GARCH(1,1) model with Student-\(t\) innovations using the R package bayesGARCH. We illustrated the use of the package with an empirical application to foreign exchange rate log-returns.

5 Acknowledgements

The authors acknowledge two anonymous reviewers and the associate editor, Martyn Plummer, for helpful comments that have led to improvements of this note. David Ardia is grateful to the Swiss National Science Foundation (under grant #FN PB FR1-121441) for financial support. Any remaining errors or shortcomings are the authors’ responsibility.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Bayesian Estimation of the GARCH(1,1) Model with Student-t Innovations

1 Introduction

2 Model, priors and MCMC scheme

3 Illustration

Prior restrictions and normal innovations

Practical advice

4 Summary

5 Acknowledgements

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation