Modeling regimes with extremes: the bayesdfa package for identifying and forecasting common trends and anomalies in multivariate time-series data

Eric J. Ward; Sean C. Anderson; Luis A. Damiano; Mary E. Hunsicker; Michael A. Litzow

doi:10.32614/RJ-2019-007

1 Overview

A goal of many multivariate statistical techniques is to reduce dimensionality in observed data to identify shared or latent processes. Factor analysis models represent a general class of models used to relate multiple observations to a lower dimension (factors), while also considering different covariance structures of the observed data. Factors are not directly observed, but represent a hidden, shared process among variables. Though goals of factor analysis are sometimes similar to techniques such as principal component analysis (PCA), factor analysis models explicitly estimate residual error terms, whereas PCA does not (Anderson and Rubin 1956; Jolliffe 1986). These factor models are written as \({ y }_{ i }={ u }_{ i }+\textbf{ Z }{f}_{ i }+{ \varepsilon }_{ i }\), where observed data \({ y }_{ i }\) is a linear combination of an intercept \({ u }_{ i }\) and the product of latent factors \({f}_{ i }\) and loadings \(\textbf{ Z }\) (loadings are sometimes referred to in the literature as \(\textbf{L}\)).

In a time-series setting, factor models may be extended to dynamic factor analysis (DFA) models. DFA models aim to reduce the dimensionality of a collection of time series by estimating a set of shared trends and factors, representing the linear effects of each trend on the observed data (Molenaar 1985; Zuur et al. 2003; Stock and Watson 2005). The number of trends \(m\) is chosen to be less or equal than the number of time series \(n\). The general form of the DFA model can be formulated as a state-space model (Petris 2010). The latent processes (also referred to as ‘trends’) are generally modeled as random walks, so that trend \(i\) is modeled as \({ x }_{ i,t+1 }={ x }_{ i,t }+{ w }_{ i,t }\) where \({ x }_{ i, t }\) is the value of the \(i\)-th latent trend at time \(t\), and the deviations \({ w }_{ i,t }\) are modeled as white noise. Across trends, these deviations are modeled as \(\textbf{w}_{t} \sim \mathrm{MVN}(0, \textbf{Q})\). The latent trends \({ x }_{ i,t }\) are linked to data via a loadings matrix \(\textbf{Z}\) whose values do not evolve through time, \({ y }_{ t }=\textbf{Z}{ x }_{ t } + \textbf{a} + \textbf{B}{ d }_{ t } + \textbf{e}_{t}\). The loadings matrix \(\textbf{Z}\) is dimensioned \(n\: \times\: m\) so that \({Z}_{j,i}\) represents the effect of trend \(i\) on time series \(j\). The parameters \(\textbf{a}\) and \(\textbf{B}\) are optional parameters, representing time-series-specific intercepts and effects of covariates, \({d}_{t}\). Finally, the residual errors are assumed to be \(\textbf{e}_{t} \sim \mathrm{MVN}(0,\textbf{R})\), where \(\textbf{R}\) is an estimated covariance matrix.

Estimation of DFA models is typically done in a maximum likelihood framework, using the expectation-maximization (EM) algorithm or other optimization tools. Implementation of these methods is available in multiple R packages including dlm (Petris 2010), KFAS (Helske 2017), MARSS (Holmes et al. 2012b), and tsfa (Gilbert and Meijer 2005). Challenges in parameter estimation and interpretation for DFA models have been well studied. Without constraints, parameters in the DFA model are not identifiable (Harvey 1990; Zuur et al. 2003). To ensure identifiability of variance parameters, for example, the covariance matrix \(\textbf{Q}\) is generally fixed as an identity matrix (Harvey 1990). To avoid confounding the latent trends and loadings matrix \(\textbf{Z}\), elements of \(\textbf{Z}\) must also be constrained. A common choice of constraints is for the elements in the first \(m-1\) rows of \(\textbf{Z}\) to be set to zero if the column index is greater than the row index, \(j > i\) (Harvey 1990), though other constraints have been proposed (Bai and Wang 2015). For a 3-trend DFA model for instance, these constraints would mean that the \(\textbf{Z}\) matrix parameters would be configured as \[\begin{matrix} \begin{bmatrix} { Z }_{ 1,1 } & 0 & 0 \\ { Z }_{ 2,1 } & { Z }_{ 2,2 } & 0 \\ { Z }_{ 3,1 } & { Z }_{ 3,2 } & { Z }_{ 3,3 } \\ ... & ... & ... \end{bmatrix} \end{matrix}.\]

Several previous approaches to DFA estimation in a maximum likelihood framework also center (subtract the sample means) or standardize (subtract the sample means and divide by the sample standard deviations) data prior to fitting DFA models and set the intercepts \(a\) equal to zero to avoid potential confounding of level parameters (Holmes et al. 2012a). We adopt a similar approach, allowing users to either center or standardize data before estimation, and not including the intercepts as estimated parameters.

1.1 Label switching

We developed our DFA model in a Bayesian framework, using Stan and the package rstan (Stan Development Team 2016), which implements Markov chain Monte Carlo (MCMC) using the No-U Turn Sampling (NUTS) algorithm (Hoffman and Gelman 2014; Carpenter et al. 2017). Although estimation of the DFA model in a Bayesian setting is not new (Aguilar and West 2000; Koop and Korobilis 2010; Stock and Watson 2011), it presents several interesting challenges over the EM algorithm. In addition to the constraints on \(\textbf{Q}\) and \(\textbf{Z}\), Bayesian estimation suffers from a problem of label switching. In particular, elements of \(\textbf{F}\) or \(\textbf{Z}\) may flip sign within an MCMC chain, or multiple chains may converge on parameters that are identical in magnitude but with different signs.

To minimize issues with label switching, previous work on Bayesian factor analysis has proposed additional constraints on the loadings matrix, including setting the elements of \(\textbf{Z}\) to be constrained (-1, 1), or adding a positive constraint to the diagonal, \({ Z }_{ ii }>0\) (Geweke and Zhou 1996; Aguilar and West 2000). Though these constraints generally help, there may be situations where MCMC chains still do not converge. To address this issue, we adopt the parameter-expanded priors for the loadings and trends proposed by Ghosh and Dunson (2009). To ensure that the sign of the estimated quantities is the same across MCMC chains, we created the function flip_trends() to flip the posterior samples of MCMC chains relative to the first chain as needed.

2 The Bayesian dynamic factor model with extremes

There are several approaches for modeling extreme deviations in time series models. Techniques include modeling deviations as a two-component mixture (Ward et al. 2007; Evin et al. 2011), or modeling deviations with non-Gaussian distributions including the Student-t distribution (Praetz 1972; Anderson et al. 2017; Anderson and Ward 2018). There are several existing packages to include Student-t distributions; these include heavy for applications to regression and mixed effects models (Osorio and F. 2018), bsts for univariate time series models (Scott 2018), and stochvol for stochastic volatility models (Kastner 2016). Because switching from a Gaussian to Student-t distribution only introduces a single parameter, \(\nu\), the degrees of freedom, we extend the latter approach to a multivariate setting to model extreme events in the latent trends, so that deviations in the trends are modeled as \({ w }_{ t } \sim \mathrm{MVT}(\nu, 0, \textbf{Q})\). As before, \(\textbf{Q}\) is fixed as an identity matrix \(\textbf{I}\). Our parameterization constrains DFA models to have the same degrees of freedom \(\nu\) in the residuals of the multiple trends, which may be fixed a priori or treated as a free parameter with a gamma(shape = 2, rate = 0.1)[2,\(\infty\)] prior (Juárez and Steel 2010).

2.1 Including autoregressive and moving average components

The trends of the dynamic factor model are most commonly modeled as non-stationary random walks, \({ x }_{ i,t+1 }={ x }_{ i,t }+{ w }_{ i,t }\), where the \({ w }_{ i,t } \sim N(0,1)\) are Gaussian white noise. Like with other vector autoregressive time series models, this framework can be easily extended to include optional autoregressive (AR) or moving average (MA) components (Chow et al. 2011). We allow for AR(1) and MA(1) processes to be specified with boolean arguments to the fit_dfa() function. For both the AR(1) and MA(1) components, we assume separate parameters for each trend. Including the AR(1) component \(\phi_{i}\) makes the trend process become \({ x }_{ i,t+1 }=\phi_{i}{ x }_{ i,t }+{ w }_{ i,t }\), where values of \(\phi_{i}\) close to 1 make the trend behave as a random walk, and small values of \(\phi_{i}\) close to 0 make the trend behave as white noise. Similarly, we model the MA(1) component as an AR(1) process on the error terms \({ w }_{ i,t }\). Instead of being independent at each time step, \({ \theta }_{ i }\) controls the degree of autocorrelation among deviations, \({ w }_{ i,t }\sim \mathrm{N}\left( { { \theta }_{ i }w }_{ i,t-1 }, 1 \right)\). For stationarity and invertability, we constrain \(\left| \phi _{ i } \right|\) < 1 and \(\left| \theta _{ i } \right|\) < 1.

2.2 Rotation of trends and loadings

Like factor analysis models, there are many solutions from a DFA model capable of producing the same fit to the data. Following previous authors, we use a varimax rotation of the loadings matrix \(\textbf{Z}\) to transform the posterior loadings and trends (Kaiser 1958; Harvey 1990; Holmes et al. 2012a). If \(\widehat { \textbf{Z} }\) is the posterior mean of the loadings matrix from a DFA model of 4 time series and 2 trends for example, the rotation matrix \(\textbf{ W }^{ * }=\mathrm{varimax}(\widehat{ \textbf{Z} } )\) is dimensioned \(2\: \times\: 2\). The rotated loadings matrix can then be calculated as \(\widehat{ \textbf{Z} }^{ * } = \widehat{\textbf{Z}}\textbf{ W }^{ * }\) and rotated trends calculated as \(\widehat{ \textbf{x} }^{ * }=\textbf{ W }^{ *-1 }\widehat{ \textbf{x} }\), where \(\widehat{\textbf{x}}\) is the posterior mean of the trends.

2.3 Identifying data support for the number of trends

Since the number of trends in a DFA model is not a parameter, comparing data support across models is often necessary. Using model selection tools to identify data support is available via Akaike’s Information Criterion (AIC) in packages implementing maximum likelihood for estimation of state-space models (Petris 2010; Holmes et al. 2012b). In addition to comparing the relative support of different number of trends, model selection for Bayesian dynamic factor models may be useful for evaluating the error structure for the residual error covariance matrix \(\textbf{R}\), whether covariates should be included, whether latent trends are better modeled with a distribution allowing for extremes (MVT versus MVN), and whether the latent trends support estimation of AR or MA components. For our Bayesian DFA models, we extend the loo package (Vehtari et al. 2016a,b) to generate estimates of LOOIC (Leave-One-Out Information Criterion) for fitted models. To ease the selection process, bayesdfa includes the function find_dfa_trends() to run multiple models specified by the user. It returns a table of LOOIC values (denoting which of those failed convergence criteria) and the model with the lowest LOOIC value.

2.4 Anomalies or black-swan events

As a diagnostic tool, we include the function find_swans() to fitted DFA models. We adopt the same approach and terminology for ‘black-swan events’ as in Anderson et al. (2017), where black-swan events are rare and unexpected extremes. Our find_swans() function first-differences the posterior mean estimates of each DFA trend and evaluates the probability of observing a difference that is more extreme than expected under a normal distribution with the same scale parameter. Events beyond a user-defined threshold (e.g. 1 in 100, or 1 in 10,000) are then classified as outliers and plotted.

2.5 Simulation tests

To evaluate the ability of the Bayesian DFA model to identify anomalies in latent processes, we created simulated data using our sim_dfa() function. We generated simulated multivariate time series (\(n = 4\) time series with \(T = 20\) time steps each) with \(m = 2\) underlying latent trends. Extremes were included as a step-change in the midpoint of the first trend in each simulated dataset. We varied the value of the step from -4 to -8, which represent unlikely events under the assumption that temporal deviations in the latent trends are distributed according to \(N(0,1)\). Because increased observation error may corrupt inference about anomalies in the trends, we considered three levels of observation error \((\sigma = 0.25, 0.75, 1.25)\). We generated 200 simulated samples for each permutation of parameters, resulting in a total of 3000 datasets.

We fit the Bayesian DFA model with Student-t errors to each simulated dataset. As expected, the posterior estimates from these simulations illustrate that the ability to estimate low degrees of freedom is related to the magnitude of extremes (Figure 1). Similarly, higher observation error corrupts the ability to estimate extreme events, even when they are large in magnitude (Figure 1).

graphic without alt text — Figure 1: Results for simulated data illustrating support for the Student-t distribution (low values of nu), varying the magnitude of extremes (standard deviations from the mean) and magnitude of observation error.

3 Using HMMs to classify regimes in latent DFA trends

An alternative approach to DFA for dimension reduction of multivariate time series data are Hidden Markov Models (HMMs). Like DFA models, they model a latent process for a time series (or collection of multivariate time series). Instead of the latent process being modeled continuously (e.g. as a random walk in DFA), HMMs conceive the latent process as a series of discrete-time, discrete-state first-order Markov chains \(s_t \in \left\{ 1, \dots, G \right\}\) with the number of possible states \(G\) specified a priori. State transition is characterized by the \(G\: \times\: G\) transition matrix with simplex rows \(\mathbf{A} = \left\{ a_{ig} \right\}\) where \(a_{ig} = p({ s }_{ t } = g | { s }_{ t-1 } = i)\) represents the probability of transitioning from state \(i\) to \(g\). Useful quantities from HMMs include the transition probabilities between latent states, and the probability of being in a given lantent state at each point in time (Zucchini et al. 2017).

HMMs can be applied to raw multivariate data to identify latent states; however, they may also be linked with DFA to identify regimes and transitions in the latent DFA trends. Similar to DFA, applications of HMMs are widely available in R, including via the packages depmixS4 (Visser and Speekenbrink 2010), HMM (Himmelmann 2010), and msm (Jackson 2011). Consistent with our implementation of the Bayesian DFA model, we include fully Bayesian inference in Stan based on (Damiano et al. 2018). We apply independent HMM models to each DFA trend to identify alternate states or regimes. Like with the estimation of DFA models, we use the LOOIC metric to evaluate the relative support for HMMs with different numbers of underlying states, selecting the converged model with the lowest LOOIC. By default, we assume the observation model of the input time series to be normally distributed with the scale parameter equal to the estimated residual variance. However, for some applications, such as datasets with changing sampling frequencies over time, uncertainty in DFA trends may also vary through time. To propogate this uncertainty forward, we also allow the residual variance to be entered as a known quantity for every data point in our find_regimes() function.

4 Example application: identifying common patterns in sea surface temperatures in the Northeast Pacific Ocean

To illustrate an example application of the bayesdfa package to real data, we use monthly anomalies of sea surface temperature (SST, measured in \(C^\circ\)). SST is observed from satellite and buoy data at fixed locations, and model-based interpolations are used to generate estimates at additional gridded locations¹. We used estimates generated at the locations of 4 observing stations used by the Pacific Fisheries Environmental Laboratory² from the west coast of North America (USA). The four stations have some degree of correlation with one another, and are separated by approximately 6 degrees of latitude from one another. In summary, we work with \(n = 4\) monthly time series with \(T = 167\) observations each (from 2003–01 to 2016–05) and no missing values.

Initially, we fit a DFA model with 2 hidden trends, and will assume the 4 time series to have the same error variances \(\textbf{R}\). We will fit the DFA model with possible extremes, modeling process error with a Student-t distribution by using the argument estimate_nu(). To evaluate whether these data support an extreme DFA with trends modeled as a t-distribution, we will fit two competing forms: one modeling the random walks with a Gaussian distribution, and the other using a Student-t distribution. Generating posterior samples for each model takes approximately 7 minutes per chain, when MCMC chains aren’t run in parallel.

After fitting the models, we confirm whether the MCMC chains are consistent with convergence using a threshold value of \(\hat { R }=1.05\) (Gelman et al. 2014) using our is_converged() function. We also visually inspect chain traceplots (e.g. Figure 3) and check the minimum effective sample size across parameters: NaN.

As a consistency diagnostic, we also retrieve the estimated degrees of freedom from the Student-t model \(\nu\). By visual inspection, Figure 4 shows that the posterior distribution on \(\nu\) is lower than the prior distribution.

4.1 Visualizing the trends and loadings

We will focus the remaining portion of our analysis on the results from the DFA model with Student-t deviations. In Figure 5, we observe that Trend 1 and Trend 2 both support SST anomalies increasing over the latter half of the time series. Both trends appear to have reversed direction (reverting to the mean in the last 2–3 years) and this pattern is more evident in Trend 1. Because we do not model seasonality explicitly, for example by including a covariate effect for the month, each of the estimated trends also includes the within-year variability that describes seasonal patterns in observed sea surface temperature.

In the violin plot of Figure 6, we note that more southern stations (24 and 30N) contribute largely to Trend 1, while the more northern stations appear to load more heavily on Trend 2.

4.2 Identifying regimes in the latent DFA trends with Hidden Markov Models

For each trend, we apply independent HMMs to examine the support for differing numbers of underlying regimes. Both the posterior mean and standard deviation (optional argument) will be the inputs to the HMM.

Using LOOIC as a metric of support for the number of regimes, the estimates reported in Table 1 support the inclusion of 2 regimes for both Trends 1 and 2.

Table 1: LOOIC estimates across different numbers of regimes for each latent DFA trend. LOOIC is calculated using the `loo::loo()` function.
Regimes	LOOIC Trend 1	LOOIC Trend 2
1	855.5	756.7
2	31.0	30.3
3	69.1	99.9
4	139.7	164.6

Our fit_regimes() function computes the probability of each time point being in one of the regime states, which may also be visualized using plot_regime_model(). For example, the output of the 2-regime model for Trend 1 in Figure 7 suggests a change in the middle of the time series, then changing back again to State 1. Similarly, by the end of the series, the HMM assigns Trend 1 to being in State 1.

5 Extensions

There are a number of extensions to our implementation of the Bayesian DFA model with extremes that could make the model more applicable to a wider range of problems. Examples for the process model include adopting a skew-t distribution for asymmetric extremes. For models estimating multiple trends, multiple parameters may be treated hierarchically (e.g. covariate effects, variance parameters). For the observation or data model, our implementation of the Bayesian DFA model only includes data arising from a Gaussian or Student-t distribution, though this could be extended to include discrete or other continuous densities. Finally, spatial dynamic factor models (sDFA) have emerged as a useful tool for complicated multivariate spatial datasets (Lopes et al. 2011; Thorson et al. 2015), and could be similarly implemented in Stan.

6 Conclusion

This paper presents the bayesdfa package for applying Bayesian DFA to multivariate time series as a dimension reduction tool, particularly if extreme events may be present in observed data. In addition to allowing for the inclusion of covariates, we also extend the conventional dynamic factor model to include optionial moving average and autoregressive components in the latent trends. Applying this package to a dataset of sea surface temperature from the Northeast Pacific Ocean, we fit DFA models with Gaussian and Student-t errors. Though the model with Student-t errors has slightly lower LOOIC, the results from the two models are similar. Output from these 2-trend DFA models of sea surface temperature are useful in demonstrating a north-to-south gradient in temperature anomalies (Figure 6). Standardized temperature data from southern stations experience more interannual variability and temperatures that are greater in magnitude compared to northern stations (Figure 5). We also illustrate how latent trends from DFA models can be analyzed in a HMM framework to identify regimes and transitions; applied to the sea surface temperature data, both Trend 1 and Trend 2 support 2-regime models (roughly interpreted as ‘warm’ and ‘cool’ regimes; Figure 7).

6.1 Acknowledgements

This work was funded by NOAA’s Fisheries and the Environment (FATE) Program. Development of this package benefitted from discussions with other members of our working group (Jin Gao, Chris Harvey, Sam McClatchie, Stepahni Zador) and scientists at the Northwest Fisheries Science Center (including Mark Scheuerell, James Thorson, Eli Holmes, and Kelly Andrews). 2 anonymous reviewers helped improve the clarity and plots of this paper.