This paper introduces the R package slm, which stands for Stationary Linear Models. The package contains a set of statistical procedures for linear regression in the general context where the error process is strictly stationary with a short memory. We work in the setting of (Hannan 1973), who proved the asymptotic normality of the (normalized) least squares estimators (LSE) under very mild conditions on the error process. We propose different ways to estimate the asymptotic covariance matrix of the LSE and then to correct the type I error rates of the usual tests on the parameters (as well as confidence intervals). The procedures are evaluated through different sets of simulations.
We consider the usual linear regression model
In this paper, we propose to modify the standard statistical procedures
(tests, confidence intervals, …) of the linear model in the more
general context where the
Let us emphasize that Hannan’s conditions on the error process are very
mild and are satisfied for most of the short-memory processes (see the
discussion in Section
Hence, the tools presented in the present paper can be seen from two different points of view:
as appropriate tools for time series regression with a short memory error process
as a way to robustify the usual statistical procedures when the residuals are correlated.
Let us now describe the organization of the paper. In the next section,
we recall the mathematical background, the consistent estimator of the
asymptotic covariance matrix introduced in (Caron 2019), and the
modified slm
with the usual summary output of lm
. An
extended version of this paper is available as an arXiv preprint (see
(Caron et al. 2019)).
We start this section by giving a short presentation of linear
regression with stationary errors, more details can be found for
instance in (Caron 2019). Let
Let
(Hannan 1973) has shown a Central Limit Theorem for
Let us illustrate this plug-in approach with a kernel estimator which
has been proposed in (Caron 2019). For some
More generally, in this paper, we say that an estimator
To conclude this section, let us make some additional remarks. The
interest of Caron’s recent paper is that the consistency of the
estimator
We now present tests and confidence regions for arbitrary estimators
We introduce the following univariate statistics:
Let
The test can be used to simplify a linear model by testing that several
linear combinations between the parameters
Using the slm package is very intuitive because the arguments and the
outputs of slm
are similar to those of the standard functions lm
,
glm
, etc. The output of the main function slm
is an object of class
"slm", a specific class that has been defined for linear regression
with stationary processes. The "slm" class has methods plot
,
summary
, confint
, and predict
, see the extended version
(Caron et al. 2019) for more details. Moreover, the class "slm"
inherits from the "lm" class and thus provides the output of the
classical lm
function.
The statistical tools available in slm
strongly depend on the choice
of the covariance plug-in estimator slm
are residual-based estimators, but they rely on different
approaches. In this section, we present the main functionality of slm
together with the different covariance plug-in estimators.
For illustrating the package, we simulate synthetic data according to
the linear model:
generative_model
and generative_process
respectively to simulate observations according to this regression
design and with this specific stationary process.
R> library(slm)
R> set.seed(42)
R> n = 500
R> eps = generative_process(n,"Nonmixing")
R> design = generative_model(n,"mod2")
R> design_sim = cbind(rep(1,n), as.matrix(design))
R> beta_vec = c(2,0.001,0.5)
R> Y = design_sim %*% beta_vec + eps
A large class of stationary processes with continuous spectral density
can be well approximated by AR processes, see for instance Corollary
4.4.2 in (Brockwell and Davis 1991). The covariance structure of an AR process
having a closed form, it is thus easy to derive an approximation
slm
. This method proceeds in four main steps:
Fit an autoregressive process on the residual process
Compute the theoretical covariances of the fitted AR process ;
Plug the covariances in the Toeplitz matrix
Compute
The slm
function fits a linear regression of the vector ar
function from the
stats package. The output
of the slm
function is an object of class "slm". The order model_selec
:
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "fitAR",
+ model_selec = 3)
The estimated covariance is recorded as a vector in the attribute
cov_st
of regslm
, which is an object of class "slm". The estimated
covariance matrix can be computed by taking the Toeplitz matrix of
cov_st
, using the toeplitz
function.
The order model_selec = p
, or automatically with the AIC criterion by setting
model_selec = -1
.
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "fitAR",
+ model_selec = -1)
The order of the fitted AR process is recorded in the model_selec
attribute of regslm
:
R> regslm@model_selec
[1] 2
Here, the AIC criterion suggests to fit an AR(2) process on the residuals.
The second method for estimating the covariance matrix
kernel_fonc = |
kernel definition |
---|---|
rectangular |
|
triangle |
|
trapeze |
It is also possible for the user to define his own kernel and use it in
the argument kernel_fonc
of the slm
function. Below we use the
triangle kernel, which assures that the covariance matrix is positive
definite. The support of the kernel model_selec
in the
slm
function. Then the bandwidth model_selec
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "kernel",
+ model_selec = 5, kernel_fonc = triangle, plot = TRUE)
The plot output by the slm
function is given in
Figure 1.
The order parameter can be chosen at hand as before or automatically by
setting model_selec = -1
. The automatic order selection is based on
the bootstrap procedure proposed by (Wu and Pourahmadi 2009) for banded
covariance matrix estimation. The block_size
argument sets the size of
bootstrap blocks, and the block_n
argument sets the number of blocks.
The final order is chosen by taking the order which has the minimal
risk. Figure 2 gives the plots of the estimated risk
for the estimation of
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st ="kernel",
+ model_selec = -1, kernel_fonc = triangle, model_max = 30,
+ block_size = 100, block_n = 100, plot = TRUE)
|
|
|
|
slm
for the kernel
method with bootstrap selection of the order.
The selected order is recorded in the model_selec
attribute of the
slm
object output by the slm
function:
R> regslm@model_selec
[1] 10
An alternative method for choosing the bandwidth in the case of the
rectangular kernel has been proposed in (Efromovich 1998). For a
large class of stationary processes with exponentially decaying
autocovariance function (mainly the ARMA processes), Efromovich proved
that the rectangular kernel is asymptotically minimax, and he proposed
the following estimator:
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "efromovich",
+ model_selec = -1)
Another method for choosing the bandwidth has been proposed by (Andrews 1991) and implemented in the package sandwich by Zeileis, Lumley, Berger and Graham (see the paper by (Zeileis 2004)). For the slm package, the automatic choice of the bandwidth proposed by Andrews can be obtained as follows:
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "hac")
The procedure is based on the function kernHAC
in the sandwich
package. This function computes directly the covariance matrix estimator
of Cov_ST
of the
slm
function. Here, we take the quadratic spectral kernel:
Depending on the method used, the matrix
fitAR
and spectralproj
methods.
The projection method relies on the ideas of (Comte 2001), where
an adaptive nonparametric method has been proposed for estimating the
spectral density of a stationary Gaussian process. We use the residual
process as a proxy for the error process, and we compute the projection
coefficients with the residual-based empirical covariance coefficients
slm
function by setting
method_cov_st =
"spectralproj"
:
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "spectralproj",
+ model_selec = 10, plot = TRUE)
The graph of the estimated spectral density can be plotted by setting
plot = TRUE
in the slm
function, see Figure 3.
The Gaussian model selection method proposed in (Comte 2001)
follows the ideas of Birgé and Massart, see for
instance (Massart 2007). It consists of minimizing the
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "spectralproj",
+ model_selec = -1, model_max = 50, plot = TRUE)
The selected dimension is recorded in the model_selec
attribute of the
slm
object output by the slm
function:
R> regslm@model_selec
[1] 8
The slope heuristic algorithm here selects a Histogram on a regular
partition of size
This method is a full-manual method for estimating the covariance matrix
The next instruction selects the coefficients
R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "select",
+ model_selec = c(1,2,4))
The positive lags of the selected covariances are recorded in the
model_selec
argument. Let us notice that the variance
As for the kernel method, the resulting covariance matrix may not be positive definite. If it is the case, the positive definite projection method described before is used.
This last method is a direct plug-in method. The user proposes his own
vector estimator
R> v = rep(0,n)
R> v[1:10] = acf(eps, type = "covariance", lag.max = 9)$acf
R> regslm = slm(Y ~ X1 + X2, data = design, cov_st = v)
The user can also propose his own covariance matrix
R> v = rep(0,n)
R> v[1:10] = acf(eps, type = "covariance", lag.max = 9)$acf
R> V = toeplitz(v)
R> regslm = slm(Y ~ X1 + X2, data = design, Cov_ST = V)
Let us notice that the user must verify that the resulting covariance matrix is positive definite. The positive definite projection algorithm is not used with this method.
This section summarizes an extensive study which has been carried out to compare the performances of the different approaches presented before in the context of a linear model with short range dependent stationary errors.
We first present the five generative models for the errors that we consider in the paper. We choose different kinds of processes to reflect the diversity of short-memory processes.
AR1 process. The AR1 process is a Gaussian AR(1) process defined
by
AR12 process. The AR12 process is a seasonal AR(12) process
defined by
MA12 process. The MA12 is also a seasonal process defined by
Nonmixing process. The three processes described above are basic
ARMA processes, whose innovations have absolutely continuous
distributions; in particular, they are strongly mixing in the sense
of (Rosenblatt 1956), with a geometric decay of the mixing
coefficients (in fact, the MA12 process is even
Sysdyn process. The four processes described above have the
property of "geometric decay of correlations", which means that
the
It follows from (Liverani et al. 1999) that there exists a
unique
The linear regression models simulated in the experiments all have the
following form:
generative_process
function. The design can be simulated with the
generative_model
function.
It is, of course, of first importance to provide hypothesis tests with
correct significance levels or at least with correct asymptotical
significance levels, which is possible if the estimator
fitAR
method, bandwidth choice for the kernel
method, dimension selection
for the spectralproj
method.
It is a tricky issue to design a data-driven procedure for choosing test
parameters in order to have a correct Type I Error. Note that unlike
with supervised problems and density estimation, it is not possible to
calibrate hypothesis tests in practice using cross-validation
approaches. We thus propose to calibrate the tests using well-founded
statistical procedures for risk minimization ; AIC criterion for the
fitAR
method, bootstrap procedures for the kernel
method, and slope
heuristics for the spectralproj
method. These procedures are
implemented in the slm
function with the model_selec = -1
argument,
as detailed in the previous section.
Let us first illustrate the calibration problem with the AR12 process.
For fitAR
method of the slm
function with orders between
We first study the case of non-Seasonal error processes. We simulate an
model_selec = -1
). The simulations are
repeated
n | Process Method | Fisher test | fitAR | spectralproj |
200 | AR1 process | 0.465 | 0.097 | 0.14 |
NonMixing | 0.298 | 0.082 | 0.103 | |
Sysdyn process | 0.385 | 0.105 | 0.118 | |
1000 | AR1 process | 0.418 | 0.043 | 0.049 |
NonMixing | 0.298 | 0.046 | 0.05 | |
Sysdyn process | 0.393 | 0.073 | 0.077 | |
2000 | AR1 process | 0.454 | 0.071 | 0.078 |
NonMixing | 0.313 | 0.051 | 0.053 | |
Sysdyn process | 0.355 | 0.063 | 0.064 | |
5000 | AR1 process | 0.439 | 0.044 | 0.047 |
NonMixing | 0.315 | 0.053 | 0.056 | |
Sysdyn process | 0.381 | 0.058 | 0.061 |
n | Process Method | efromovich | kernel | hac |
200 | AR1 process | 0.135 | 0.149 | 0.108 |
NonMixing | 0.096 | 0.125 | 0.064 | |
Sysdyn process | 0.124 | 0.162 | 0.12 | |
1000 | AR1 process | 0.049 | 0.086 | 0.049 |
NonMixing | 0.053 | 0.076 | 0.038 | |
Sysdyn process | 0.079 | 0.074 | 0.078 | |
2000 | AR1 process | 0.075 | 0.067 | 0.071 |
NonMixing | 0.057 | 0.067 | 0.047 | |
Sysdyn process | 0.066 | 0.069 | 0.073 | |
5000 | AR1 process | 0.047 | 0.047 | 0.044 |
NonMixing | 0.059 | 0.068 | 0.05 | |
Sysdyn process | 0.057 | 0.064 | 0.071 |
For fitAR
and the hac
methods show
better performances than the others. The kernel
method is slightly
less effective. With this method, we must choose the size of the
bootstrap blocks as well as the number of blocks, and the test results
are really sensitive to these parameters. In these simulations, we have
chosen
Let us notice that for all methods and for all sample sizes, the estimated level is much better than if no correction is made (usual Fisher tests).
We now study the case of linear regression with seasonal errors. The experiment is exactly the same as before, except that we simulate AR12 or MA12 processes. The results of these experiments are summarized in Table 3.
n | Process Method | Fisher test | fitAR | spectralproj |
200 | AR12 process | 0.436 | 0.178 | 0.203 |
MA12 process | 0.228 | 0.113 | 0.113 | |
1000 | AR12 process | 0.468 | 0.068 | 0.183 |
MA12 process | 0.209 | 0.064 | 0.066 | |
2000 | AR12 process | 0.507 | 0.071 | 0.196 |
MA12 process | 0.237 | 0.064 | 0.064 | |
5000 | AR12 process | 0.47 | 0.062 | 0.183 |
MA12 process | 0.242 | 0.044 | 0.048 |
n | Process Method | efromovich | kernel | hac |
200 | AR12 process | 0.223 | 0.234 | 0.169 |
MA12 process | 0.116 | 0.15 | 0.222 | |
1000 | AR12 process | 0.181 | 0.124 | 0.179 |
MA12 process | 0.069 | 0.063 | 0.18 | |
2000 | AR12 process | 0.153 | 0.104 | 0.192 |
MA12 process | 0.058 | 0.068 | 0.173 | |
5000 | AR12 process | 0.1 | 0.091 | 0.171 |
MA12 process | 0.043 | 0.057 | 0.147 |
We directly see that the case of seasonal processes is more complicated
than for the non-seasonal processes especially for the AR12 process. For
a small samples size, the estimated level is between fitAR
method is the best method here for the AR12 process because for
efromovich
and kernel
methods, a level less than spectralproj
and hac
methods do not
seem to work well for the AR12 process, although they remain much better
than the usual Fisher tests (around
The case of the MA12 process seems easier to deal with. For hac
(around fitAR
, spectralproj
and efromovich
methods.
To be complete, we consider the case where the
n | Process Method | Fisher test | fitAR | spectralproj |
150 | i.i.d. process | 0.053 | 0.068 | 0.078 |
300 | i.i.d. process | 0.052 | 0.051 | 0.06 |
500 | i.i.d. process | 0.047 | 0.049 | 0.053 |
n | Process Method | efromovich | kernel | hac |
150 | i.i.d. process | 0.061 | 0.124 | 0.063 |
300 | i.i.d. process | 0.05 | 0.095 | 0.052 |
500 | i.i.d. process | 0.049 | 0.082 | 0.056 |
Except for the kernel
method, the estimated levels are close to fitAR
,
efromovich
, and hac
.
As a general conclusion of this section about numerical experiments and
method comparison, we see that the fitAR
method performs quite well in
a wide variety of situations and should therefore be used as soon as the
user suspects that the error process can be modeled by a stationary
short-memory process.
This dataset comes from a study about fine particle pollution in five
Chinese cities. The data are available on the following website
https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities#.
Here we are interested with the city of Shanghai. We study the
regression of PM2.5 pollution in Xuhui District by other measurements of
pollution in neighboring districts and also by meteorological variables.
The dataset contains hourly observations between January 2010 and
December 2015. More precisely, it contains
We remove the lines that contain NA
observations, and we then extract
the first
PM_Xuhui
: PM2.5 concentration in the Xuhui district (
PM_Jingan
: PM2.5 concentration in the Jing’an district
(
PM_US.Post
: PM2.5 concentration in the U.S diplomatic post
(
DEWP
: Dew Point (Celsius Degree)
TEMP
: Temperature (Celsius Degree)
HUMI
: Humidity (
PRES
: Pressure (hPa)
Iws
: Cumulated wind speed (
precipitation
: hourly precipitation (mm)
Iprec
: Cumulated precipitation (mm)
R> shan = read.csv("ShanghaiPM20100101_20151231.csv", header = TRUE,
+ sep = ",")
R> shan = na.omit(shan)
R> shan_complete = shan[1:5000,c(7,8,9,10,11,12,13,15,16,17)]
R> shan_complete[1:5,]
PM_Jingan PM_US.Post PM_Xuhui DEWP HUMI PRES TEMP Iws
26305 66 70 71 -5 69.00 1023 0 60
26306 67 76 72 -5 69.00 1023 0 62
26308 73 78 74 -4 74.41 1023 0 65
26309 75 77 77 -4 80.04 1023 -1 68
26310 73 78 80 -4 80.04 1023 -1 70
precipitation Iprec
26305 0 0
26306 0 0
26308 0 0
26309 0 0
26310 0 0
The aim is to study the concentration of particles in Xuhui District
according to the other variables. We first fit a linear regression with
the lm
function.
R> reglm = lm(shan_complete$PM_Xuhui ~ . ,data = shan_complete)
R> summary.lm(reglm)
Call:
lm(formula = shan_complete$PM_Xuhui ~ ., data = shan_complete)
Residuals:
Min 1Q Median 3Q Max
-132.139 -4.256 -0.195 4.279 176.450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.859483 40.975948 -1.339 0.180690
PM_Jingan 0.596490 0.014024 42.533 < 2e-16 ***
PM_US.Post 0.375636 0.015492 24.246 < 2e-16 ***
DEWP -1.038941 0.170144 -6.106 1.10e-09 ***
HUMI 0.291713 0.045799 6.369 2.07e-10 ***
PRES 0.025287 0.038915 0.650 0.515852
TEMP 1.305543 0.168754 7.736 1.23e-14 ***
Iws -0.007650 0.002027 -3.774 0.000163 ***
precipitation 0.462885 0.132139 3.503 0.000464 ***
Iprec -0.125456 0.039025 -3.215 0.001314 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.68 on 4990 degrees of freedom
Multiple R-squared: 0.9409, Adjusted R-squared: 0.9408
F-statistic: 8828 on 9 and 4990 DF, p-value: < 2.2e-16
The variable PRES
has no significant effect on the PM_Xuhui
variable. We then perform a backward selection procedure, which leads to
select
R> shan_lm = shan[1:5000,c(7,8,9,10,11,13,15,16,17)]
R> reglm = lm(shan_lm$PM_Xuhui ~ . ,data = shan_lm)
R> summary.lm(reglm)
Call:
lm(formula = shan_lm$PM_Xuhui ~ ., data = shan_lm)
Residuals:
Min 1Q Median 3Q Max
-132.122 -4.265 -0.168 4.283 176.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.365506 4.077590 -6.956 3.94e-12 ***
PM_Jingan 0.595564 0.013951 42.690 < 2e-16 ***
PM_US.Post 0.376486 0.015436 24.390 < 2e-16 ***
DEWP -1.029188 0.169471 -6.073 1.35e-09 ***
HUMI 0.285759 0.044870 6.369 2.08e-10 ***
TEMP 1.275880 0.162453 7.854 4.90e-15 ***
Iws -0.007734 0.002023 -3.824 0.000133 ***
precipitation 0.462137 0.132127 3.498 0.000473 ***
Iprec -0.127162 0.038934 -3.266 0.001098 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.68 on 4991 degrees of freedom
Multiple R-squared: 0.9409, Adjusted R-squared: 0.9408
F-statistic: 9933 on 8 and 4991 DF, p-value: < 2.2e-16
The autocorrelation of the residual process shows that the errors are
clearly not i.i.d., see Figure 5. We thus suspect the
lm
procedure to be unreliable in this context.
The autocorrelation function decreases pretty fast, and the partial
autocorrelation function suggests that fitting an AR process on the
residuals should be an appropriate method in this case. The automatic
fitAR
method of slm
selects an AR process of order
Consequently, we propose to perform a linear regression with
slm
function, using the fitAR
method on the complete model.
R> regslm = slm(shan_complete$PM_Xuhui ~ . ,data = shan_complete,
+ method_cov_st = "fitAR", model_selec = -1)
R> summary(regslm)
Call:
"slm(formula = myformula, data = data, x = x, y = y)"
Residuals:
Min 1Q Median 3Q Max
-132.139 -4.256 -0.195 4.279 176.450
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -54.859483 143.268399 -0.383 0.701783
PM_Jingan 0.596490 0.028467 20.953 < 2e-16 ***
PM_US.Post 0.375636 0.030869 12.169 < 2e-16 ***
DEWP -1.038941 0.335909 -3.093 0.001982 **
HUMI 0.291713 0.093122 3.133 0.001733 **
PRES 0.025287 0.137533 0.184 0.854123
TEMP 1.305543 0.340999 3.829 0.000129 ***
Iws -0.007650 0.005698 -1.343 0.179399
precipitation 0.462885 0.125641 3.684 0.000229 ***
Iprec -0.125456 0.064652 -1.940 0.052323 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.68
Multiple R-squared: 0.9409
chi2-statistic: 8383 on 9 DF, p-value: < 2.2e-16
Note that the variables show globally larger p-values than with the
lm
procedure, and more variables have no significant effect than with
lm
. After performing a backward selection, we obtain the following
results:
R> shan_slm = shan[1:5000,c(7,8,9,10,11,13)]
R> regslm = slm(shan_slm$PM_Xuhui ~ . , data = shan_slm,
+ method_cov_st = "fitAR", model_selec = -1)
R> summary(regslm)
Call:
"slm(formula = myformula, data = data, x = x, y = y)"
Residuals:
Min 1Q Median 3Q Max
-132.263 -4.341 -0.192 4.315 176.501
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -29.44924 8.38036 -3.514 0.000441 ***
PM_Jingan 0.60063 0.02911 20.636 < 2e-16 ***
PM_US.Post 0.37552 0.03172 11.840 < 2e-16 ***
DEWP -1.05252 0.34131 -3.084 0.002044 **
HUMI 0.28890 0.09191 3.143 0.001671 **
TEMP 1.30069 0.32435 4.010 6.07e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.71
Multiple R-squared: 0.9406
chi2-statistic: 8247 on 5 DF, p-value: < 2.2e-16
The backward selection with slm
only keeps
The authors are grateful to Anne Philippe (Nantes University) and Aymeric Stamm (CNRS - Nantes University) for valuable discussions.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Caron, et al., "Linear Regression with Stationary Errors: the R Package slm", The R Journal, 2021
BibTeX citation
@article{RJ-2021-030, author = {Caron, Emmanuel and Dedecker, Jérôme and Michel, Bertrand}, title = {Linear Regression with Stationary Errors: the R Package slm}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {146-163} }