This paper is an introduction to the new package in R called smoots (smoothing time series), developed for data-driven local polynomial smoothing of trend-stationary time series. Functions for data-driven estimation of the first and second derivatives of the trend are also built-in. It is first applied to monthly changes of the global temperature. The quarterly US-GDP series shows that this package can also be well applied to a semiparametric multiplicative component model for non-negative time series via the log-transformation. Furthermore, we introduced a semiparametric Log-GARCH and a semiparametric Log-ACD model, which can be easily estimated by the smoots package. Of course, this package applies to suitable time series from any other research area. The smoots package also provides a useful tool for teaching time series analysis, because many practical time series follow an additive or a multiplicative component model.
This paper provides an introduction to a new package in R called smoots (version 1.0.1, Feng and Schulz 2019) for data-driven local polynomial smoothing of trend-stationary time series. This package is developed based on the R codes for the practical implementation of the proposals in Feng et al. (2020). Detailed discussion on the methodology and the development of the algorithms may be found in that work. Here, the main algorithm is a fully data-driven IPI (iterative plug-in, Gasser et al. 1991) approach for estimating the trend under stationary time series errors, where the variance factor in the asymptotically optimal bandwidth is estimated by another IPI-approach for a lag-window estimator of the spectral density following Bühlmann (1996). Numerous sub-algorithms determined by different options, such as the choice of the order of local polynomial, the choice of the kernel weighting functions and the choice of the so-called inflation factors for estimating the bias in the asymptotically optimal bandwidth, are included. Further functions for data-driven estimation of the first and second derivatives of the trend are also developed by adapting the idea of Feng (2007). A related proposal may be found in Francisco-Fernández et al. (2004). The algorithms included in the smoots package differ to their proposal in several ways. (1) The IPI-algorithm is usually superior to the DPI (see Beran et al. 2009 for detailed discussion in models with i.i.d. errors). (2) In Francisco-Fernández et al. (2004) the variance factor is estimated under an AR(1) model, which is usually a misspecification. (3) Data-driven estimation of the derivatives are also included in the current package. Moreover, the user can obtain any estimate with fixed bandwidths chosen beforehand. A function for kernel smoothing is also built-in for comparison.
The smoots package complements the Comprehensive R Archive Network
(CRAN), as already existing base R functions or CRAN packages for local
polynomial smoothing either do not implement an automated bandwidth
selection or, if such bandwidth algorithms do exist, they are only
suitable for data with i.i.d. (independently and identically
distributed) errors, which is an assumption that is often violated for
time series. In more detail, notwithstanding that the
stats package offers the
functions lowess()
and loess()
for computing the local polynomial
estimates of the regression function given one or multiple predictor
variables, a smoothing bandwidth has to be selected arbitrarily in both
cases. Similar functionality is provided by the
locpol
(Ojeda Cabrera 2018),
KernSmooth
(Wand 2021) and
lokern
(Herrmann and Maechler 2021) packages, however, derivative estimation
approaches and various bandwidth selection algorithms for the regression
function, such as the leave-one-out estimator (Allen 1974)
and the plug-in algorithm by Ruppert et al. (1995), are built into them.
Moreover, within lokern, the bandwidth for estimating the first or
second derivative of the regression function can be selected
automatically. Nevertheless, these traditional data-driven bandwidth
selection approaches are only suitable in case of data with i.i.d.
errors (Altman 1990; Hart 1991; Opsomer 1997)
and can therefore often not be considered for time series. Explicit CRAN
packages for detrending a time series nonparametrically are
rmaf (Qiu 2015) and
mFilter
(Balcilar 2019). While with rmaf, a fully data-driven refined
moving average and a cubic smoothing spline, which requires a manually
selected smoothness parameter, can be applied, the trend as well as the
cyclical component of a time series can be extracted with the filters
implemented in the mFilter package like the Baxter-King
(Baxter and King 1999) or Hodrick-Prescott (Hodrick and Prescott 1997) filters
among others. However, all filtering functions in the mFilter package
are not fully data-driven. Instead, they make use of default values
recommended within scientific literature. Thus, to our knowledge, there
are not any packages on CRAN that implement a data-driven local
polynomial regression for the trend of a time series or its derivatives.
The methodological and computational background for the smoots package
is summarized briefly hereinafter. For further details we refer the
reader to Feng et al. (2020) and references therein. Our focus is to
illustrate the wide application spectrum of the smoots package for
semiparametric modeling of time series. We propose to estimate the
nonparametric trend using the current package in the first stage and to
fit e.g. an ARMA model to the residuals with the standard arima()
function of the stats package in R. This idea is first shown with
monthly Northern Hemisphere temperature changes data obtained from the
website of the National Aeronautics and Space Administration (NASA).
Then the proposal is applied to the log-transformation of the quarterly
US-GDP data, retrieved from the website of the Federal Reserve Bank of
St. Louis, using a semiparametric log-local-linear growth model.
Moreover, two new semiparametric models for financial time series are
proposed to show further application of this package. Firstly, a
Semi-Log-GARCH model is defined by introducing a scale function into the
Log-GARCH (logarithmic GARCH)
(Geweke 1986; Pantula 1986; Milhøj 1987). The latter is a special
extension of the seminal ARCH (autoregressive conditional
heteroskedasticity, Engle 1982) and GARCH (generalized ARCH, Bollerslev 1986) models and is just a slightly restricted ARMA
model for the log-transformation of the squared returns (see e.g. Sucarrat 2019). The usefulness of the Log-GARCH has recently been
rediscovered by Francq et al. (2013). The application of this proposal is
illustrated by the DAX returns collected from Yahoo Finance. Secondly, a
Semi-Log-ACD model is proposed as an extension of the Type I Log-ACD
(Bauwens and Giot 2000), which is closely related to the Semi-Log-GARCH. Like
the ACD (autoregressive conditional duration, Engle and Russell 1998) model, the
Semi-Log-ACD can be applied to different non-negative financial data,
such as realized volatilities and trading volumes. In this paper, this
new model is applied to the CBOE Volatility Index (VIX) obtained from
Yahoo Finance. Datasets for all of those examples are built in the
proposed package. The smoots package can be applied to suitable time
series from other research areas. In addition, the smoots package
provides a useful tool for teaching time series analysis, which helps a
lecturer to obtain automatically detrended real data examples to show
the application of parametric time series models.
The methods and IPI-algorithms are summarized in the following two sections. The general usage of the R functions is then described in another section. Three further sections illustrate the applications of those functions in simple cases and with respect to the Semi-Log-GARCH and the Semi-Log-ACD.
The main algorithm of the smoots package is a fully automatic
non-parametric procedure for estimating a deterministic trend in an
additive time series model with observations
The local polynomial estimator of
An IPI-algorithm is developed based on the asymptotically optimal
bandwidth
After estimating and removing the nonparametric trend, any suitable
parametric model can be fitted to the residuals for further econometric
analysis. For instance, we can assume that
Since both
Let
Let
Global steps: Estimate
Local adaptation at
It is proposed to use the Bartlett-window throughout the whole algorithm
for simplicity. If
In this subsection the data-driven IPI-algorithm for selecting the
bandwidth for local linear and local cubic estimators
Start with an initial bandwidth
Obtain
Let
Increase
In the developed package the initial bandwidth
The proposed IPI-algorithm can be easily adapted to select bandwidths
for estimating
In the first stage
Then an IPI-procedure as proposed above to select the bandwidth for
estimating
Note that
The package smoots developed based on the algorithms described in the
last section consists of five directly usable functions, two S3 methods
and four datasets as examples. The first four functions are called
msmooth()
, tsmooth()
, gsmooth()
and knsmooth()
, designed for
estimating the trend in different ways. Data-driven estimation can be
carried out by each of the first two functions, where msmooth()
is a
user-friendlier simplified version of tsmooth()
. Local polynomial
estimation of gsmooth()
and knsmooth()
, respectively. In the first case, one should choose dsmooth()
.
The functions for selecting the bandwidth will be described in more detail. Moreover, for simplicity, shared arguments between functions will only be discussed once. With
tsmooth(y, p, mu, Mcf, InfR, bStart, bvc, bb, cb, method)
,
the trend in equidistant and trend-stationary time series with short-memory can be estimated via an automated local polynomial. Its arguments are as follows.
y
is the input time series.p
reflects the order of polynomial and currently only p = 1
(local linear) and p = 3
(local cubic) are selectable.mu
corresponds to the smoothness parameter 0
, 1
, 2
and 3
are valid options.Mcf
defines the method for estimating Mcf = "NP"
, the
default, "AR"
,
"MA"
or "ARMA"
are selected, InfR
sets the inflation rate InfR = "Opt"
,
which corresponds to InfR = "Nai"
with
InfR = "Var"
, which always sets
bStart
is the (relative) starting bandwidth for the bandwidth
selection algorithm. The default is bStart = 0.15
. Note that
bvc
the estimation of bvc = "Y"
, the bandwidth for estimating bvc = "N"
.bb
describes the boundary method. If bb = 0
is selected, the
number of observations considered for smoothing is decreasing
towards the boundaries, while it is constant throughout for
bb = 1
.cb
is the proportion of observations at each boundary that is
omitted in the bandwidth selection process.method
the final smoothing method, after the bandwidth has
been selected, can be defined. The default method = "lpr"
corresponds to local polynomial estimates, whereas with
method = "kr"
a kernel regression is conducted.A simplified version of tsmooth()
for estimating a time series trend
with a data-driven local polynomial is
msmooth(y, p, mu, bStart, alg, method)
,
which shares most of its arguments with tsmooth()
. The only unknown
argument is alg
.
alg
defines specific argument settings of tsmooth()
as named
subalgorithms. The two algorithms AlgA and AlgB described in
the section The IPI-algorithm for estimating alg = "A"
and alg = "B"
, respectively.In accordance with Feng et al. (2020), we propose the use of AlgA
with the optimal inflation factor for local linear regression. For local
cubic regression with a moderate "Nai"
or even "Var"
should be employed.
To estimate the first or second derivative of a time series trend with a data-driven local polynomial, the function
dsmooth(y, d, mu, pp, bStart.p, bStart)
should be employed. With
this function, the first and second derivatives are always estimated
using a local quadratic and local cubic regression, respectively. To
obtain tsmooth()
is computed.
d
specifies the order of derivative of the trend that will be
estimated. Currently, d = 1
and d = 2
are valid options.pp
corresponds to the order of polynomial considered for the pilot
estimate of the trend function. pp = 1
and pp = 3
are available.bStart.p
is the starting bandwidth for the data-driven pilot
estimate of the trend.Note that here, bStart
is the initial bandwidth considered in the
bandwidth selection for the derivative series. For further details on
the functions we refer the reader to the user’s guideline for this
package. Beside the above functions, two S3 methods are also built in
the package: a print()
and a plot()
method for objects of class
"smoots"
, a newly introduced class of objects created by the smoots
package. They allow for a quick and detailed overview of the estimation
results. The "smoots"
objects themselves are generally lists
consisting of different components such as input data and estimation
results.
In this and the next two sections, the wide applicability of the above
mentioned functions will be illustrated by four real data examples.
Those datasets are built in the package so that the reader can also use
them. They are: tempNH
, gdpUS
, dax
and vix
, which contain
observations of the mean monthly temperature changes of the Northern
Hemisphere, the US GDP and daily financial data of the German stock
index (DAX) and the CBOE Volatility Index (VIX), respectively. For
further information see also the documentation on the data within the
smoots package. Since the package is available on CRAN, the commands
install.packages("smoots")
and library(smoots)
can be used to
install it and attach it within the current R environment.
To show the application of the additive Semi-ARMA model defined by
(1) and (6), the time series of the mean
monthly Northern Hemisphere temperature changes from 1880 to 2018 (NHTM)
is chosen. The data are downloaded from the website of the NASA. For
this purpose, the function tsmooth()
is applied. The used settings of
the arguments for this function are equivalent to those in algorithm A
with
tempChange <- smoots::tempNH$Change
est_temp <- smoots::tsmooth(tempChange, p = 1, mu = 2, Mcf = "NP", InfR = "Opt",
bStart = 0.1, bvc = "Y", method = "lpr")
d1_temp <- smoots::dsmooth(tempChange, d = 1, mu = 2, pp = 3, bStart.p = 0.2,
bStart = 0.15)
d2_temp <- smoots::dsmooth(tempChange, d = 2, mu = 3, pp = 1, bStart.p = 0.1,
bStart = 0.2)
Figure 1(a) shows the observations together with the estimated trend. Here, the selected bandwidth is 0.1221. We see, the estimated trend fits the data very well. In particular, the trend increases steadily after 1970 that might be a signal for possible global warming in the last decades. The residuals are displayed in Figure 1(b), which look quite stationary. This indicates that Model (1) is a suitable approach for this time series. Moreover, Figures 1(c) and 1(d) illustrate the data-driven estimates of the first and second derivatives of the trend respectively, which fit the features of the trend function very well and provide us further details about the global temperature changes.
arma1 <- stats::arima(est_temp$res, order = c(1, 0, 1), include.mean = FALSE)
Consequently, an ARMA(1, 1) model is fitted to the residual series
arima()
function in R,
which results in
A well-known approach in developing economics is the log-linear growth model. Assume that the log-transformation of a macroeconomic time series follows Models (1) and (6), we achieve a semiparametric local polynomial, in particular a local-linear extension of this theory. To show this application, the series of the quarterly US-GDP from the first quarter of 1947 to the second quarter of 2019, downloaded from the Federal Reserve Bank of St. Louis, is chosen. Data-driven local linear regression is applied to estimate the trend from the log-data using AlgA with the selected bandwidth 0.1325. A kernel regression estimate using the same bandwidth is also carried out for comparison.
l_gdp <- log(smoots::gdpUS$GDP)
gdp_t1 <- smoots::msmooth(l_gdp, p = 1, mu = 1, bStart = 0.1, alg = "A",
method = "lpr")
gdp_t2 <- smoots::msmooth(l_gdp, p = 1, mu = 1, bStart = 0.1, alg = "A",
method = "kr")
gdp_d1 <- smoots::dsmooth(l_gdp, d = 1, mu = 1, pp = 1, bStart.p = 0.1,
bStart = 0.15)
gdp_d2 <- smoots::dsmooth(l_gdp, d = 2, mu = 1, pp = 1, bStart.p = 0.1,
bStart = 0.2)
The results together with the log-data are displayed in Figure 2(a). We see that the two trend estimates in the middle part coincide with each other. They differ from each other only at the boundary points and the kernel estimate is affected by a clear boundary problem. Thus, the local linear method should be used. Residuals of this estimate are shown in Figure 2(b). Again, the estimated first and second derivatives are given in Figures 2(c) and 2(d), respectively, which help us to discover more detailed features of the economic development in the US.
arma2 <- stats::arima(gdp_t1$res, order = c(1, 0, 1), include.mean = FALSE)
Furthermore, the following ARMA(1, 1) model is obtained from the
residuals:
Most of the GARCH extensions, including the Log-GARCH (Geweke 1986; Pantula 1986; Milhøj 1987), are defined for stationary return series. Recently, Francq et al. (2013) rediscovered the usefulness of the Log-GARCH. In practice it is however found that the unconditional variance of financial returns may change slowly over time and are hence non-stationary. To overcome this problem a semiparametric GARCH (Semi-GARCH) approach is proposed by Feng (2004), by defining the return series as a product of a (deterministic) smooth scale function and a GARCH model. Another well-known closely related approach is the Spline-GARCH introduced by Engle and Rangel (2008). In this paper we will introduce a Semi-Log-GARCH (semiparametric Log-GARCH), defined as a Log-GARCH with a smooth scale function. We propose to estimate the scale function in the Semi-Log-GARCH model based on the log-transformation of the squared returns as proposed by Engle and Rangel (2008).
Denote the centralized returns by
arima()
function of the stats package. To obtain the total
volatilities
Obtain estimates of the nonparametric trend function
Fit an ARMA(arima()
function of the stats package, and compute the ARMA
residuals
Consider
Estimates of the conditional volatilities are also calculable similarly
by following the same three-step procedure while replacing
(15) with
In the following, the DAX series from 1990 to July 2019 downloaded from Yahoo Finance is chosen to show the application of the Semi-Log-GARCH model. Note that an observed return can be sometimes exactly zero. To overcome this problem, the log-transformation is calculated for the squared centralized returns, which are a.s. non-zero. This would even be a necessary treatment, if the returns had a very small, but non-zero mean.
# Calculate the centralized log-returns
dax_close <- smoots::dax$Close; dax <- diff(log(dax_close))
rt <- dax - mean(dax); yt <- log(rt ^ 2)
Subsequently, the previously described estimation procedure according to
Sucarrat (2019) is implemented by estimating the trend function in the
logarithm of the squared returns yt
with the function msmooth()
of
the smoots package. More specifically, a local cubic trend is fitted,
while employing AlgA.
# Step 1: Estimate the trend in the log-transformed data using 'smoots'
estim3 <- smoots::msmooth(yt, p = 3, alg = "A")
m_xt <- estim3$ye
# Step 2: Fit an ARMA model to the residuals
xi <- estim3$res
arma3 <- arima(xi, order = c(1, 0, 1), include.mean = FALSE)
# Step 3: Estimate further quantities and the total volatilities
mu_le <- -log(mean(exp(arma3$residuals)))
vol <- exp((xi - arma3$residuals + m_xt - mu_le) / 2)
For reference, estim3.2 <- smoots::msmooth(yt, p = 1, alg = "A")
is
called, i.e. a local linear trend under consideration of AlgA is
estimated as well. The centralized log-returns and the log-transformed
data with the two estimated trends (local linear: blue; local cubic:
red) are displayed in Figures 3(a) and 3(b).
Moreover, the selected bandwidths are 0.0869 and 0.1013, respectively.
Results in Figure 3(b) indicate that the unconditional
variance of the DAX-returns changes slowly over time. Ultimately, the
local cubic trend is chosen for further analysis, because here the
results of the local linear approach are over-smoothed. The following
ARMA(1, 1) model (see Step 2 in the code) is obtained from the residuals
of the log-data
A more general framework for modeling non-negative financial time series
is the ACD (autoregressive conditional duration, Engle and Russell 1998) model,
which corresponds to a squared GARCH model and can be applied to both
high-frequency or daily financial data. Logarithmic extensions of this
approach were introduced by Bauwens and Giot (2000), where the Type I
definition (called a Log-ACD) corresponds to a squared form of the
Log-GARCH. Semiparametric generalization of the Log-ACD (Semi-Log-ACD)
was defined and applied to different kinds of non-negative financial
data by Forstinger (2018). In this paper, the application of the
Semi-Log-ACD model will be illustrated by the CBOE Volatility Index
(VIX) from 1990 to July 2019, denoted by
# Calculate the logarithm of the index
V <- smoots::vix$Close; lnV <- log(V)
# Step 1: Estimate the trend in the log-transformed data using 'smoots'
estim4 <- smoots::msmooth(lnV)
estim4.2 <- smoots::msmooth(lnV, p = 3, alg = "B")
m_xt <- estim4$ye
# Step 2: Fit an ARMA model to the residuals
xi <- estim4$res
arma4 <- arima(xi, order = c(1, 0, 1), include.mean = FALSE)
# Step 3: Estimate further quantities and the total means
mu_le <- -log(mean(exp(arma4$residuals)))
means <- exp(xi - arma4$residuals + m_xt - mu_le)
The original series of
In this paper the methodological background for developing the R package smoots (version 1.0.1) is summarized. The usage of the main functions in this package is explained in detail. To show the wide applicability of this approach two new semiparametric models for analyzing financial time series are also introduced. Data examples show that the proposed approach can be applied to different kinds non-stationary time series and the developed R package works very well for data-driven implementation of those semiparametric time series models. In particular, non-negative time series following a semiparametric multiplicative model can be easily estimated via the log-transformation. It is found that the errors in some examples could exhibit clear long memory. However, the current package is developed under short memory assumption. It is hence worthy to study the possible extension of the current approach to semiparametric time series models with long memory errors. Further extensions of the proposals in this paper, such as the development of suitable forecasting procedures and tools for testing stationarity of the errors or linearity of the deterministic trend, should also be studied in the future and included in future versions of the smoots package.
The numerical results in this paper were obtained using R 4.1.1 with the smoots 1.0.1 package and the stats 4.1.1 package. R itself and all packages used are available from CRAN at https://CRAN.R-project.org/.
.5cm
This work was supported by the German DFG project GZ-FE-1500-2-1. The data used were downloaded from different public sources as indicated in the contexts. We are grateful to the CRAN-network for great help during the publication of the R package smoots. We would also like to thank Dr. Marlon Fritz for helpful discussions.
smoots, stats, locpol, KernSmooth, lokern, rmaf, mFilter, lgarch
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Feng, et al., "The smoots Package in R for Semiparametric Modeling of Trend Stationary Time Series", The R Journal, 2022
BibTeX citation
@article{RJ-2022-017, author = {Feng, Yuanhua and Gries, Thomas and Letmathe, Sebastian and Schulz, Dominik}, title = {The smoots Package in R for Semiparametric Modeling of Trend Stationary Time Series}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {1}, issn = {2073-4859}, pages = {182-195} }