We implement in the R package carx a novel and computationally efficient quasi-likelihood method for estimating a censored autoregressive model with exogenous covariates. The proposed quasi-likelihood method reduces to maximum likelihood estimation in absence of censoring. The carx package contains many useful functions for practical data analysis with censored stochastic regression, including functions for outlier detection, model diagnostics, and prediction with censored time series data. We illustrate the capabilities of the carx package with simulations and an elaborate real data analysis.
Censored data are frequently encountered in diverse fields including environmental monitoring, medicine, economics, and social sciences. Censoring may arise, for example, when a measuring device is subject to some detection limits beyond which the device cannot yield a reliable measurement. Censoring can also occur due to regulations on price change, e.g., limits on maximal intra-daily price change in a stock market.
There exists an extensive literature on regression analysis with
censored responses since the pioneering work of Buckley and James (1979).
Considerable efforts have also been spent implementing existing methods
for estimating various models with censored observations, many of which
have been implemented in R. For instance, Henningsen (2010)
introduced the
censReg package
(Henningsen 2013), which covers standard regression models with
censored responses including the standard Tobit model
(Tobin 1958), maximum likelihood estimation with
cross-sectional data, and random-effects maximum likelihood procedure
for panel-data using Gauss-Hermite quadrature. The Tobit model is also
implemented in other packages with possibly different estimation
methods, including tobit()
in
AER
(Kleiber and Zeileis 2008), cenmle()
in
NADA (Lee 2013),
tobit()
in VGAM
(Yee 2015), MCMCtobit()
in
MCMCpack
(Martin et al. 2011), etc.
While there exists an extensive literature on estimating regression
models with censored responses and associated software, there are few
studies with censored time series response data. More generally, the
problem of stochastic regression with both the response and covariates
being possibly censored is relatively under-explored.
Zeger and Brookmeyer (1986) studied maximum likelihood estimation of a
regression model with the errors driven by an autoregressive model of
known order
In term of publicly available R packages facilitating estimation with
censored time series data, we are aware of only three such packages to
date, namely, cents
(McLeod et al. 2014),
ARCensReg
(Schumacher et al. 2016), and our
carx (Wang and Chan 2017a).
The cents package includes the fitcar1()
function, for fitting
an AR(1) model in the presence of censored and/or missing data, and the
cenarma()
function which, according to the authors, implements a
quasi-EM algorithm whose M-step is carried out by the arima()
function
and the E-step via the Durbin-Levinson recursions. However, there is
little documentation about these functions, rendering it hard to
understand and use the cents package. The ARCensReg package
offers similar functionality as our carx package. But their
estimation is implemented via a stochastic approximation version of the
EM (SAEM), which is different from our approach. In addition, it seems
to be developed after the carx, as a dataset in carx is
included in ARCensReg.
Motivated by the need for developing a computationally efficient method
for estimating censored stochastic regression models, Wang and Chan (2017b) have
recently introduced such a method for censored autoregressions with
exogenous covariates (CARX). The basic idea of our new approach assumes
that the score of the complete-data conditional log-likelihood of
In this paper we aim to introduce the R package carx, in which quasi-likelihood estimation of a CARX model is implemented for the important special case of normal innovations. The main functionality of the package is to provide an intuitive interface with comprehensive documentation to enable the user to estimate the parameters of a CARX model. In addition, some utility functions for model summary, model diagnostics, outlier detection, and prediction with censored time series data are also included in the package.
In addition, we have also implemented a new object class for censored
time series, i.e., "cenTS"
. The "cenTS"
class inherits the
extensible time series class "xts"
in the R package
xts (Ryan and Ulrich 2017).
Some functionalities, including plotting and summary, for the "cenTS"
class have been implemented. The "cenTS"
class is expected to be
extended in future and is hoped to be used as a standard data structure
for censored time series data.
In the following sections we first elaborate the CARX model and review the quasi-likelihood estimation method, then present the functionality and main functions of the R package carx and illustrate the package with data analyses using both simulated and real data examples. Some simulation studies assessing the empirical performance of model selection by minimizing the AIC and the accuracy of the proposed forecasting method and real data example are also reported.
In this section we briefly review the quasi-likelihood method for estimating a CARX model, and refer the reader to Wang and Chan (2017b) for details and some theoretical properties of the estimator. We first formulate the problem by specifying the model, then outline the estimation method and discuss some specific topics including model prediction, model diagnostics, and outlier detection.
Let
In practice,
The Censored Auto-Regressive model with eXgenous variables (CARX)
specifies that the uncensored response
The
Let
The quasi-likelihood method estimates
The following iterative scheme for solving Eq (3) was proposed by Wang and Chan (2017b).
Initialize the parameter estimate by some
consistent estimate, denoted by
For each
Iterate Step (2) until
The optimization in Step (2) for the case of normal innovations is
elaborated in Section 2.4 of Wang and Chan (2017b). The value
In the carx package, the initial value for the preceding iterative algorithm is set to the conditional least squares estimate obtained with the censored data replaced by the corresponding censoring limit, which appears to work well in simulation examples reported in Wang and Chan (2017b).
Wang and Chan (2017b) proved the consistency and asymptotic normality of the quasi-likelihood estimator under mild regularity conditions. But the asymptotic covariance matrix of the estimator involves two intractable matrices. Consequently, Wang and Chan (2017b) proposed to use parametric bootstrap for drawing inference, including estimating the asymptotic covariance matrix and constructing confidence intervals of the unknown parameters.
It is of practical interest to predict the future values
There are two cases. Case 1:
We now consider Case 2:
Note that the proposed predictive scheme is conditional on the future
covariate values
A main task in model diagnostics consists of checking whether or not the
data are consistent with the model assumption that the innovations are
independent and identically normally distributed of zero mean and
constant variance. In the presence of censoring, how to define the
residuals is unclear. For the simple case when
A simulation study reported in Wang and Chan (2017b) suggests that the asymptotic null distribution of the Ljung-Box test statistic, for checking residual autocorrelations, based on the simulated residuals is the same as that based on the uncensored data. Thus, standard diagnostic tools for residual analysis may be applicable with the simulated residuals.
Real data are often marred by outliers. An outlier in a time series may result from a perturbation inducing an unknown shift in an observation or an innovation, resulting in the so-called additive or innovative outlier, respectively ( Cryer and Chan 2008). An innovative outlier (IO) may mask as a contiguous block of additive outliers (AO). Since it is harder to detect IOs in censored time series, we focus on detecting AOs with a new method for doing so in censored time series.
As the number of outliers and their locations are generally unknown,
outlier detection is carried out one by one and iteratively. The
procedure begins with an outlier-free CARX model. Then we check for the
presence of additive outliers by a method to be described below. If an
outlier is detected at time
More specifically, we describe a method to detect any remaining additive
outliers given the data and a CARX model. For the sake of fast
computation, we consider the predictive distribution of
Let
In this section we present the R package carx in which the
estimation, prediction, and diagnostics procedures discussed in previous
section are implemented, assuming the normality of
First, let us introduce a class "cenTS"
designed to encapsulate a
censored time series with its observed values as well as the left
(lower) and right (upper) censoring limits. The "cenTS"
inherits the
extensible time series class "xts"
in the R package xts. A
"cenTS"
object can be constructed by the following function call.
cenTS(value, order.by, lcl = NULL, ucl= NULL, value.name = "value", ...)
Note that the value
(whose name can be set in value.name
) and
order.by
denote the observed values and their corresponding indices
respectively, and lcl
and ucl
denote the left (lower) and right
(upper) censoring limits respectively. Other time series variables to be
included as covariates in the regression can be supplied via additional
arguments.
A "cenTS"
object can be inspected by the print()
and plot()
methods. Any covariate time series can be retrieved by the xreg()
method.
The foremost function is the method for the S3
class "carx"
,
carx()
, whose signature is the following.
carx(formula, data = list(), p = 1,
prmtrX = NULL, prmtrAR = NULL, sigma = NULL,
y.na.action = c("skip","as.censored"), addMu = TRUE,
tol = 1e-4, max.iter = 500,
CI.compute = FALSE, CI.level = 0.95,
b = 1000,b.robust = FALSE,b.show.progress = FALSE,
init.method = c("biased","consistent"),
cenTS = NULL, verbose = FALSE,...)
The carx()
method provides a simple-to-use interface for the user to
input a formula, a data set, and other arguments to estimate a CARX
model.
The carx()
method returns a "carx"
object which stores the supplied
data, the quasi-likelihood coefficient estimates, as well as other
information. It allows many optional arguments to control the function
behavior. The main arguments are listed below:
formula
is a formula representing the regression part of the
model, such as y ~ x1 + x2
.
data
denotes a data.frame
which includes the following:
The response variable with variable name identified by the supplied formula.
Any covariate(s) with variable name(s) identified by the supplied formula.
A vector with name ci
whose components take values from
lcl
representing the vector of left (lower) censoring limits.
If not present, indicating no lower limit.
ucl
representing the vector of right (upper) censoring limits.
If not present, indicating no upper limit.
p
denotes the autoregressive order of the regression errors,
default = 1.
The above arguments supply the data structure including the censoring information, and specify the CARX model to be estimated. Although the function contains many optional arguments for fine-tuning the fitting algorithm and obtaining more information about the fitted model such as confidence intervals, we merely discuss the following two arguments:
prmtrX
, prmtrAR
, and sigma
are used to specify the initial
values of the regression coefficients
y.na.action
is a string indicating how to handle missing (NA
)
values in y
. If it is set to "skip"
(default), cases containing
a missing value will be skipped, so that the estimating equation of
future cases will be conditional on the most recent p
complete
cases after the skipped case. For "as.censored"
, the y
value
will be treated as left-censored with the left (lower) censoring
limit replaced by positive infinity. The user may choose to use
skip
if there exist few long gaps in the response. Use
"as.censored"
in the presence of numerous, non-contiguous missing
values in y
. Note that the presence of any missing values in x
will automatically hard-code y.na.action
to be "skip"
.
As "carx"
is an S3
class, some generic methods have been implemented
so that the estimation function can be easily called for practical use
and more information about the model fitting can be easily extracted.
The function print()
simply returns a plain output of the fitted
model, while the summary()
function provides a more elaborate summary
of the fitted model including the estimates, their standard errors, 95%
confidence limits and p-values, based on parametric bootstrap, for each
model parameter, if CI.compute = TRUE
. The model parameters can be
conveniently extracted by the function coef()
, which returns all
coefficient estimates except that the error (innovation) standard
deviation is returned as the sigma
component of the list returned by
carx()
. logLik()
returns the maximum (quasi-)log-likelihood
AIC()
computes the AIC of the model with the
maximum log-likelihood replaced by maximum (quasi-)log-likelihood.
There are some other useful functions in the package. The method
plot()
draws the time plot of the censored response time series,
superimposed with the fitted values (1-step-ahead predictors) from the
supplied CARX model. The function predict()
computes the
multi-step-ahead point predictors and their associated prediction
limits, based on a given model and future values of the covariates
supplied by the user. The function fitted()
returns the fitted values
by calling the predict
method. The function residuals()
returns the
simulated residuals of the fitted model. The outlier detection method
discussed in Section 2.5 is implemented by the method
outlier()
. Model diagnostics based on the simulated residuals are
visualized by the method tsdiag()
, which consists of four subplots:
the time plot of standardized simulated residuals, the residuals versus
fitted values plot, the residual autocorrelation function plot, and the
plot of the p-values of the Ljung-Box test statistics, for testing no
residual autocorrelations.
In this section we illustrate the various functions of the package
through two examples, the first one is a simulated data set and the
second a real data set. Note that an extensive simulation study about
the performance of the proposed estimation method and some model
diagnostics can be found in Wang and Chan (2017b) which shows the robustness of
the proposed estimation method to slight departure from the normality
assumption of the innovations. We first load the carx
package by the
following command.
> library(carx)
To begin, we introduce the function carxSimCenTS()
for simulating data
from a CARX model, whose signature and default values of arguments are
shown below.
carxSimCenTS(nObs = 200, prmtrAR = c(-0.28,0.25),
prmtrX = c(0.2,0.4), sigma = 0.60, lcl = -1, ucl = 1, x = NULL,
seed = NULL, value.name = 'y', end.date = Sys.date())
The carxSimCenTS()
function generates a simulated "cenTS"
time
series of length nObs
, with the AR parameters
prmtrAR
, the regression coefficients through prmtrX
, and innovation
standard deviation through sigma
, the lower and upper censoring limits
through lcl
and ucl
respectively. The regressors can be supplied via
x
, which, if is NULL
, will be generated as independent standard
normal variables. The user can also specify the seed of the random
number generator by seed
for ensuring repeatability. As
carxSimCenTS()
encapsulates the simulated data into a "cenTS"
object, the construction of which need a time/date-based index. The
default treats the data as daily observations, with the end date
specified by end.date
. The user can set the name of the censored time
series via value.name
but the names of the regressors are hard-coded
as X1
, X2
, etc. There is another function carxSim()
, which returns
a data.frame
consisting of y
, x
, lcl
, ucl
and ci
. We will
mainly use the carxSimCenTS()
function for simulation as it
encapsulates the data as a "cenTS"
object.
We first simulate a "cenTS"
series, using the carxSimCenTS()
function with essentially the default setting, i.e., simulate
interval-censored data from a regression model with a 2-dimensional
covariate comprising independent standard normal components whose
regression coefficients are
> datSim <- carxSimCenTS(seed = 0,end.date = as.Date('2015-08-01'))
A glimpse of the last few data cases of the series is instructive.
> tail(datSim)
y lcl ucl ci X1 X2
2015-07-26 1.000 -1 1 1 -0.5466 0.288
2015-07-27 -0.321 -1 1 0 -1.6887 -1.505
2015-07-28 -0.259 -1 1 0 -1.5724 1.519
2015-07-29 0.386 -1 1 0 -0.4050 0.367
2015-07-30 0.282 -1 1 0 0.3193 1.700
2015-07-31 0.181 -1 1 0 0.0404 0.644
Censoring rate: 0.205
The simulated series can be readily visualized using the plot function (see Figure 1).
> plot(datSim)
"cenTS"
series. The observed
responses are connected as a solid black line, and the lower and upper
censoring limits drawn as red dotted and dashed line with censored
observations marked by triangles pointing up and down, respectively.Then the parameters can be estimated by the carx()
method, with the
fitted model saved in the object named modelSim
.
> modelSim <- carx(y ~ X1 + X2 - 1,data = datSim, p = 2, CI.compute = TRUE)
Note that -1
in the formula specifies no intercept in the regression.
Information about the fitted model can be obtained directly by typing
the variable name modelSim
.
> modelSim
Call:
carx.formula(formula = y ~ X1 + X2 - 1, data = datSim, p = 2,
CI.compute = T)
Coefficients:
X1 X2 AR1 AR2
0.203 0.460 -0.234 0.279
Residual (innovation) standard deviation:
[1] 0.548
Censoring rate:
[1] 0.205
Sample size:
[1] 200
Number of parameters:
[1] 5
Quasi-log-likelihood:
[1] 20.1
AIC:
[1] -30.1
Confidence interval:
2.50% 97.50%
X1 0.131 0.2766
X2 0.383 0.5446
AR1 -0.390 -0.0919
AR2 0.123 0.4066
sigma 0.483 0.6122
Variance-covariance matrix:
X1 X2 AR1 AR2 sigma
X1 1.41e-03 1.98e-05 5.86e-05 -1.49e-04 9.95e-05
X2 1.98e-05 1.73e-03 1.28e-04 9.08e-05 2.45e-04
AR1 5.86e-05 1.28e-04 5.76e-03 2.08e-03 1.32e-04
AR2 -1.49e-04 9.08e-05 2.08e-03 5.15e-03 5.88e-05
sigma 9.95e-05 2.45e-04 1.32e-04 5.88e-05 1.05e-03
N.B.: Confidence intervals and variance-covariance matrix
are based on 1000 bootstrap samples.
A summary of the fitted model can be obtained by running the summary()
function.
> summary(modelSim)
Call:
carx.formula(formula = y ~ X1 + X2 - 1, data = datSim, p = 2,
CI.compute = T)
Coefficients:
Estimate StdErr lowerCI upperCI p.value
X1 0.2025 0.0375 0.1314 0.28 <2e-16 ***
X2 0.4602 0.0416 0.3826 0.54 <2e-16 ***
AR1 -0.2336 0.0759 -0.3903 -0.09 0.002 **
AR2 0.2792 0.0717 0.1232 0.41 <2e-16 ***
sigma 0.2025 0.0324 0.4834 0.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC:
[1] -30.1
Although it can be shown that the quasi-likelihood estimator is
asymptotically normal under some regularity conditions (Wang and Chan 2017b),
the asymptotic variance-covariance matrix is intractable so it is
computed via parametric bootstrap. The summary function prints out the
coefficient estimates and innovation standard deviation estimate,
together with their estimated (bootstrap) standard errors, and lower and
upper 95% confidence limits. Note that the bootstrap computation time
increases almost linearly with the bootstrap replicate size; the default
is 1000. More specific information can be easily obtained by invoking
various methods. For instance, logLik
returns the quasi-log-likelihood
of the data, coef
returns the coefficients of the model, and the
standard deviation of modelSim$sigma
.
> logLik(modelSim)
[1] 20.1
attr(,"class")
[1] "logLik.carx"
> coef(modelSim)
X1 X2 AR1 AR2
0.203 0.460 -0.234 0.279
> modelSim$sigma
[1] 0.548
The plot()
function provides a visual check of how the fitted values
track the data, with the censoring limits superimposed on the diagram,
see Figure 2.
> plot(modelSim)
Model diagnostics are facilitated by the tsdiag()
function which is
similar to the tsdiag()
function in
TSA package
(Chan and Ripley 2012). The tsdiag()
function generates a plot of 4
sub-figures, namely, the time plot of the simulated residuals which is
useful for visually checking the presence of residual temporal patterns
and/or outliers, the simulated residuals versus fitted values plot which
is useful for checking the adequacy of the linear regression model
assumptiom, the residual autocorrelation function (ACF) plot that
quantifies the residual correlations, and the plot of the p-values of
the Ljung-Box tests for the presence of residual autocorrelation. The
following command generates the diagnostics plot for the model fitted to
the simulated data.
> tsdiag(modelSim)
The uppermost diagram in Figure 3 shows no
apparent residual temporal patterns, which is also confirmed by the fact
that none of the examined residual autocorrelations in the second
sub-figure from the bottom are significant and that the bottom
sub-figure shows that all p-values of the Ljung-Box test statistics
based on the first
In this example we utilize the package to analyze the change of total
phosphorus (P) concentrations in river water. Phosphorus is a major
nutrient in river water, of which an excessive amount can result in
environmental problem such as eutrophication. Phosphorus concentration
in many rivers in Iowa has been monitored under the ambient water
quality program conducted by the Iowa Department of Natural Resources
(Libra et al. 2004). An analysis of the change of P concentration has
been reported by (Wang et al. 2016). Here we illustrate the
analysis for a particular data set from an ambient site located in the
West Fork Cedar River at Finchford, with the data available in a
"cenTS"
object named pts
in the carx package.
The P concentrations (in mg/l) were left-censored whenever they fell
below certain time-varying detection limits, resulting in a censoring
rate of
> names(pts)
[1] "logP" "lcl" "ci" "tInMonth" "logQ" "season"
The variable logP
consists of the logarithmic P, lcl
the
corresponding censoring limits, ci
the indicator variable of
censoring, tInMonth
is the time index (in month), logQ
is the
corresponding logarithmic water discharge data (Q) measured in cf/s,
obtained from the U.S. Geological Survey, and season indicates to which
season
the month index belongs (see below for further details). P is
generally correlated with the water discharge
(Schilling et al. 2010). We will explore the relationship between P
and Q. See Figure 4 for the time plots of P, Q,
and the historical censoring limits.
It is also conjectured that the association between the logP and logQ
may be seasonal. The variable season
in pts
is constructed to denote
the quarter of the month with Quarter 1 consists of the first three
months, namely, January, February, and March; Quarter 2 comprises the
next three months, and so on. Figure 5
illustrates the seasonal relationship between logP
and logQ
. It is
of interest whether there exists a linear trend in the logarithmic P.
Preliminary analysis (not reported here) suggests the presence of
significant autocorrelation in the regression errors. Thus, the general
model takes the following form
logQ
, and
logP
versus logQ
. The data are labeled
by different numbers (1 for Quarter 1 and so on) and colors (black, red,
green, blue for Quarter 1, 2, 3 and 4, respectively). Least squares
quaterly regression lines (solid, dashed, dotted, and dash dotted) are
superimposed with the same color as the data
points.Note that we need to determine whether the intercept and/or the
regression coefficient are seasonal, and whether to include in the model
a time trend, resulting in 8 combinations. Moreover, the AR order for
the regression errors has to be specified. Assuming the maximal AR order
to be carxSelect()
. Here, the
maximal AR order is 3.
> arOrder <- 3
The list of models, named M1
to M8
, is specified in the following
code.
> s1 <- logP ~ logQ
> s2 <- logP ~ tInMonth + logQ
> s3 <- logP ~ logQ:as.factor(season)
> s4 <- logP ~ tInMonth + logQ:as.factor(season)
> s5 <- logP ~ as.factor(season) + logQ - 1
> s6 <- logP ~ tInMonth + as.factor(season) + logQ - 1
> s7 <- logP ~ as.factor(season) + logQ:as.factor(season) - 1
> s8 <- logP ~ tInMonth + as.factor(season) + logQ:as.factor(season) - 1
> fmls <- c(s1,s2,s3,s4,s5,s6,s7,s8)
> names(fmls) <- paste0('M',seq(1,8))
The model selection is performed by invoking the function carxSelect()
which has two required arguments: a list of formulas and the maximal AR
orders, plus an optional argument detect.outlier
, which by default is
TRUE
, indicates whether outliers should be detected in the model. The
function carxSelect()
returns a "carx"
object comprising an
additional element selectionInfo
which is a list containing more
information about the selection result, including aicMat
which is a
matrix whose
For the purpose of illustrating the prediction by the predict()
method, we use all data up to the end of 2012 for model selection and
fitting, and then check the model prediction against the observed data
in 2013.
> cs <- carxSelect(fmls, arOrder, data = pts['1998/2012'],
+ detect.outlier = TRUE, CI.compute = TRUE)
> print(round(cs$selectionInfo$aicMat,1))
AR1 AR2 AR3
M1 -41.8 -39.6 -38.2
M2 -39.7 -38.1 -36.8
M3 -51.2 -47.5 -45.1
M4 -49.4 -45.9 -43.6
M5 -53.2 -49.6 -47.1
M6 -51.5 -48.2 -45.8
M7 -53.9 -50.8 -47.6
M8 -52.8 -49.8 -46.8
A summary of the model fit is shown below.
> summary(cs)
Call:
carx.formula(formula = formula(m0), data = m0$data, p = m0$p,
CI.compute = ..1)
Coefficients:
Estimate StdErr lowerCI upperCI p.value
as.factor(season)1 -6.053239 0.633190 -7.325537 -4.9191 <2e-16 ***
as.factor(season)2 -3.455734 0.612764 -4.696565 -2.2607 <2e-16 ***
as.factor(season)3 -4.235837 0.414149 -5.028568 -3.4297 <2e-16 ***
as.factor(season)4 -4.854407 0.476015 -5.764154 -3.9324 <2e-16 ***
as.factor(season)1:logQ 0.633582 0.111002 0.436431 0.8566 <2e-16 ***
as.factor(season)2:logQ 0.248797 0.086893 0.073705 0.4247 0.006 **
as.factor(season)3:logQ 0.373538 0.068555 0.235532 0.5053 <2e-16 ***
as.factor(season)4:logQ 0.389721 0.087467 0.217672 0.5584 <2e-16 ***
AR1 0.075072 0.090671 -0.137839 0.2227 0.596
sigma 0.482897 0.028770 0.412907 0.5265 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC:
[1] -53.85639
The fitted model can be visualized by calling the plot()
function,
which is shown in Figure 6. The fitted values appear to
track the data well.
> plot(cs)
We examine the goodness-of-fit of the fitted model via the tsdiag()
function which generates 4 diagnostic plots in
Figure 7. These plots indicate that the fitted
model provides good fit to the data.
> tsdiag(cs)
The selected model can be interpreted as follows. The linear trend was not selected, suggesting no significant long-term change in the P concentrations. The intercept and the regression coefficient of logQ were seasonal. The regression errors appeared to be mildly auto-correlated and can be modeled as an order 1 AR process, although the AR coefficient was not significant.
Finally, we compute the prediction of logP
via the predict()
method
and compare the prediction with the actual data from January to October
2013. Note the prediction makes use of observed discharge data in 2013.
Figure 8 shows the point predictors (blue dashed
line) against the actual values (black solid line) and the 95%
prediction bands (red lines), which indicates that the prediction tracks
the actual data well.
In this section, we report a simulation study on the effectiveness of
model selection by minimizing the AIC. Recall this functionality is
implemented by the carxSelect()
function which outputs the model with
the smallest AIC from a set of models of various AR orders up to some
pre-specified maximum order.
We restrict the simulation study to the problem of selecting the AR
order with the same model specification, i.e, the same set of
regressors, which is conducted as follows. We simulated 1000 series by
calling carxSim()
with the default setting, hence the true AR order is
equal to 2, and for each simulated series we selected the best model
among the models with the AR order from 1 to 6. Since the uncensored
data were available in the simulation, we repeated the model selection
with the uncensored observed data, for comparing with the results using
the censored data. This simulation study can be reproduced by the
following code.
> singleTestSelectAROrder <- function(iter)
+ {
+ seed <- 1375911
+ cts <- carxSim(seed = iter*seed)
+ m0 <- carxSelect(list(f1 = as.formula(y~X1+X2-1)), max.ar = 6,
+ data=cts[,c("y","X1","X2")],detect.outlier = FALSE)
+ m1 <- carxSelect(list(f1 = as.formula(y~X1+X2-1)),max.ar = 6,
+ data = cts, detect.outlier = FALSE)
+ c(m0$fitted$p, m1$fitted$p)
+ }
> nRep <- 1000
> orders <- parallel::mclapply( 1:nRep, singleTestSelectAROrder,
+ mc.cores = parallel::detectCores() - 1)
> orders <- do.call(rbind, lapply(orders, matrix, ncol = 2, byrow = TRUE))
> freqComDat = count(orders[1,])
> freqCenDat = count(orders[2,])
The selected orders are reported in Table 1, which
shows that the true order can be recovered with an empirical probability
of
AR order | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Frequency (complete data) | 17 | 626 | 139 | 97 | 69 | 52 |
Frequency (censored data) | 37 | 527 | 160 | 99 | 101 | 76 |
In this section we report a simulation study about the empirical
performance of the model prediction procedure. A series of 210 data was
simulated using the default parameters of carxSim()
, with the first
200 data used to estimate the model, and the last 10 observations used
to compare with the predicted values based on the fitted model and the
simulated future covariate values. The above procedure was repeated 500
times. The empirical coverage rates of the 95%
> nRep = 500; nObs = 200; n.ahead=10
> runSimPredCR <- function()
+ {
+ set.seed(0)
+ crMat = matrix(nrow = n.ahead, ncol = nRep)
+ for(iRep in 1:nRep)
+ {
+ sdata = carxSim(nObs = nObs + n.ahead)
+ trainingData = sdata[1:nObs,]
+ testData = sdata[-(1:nObs),]
+ mdl = carx(y ~ X1 + X2 - 1, data = trainingData, p = 2)
+ newxreg = testData[,c('X1','X2')]
+ predVal = predict(mdl, newxreg = newxreg, n.ahead = n.ahead)
+ crInd = (predVal$ci[,1] <= testData$y) & (predVal$ci[,2] >= testData$y)
+ crMat[,iRep] = crInd
+ }
+ crPred = apply(crMat,1,mean)
+ }
> runSimPredCR()
n.ahead | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Empirical Coverage | 0.954 | 0.932 | 0.946 | 0.952 | 0.946 | 0.946 | 0.938 | 0.946 | 0.948 | 0.952 |
In summary, we have reviewed the quasi-likelihood method to estimate a censored time series regression model and introduced the carx package in which quasi-likelihood estimation is implemented, together with other useful functions for model selection, prediction, diagnostics and outlier detections. We illustrated the carx package with two major examples, and shed light on the effectiveness of model selection via minimizing AIC and the prediction accuracy.
Future work includes extending the method for more complex regression noise structure than the AR model, for instance, the more general ARIMA model, and updating the package according to the feedback from the public.
censReg, AER, NADA, VGAM, MCMCpack, cents, ARCensReg, carx, xts, TSA
Bayesian, Distributions, Econometrics, Environmetrics, Epidemiology, ExtremeValue, Finance, MissingData, Psychometrics, SpatioTemporal, Survival, TeachingStatistics, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wang & Chan, "carx:an R Package to Estimate Censored Autoregressive Time Series with Exogenous Covariates ", The R Journal, 2017
BibTeX citation
@article{RJ-2017-064, author = {Wang, Chao and Chan, Kung-Sik}, title = {carx:an R Package to Estimate Censored Autoregressive Time Series with Exogenous Covariates }, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {213-231} }