General-to-Specific (GETS) modelling provides a comprehensive, systematic and cumulative approach to modelling that is ideally suited for conditional forecasting and counterfactual analysis, whereas Indicator Saturation (ISAT) is a powerful and flexible approach to the detection and estimation of structural breaks (e.g. changes in parameters), and to the detection of outliers. To these ends, multi-path backwards elimination, single and multiple hypothesis tests on the coefficients, diagnostics tests and goodness-of-fit measures are combined to produce a parsimonious final model. In many situations a specific model or estimator is needed, a specific set of diagnostics tests may be required, or a specific fit criterion is preferred. In these situations, if the combination of estimator/model, diagnostics tests and fit criterion is not offered in a pre-programmed way by publicly available software, then the implementation of user-specified GETS and ISAT methods puts a large programming-burden on the user. Generic functions and procedures that facilitate the implementation of user-specified GETS and ISAT methods for specific problems can therefore be of great benefit. The R package gets is the first software – both inside and outside the R universe – to provide a complete set of facilities for user-specified GETS and ISAT methods: User-specified model/estimator, user-specified diagnostics and user-specified goodness-of-fit criteria. The aim of this article is to illustrate how user-specified GETS and ISAT methods can be implemented with the R package gets.
General-to-Specific (GETS) modelling provides a comprehensive, systematic and cumulative approach to modelling that is ideally suited for scenario analysis, e.g. conditional forecasting and counterfactual analysis. To this end, well-known ingredients (tests of coefficients, multi-path backwards elimination, diagnostics tests and fit criteria) are combined to produce a parsimonious final model that passes the chosen diagnostics. GETS modelling originated at the London School of Economics (LSE) during the 1960s, and gained widespread acceptance and usage in economics during the 1980s and 1990s. The two-volume article collection by Campos et al. (2005) provides a comprehensive historical overview of key-developments in GETS modelling. Software-wise, a milestone was reached in 1999, when the data-mining experiment of Lovell (1983) was re-visited by Hoover and Perez (1999). They showed that automated GETS modelling could improve substantially upon the then prevalent modelling approaches. The study spurred numerous new studies and developments, including Indicator Saturation (ISAT) methods, see Hendry et al. (2008) and Castle et al. (2015). ISAT methods provide a powerful and flexible approach to the detection and estimation of structural breaks (e.g. changes in parameters), and to the detection of outliers.
On CRAN, there are two packages that provide GETS methods. The second, named gets, is simply the successor of the first, which is named AutoSEARCH.1 Since October 2014 the development of AutoSEARCH is frozen, and all development efforts have been directed towards gets together with Dr. Felix Pretis and Dr. James Reade.2 An introduction to the gets package is provided by (Pretis et al. 2018). However, it does does not cover the user-specification capabilities of the package, some of which were not available at the time.
At the time of writing (September 2020), the publicly available softwares that provide GETS and ISAT methods are contained in Table 1. Although they offer GETS and ISAT methods for some of the most popular models in applications, in many situations a specific model or estimator will be needed, a specific set of diagnostics tests may be required, or a specific fit criterion is preferred. In these situations, if the combination of estimator/model, diagnostics tests and fit criterion is not offered in a pre-programmed way by the publicly available softwares, then the implementation of user-specified GETS and ISAT methods puts a large programming-burden on the user. Generic functions and procedures that facilitate the implementation of user-specified GETS and ISAT methods for specific problems can therefore be of great benefit. The R package gets, since version 0.20 (September 2019), is the first software – both inside and outside the R universe – to provide a complete set of facilities for user-specified GETS and ISAT methods: User-specified model/estimator, user-specified diagnostics and user-specified goodness-of-fit criteria. The aim of this article is to illustrate how user-specified GETS and ISAT methods can be implemented.
\(\underset{(\text{MATLAB})}{\text{HP1999}}\) | \(\underset{(\text{OxMetrics})}{\text{Autometrics}}\) | \(\underset{(\text{Scilab})}{\text{Grocer}}\) | \(\underset{(\text{STATA})}{ \text{genspec} }\) | EViews\(^*\) | \(\underset{(\text{R})}{ \text{ \emph{gets} } }\) | ||
---|---|---|---|---|---|---|---|
More than 10 paths | Yes | Yes | Yes | Yes | Yes | ||
GETS of linear regression | Yes | Yes | Yes | Yes | Yes | Yes | |
GETS of variance models | Yes | ||||||
GETS of logit/count models | Yes | ||||||
GETS of probit models | Yes | Yes | |||||
GETS of panel models | Yes | Yes | |||||
GETS of MIDAS models | Yes | ||||||
ISAT of linear regression | Yes | Yes | Yes | Yes | |||
User-specified GETS | Yes | Yes | |||||
User-specified ISAT | Yes | ||||||
User-specified diagnostics | Yes | Yes | |||||
User-specified goodness-of-fit | Yes | ||||||
Menu-based GUI | Yes | Yes | |||||
Free and open source | Yes\(^{**}\) | Yes | Yes\(^{**}\) | Yes |
The rest of this article contains four sections. In the next section the model selection properties of GETS and ISAT methods are summarised. This is followed by a section that outlines the general principles of how user-specified estimation, user-specified diagnostics and user-specified goodness-of-fit measures are implemented. Next, a section with four illustrations follows. The final section contains a summary.
It is useful to denote a generic model for observation \(t\) as \[m\left(y_t, \boldsymbol{x}_t, \boldsymbol{\beta}\right), \qquad t=1,2,\ldots,n,\] where \(y_t\) is the dependent variable, \(\boldsymbol{x}_t = (x_{1t}, x_{2t}, \ldots)'\) is a vector of covariates, \(\boldsymbol{\beta}= (\beta_1, \beta_2, \ldots)'\) is a vector of parameters to be estimated and \(n\) is the sample size. Two examples are the linear regression model and the logit-model: \[\begin{aligned} y_t &= \beta_1 x_{1t} + \cdots + \beta_k x_{kt} + \epsilon_t, \label{eq:linear:regression:model} \end{aligned} \tag{1}\] \[\begin{aligned} Pr\left(y_t=1|\boldsymbol{x}_t\right) &= \frac{1}{1+\exp\left(-h_t\right)} \qquad\text{with}\qquad h_t = \beta_1 x_{1t} + \cdots + \beta_k x_{kt}. \label{eq:linearlogitmodel} \end{aligned} \tag{2}\] Note that, in a generic model \(m(y_t, \boldsymbol{x}_t, \boldsymbol{\beta})\), the dimension \(\boldsymbol{\beta}\) is usually – but not necessarily – equal to the dimension of \(\boldsymbol{x}_t\). Here, for notational convenience, they will both have dimension \(k\) unless otherwise stated.
In (1)–(2), a variable \(x_{jt} \in \boldsymbol{x}_t\) is said to be relevant if \(\beta_j \neq 0\) and irrelevant if \(\beta_j = 0\). Let \(k_{\text{rel}} \geq 0\) and \(k_{\text{irr}} \geq 0\) denote the number of relevant and irrelevant variables, respectively, such that \(k_{\text{rel}} + k_{\text{irr}} = k\). GETS modelling aims at finding a specification that contains as many relevant variables as possible, and a proportion of irrelevant variables that on average equals the significance level \(\alpha\) chosen by the investigator. Put differently, if \(\widehat{k}_{\text{rel}}\) and \(\widehat{k}_{\text{irr}}\) are the retained number of relevant and irrelevant variables in an empirical application, respectively, then GETS modelling aims at satisfying \[\label{eq:asymptotic:gauge:and:potency} \mathbb{E}\left(\widehat{k}_{\text{rel}}/k_{\text{rel}}\right) \rightarrow 1 \quad \textnormal{and} \quad \mathbb{E}\left(\widehat{k}_{\text{irr}}/k_{\text{irr}}\right) \rightarrow \alpha \quad \textnormal{as} \quad n\rightarrow\infty, \tag{3}\] when \(k_{\text{rel}},k_{\text{irr}}>0\). If either \(k_{\text{rel}}=0\) or \(k_{\text{irr}}=0\), then the targets are modified in natural ways: If \(k_{\text{rel}}=0\), then the first target is \(\mathbb{E}\left(\widehat{k}_{\text{rel}}\right) = 0\), and if \(k_{\text{irr}}=0\), then the second target is \(\mathbb{E}\left(\widehat{k}_{\text{irr}}\right) = 0\). Sometimes, the irrelevance proportion \(\widehat{k}_{\text{irr}}/k_{\text{irr}}\) is also referred to as gauge, whereas the relevance proportion \(\widehat{k}_{\text{irr}}/k_{\text{irr}}\) is also referred to as potency.
In targeting a relevance proportion equal to 1 and an irrelevance proportion equal to \(\alpha\), GETS modelling combines well-known ingredients: Multi-path backwards elimination, tests on the \(\beta_j\)’s (both single and multiple hypothesis tests), diagnostics tests and fit-measures (e.g. information criteria). Let \(V\left(\widehat{\boldsymbol{\beta}}\right)\) denote the estimated coefficient-covariance. GETS modelling in the package gets can be described as proceeding in three steps:3
Formulate a General Unrestricted Model (GUM), i.e. a starting model, that passes a set of chosen diagnostic tests. A regressor \(x_j\) in the GUM is non-significant if the \(p\)-value of a two-sided \(t\)-test is lower than the chosen significance level \(\alpha\), and each non-significant regressor constitutes the starting point of a backwards elimination path. The test-statistics of the \(t\)-tests are computed as \(\widehat{\beta}_j/se\left(\widehat{\beta}_j\right)\), where \(se\left(\widehat{\beta}_j\right)\) is the square root of the \(j\)th. element of the diagonal of \(V\left(\widehat{\boldsymbol{\beta}}\right)\).
Undertake backwards elimination along multiple paths by removing, one-by-one, non-significant regressors as determined by the chosen significance level \(\alpha\). Each removal is checked for validity against the chosen set of diagnostic tests, and for parsimonious encompassing (i.e. a multiple hypothesis test) against the GUM. These multiple hypothesis tests on subsets of \(\boldsymbol{\beta}\) are implemented as Wald-tests.
Multi-path backwards elimination can result in multiple terminal models. The last step of GETS modelling consists of selecting, among the terminal models, the specification with the best fit according to a fit-criterion, e.g. the (Schwarz 1978) information criterion.
In ISAT methods, the vector \(\boldsymbol{x}_t\) contains at least \(n-1\)
indicators in addition to other covariates that are considered.
Accordingly, standard estimation methods are infeasible, since the
number of variables in \(\boldsymbol{x}_t\) is usually larger than the
number of observations \(n\). The solution to this problem provided by
ISAT methods is to first organise \(\boldsymbol{x}_t\) into \(B\) blocks:
\(\boldsymbol{x}_t^{\left(1\right)}, \ldots, \boldsymbol{x}_t^{\left(B\right)}\).
These blocks need not be mutually exclusive, so a variable or subset of
variables can appear in more than one block. Next, GETS modelling is
applied to each block, which leads to \(B\) final models. Note that, in
the isat
function, the default is that no diagnostic tests are
undertaken. Finally, a new round of GETS modelling is undertaken with
the union of the retained variables from the \(B\) blocks as covariates in
a new starting model (i.e. a new GUM). The model selection properties
targeted by ISAT methods are the same as those of GETS methods. Note,
however, that since the starting model (the GUM) contains at least \(n-1\)
regressors, a tiny significance level – e.g. \(\alpha=0.001\) or smaller
– is usually recommended in ISAT methods.
In the current version of the package gets, version 0.25, the
functions that admit user-specified estimation are arx
, getsm
,
getsFun
and isat
.4 The user-specification principles are the same
in all four. However, if the result (i.e. a list) returned from the
user-specified estimator does not have the same structure as that
returned from the default estimator ols
(part of the gets package),
then arx
, getsm
and isat
may not always work as expected. This is
particularly the case with respect to their extraction functions
(e.g. print
, coef
, residuals
and predict
). User-specified
diagnostics and goodness-of-fit functions are optional. By default,
getsFun
and isat
do not perform any diagnostics tests, whereas the
default in arx
and getsm
is to test the standardised residuals for
autocorrelation and Autoregressive Heteroscedasticity (ARCH). This is
implemented via the diagnostics
function (part of the gets package).
Also by default, all four functions use the (Schwarz 1978) information
criterion as goodness-of-fit measure, which favours parsimony, via the
infocrit
function (part of the gets package).
The recommended, most flexible and computationally most efficient
approach to user-specified GETS modelling is via the getsFun
function.
Currently, it accepts up to twenty-five arguments. For the details of
all these arguments, the reader is referred to the discussion of the
getsm
function (Section 5) in (Pretis et al. 2018), and the help
pages of getsFun
(type ?getsFun
). For the purpose of user-specified
estimation, user-specified diagnostics and user-specified
goodness-of-fit measures, the most important arguments are:
getsFun(y, x,
user.estimator = list(name = "ols"),
user.diagnostics = NULL,
gof.function = list(name = "infocrit", method = "sc"),
gof.method = c("min", "max"),
...)
The y
is the left-hand side variable (the regressand), x
is the
regressor or design matrix, user.estimator
controls which estimator or
model to use and further arguments – if any – to be passed on to the
estimator, user.diagnostics
controls the user-specified diagnostics if
any, and gof.function
and gof.method
control the goodness-of-fit
measure used. Note that y
and x
should satisfy
is.vector(y) == TRUE
and is.matrix(x) == TRUE
, respectively, and
enter in "clean" ways: If either y
or x
are objects of class, say,
"ts"
or "zoo"
, then getsFun
may not behave as expected. By
default, the estimator ols
is used with its default arguments, which
implements OLS estimation via the qr
function. The value NULL
on
user.diagnostics
means no diagnostics checks are undertaken by
default. The following code illustrates getsFun
in linear regression
(the default), and reproduces the information printed while searching:
<- 40 #number of observations
n <- 20 #number of Xs
k
set.seed(123) #for reproducibility
<- rnorm(n) #generate Y
y <- matrix(rnorm(n*k), n, k) #create matrix of Xs
x
#do gets with default estimator (ols), store output in 'result':
<- getsFun(y, x)
result
#the information printed while searching:
18 path(s) to search
: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Searching
The object named result
is a list, and the code summary(results)
returns a summary of its contents. The most important entries are:
paths
: A list of vectors containing the searched paths. Each
vector (i.e. path) indicates the sequence of deletion of the
regressors. In the example above the first path is$paths[[1]]
1] 1 15 6 7 3 14 11 16 4 2 8 12 5 9 20 19 13 [
That is, regressor no. 1 was the first to be deleted, regressor
no. 15 was the second, regressor no. 6 was the third, and so on. If
the regressors in `x` were named, then a name-representation of the
first deletion path is obtained with ` colnames(x)[ paths[[1]] ]`.
terminals
: A list of vectors with the distinct terminal models of
the specification search. In the example above it is equal to$terminals
$terminals[[1]]
1] 10 17 18
[
$terminals[[2]]
1] 10 18 [
That is, two terminal models. The first contains regressors 10, 17
and 18, whereas the second contains regressors 10 and 18.
terminals.results
: A data frame with the goodness-of-fit
information of the terminal models. In the above example the entry
is equal to:$terminals.results
info(sc) logl n k
1: 2.514707 -44.76081 40 3
spec 2: 2.529923 -46.90958 40 2 spec
spec 1
is short for specification 1, i.e. terminal model 1, and
spec 2
is short for specification 2, i.e. terminal model 2.
info(sc)
indicates that the (Schwarz 1978) criterion (the default)
is used as goodness-of-fit measure, whereas n
and k
denote the
number of observations and parameters, respectively.
best.terminal
: An integer that indicates which terminal model is
the best according to the goodness-of-fit criterion used. In the
example above the value is 1.specific.spec
: A vector of integers that indicates which
regressors that are contained in the best terminal model. In the
above example it is$specific.spec
1] 10 17 18 [
That is, the best terminal model contains regressors no. 10, 17 and 18.
User-specified estimation is carried out via the user.estimator
argument. By default, the argument is NULL
, so no diagnostic tests are
undertaken. To carry out diagnostics, the argument must be specified as
a list containing at least one entry – a character – named name
with
the name of the estimator to be invoked.5 Optionally, the list can
also contain an item named envir
, a character, which indicates the
environment in which the user-specified estimator resides. Additional
entries in the list, if any, are passed on to the estimator as
arguments.
The user-specified estimator must also satisfy the following:
It should be of the form myEstimator(y, x, ...)
, where y
is a
vector and x
is a matrix. In other words, while the name of the
function is arbitrary, the first argument should be the regressand
and the second the matrix of covariates.
The user-defined estimator should return a list
with a minimum of
six items:
n
(the number of observations)k
(the number of coefficients)df
(degrees of freedom, used in the \(t\)-tests)coefficients
(a vector with the coefficient estimates)vcov
(the coefficient covariance matrix)logl
(a goodness-of-fit value, e.g. the log-likelihood)The items need not appear in this order. However, the naming should
be exactly as indicated. If also the diagnostics and/or the
goodness-of-fit criterion is user-specified, then additional objects
may be required, see the subsections below on user-specified
diagnostics and goodness-of-fit criteria. Note also that, if the
goodness-of-fit criterion is user-specified, then logl
can in
certain situations be replaced by another item (which needs not be
named logl
).
The user-defined estimator must be able to handle NULL
regressor-matrices, i.e. situations where either NCOL(x)
is 0 or
is.null(x)
is TRUE
. This is needed in situations where a
terminal model is empty (i.e. no regressors are retained).
To illustrate how the requirements above can be met in practice, suppose
– as an example – that we would like to use the function lm
for
estimation rather than ols
. The first step is then to make a function
that calls lm
while satisfying requirements 1 to 3:
<- function(y, x, ...){
lmFun
##create list:
<- list()
result
##n, k and df:
$n <- length(y)
resultif( is.null(x) || NCOL(x) == 0 ){
$k <- 0
resultelse{
}$k <- NCOL(x)
result
}$df <- result$n - result$k
result
##call lm if k > 0:
if( result$k > 0){
<- lm(y ~ x - 1)
tmp $coefficients <- coef(tmp)
result$vcov <- vcov(tmp)
result$logl <- as.numeric(logLik(tmp))
resultelse{
}$coefficients <- NULL
result$vcov <- NULL
result$logl <- sum(dnorm(y, sd = sqrt(var(y)), log = TRUE))
result
}
##return result:
return(result)
}
The code
getsFun(y, x, user.estimator = list(name = "lmFun"))
undertakes the same specification search as earlier, but uses lmFun
rather than ols
.
User-specified diagnostics is carried out via the user.diagnostics
argument. The argument must be a list containing at least two entries: A
character named name
containing the name of the diagnostics function
to be called, and an entry named pval
that contains a vector with
values between 0 and 1, i.e. the chosen significance level(s) for the
diagnostics test(s).6 If only a single test is undertaken by the
diagnostics function, then pval
should be of length one. If two tests
are undertaken, then pval
should be of length two. And so on. An
example of the argument when only a single test is undertaken is:
= list(name = "myDiagnostics", pval = 0.05)) user.diagnostics
That is, the name of the function is myDiagnostics
, and the chosen
significance level for the single test that is carried out is 5%.
Optionally, just as when the estimator is user-specified, the list can
contain an item named envir
, a character, which indicates the
environment in which the user-specified diagnostics function resides.
Additional items in the list, if any, are passed on to the
user-specified function as arguments.
The user-specified diagnostics function must satisfy the following:
It should be of the form myDiagnostics(result, ...)
, where
result
is the list returned from the estimator in question,
e.g. that of the user-specified estimator (recall requirement 2 in
the previous section above).
It should return an \(m \times 3\) matrix that contains the \(p\)-value(s) of the test(s) in the third column, where \(m\geq 1\) is the number of tests carried out. So if only a single test is carried out, then \(m=1\) and the \(p\)-value should be contained in the third column. An example could look like:
statistic df pvalNA NA 0.0734 normality
Note that the row-names and column-names in the example are not
required. However, they do indicate what kind of information you may
wish to put there for reporting purposes, e.g. by using the function
diagnostics
(also part of the gets package).
To illustrate how the requirements can be met in practice, suppose we
would like to ensure the residuals are normal by testing for
non-normality with the Shapiro-Wilks test function shapiro.test
. In
this context, its main argument is the residuals of the estimated model.
The list returned by the user-defined estimator named lmFun
above,
however, does not contain an item with the residuals. The first step,
therefore, is to modify the estimator lmFun
so that the returned list
also contains the residuals:
<- function(y, x, ...){
lmFun
##info needed for estimation:
<- list()
result $n <- length(y)
resultif( is.null(x) || NCOL(x)==0 ){
$k <- 0
resultelse{
}$k <- NCOL(x)
result
}$df <- result$n - result$k
resultif( result$k > 0){
<- lm(y ~ x - 1)
tmp $coefficients <- coef(tmp)
result$vcov <- vcov(tmp)
result$logl <- as.numeric(logLik(tmp))
resultelse{
}$coefficients <- NULL
result$vcov <- NULL
result$logl <- sum(dnorm(y, sd=sqrt(var(y)), log=TRUE))
result
}
##residuals:
if( result$k > 0){
$residuals <- residuals(tmp)
resultelse{
}$residuals <- y
result
}
return(result)
}
Computationally, the only modification appears under ##residuals
. We
can now make the user-specified diagnostics function:
<- function(x, ...){
myDiagnostics <- shapiro.test(x$residuals) #do the test
tmp <- rbind( c(tmp$statistic, NA, tmp$p.value) )
result return(result)
}
The following code undertakes GETS modelling with the user-specified estimator defined above, and the user-specified diagnostics function using a 5% significance level for the latter:
getsFun(y, x, user.estimator = list(name = "lmFun"),
user.diagnostics = list(name = "myDiagnostics", pval = 0.05))
Note that if the chosen significance level for the diagnostics is
sufficiently high, then no specification search is undertaken because
the starting model does not pass the non-normality test. With the
current data, for example, a little bit of trial and error reveals this
is the case for a level of about pval = 0.35
.
User-specified goodness-of-fit is carried out with the gof.function
and gof.method
arguments. The former indicates which Goodness-of-Fit
(GOF) function to use, and the second is a character that indicates
whether the best model maximises ("max"
) or minimises ("min"
) the
GOF criterion in question. The first argument is a list with a structure
similar to earlier: It must contain at least one entry, a character
named name
, with the name of the GOF function to call. An example is:
= list(name = "myGof")) gof.function
Optionally, also here the list can contain an item named envir
, a
character, which indicates the environment in which the user-specified
GOF function resides. Also as earlier, additional items in the list are
passed on to the user-specified GOF function as arguments. The default
value, for example,
gof.function = list(name = "infocrit", method = "sc")
, means the
argument method = "sc"
is passed on to the function infocrit
, see
the help pages of infocrit
(type ?infocrit
) for information on the
methods available via this function. The user-specified GOF function
must satisfy the following:
It should be of the form myGof(result, ...)
, where result
is the
list returned from the estimator in question, e.g. that of the
user-specified estimator.
It should return a single numeric value, i.e. the value of the GOF measure in question.
To illustrate how the requirements can be met in practice, suppose we
would like to use the adjusted \(R^2\) as our GOF measure in combination
with our user-defined estimator. For the moment, the user-defined
estimator lmFun
does not contain the information necessary to compute
the adjusted \(R^2\). In particular, it lacks the regressand y
. However,
this is readily added:
<- function(y, x, ...){
lmFun
##info needed for estimation:
<- list()
result $n <- length(y)
resultif( is.null(x) || NCOL(x)==0 ){
$k <- 0
resultelse{
}$k <- NCOL(x)
result
}$df <- result$n - result$k
resultif( result$k > 0){
<- lm(y ~ x - 1)
tmp $coefficients <- coef(tmp)
result$vcov <- vcov(tmp)
result$logl <- as.numeric(logLik(tmp))
resultelse{
}$coefficients <- NULL
result$vcov <- NULL
result$logl <- sum(dnorm(y, sd=sqrt(var(y)), log=TRUE))
result
}
##residuals:
if( result$k > 0){
$residuals <- residuals(tmp)
resultelse{
}$residuals <- y
result
}
##info needed for r-squared:
$y <- y
result
return(result)
}
The added part appears under ##info needed for r-squared
. A GOF
function that returns the adjusted \(R^2\) is:
<- function(object, ...){
myGof <- sum((object$y - mean(object$y))^2)
TSS <- sum(object$residuals^2)
RSS <- 1 - RSS/TSS
Rsquared <- 1 - (1 - Rsquared) * (object$n - 1)/(object$n - object$k)
result return(result)
}
The following code undertakes GETS modelling with all the three user-specified functions defined so far:
getsFun(y, x, user.estimator = list(name = "lmFun"),
user.diagnostics = list(name = "myDiagnostics", pval = 0.05),
gof.function = list(name = "myGof"), gof.method = "max")
Incidentally, it leads to the same final model as when the default GOF function is used.
In multi-path backwards elimination search, one may frequently arrive at a specification that has already been estimated and tested. As an example, consider the following two paths:
$paths[[1]]
1] 2 4 3 1 5
[
$paths[[2]]
1] 4 2 3 1 5 [
In path 1, i.e. paths[[1]]
, regressor no. 2 is the first to be
deleted, regressor no. 4 is the second, and so on. In path 2 regressor
no. 4 is the first to be deleted, regressor no. 2 is the second, and so
on. In other words, after the deletion of the first two variables, the
set of remaining variables (i.e. 3, 1 and 5) in the two paths is
identical. Accordingly, knowing the result from the first path, in path
2 it is unnecessary to proceed further after having deleted the first
two regressors. Setting the argument turbo
equal to TRUE
turns such
a check on, and thus skips searches, estimations and tests that are
unnecessary. The turbo comes at a small computational cost (often less
than 1 second), since the check is undertaken at each deletion. This is
why the default is turbo = FALSE
. However, if the estimation time is
noticeable, then turning the turbo on can reduce the search time
substantially. As a rule of thumb, if each estimation takes 1 second or
more, then turning the turbo on will (almost) always reduce the total
search time.
Searching more paths may increase the relevance proportion or potency.
Whether and to what extent this happens depends on the sample size \(n\),
and on the degree of multicolinearity among the regressors
\(\boldsymbol{x}_t\). If \(n\) is sufficiently large, or if the regressors
are sufficiently uncorrelated, then searching fewer paths will not
reduce the relevance proportion. In many situations, therefore, one may
consider reducing the number of paths to increase the speed. This is
achieved via the max.paths
argument. Setting max.paths = 10
, for
example, means a maximum of 10 paths is searched. The paths that are
searched are those of the 10 most insignificant variables (i.e. those
with the highest \(p\)-values) in the starting model.
When implementing ISAT methods, the function isat
undertakes a
multi-round form of GETS modelling. In the first round the variables are
split into \(B\) blocks, and then GETS modelling is undertaken on each
block. This is a socalled "embarassingly parallel" problem. To make
isat
search in parallel during the first round, simply set the
argument parallel.options
equal to an integer greater than 1. The
integer determines how many cores/threads to use, and the command
detectCores()
can be used to find out how many cores/threads that are
available on the current machine. Remember, it is not recommended to use
all the cores/threads available. Within isat
, parallel-computing is
implemented with the makeCluster
and parApply
functions from the
package parallel. If the time required by makeCluster
to set up
parallel computing is negligible relative to the total computing time
(on an average computer the setup-time is about 1 second), then the
total computing time may – in optimal situations – be reduced by a
factor of about \(m-0.8\), where \(m>1\) is the number of cores/threads used
for parallel computing.
The function glm
enables the estimation of a large number of
specifications within the class of Generalised Linear Models (GLMs).
Here, it is illustrated how GETS modelling can be implemented with GLMs.
To fix ideas, the illustration is in terms of the logit-model.
Let \(y_t \in \left\{0,1\right\}\) denote the regressand of the logit-model given by \[Pr\left(y_t=1|\boldsymbol{x}_t\right) = \frac{1}{1+\exp\left(-h_t\right)}, \qquad h_t = \boldsymbol{\beta}'\boldsymbol{x}_t.\] Next, consider the following set of data:
<- 40 #number of observations
n <- 20 #number of Xs
k set.seed(123) #for reproducibility
<- round(runif(40)) #generate Y
y <- matrix(rnorm(n*k), n, k) #create matrix of Xs x
In other words, one regressand \(y_t \in \left\{0,1\right\}\) which is entirely independent of the 20 regressors in \(\boldsymbol{x}_t\). The following function enables GETS modelling of logit-models:
<- function(y, x, ...){
logitFun
##create list:
<- list()
result
##n, k and df:
$n <- length(y)
resultif( is.null(x) || NCOL(x)==0 ){
$k <- 0
resultelse{
}$k <- NCOL(x)
result
}$df <- result$n - result$k
result
##call glm if k > 0:
if( result$k > 0){
<- glm(y ~ x - 1, family = binomial(link="logit"))
tmp $coefficients <- coef(tmp)
result$vcov <- vcov(tmp)
result$logl <- as.numeric(logLik(tmp))
resultelse{
}$coefficients <- NULL
result$vcov <- NULL
result$logl <- result$n*log(0.5)
result
}
##return result:
return(result)
}
To undertake the GETS modelling:
getsFun(y, x, user.estimator=list(name="logitFun"))
Two variables are retained, namely \(x_{5t}\) and \(x_{11t}\), at the
default significance level of 5% (i.e. t.pval = 0.05
). To reduce the
chance of retaining irrelevant variables, the significance level can be
lowered to, say, 1% by setting t.pval = 0.01
.
To implement GETS modelling for a different GLM model, only two lines of
code needs to be modified in the user-defined function above. The first
is the line that specifies the family
, and the second is the one that
contains the log-likelihood associated with the empty model (i.e. the
line result$logl <- result$n*log(0.5)
).
gets
method (S3) for the "lm"
class of modelsThe package gets provides the generic function gets
. This enables
the creation of GETS methods (S3) for models of arbitrary classes (type
?S3Methods
for more info on S3 methods). Here, this is illustrated for
models of class "lm"
. Our aim is to be able to do the following:
<- lm(y ~ x)
mymodel gets(mymodel)
That is, to first estimate a model of class "lm"
where x
is a matrix
of regressors, and then to conveniently undertake GETS modelling by
simply applying the code gets(.)
to the named object mymodel
. To
this end, a function named gets.lm
that relies on getsFun
will be
created. In doing so, a practical aspect is how to appropriately deal
with the intercept codewise. Indeed, as we will see, a notable part of
the code in the user-defined function will be devoted to the intercept.
The reason for this is that lm
includes the intercept by default.
Another practical aspect is whether to use lm
or ols
whenever a
model is estimated by OLS (both employ the QR decomposition). The latter
is simpler codewise and computationally faster, so we opt for the
latter. The function is:
<- function(object, ...){
gets.lm
##make y:
<- as.vector(object$model[, 1])
y <- names(object$model)[1]
yName
##make x:
<- as.matrix(object$model[, -1])
x <- colnames(x)
xNames if(NCOL(x) == 0){
<- NULL
x <- NULL
xNames else{
}if(is.null(xNames)){
<- paste0("X", 1:NCOL(x))
xNames colnames(x) <- xNames
}
}
##is there an intercept?:
if(length(coef(object)) > 0){
<- names(coef(object))[1] == "(Intercept)"
cTRUE if(cTRUE){
<- cbind(rep(1, NROW(y)), x)
x <- c("(Intercept)", xNames)
xNames colnames(x) <- xNames
}
}
##do gets:
<- getsFun(y, x, ...)
myspecific
##which are the retained regressors?:
<- xNames[myspecific$specific.spec]
retainedXs cat("Retained regressors:\n ", retainedXs, "\n")
##return result
return(myspecific)
}
Next, recall the Data Generation Process (DGP) of the first experiment:
<- 40 #number of observations
n <- 20 #number of Xs
k set.seed(123) #for reproducibility
<- rnorm(n) #generate Y
y <- matrix(rnorm(n*k), n, k) #create matrix of Xs x
We can now do GETS modelling on models of class "lm"
by simply
applying the code gets(...)
on the object in question. The following
code first stores an estimated model of class "lm"
in an object named
startmodel
, and then applies the newly defined function gets.lm
to
it:
<- lm(y ~ x)
startmodel <- gets(startmodel) finallm
The information from the specification search is stored in the object
called finallm
, and during the search the following is printed:
18 path(s) to search
: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Searching:
Retained regressors X10 X17 X18
In other words, the retained regressors are no. 10, 17 and 18. Note that, due to the way the user-defined function has been put together, the intercept is excluded from deletion.
The function arima
can be used to estimate a linear regression with
deterministic regressors and an error-term that follows an ARMA. An
example is
\[y_t = \boldsymbol{\beta}'\boldsymbol{x}_{t} + \epsilon_t, \qquad \epsilon_t = \phi_1\epsilon_{t-1} + \theta_1 u_{t-1} + u_t, \qquad u_t\sim WN\left(0,\sigma_u^2\right),\]
where \(\boldsymbol{x}_t\) is a vector of deterministic regressors and
\(WN\) is short for White Noise. The error \(\epsilon_t\) is thus governed
by an ARMA(1,1). Let \(x_t\) denote a (deterministic) step-shift variable
in which the step-shift occurs at observation 30,
i.e. \(x_t = 1\left(t\geq30\right)\). Next, consider the DGP given by
\[\label{eq:dgp:of:location-shift}
y_t = 4x_t + \epsilon_t, \qquad \epsilon_t = 0.4\epsilon_{t-1} + 0.1 u_{t-1} + u_t, \qquad u_t \sim N\left(0,1\right), \qquad t=1,\ldots,n \tag{4}\]
with \(n=60\). In other words, the series \(y_t\) is non-stationary and
characterised by a large location shift at \(t=30\). Figure
1 illustrates the evolution of \(y_t\), which is
generated with the following code:
set.seed(123) #for reproducibility
<- arima.sim(list(ar = 0.4, ma = 0.1), 60) #epsilon
eps <- coredata(sim(eps, which.ones = 30)) #step-dummy at t = 30
x <- 4*x + eps #the dgp
y plot(y, ylab="y", xlab="t", lwd = 2)
By just looking at the graph, it seems clear that there is a location
shift, but it is not so clear that it in fact occurs at \(t=30\). I now
illustrate how the arima
function can be used in combination with
getsFun
to automatically search for where the break occurs. The idea
is to do GETS modelling over a set or block of step-indicators that
cover the period in which the break visually appears to be in.
Specifically, the aim is to apply GETS modelling to the following
starting model with 11 regressors:
\[y_t = \sum_{i=1}^{11} \beta_i \cdot 1_{ \left\{ t \geq 24+i \right\} } + \epsilon_t, \qquad \epsilon_t = \phi_1 \epsilon_{t-1} + \theta_1 u_{t-1} + u_t.\]
To this end, we first need to make the user-specified estimator:
<- function(y, x){
myEstimator
##create list:
<- list()
result
##estimate model:
if( is.null(x) || NCOL(x)==0 ){
$k <- 0
result<- arima(y, order = c(1,0,1)) #empty model
tmp else{
}$k <- NCOL(x)
result<- arima(y, order = c(1,0,1), xreg = x)
tmp $coefficients <- tmp$coef[-c(1:3)]
result$vcov <- tmp$var.coef
result$vcov <- result$vcov[-c(1:3),-c(1:3)]
result
}
##rename and re-organise things:
$n <- tmp$nobs
result$df <- result$n - result$k
result$logl <- tmp$loglik
result
return(result)
}
Note that the estimator has been put together such that the ARMA(1,1) specification of the error \(\epsilon_t\) is fixed. As a consequence, the specification search is only over the regressors. The following code first creates the 11 step dummies, and then undertakes the GETS modelling:
<- coredata(sim(eps, which.ones = 25:35)) #11 step-dummies
xregs getsFun(y, xregs, user.estimator = list(name = "myEstimator"))
Two step-dummies are retained, namely those of \(t=30\) and \(t=35\).
ISAT methods are computationally intensive, since at least \(n-1\)
indicators are included as regressors. Accordingly, as \(n\) grows large,
purpose-specific estimators can greatly reduce the computing time. One
way of building such an estimator is by using tools from the package
Matrix, see
Bates and Maechler (2018). The code below illustrates this. First it loads the
library, and then it creates a function named olsFaster
that
re-produces the structure of the estimation result returned by the
function ols
with method = 3
(i.e. OLS with the ordinary
coefficient-covariance), but with functions from Matrix. The code is:
library(Matrix)
<- function(y, x){
olsFaster <- list()
out $n <- length(y)
outif (is.null(x)){ out$k <- 0 }else{ out$k <- NCOL(x) }
$df <- out$n - out$k
outif (out$k > 0) {
<- as(x, "dgeMatrix")
x $xpy <- crossprod(x, y)
out$xtx <- crossprod(x)
out$coefficients <- as.numeric(solve(out$xtx,out$xpy))
out$xtxinv <- solve(out$xtx)
out$fit <- out$fit <- as.vector(x %*% out$coefficients)
outelse{ out$fit <- rep(0, out$n) }
}$residuals <- y - out$fit
out$residuals2 <- out$residuals^2
out$rss <- sum(out$residuals2)
out$sigma2 <- out$rss/out$df
outif (out$k > 0) { out$vcov <- as.matrix(out$sigma2 * out$xtxinv) }
$logl <- -out$n * log(2 * out$sigma2 * pi)/2 - out$rss/(2 * out$sigma2)
outreturn(out)
}
Depending on the data and hardware/software configuration, the estimator
may lead to a considerably speed-improvement. In the following example,
the function system.time
suggests a speed improvement of about 20% on
the current hardware/software configuration:
set.seed(123) #for reproducibility
<- rnorm(1000)
y <- matrix(rnorm(length(y)*20), length(y), 20)
x #with ols:
system.time( finalmodel <- isat(y, mxreg = x, max.paths = 5))
#with olsFaster:
system.time( finalmodel <- isat(y, mxreg = x, max.paths = 5,
user.estimator = list(name = "olsFaster")) )
In many applications a specific model or estimator is needed, a specific set of diagnostics tests may be required, or a specific fit criterion is preferred. In these situations, if the combination of estimator/model, diagnostics tests and fit criterion is not already offered in a pre-programmed way by publicly available software, the implementation of user-specified GETS and ISAT methods puts a large programming-burden on the user. This article has outlined how recent additions to the package gets greatly simplifies the development of user-specified GETS and ISAT methods. The package is the first software – both inside and outside the R universe – to provide a complete set of facilities for user-specified GETS and ISAT methods.
I am grateful to the Editor Catherine Hurley, an anonymous reviewer, Damian Clarke, Éric Dubois, Jonas Kurle, Felix Pretis, James Reade, Moritz Schwarz, Gareth Thomas, participants at the UseR! 2019 conference (Toulouse, July), the Statistics Norway seminar (May, 2018) and the Norges Bank seminar (April, 2018) for their helpful comments, suggestions and questions.
Econometrics, Finance, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sucarrat, "User-Specified General-to-Specific and Indicator Saturation Methods", The R Journal, 2021
BibTeX citation
@article{RJ-2021-024, author = {Sucarrat, Genaro}, title = {User-Specified General-to-Specific and Indicator Saturation Methods}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {251-265} }