Quantiles play a fundamental role in statistics. The quantile function defines the distribution of a random variable and, thus, provides a way to describe the data that is specular but equivalent to that given by the corresponding cumulative distribution function. There are many advantages in working with quantiles, starting from their properties. The renewed interest in their usage seen in the last years is due to the theoretical, methodological, and software contributions that have broadened their applicability. This paper presents the R package Qtools, a collection of utilities for unconditional and conditional quantiles.
Quantiles have a long history in applied statistics, especially the
median. The analysis of astronomical data by Galileo Galilei in 1632
(Hald 2003 149) and geodic measurements by Roger Boscovich in 1757
(Koenker 2005 2) are presumably the earliest examples of application
of the least absolute deviation (
Functions implementing quantile methods can be found in common statistical software. A considerable number of R packages that provide such functions are available on the Comprehensive R Archive Network (CRAN). The base package stats contains basic functions to estimate sample quantiles or compute quantiles of common parametric distributions. The quantreg package (Koenker 2013) is arguably a benchmark for distribution-free estimation of linear quantile regression models, as well as the base for other packages which make use of linear programming (LP) algorithms (Koenker and D’Orey 1987; Koenker and Park 1996). Other contributions to the modelling of conditional quantile functions include packages for Bayesian regression, e.g. bayesQR (Benoit et al. 2014) and BSquare (Smith and Reich 2013), and the lqmm package (Geraci 2014; Geraci and Bottai 2014) for random-effects regression.
The focus of this paper is on the R package Qtools, a collection of models and tools for quantile inference. These include commands for
quantile-based analysis of the location, scale and shape of a distribution;
transformation-based quantile regression;
goodness of fit and restricted quantile regression;
quantile regression for discrete data;
quantile-based multiple imputation.
The emphasis will be put on the first two topics listed above as they represent the main contribution of the package, while a short description of the other topics is given for completeness.
Let
Quantiles enjoy a number of properties. An excellent overview is given
by Gilchrist (2000). In particular, the Q-tranformation rule
(Gilchrist 2000) or equivariance to monotone transformations states
that if
Sample quantiles for a random variable quantile()
in the base package stats provides
nine different sample quantile estimators, which are based on the sample
order statistics or the inverse of the empirical CDF. These estimators
are distribution-free as they do not depend on any parametric assumption
about
Let
As mentioned previously, the discontinuities of
The package Qtools provides the functions midecdf()
and
midquantile()
, which return objects of class "midecdf"
or
"midquantile"
, respectively, containing: the values or the
probabilities at which mid-cumulative probabilities or mid-quantiles are
calculated (x
), the mid-cumulative probabilities or the mid-quantiles
(y
), and the functions that linearly interpolate those coordinates
(fn
). An example is shown below using data simulated from a Poisson
distribution.
> library("Qtools")
> set.seed(467)
> y <- rpois(1000, 4)
> pmid <- midecdf(y)
> xmid <- midquantile(y, probs = pmid$y)
> pmid
Empirical mid-ECDF
Call:
midecdf(x = y)
> xmid
Empirical mid-ECDF
Call:
midquantile(x = y, probs = pmid$y)
A confidence interval for sample mid-quantiles can be obtained using
confint.midquantile()
. This function is applied to the output of
midquantile()
and returns an object of class "data.frame"
containing
sample mid-quantiles, lower and upper bounds of the confidence intervals
of a given level (stderr
. This is shown below using the sample y
generated in the previous example.
> xmid <- midquantile(y, probs = 1:3/4)
> x <- confint(xmid, level = 0.95)
> x
midquantile lower upper
25% 2.540000 2.416462 2.663538
50% 3.822816 3.693724 3.951907
75% 5.254902 5.072858 5.436946
> attr(x, "stderr")
[1] 0.06295447 0.06578432 0.09276875
Finally, a plot method is available for both midecdf()
and
midquantile()
objects. An illustration is given in Figure 1.
The mid-distribution and mid-quantile functions are discrete and their
values are marked by filled squares. The piecewise linear functions
connecting the filled squares represent continuous versions of the CDF
and QF which interpolate between the steps of, respectively, the
ordinary CDF and quantile functions. Note that the argument jumps
is a
logical value indicating whether values at jumps should be marked.
> par(mfrow = c(1,2))
> plot(pmid, xlab = "y", ylab = "CDF", jumps = TRUE)
> points(pmid$x, pmid$y, pch = 15)
> plot(xmid, xlab = "p", ylab = "Quantile", jumps = TRUE)
> points(xmid$x, xmid$y, pch = 15)
Since the cumulative distribution and quantile functions are two sides of the same coin, the location, scale, and shape (LSS) of a distribution can be examined using one or the other. Well-known quantile-based measures of location and scale are the median and inter-quartile range (IQR), respectively. Similarly, there are also a number of quantile-based measures for skewness and kurtosis (Groeneveld and Meeden 1984; Groeneveld 1998; Jones et al. 2011).
Define the central portion of the distribution as that delimited by the
quantiles
It is easy to verify that a quantile function can be written as
The function qlss()
provides a quantile-based LSS summary with the
indices defined in ((3)) of either a theoretical or an empirical
distribution. It returns an object of class "qlss"
, which is a list
containing measures of location (median), scale (IQR and IPR), and shape
(skewness and shape indices) for each of the probabilities specified in
the argument probs
(by default, probs
= 0.1). The quantile-based LSS
summary of the normal distribution is given in the example below for
fun
can take any quantile function whose
probability argument is named p
(this is the case for many standard
quantile functions in R, e.g. qt()
, qchisq()
, qf()
, etc. ).
> qlss(fun = "qnorm", probs = 0.1)
call:
qlss.default(fun = "qnorm", probs = 0.1)
Unconditional Quantile-Based Location, Scale, and Shape
** Location **
Median
[1] 0
** Scale **
Inter-quartile range (IQR)
[1] 1.34898
Inter-quantile range (IPR)
0.1
2.563103
** Shape **
Skewness index
0.1
0
Shape index
0.1
1.900031
An empirical example is now illustrated using the faithful
data set,
which contains
Minimum | Q1 | Q2 | Q3 | Maximum | |
---|---|---|---|---|---|
Waiting time | 43.0 | 58.0 | 76.0 | 82.0 | 96.0 |
Duration | 1.6 | 2.2 | 4.0 | 4.5 | 5.1 |
Suppose the interest is in describing the distribution of waiting times.
The density is plotted in Figure 2, along with the
mid-quantile function. The distribution is bimodal with peaks at around
quantile()
, including the argument type
, can be passed on to
qlss()
.
> y <- faithful$waiting
> par(mfrow = c(1,2))
> plot(density(y))
> plot(midquantile(y, probs = p), jumps = FALSE)
> qlss(y, probs = c(0.05, 0.1, 0.25), type = 7)
call:
qlss.numeric(x = y, probs = c(0.05, 0.1, 0.25), type = 7)
Unconditional Quantile-Based Location, Scale, and Shape
** Location **
Median
[1] 76
** Scale **
Inter-quartile range (IQR)
[1] 24
Inter-quantile range (IPR)
0.05 0.1 0.25
41 35 24
** Shape **
Skewness index
0.05 0.1 0.25
-0.3658537 -0.4285714 -0.5000000
Shape index
0.05 0.1 0.25
1.708333 1.458333 1.000000
At
In general, the
Waiting times between eruptions are plotted against the durations of the
eruptions in Figure 3. Two clusters of observations can be
defined for durations below and above 3 minutes (see also Azzalini and Bowman 1990). The distribution shows a strong bimodality as
already illustrated in Figure 2. A dummy variable for
durations equal to or longer than rq()
in the package quantreg
for
> require("quantreg")
> y <- faithful$waiting
> x <- as.numeric(faithful$eruptions >= 3)
> fit <- rq(formula = y ~ x, tau = c(0.1, 0.25, 0.5, 0.75, 0.9))
> fit
Call:
rq(formula = y ~ x, tau = c(0.1, 0.25, 0.5, 0.75, 0.9))
Coefficients:
tau= 0.10 tau= 0.25 tau= 0.50 tau= 0.75 tau= 0.90
(Intercept) 47 50 54 59 63
x 26 26 26 25 25
Degrees of freedom: 272 total; 270 residual
From the output above, it is quite evident that the distribution of
waiting times is shifted by an approximately constant amount at all
considered values of ?quantreg::KhmaladzeTest
for further details). The
critical values of the test and corresponding significance levels
(Koenker 2005) are not readily available in the same package. These
have been hardcoded in the Qtools function KhmaladzeFormat()
which
can be applied to "KhmaladzeTest"
objects. For the Old Faithful Geyser
data, the result of the test is not statistically significant at the
> kt <- KhmaladzeTest(formula = y ~ x, taus = seq(.05, .95, by = .01),
> KhmaladzeFormat(kt, 0.05)
Khmaladze test for the location-shift hypothesis
Joint test is not significant at 10% level
Test(s) for individual slopes:
not significant at 10% level
Distribution-free quantile regression does not require introducing an assumption on the functional form of the error distribution (Koenker and Bassett 1978), but only weaker quantile restrictions (Powell 1994). Comparatively, the linear specification of the conditional quantile function in Equation (4) is a much stronger assumption and thus plays an important role for inferential purposes.
The problem of assessing the goodness of fit (GOF) is rather neglected
in applications of QR. Although some approaches to GOF have been
proposed
(Zheng 1998; Koenker and Machado 1999; He and Zhu 2003; Khmaladze and Koul 2004),
there is currently a shortage of software code available to users. The
function GOFTest()
implements a test based on the cusum process of the
gradient vector (He and Zhu 2003). Briefly, the test statistic is given by
the largest eigenvalue of
An example is provided further below using the New York Air Quality data
set, which contains
Suppose that the model of interest is
> dd <- airquality[complete.cases(airquality), ]
> dd <- dd[order(dd$Solar.R), ]
> fit.rq <- rq(Ozone ~ Solar.R, tau = c(.1, .5, .9), data = dd)
> x <- seq(min(dd$Solar.R), max(dd$Solar.R), length = 200)
> yhat <- predict(fit.rq, newdata = data.frame(Solar.R = x))
> plot(Ozone ~ Solar.R, data = dd)
> apply(yhat, 2, function(y, x) lines(x, y), x = x)
As a function of solar radiation, the median of the ozone daily averages
increases by
The RC test applied to the the object fit.rq
provides evidence of lack
of fit for all quantiles considered, particularly for
> gof.rq <- GOFTest(fit.rq, alpha = 0.05, B = 1000, seed = 987)
> gof.rq
Goodness-of-fit test for quantile regression based on the cusum process
Quantile 0.1: Test statistic = 0.1057; p-value = 0.001
Quantile 0.5: Test statistic = 0.2191; p-value = 0
Quantile 0.9: Test statistic = 0.0457; p-value = 0.018
Complex dynamics may result in nonlinear effects in the relationship
between the covariates and the response variable. For instance, in
kinesiology, pharmacokinetics, and enzyme kinetics, the study of the
dynamics of an agent in a system involves the estimation of nonlinear
models; phenomena like human growth, certain disease mechanisms and the
effects of harmful environmental substances such as lead and mercury,
may show strong nonlinearities over time. In this section, the linear
model is abandoned in favor of a more general model of the type
which may provide substantive interpretability, possibly parsimonious (in general more parsimonious than polynomials), and valid beyond the observed range of the data. A nonlinear model depends on either prior knowledge of the phenomenon or the introduction of new, strong theory to explain the observed relationship with potential predictive power. Estimation may present challenges;
falling under the label of nonparametric regression, in which the complexity of the model is approximated by a sequence of locally linear polynomials (a naïve global polynomial trend can be considered to be a special case). A nonparametric model need not introducing strong assumptions about the relationship and is essentially data-driven. Estimation is based on linear approximations and, typically, requires the introduction of a penalty term to control the degree of smoothing; and
a flexible, parsimonious family of parametric transformations is applied to the response seeking to obtain approximate linearity on the transformed scale. The data provide information about the “best” transformation among a family of transformations. Estimation is facilitated by the application of methods for linear models.
The focus of this section is on the third approach. More specifically the functions available in Qtools refer to the methods for transformation-based QR models developed by (Powell 1991), (Chamberlain 1994), (Mu and He 2007), (Dehbi et al. 2016) and (Geraci and Jones 2015). Examples of approaches to nonlinear QR based on parametric models or splines can be found in (Koenker and Park 1996) and (Yu and Jones 1998), respectively.
The goal of the transformation-based QR is to fit the model
The package Qtools provides several transformation families, namely
the Box–Cox (Box and Cox 1964), Aranda-Ordaz (Aranda-Ordaz 1981), and Jones
(Jones 2007; Geraci and Jones 2015) transformations. A distinction between
these families is made in terms of the support of the response variable
to which the transformation is applied and the number of transformation
parameters. The Box–Cox model is a one-parameter family of
transformations which applies to singly bounded variables,
with one parameter and assuming both symmetric and asymmetric forms;
with two parameters, with one parameter modelling the symmetry (or lack thereof) of the transformation.
Originally, (Box and Cox 1964) proposed using power transformations to address lack of linearity, homoscedasticity and normality of the residuals in mean regression modelling. Sakia (1992) ((1992 175)) reported that “seldom does this transformation fulfil the basic assumptions of linearity, normality and homoscedasticity simultaneously as originally suggested by Box & Cox (1964). The Box-Cox transformation has found more practical utility in the empirical determination of functional relationships in a variety of fields, especially in econometrics”.
Indeed, the practical utility of power transformations has been long
recognised in QR modelling
(Powell 1991; Chamberlain 1994; Buchinsky 1995; Mu and He 2007).
Model (7) is the Box–Cox QR model if
The symmetric Aranda-Ordaz transformation is given by
To overcome range problems, which give rise to computational
difficulties, (Geraci and Jones 2015) proposed to use instead one-parameter
transformations with range
Support of |
Symmetric | Asymmetric |
---|---|---|
Since the transformation in ((13)) has range
All transformation models discussed above can be fitted using a
two-stage (TS) estimator (Chamberlain 1994; Buchinsky 1995) whereby
There are several methods for interval estimation, including those based
on large-
In Qtools, model fitting for one-parameter transformation models can
be carried out using the function tsrq()
. The formula
argument
specifies the model for the linear predictor as in ((7)), while
the argument tsf
provides the desired transformation "bc"
for the Box–Cox model,
"ao"
for Aranda-Ordaz families, and "mcjI"
for proposal I
transformations. Additional arguments in the function tsrq()
include
symmetry
a logical flag to specify the symmetric or asymmetric version of
"ao"
and "mcjI"
;
dbounded
a logical flag to specify whether the response variable is doubly bounded (default is strictly positive, i.e. singly bounded);
lambda
a numerical vector to define the grid of values for estimating
conditional
, a logical flag indicating whether
lambda
provides such known value).
There are other functions to fit transformation models. The function
rcrq()
fits one-parameter transformation models using the RC
estimator. The functions tsrq2()
and nlrq2()
are specific to
Geraci and Jones (2015)’s ((2015)) Proposal II
transformations. The former employs a two-way grid search while the
latter is based on Nelder-Mead optimization as implemented in optim()
.
Simulation studies in (Geraci and Jones 2015) suggest that, although
computationally slower, a two grid search is numerically more stable
than the derivative-free approach.
A summary of the basic differences between all fitting functions is
given in Table 3. The table also shows the available methods
in summary.rqt()
to estimate standard errors and confidence intervals
for the model’s parameters. Unconditional inference is carried out
jointly on
When summary.rqt()
is executed with the argument conditional = TRUE
,
confidence interval estimation for "rank"
, "iid"
,
"nid"
, "ker"
, and "boot"
in quantreg::summary.rq()
).
Function | Transformation | Estimation | Standard errors or confidence intervals | |
name | parameters | Unconditional | Conditional | |
tsrq() |
1 | Two-stage | "iid" , "nid" , "boot" |
All types |
rcrq() |
1 | Residual cusum process | "boot" |
All types |
tsrq2() |
2 | Two-stage | "boot" |
All types |
nlrq2() |
2 | Nelder–Mead | "boot" |
– |
In the New York Air Quality data example, a linear model was found
unsuitable to describe the relationship between ozone and solar
radiation. At closer inspection, Figure 4 reveals that the
conditional distribution of ozone may in fact be nonlinearly associated
with solar radiation, at least for some of the conditional quantiles.
The model
> system.time(fit.rqt <- tsrq(Ozone ~ Solar.R, data = dd, tsf = "mcjI",
+ symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005),
+ conditional = FALSE, tau = c(.1, .5, .9)))
user system elapsed
0.5 0.0 0.5
> fit.rqt
call:
tsrq(formula = Ozone ~ Solar.R, data = dd, tsf = "mcjI", symm = TRUE,
dbounded = FALSE, lambda = seq(1, 3, by = 0.005), conditional = FALSE,
tau = c(0.1, 0.5, 0.9))
Proposal I symmetric transformation (singly bounded response)
Optimal transformation parameter:
tau = 0.1 tau = 0.5 tau = 0.9
2.210 2.475 1.500
Coefficients linear model (transformed scale):
tau = 0.1 tau = 0.5 tau = 0.9
(Intercept) -3.3357578 -48.737341 16.557327
Solar.R 0.4169697 6.092168 1.443407
Degrees of freedom: 111 total; 109 residual
The TS estimator makes a search for seq(-5, 5, by = 0.5)
), then center the
interval about the resulting estimate using a finer grid, and re-fit the
model.
The output above reports the estimates tau
. Here, the
quantities of interest are the predictions on the ozone scale and the
marginal effect of solar radiation, which can obtained using the
function predict.rqt()
.
> x <- seq(9, 334, length = 200)
> qhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x),
+ type = "response")
> dqhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x),
+ type = "maref", namevec = "Solar.R")
The linear component of the marginal effect is calculated as derivative of
Ozone ~ beta1 * Solar.R
with respect to Solar.R
The calculations above are based on a sequence of 200 ozone values in
the interval newdata
(if
this argument is missing, the function returns the fitted values). There
are three types of predictions available:
link
predictions of conditional quantiles on the transformed scale
((7)),
i.e.
response
predictions of conditional quantiles on the original scale
((8)),
i.e.
maref
predictions of the marginal effect ((9)).
In the latter case, the argument namevec
is used to specify the name
of the covariate with respect to which the marginal effect has to be
calculated. The function maref.rqt()
computes derivatives symbolically
using the stats function deriv()
and these are subsequently
evaluated numerically. While the nonlinear component of the marginal
effect in Equation (9) (i.e. formula
argument in order to obtain an expression suitable for deriv()
. The
function maref.rqt()
can handle simple expressions with common
functions like log()
, exp()
, etc. , interaction terms, and "AsIs"
terms (i.e. I()
). However, using functions that are not recognised by
deriv()
will trigger an error.
The predicted quantiles of ozone and the marginal effects of solar radiation are plotted in Figure 5 using the following code:
> par(mfrow = c(1, 2))
> plot(Ozone ~ Solar.R, data = dd, xlab = "Solar radiation (lang)",
+ ylab = "Ozone (ppb)")
> for(i in 1:3) lines(x, qhat[ ,i], lty = c(1, 2, 4)[i], lwd = 2)
> plot(range(x), range(dqhat), type = "n", xlab = "Solar radiation (lang)",
+ ylab = "Marginal effect")
> for(i in 1:3) lines(x, dqhat[ ,i], lty = c(1, 2, 4)[i], lwd = 2)
The effect of solar radiation on different quantiles of ozone levels
shows a nonlinear behavior, especially at lower ranges of radiation
(below
> GOFTest(fit.rqt, alpha = 0.05, B = 1000, seed = 416)
Goodness-of-fit test for quantile regression based on the cusum process
Quantile 0.1: Test statistic = 0.0393; p-value = 0.025
Quantile 0.5: Test statistic = 0.1465; p-value = 0.005
Quantile 0.9: Test statistic = 0.0212; p-value = 0.127
The TS and RC estimators generally provide similar estimates and
predictions. However, computation based on the cusum process tends to be
somewhat slow, as shown further below. This is also true for the RC test
provided by GOFTest()
.
> system.time(fit.rqt <- rcrq(Ozone ~ Solar.R, data = dd, tsf = "mcjI",
+ symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005),
+ tau = c(.1, .5, .9)))
user system elapsed
36.88 0.03 37.64
An example using doubly bounded transformations is demonstrated using the A-level Chemistry Scores data set. The latter is available from Qtools and it consists of 31022 observations of A-level scores in Chemistry for England and Wales students, 1997. The data set also includes information of prior academic achievement as assessed with General Certificate of Secondary Education (GCSE) average scores. The goal is to evaluate the ability of GCSE to predict A-level scores. The latter are based on national exams in specific subjects (e.g. chemistry) with grades ranging from A to F. For practical purposes, scores are converted numerically as follows: A = 10, B = 8, C = 6, D = 4, E = 2, and F = 0. The response is therefore doubly bounded between 0 ad 10. It should be noted that this variable is discrete, although, for the sake of simplicity, here it is assumed that the underlying process is continuous.
The model considered here is
> data(Chemistry)
> fit.rqt <- tsrq(score ~ gcse, data = Chemistry, tsf = "ao", symm = FALSE,
+ lambda = seq(0, 2, by = 0.01), tau = 0.9)
The predicted
> summary(fit.rqt, conditional = FALSE, se = "nid")
call:
summary.rqt(object = fit.rqt, se = "nid", conditional = FALSE)
Aranda-Ordaz asymmetric transformation (doubly bounded response)
Summary for unconditional inference
tau = 0.9
Optimal transformation parameter:
Value Std. Error Lower bound Upper bound
0.000000000 0.001364422 -0.002674218 0.002674218
Coefficients linear model (transformed scale):
Value Std. Error Lower bound Upper bound
(Intercept) -4.3520060 0.015414540 -4.3822179 -4.3217941
gcse 0.8978072 0.002917142 0.8920898 0.9035247
Degrees of freedom: 31022 total; 31020 residual
Alternatively, one can estimate the parameter
> coef(tsrq2(score ~ gcse, data = chemsub, dbounded = TRUE,
+ lambda = seq(0, 2, by = 0.1), delta = seq(0, 2, by = 0.1),
+ tau = 0.9), all = TRUE)
(Intercept) gcse lambda delta
-4.1442274 0.8681246 0.0000000 0.0000000
These results confirm the asymmetric nature of the relationship since
nlrq2()
.
In conclusion, the package Qtools offers several options in terms of transformations and estimation algorithms, the advantages and disadvantages of which are discussed by (Geraci and Jones 2015). In particular, they found that the symmetric Proposal I transformation improves considerably on the Box-Cox method and marginally on the Aranda-Ordaz transformation in terms of mean squared error of the predictions. Also, asymmetric transformations do not seem to improve sufficiently often on symmetric transformations to be especially recommendable. However, the Box-Cox and the symmetric Aranda-Ordaz transformations should not be used when individual out-of-range predictions represent a potential inconvenience as, for example, in multiple imputation (see section further below). Finally, in some situations transformation-based quantile regression may be competitive as compared to methods based on smoothing, as demonstrated by a recent application to anthropometric charts (Boghossian et al. 2016).
Quantile-based measures of location, scale, and shape can be assessed
conditionally on covariates. A simple approach is to a fit a linear
model as in ((4)) or a transformation-based model as in
((7)), and then predict
Estimation of conditional LSS can be carried out by using the function
qlss.formula()
. The conditional model is specified in the argument
formula
, while the probability probs
. (As seen in
Equation (3), the other probabilities of interest to obtain the
decomposition of the conditional quantiles are type
specifies the required type of regression
model, more specifically "rq"
for linear models and "rqt"
for
transformation-based models. The function qlss.formula()
will take any
additional argument to be passed to quantreg::rq()
or tsrq()
(e.g. subset
, weights
, etc. ).
Let’s consider the New York Air Quality data example discussed in the
previous section and assume that the transformation model ((15))
holds for the quantiles
> fit.qlss <- qlss(formula = Ozone ~ Solar.R, data = airquality, type =
+ "rqt", tsf = "mcjI", symm = TRUE, dbounded = FALSE, lambda =
+ seq(1, 3, by = 0.005), probs = c(0.05, 0.1))
> fit.qlss
call:
qlss.formula(formula = Ozone ~ Solar.R, probs = c(0.05, 0.1),
data = airquality, type = "rqt", tsf = "mcjI", symm = TRUE,
dbounded = FALSE, lambda = seq(1, 3, by = 0.005))
Conditional Quantile-Based Location, Scale, and Shape
-- Values are averaged over observations --
** Location **
Median
[1] 30.2258
** Scale **
Inter-quartile range (IQR)
[1] 43.40648
Inter-quantile range (IPR)
0.05 0.1
88.02909 73.93430
**Shape**
Skewness index
0.05 0.1
0.5497365 0.5180108
Shape index
0.05 0.1
1.960315 1.661648
The output, which is of class "qlss"
, is a named list with the same
LSS measures seen in the case of unconditional quantiles. However, these
are now conditional on solar radiation. By default, the predictions are
the fitted values, which are averaged over observations for printing
purposes. An optional data frame for predictions can be given via the
argument newdata
in predict.qlss()
. If interval = TRUE
, the latter
computes confidence intervals at the specified level
using R
bootstrap replications (it is, therefore, advisable to set the seed
before calling predict.qlss()
). The conditional LSS measures can be
conveniently plotted using the plot.qlss()
function as shown in the
code below. The argument z
is required and specifies the covariate
used for plotting. Finally, the argument whichp
specifies one
probability (and one only) among those given in probs
that should be
used for plotting (e.g.
> set.seed(567)
> x <- seq(9, 334, length = 200)
> qhat <- predict(fit.qlss, newdata = data.frame(Solar.R = x),
+ interval = TRUE, level = 0.90, R = 500)
> plot(qhat, z = x, whichp = 0.1, interval = TRUE, type = "l",
+ xlab = "Solar radiation (lang)", lwd = 2)
Figure 7 shows that both the median and the IQR of ozone
increase nonlinearly with increasing solar radiation. The distribution
of ozone is skewed to the right and the degree of asymmetry is highest
at low values of solar radiation. This is due to the extreme curvature
of the median which takes on values close to the
Besides a loss of precision, high sparsity (low density) might also lead
to a violation of the basic property of monotonicity of quantile
functions. Quantile crossing occurs when
The package Qtools provides the functions rrq()
, rrq.fit()
and
rrq.wfit()
which are, respectively, the restricted analogs of rq()
,
rq.fit()
, and rq.wfitv
in quantreg. S3 methods print()
,
coef()
, predict()
, fitted()
, residuals()
, and summary()
are
available for objects of class "rrq"
. In particular, confidence
intervals are obtained using the functions boot()
and boot.ci()
from
package boot. Future versions of the package will develop the function
summary.rrq()
to include asymptotic standard errors (Zhao 2000). An
application is shown below using an example discussed by (Zhao 2000).
The data set, available from Qtools, consists of
> data("esterase")
> taus <- c(.1, .25, .5, .75, .9)
> fit.rq <- rq(Count ~ Esterase, data = esterase, tau = taus)
> yhat1 <- fitted(fit.rq)
> fit.rrq <- rrq(Count ~ Esterase, data = esterase, tau = taus)
> yhat2 <- fitted(fit.rrq)
The predicted
As discussed above, the reliability of the results depends on the validity of the restriction carried by RRQ. A quick check can be performed using the location–scale-shift specification of the Khmaladze test.
> kt <- KhmaladzeTest(formula = Count ~ Esterase, data = esterase,
+ taus = seq(.05,.95,by = .01), nullH = "location-scale")
> KhmaladzeFormat(kt, 0.05)
Khmaladze test for the location-shift hypothesis
Joint test is not significant at 10% level
Test(s) for individual slopes:
not significant at 10% level
The quantile crossing problem can be approached also by directly
rearranging the fitted values "rrq"
objects, may
be of practical utility when summarising the results.
Modelling conditional functions of discrete data is less common and, on a superficial level, might even appear as an unnecessary complication. However, a deeper look at its rationale will reveal that a distribution-free analysis can provide insightful information in the discrete case as it does in the continuous case. Indeed, methods for conditional quantiles of continuous distributions can be—and have been—adapted to discrete responses.
The package Qtools offers some limited functionalities for count and binary data. Further research is needed to develop the theory of QR for discrete data and to improve computational algorithms. Therefore, the user should use these functions with caution.
Let
(Machado and Santos Silva 2005) proposed the model
Machado and Santos Silva (2005)’s ((2005)) methods, including
large-rq.counts()
. The formula
argument specifies a linear model
as in ((17)), while the argument tsf
provides the desired
transformation GOFTest()
can be applied to
"rq.counts"
objects as well.
Qtools provides functions for modelling binary responses as well.
First of all, it is useful to note that the classical GLM for a binary
response
Maximum score estimation, originally developed by
(Manski 1975, 1985), is equivalent to estimating the
conditional quantiles of the latent variable rq.bin()
is the main function to
obtain binary regression quantiles. It is a wrapper for the function
rqbin.fit()
which calls Fortran code written for simulated annealing
estimation (Goffe et al. 1994). Qtools offers a limited number of
functions for objects of class "rq.bin"
including coef()
and
predict()
. These methods should be considered still experimental. In
particular, the user should be aware that the estimates obtained from
the fitting procedure may be sensitive to different settings of the
simulated annealing algorithm. The latter can be controlled using
rqbinControl()
.
Regression models play an important role in conditional imputation of missing values. QR can be used as an effective approach for multiple imputation (MI) when location-shift models are inadequate (Muñoz and Rueda 2009; Bottai and Zhen 2013; Geraci 2016).
In Qtools, mice.impute.rq()
and mice.impute.rrq()
are auxiliary
functions written to be used along with the functions of the R package
mice (Buuren and Groothuis-Oudshoorn 2011). The former is
based on the standard QR estimator (rq.fit()
) while the latter on the
restricted counterpart (rrq.fit()
). Both imputation functions allow
for the specification of the transformation-based QR models discussed
previously. The equivariance property is useful to achieve linearity of
the conditional model and to ensure that imputations lie within some
interval when imputed variables are bounded. An example is available
from the help file ?mice.impute.rq
using the nhanes
data set. See
also (Geraci 2016) for a thorough description of these methods.
Quantiles have long occupied an important place in statistics. The package Qtools builds on recent methodological and computational developments of quantile functions and related methods to promote their application in statistical data modelling.
This work was partially supported by an ASPIRE grant from the Office of the Vice President for Research at the University of South Carolina. I wish to thank anonymous reviewers for their helpful comments and Alexander McLain for his help with revising the final draft of the manuscript.
quantreg, bayesQR, BSquare, lqmm, Qtools, boot, Rearrangement, mice
Bayesian, Econometrics, Environmetrics, MissingData, MixedModels, Optimization, ReproducibleResearch, Robust, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Geraci, "Qtools: A Collection of Models and Tools for Quantile Inference", The R Journal, 2016
BibTeX citation
@article{RJ-2016-037, author = {Geraci, Marco}, title = {Qtools: A Collection of Models and Tools for Quantile Inference}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {117-138} }