In quantile regression, various quantiles of a response variable
In numerous applications, quantile regression is used to evaluate the
impact of a
indeed always yield a complete description of the conditional distribution. For our purposes, it is useful to recall that the conditional quantiles in ((1)) can be equivalently defined as
where
For fixed
Many approaches have been developed to estimate conditional quantiles.
After the seminal paper of Koenker and Bassett (1978) that introduced linear
quantile regression, much effort has been made to consider nonparametric
quantile regression. The most classical procedures in this vein are the
nearest neighbor estimators (Bhattacharya and Gangopadhyay 1990), the (kernel) local linear
estimators (Yu and Jones 1998) or the spline-based estimators
(Koenker et al. 1994; Koenker and Mizera 2004). For related work, we also refer to,
e.g. Fan et al. (1994), Gannoun et al. (2002), Muggeo et al. (2013) and Yu et al. (2003). There also exists
a wide variety of R functions/packages dedicated to the estimation of
conditional quantiles. Among them, let us cite the functions rqss
(only for gcrq
(only for
Recently, Charlier et al. (2015b) proposed a nonparametric quantile regression
method based on the concept of optimal quantization. Optimal
quantization replaces the (typically continuous) covariate
The goal of this paper is to describe an R package, called
QuantifQuantile
(Charlier et al. 2015c), that allows to perform quantization-based quantile
regression. This includes the data-driven selection of the grid size
The paper is organized as follows. The first section briefly recalls the construction of quantization-based quantile regression introduced in Charlier et al. (2015b,a) and explains the various steps needed to obtain the resulting estimators. The second section lists the main functions of QuantifQuantile and describes their inputs and outputs. Finally, the third section provides several examples that illustrate the use of the various functions. We conclude the paper by comparing our method with R alternatives on a real data set. An illustration of the function computing optimal quantization grids is given in the Appendix, which can be of independent interest in numerical probability, finance or numerical integration where quantization is extensively used (Pagès 1998; Pagès et al. 2004).
As mentioned above, the R package we describe in this paper implements the Charlier et al. (2015b,a) quantization-based methodology to perform nonparametric quantile regression. This section describes this methodology.
Let
Obviously, the choice of the grid has a significant impact on the
quality of this approximation. Under the assumption that
Based on optimal quantization of
where
As we will see below, whenever independent copies
First, an initial grid –
Second,
The optimal grid provided by this algorithm is then
Assume now that a sample
First, we compute the optimal grid
Then, the conditional quantiles are estimated by
where
It is shown in Charlier et al. (2015b) that under mild assumptions,
When the sample size
where
Both for the original estimators
Charlier et al. (2015a) proposed the following data-driven method to choose
This, however, is infeasible, since the population
quantiles
where
As shown in Charlier et al. (2015a) through simulations, both
If quantiles are to be estimated at various orders
where the sum is computed over the various
Charlier et al. (2015a) performed extensive comparisons between the
quantization-based estimators in ((5)) – based on the
efficient data-driven selection method for
Unlike the local linear and local constant estimators from Yu and Jones (1998),
that are usually based on a global-in-rqss
and qss
functions) tend to perform poorly for complex link functions,
since they always provide piecewise linear reference curves
(Koenker et al. 1994). Moreover, the current implementation of the rqss
function only supports dimensions 1 and 2, whereas our package allows to
compute quantization-based estimators in any dimension
This section provides a description of the various functions offered in the R package QuantifQuantile. We first detail the three functions that allow to estimate conditional quantiles through quantization. Then we describe a function computing optimal quantization grids.
QuantifQuantile is composed of three main functions that each provide
estimated conditional quantiles
in ((4))-((5)). These functions work in a
similar way but address different values of
The function QuantifQuantile
is suitable for
The function QuantifQuantile.d2
addresses the case
Finally, QuantifQuantile.d
can deal with an arbitrary value
of
Combined with the plot
function, the first two functions provide
reference curves and reference surfaces, respectively. No graphical
outputs can be obtained from the third function if
The three functions share the same arguments, but not necessarily the
same default values. For each function, using args()
displays the
various arguments and corresponding default values :
QuantifQuantile(X, Y, alpha = c(0.05, 0.25, 0.5, 0.75, 0.95), x = seq(min(X), max(X),
length = 100), testN = c(35, 40, 45, 50, 55), p = 2, B = 50, tildeB = 20,
same_N = TRUE, ncores = 1)
QuantifQuantile.d2(X, Y, alpha = c(0.05, 0.25, 0.5, 0.75, 0.95),
x = matrix(c(rep(seq(min(X[1, ]), max(X[1, ]), length = 20), 20),
sort(rep(seq(min(X[2, ]), max(X[2, ]), length = 20), 20))), nrow = 2, byrow = TRUE),
testN = c(110, 120, 130, 140, 150), p = 2, B = 50, tildeB = 20, same_N = TRUE,
ncores = 1)
QuantifQuantile.d(X, Y, x, alpha = c(0.05, 0.25, 0.5, 0.75, 0.95),
testN = c(35, 40, 45, 50, 55), p = 2, B = 50, tildeB = 20, same_N = TRUE, ncores = 1)
We now give more details on these arguments.
X
: a QuantifQuantile
). The columns of this
matrix are the
Y
: an
alpha
: an
x
: a QuantifQuantile
and
QuantifQuantile.d2
, required by QuantifQuantile.d
). The columns
of this matrix are the x
is not
provided when calling QuantifQuantile
, then it is set to a vector
of QuantifQuantile.d2
, then the default for x
is a matrix whose
x
are obtained by considering all
combinations of those 20 values for the first component with the 20
values for the second onex
obtained
in this fashion would increase exponentially with the dimension
testN
: an
p
: a real number larger than or equal to one (optional for all
three functions). This is the parameter
B
: a positive integer (optional for all three functions). This is
the number of bootstrap replications
tildeB
: a positive integer (optional for all three functions).
This is the number of bootstrap replications
same_N
: a boolean variable (optional for all three functions). If
same_N=TRUE
, then a common value of same_N=FALSE
, then optimal values
of
ncores
: number of cores to use. These functions can use parallel
computation to save time by increasing this parameter. Parallel
computation relies on mclapply
from parallel package, hence is
not available on Windows.
All three functions return the following list of objects, which is of
class ‘QuantifQuantile
’ :
hatq_opt
: an same_N=TRUE
, then the
entry same_N=FALSE
, then it is
rather fitted.values
function.
N_opt
: a positive integer (if same_N=TRUE
) or an same_N=FALSE
). Depending on
same_N
, this provides the value
of
hatISE_N
: an testN
, allows to assess the
global convexity of these ISEs. Hence, it can be used to indicate
whether or not the choice of testN
was adequate. This will be
illustrated in the examples below.
hatq_N
: an testN
.
From this output, it is easy by fixing the third entry to get the
matrix of the testN
.
The arguments X
, Y
, x
, alpha
, and testN
are also reported
in this response list.
Moreover, when the optimal value N_opt
selected is on the boundary of
testN
, which means that testN
most likely was not well chosen, a
warning message is printed.
The ‘QuantifQuantile
’ class response can be used as argument of the
functions plot
(only for summary
and print
. The plot
function draws the observations and plots the estimated conditional
quantile curves (plot
also has an
optional argument ise
. Setting this argument to TRUE
(the default is
FALSE
), this function, that can be used for any dimension same_N=FALSE
or
same_N=TRUE
, respectively; see the examples below for details. If
Besides the functions that allow to estimate conditional quantiles and
to plot the corresponding reference curves/surfaces, QuantifQuantile
provides a function that computes optimal quantization grids. This
function, called choice.grid
, admits the following arguments :
X
: a X
is used as a stimulus
in the stochastic gradient algorithm to get an optimal grid.
N
: a positive integer (required). The size of the desired
quantization grid.
ng
: a positive integer (optional). The number of desired
quantization grids. The default is
p
: a real number larger than or equal to one (optional). This is
the parameter
In some cases, it may be necessary to have several quantization grids,
such as in ((5)), where B + tildeB
grids are needed. The
three functions computing quantization-based conditional quantiles then
call the function choice.grid
with ng
X
(the
The output is a list containing the following elements :
init_grid
: a N
ng
real array.
The entry
opti_grid
: a N
ng
real array.
The entry
In this section, we illustrate on several examples the use of the
functions described above. Examples 1–3 restrict to
QuantifQuantile
/QuantifQuantile.d2
and provide graphical
representations in each case. Example 4 deals with a three-dimensional
covariate, without graphical representation. An illustration of the
function choice.grid
is given in the Appendix.
We generate a random sample
set.seed(258164)
n <- 300
X <- runif(n, -2, 2)
Y <- X^2 + rnorm(n)
We test the number QuantifQuantile
with these arguments and stock the
response in res
.
testN <- seq(10, 30, by = 5)
res <- QuantifQuantile(X, Y, testN = testN)
No warning message is printed, which means that this choice of testN
was adequate. To assess this in a graphical way, we use the function
plot
with ise
argument set to TRUE
that plots hatISEmean_N
(obtained by taking the mean of hatISE_N
over alpha
) against the
various testN
.
plot(res, ise = TRUE)
Figure 1A provides the resulting graph, which confirms that
testN
was well chosen since hatISEmean_N
is larger for smaller and
larger values of N_opt
. We then plot the corresponding
estimated conditional quantiles curves in Figure 1B. The default
colors of the points and of the curves are changed by using the
col.plot
argument. This argument is a vector of size
1+length(alpha)
, whose first component fixes the color of the data
points and whose remaining components determine the colors of the
various reference curves.
col.plot <- c("grey", "red", "orange", "green", "orange", "red")
plot(res, col.plot = col.plot, xlab = "X", ylab = "Y")
It is natural to make the grid testN
finer. Of course, the more
N_opt
obtained in
the first stage.
testN <- c(seq(10, 20, by = 1), seq(25, 30, by = 5))
res_step1 <- QuantifQuantile(X, Y, testN = testN)
plot(res_step1, ise = TRUE, col.plot = col.plot, xlab = "X", ylab = "Y")
|
|
(a) | (b) |
|
|
(c) | (d) |
The resulting graphs are provided in Figures 1C–1D,
respectively. We observe that the value of N_opt
is made more precise,
since we now get N_opt
=18 instead of 15. The resulting estimated
conditional quantiles curves in Figure 1B are very similar to
the ones in Figure 1D.
So far, we used the default value same_N=TRUE
, which leads to
selecting an same_N=FALSE
.
|
|
(a) | (b) |
|
|
(c) | (d) |
testN <- c(seq(10, 30, by = 5))
res2 <- QuantifQuantile(X, Y, testN = testN, same_N = FALSE)
plot(res2, ise = TRUE, col.plot = col.plot, xlab = "X", ylab = "Y")
testN <- c(seq(10, 20, by = 1), seq(25, 30, by = 5))
res2_step1 <- QuantifQuantile(X, Y, testN = testN, same_N = FALSE)
plot(res2_step1, ise = TRUE, col.plot = col.plot, xlab = "X", ylab = "Y")
The results are provided in Figure 2. Comparing the left
panels of Figures 1 and 2, we see that when
choosing N_opt = 15
with same_N = TRUE
and N_opt = 15
or 20 (depending on alpha
) for same_N = FALSE
. When
we refine the grid testN
, we find analogously N_opt = 18
for
same_N = TRUE
and N_opt = 14
, 15
, or 16
for same_N = FALSE
. In
the present setup, thus, both methods provide relatively close optimal
The sample considered here is made of
set.seed(642516)
n <- 1000
X <- matrix(runif(n*2, -2, 2), ncol = n)
Y <- apply(X^2, 2, sum) + rnorm(n)
We test B
and tildeB
to reduce the computational burden, which is heavier
for QuantifQuantile.d2
. Here, a
warning message is printed informing us that testN
was not
well-chosen. We confirm it with the function plot
with ise
argument
set to TRUE
.
testN <- seq(40, 90, by = 10)
B <- 20
tildeB <- 15
res <- QuantifQuantile.d2(X, Y, testN = testN, B = B, tildeB = tildeB)
plot(res, ise = TRUE)
Figure 3A provides the resulting graph. The parameter testN
was
not well chosen since hatISEmean_N
becomes smaller and smaller as
N_opt
increases. We then adapt the choice of testN
accordingly and
rerun the procedure, which identifies an optimal
testN <- seq(80, 130, by = 10)
res <- QuantifQuantile.d2(X, Y, testN = testN, B = B, tildeB = tildeB)
plot(res, ise = TRUE)
We then plot the corresponding estimated conditional quantile surfaces in Figure 4.
col.plot <- c("black", "red", "orange", "green", "orange", "red")
plot(res, col.plot = col.plot, xlab = "X_1", ylab = "X_2", zlab = "Y")
|
|
(a) | (b) |
|
|
(a) | (b) |
plot
function for α = 0.05, 0.25, 0.50, 0.75 and 0.95.
This example aims at illustrating the proposed estimated reference
curves on a real data set and at comparing them with some competitors.
In this example, the ncores
parameter of QuantifQuantile
function
was set to the number of cores detected by R minus 1. The data used
here, that are included in the QuantifQuantile package, involves
several variables related to employment, housing and environment
associated with percentage of owners living in their primary residence
,
percentage of buildings area
) and percentage of middle-range employees
, population density
),
respectively. In both cases,
set.seed(644925)
data(gironde)
X <- gironde[[2]]$owners
Y <- gironde[[4]]$building
testN <- seq(5, 15, by = 1)
res <- QuantifQuantile(X, Y, testN = testN, same_N = F, ncores = detectCores() - 1)
col.plot <- c("grey", "red", "orange", "green", "orange", "red")
plot(res, col.plot = col.plot, xlab = "X", ylab = "Y")
The same exercise is repeated with percentage of middle-range employees
, population density
). For
each rqss
in the package quantreg. Since the
parameter rqss
, we
selected it through AIC (via the AIC
function), separately for each
order
rank <- rank(X, ties.method = "random")
X[rank] <- X
Y[rank] <- Y
alpha <- c(0.05, 0.25, 0.5, 0.75, 0.95)
x <- seq(min(X), max(X), length = 100)
n <- length(X)
lambda <- array(0, dim = c(length(alpha), 1))
interval = c(0.2, 10)
for(i in 1:length(alpha)){
AIC_crit <- function(lambda){
AIC(rqss(Y ~ qss(X, lambda = lambda), tau = alpha[i]))[1]
}
select_lambda <- optimize(AIC_crit, interval = interval)
lambda[i] <- select_lambda$min
}
hatq <- array(0, dim = c(length(x), length(alpha)))
fitted_matrix <- array(0, dim = c(n,length(alpha)))
for(l in 1:length(alpha)){
res_rqss <- rqss(Y ~ qss(X, lambda = lambda[l]),tau = alpha[l])
fitted_matrix[,l] <- fitted(res_rqss)
}
plot(X, Y, col = col.plot[1], cex = 0.7);
for(i in 1:length(alpha)){
lines(fitted_matrix[, i] ~ X, type = "l", col = col.plot[i+1])
}
The same exercise is repeated for
Of course, the computational burden is also an important issue. Therefore, Table 1 gathers, for each method and each regression problem, the average and standard deviation of the computing times in a collection of 50 runs (these 50 runs were considered to make results more reliable). In each case, our method is faster than its spline-based competitor.
|
|
(a) | (b) |
|
|
(c) | (d) |
QuantifQuantile
(left) and rqss
(right),
for regression R1
(top) and regression R2 (bottom). In each
case, the quantile orders considered are α = 0.05, 0.25, 0.50, 0.75 and 0.95.
QuantifQuantile |
rqss |
|
---|---|---|
2.83 (0.117) | 4.39 (0.119) | |
2.47 (0.085) | 4.08 (0.115) |
To treat an example with population density
and the three
covariates percentage of farmers
, percentage of unemployed workers
, and percentage of managers
.
In this setup, no graphical output is available. We therefore restrict
to a finite collection of
The function QuantifQuantile.d
is then evaluated for the response and
covariates indicated above, and with the arguments alpha
testN
x
being the
ncores
being the number of cores detected by R minus
1.
data(gironde)
set.seed(729848)
X1 <- gironde[[1]]$farmers
X2 <- gironde[[1]]$unemployed
X3 <- gironde[[1]]$managers
Y <- gironde[[2]]$density
X <- matrix(c(X1, X2, X3), nr = 3, byrow = TRUE)
n <- length(X)/3
d <- 3
alpha <- c(0.25, 0.5, 0.75)
x1 <- round(c(mean(X1), mean(X2), mean(X3)))
x2 <- round(c((mean(X1) + max(X1))/2, mean(X2), mean(X3)))
x3 <- round(c(mean(X1), (mean(X2) + max(X2))/2, mean(X3)))
x4 <- round(c(mean(X1), mean(X2), (mean(X3) + max(X3))/2))
x5 <- round(c((mean(X1) + max(X1))/2, (mean(X2) + max(X2))/2, mean(X3)))
x6 <- round(c((mean(X1) + max(X1))/2, mean(X2), (mean(X3) + max(X3))/2))
x7 <- round(c(mean(X1), (mean(X2) + max(X2))/2, (mean(X3) + max(X3))/2))
x8 <- round(c((mean(X1) + max(X1))/2, (mean(X2) + max(X2))/2, (mean(X3) + max(X3))/2))
x <- matrix(c(x1, x2, x3, x4, x5, x6, x7, x8), nr = d)
res <- QuantifQuantile.d(X, Y, x , alpha = alpha, testN = seq(5, 10, by = 1),
same_N = F, ncores = detectCores() - 1)
round(fitted.values(res), 2)
This provided fitted.values
function then allowed to return the
following matrix hatq_opt
of estimated conditional quantiles :
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 44.30 22.59 39.50 71.59 25.05 22.40 76.37 24.19
[2,] 80.07 32.31 81.24 161.85 35.01 31.92 145.29 38.18
[3,] 139.16 46.50 223.13 344.92 53.73 47.01 402.98 73.19
This collection of (estimated) conditional quartiles allows to
appreciate the impact of a marginal perturbation of the covariates on
In this paper, we described the package QuantifQuantile that allows to
implement the quantization-based quantile regression method introduced
in Charlier et al. (2015b,a). The package is simple to use, as the
function QuantifQuantile
and its multivariate versions essentially
only require providing the covariate and response as arguments. Since
the choice of the tuning parameter plot
can also be
used as guide to change adequately the value of the parameter testN
in
the various functions. Moreover, a graphical illustration is directly
provided by the same function plot
when the dimension of the covariate
is smaller than or equal to 2. Finally, this package also contains a
function that provides optimal quantization grids, which might be useful
in other contexts, too.
Finally, we stress that quantization-based estimators, like most
nonparametric smoothing procedures, are likely to perform poorly in
high-dimensional situations due to the curse of dimensionality. For
large
The authors are grateful to the Editor and two anonymous referees for their careful reading and insightful comments that led to a significant improvement of the original manuscript. The first author’s research is supported by a Bourse FRIA of the Fonds National de la Recherche Scientifique, Communauté française de Belgique. The second author’s research is supported by an A.R.C. contract from the Communauté Française de Belgique and by the IAP research network grant P7/06 of the Belgian government (Belgian Science Policy).
choice.grid
We here put to work the function choice.grid
in the univariate and
bivariate cases. This function provides the “optimal” grid generated by
the stochastic gradient algorithm described earlier. As above mentioned,
quantization was extensively used in many other fields as numerical
integration, cluster analysis, numerical probability or finance
(Pagès 1998; Pagès et al. 2004). Therefore, this function can be of interest
outside the regression setup considered here.
We start with the univariate case and generate a random sample of
size N
ng
set.seed(643625)
n <- 500
X <- runif(n, -2, 2)
N <- 15
ng <- 1
res <- choice.grid(X, N, ng)
# Plots of the initial and optimal grids
plot(X, rep(1, n), col = "grey", cex = 0.5, ylim = c(-0.1, 1.1), yaxt = "n",
ylab = "")
points(res$init_grid, rep(0.5, N), col = "red", pch = 16, cex = 1.2)
points(res$opti_grid, rep(0, N), col = "forestgreen", pch = 16, cex = 1.2)
|
|
(a) | (b) |
choice.grid
(in
green).
|
|
(a) | (b) |
|
|
(c) | (d) |
choice.grid
(bottom).
Since the parent distribution is uniform over choice.grid
function is much closer to the population optimal grid
than the initial one. Recalling that the stochastic gradient algorithm
in choice.grid
performs as many iterations as observations in the
original sample, it is not surprising that the optimal grid associated
with the sample of size
Finally, we turn to the bivariate case and generate two random samples
of size choice.grid
was applied to these
samples with N
ng
choice.grid
(here as well, the population optimal grid should be
uniformly spread over the support of the underlying distribution). Also,
it is still the case that the resulting optimal grid is better when
based on a larger sample size
set.seed(345689)
n <- 2000
X <- matrix(runif(n*2, -2, 2), nc = n)
N <- 30
ng <- 1
res <- choice.grid(X, N, ng)
col <- c("red", "forestgreen")
plot(res$init_grid[1,,1], res$init_grid[2,,1], col = col[1], xlab = "", ylab = "")
plot(res$opti_grid[1,,1], res$opti_grid[2,,1], col = col[2], xlab = "", ylab = "")
quantreg, quantregGrowth, QuantifQuantile, rgl
Econometrics, Environmetrics, Optimization, ReproducibleResearch, Robust, SpatioTemporal, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
x
obtained
in this fashion would increase exponentially with the dimension Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Charlier, et al., "QuantifQuantile: An R Package for Performing Quantile Regression through Optimal Quantization", The R Journal, 2015
BibTeX citation
@article{RJ-2015-021, author = {Charlier, Isabelle and Paindaveine, Davy and Saracco, Jérôme}, title = {QuantifQuantile: An R Package for Performing Quantile Regression through Optimal Quantization}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {2}, issn = {2073-4859}, pages = {65-80} }