Sufficient dimension reduction (SDR) turns out to be a useful dimension reduction tool in high-dimensional regression analysis. (Weisberg 2002) developed the dr-package to implement the four most popular SDR methods. However, the package does not provide any clear guidelines as to which method should be used given a data. Since the four methods may provide dramatically different dimension reduction results, the selection in the dr-package is problematic for statistical practitioners. In this paper, a basis-adaptive selection algorithm is developed in order to relieve this issue. The basic idea is to select an SDR method that provides the highest correlation between the basis estimates obtained by the four classical SDR methods. A real data example and numerical studies confirm the practical usefulness of the developed algorithm.
Sufficient dimension reduction (SDR) in the regression of
A space spanned by the columns of
According to (Cook 1998a), for a non-singular transformation of
sliced inverse regression [SIR; Li (1991)]:
sliced average variance estimation [SAVE; Cook and Weisberg (1991)]:
principal Hessian directions [pHd; Li (1992); Cook (1998b)]:
covariance method (cov
These four SDR methods can be implemented in the
dr-package. The package
provides the basis estimates and dimension test results. A detailed
review on the dr-package is given in (Weisberg 2002). The following question is
a critical issue in using dr: which one among the four methods has to
be used? Even with the same data, one can obtain dramatically different
dimension reduction results from the four methods in dr, yet there is
no clear and up-to-date guideline on method selection. It is known that
SIR and
The purpose of the paper is to develop a basis-adaptive selection
algorithm for selecting an SDR method. The basic idea is to select the
one that gives the highest correlation between the basis estimates
obtained by all possible pairs of the four SDR methods. To measure the
correlation, a trace correlation
The organization of this article is as follows. We develop the idea of a basis-adaptive selection algorithm, and discuss its selection criteria. Next, we present a real data example and simulation studies. Finally, we summarize our work.
We begin this section with soil evaporation data in Section 6 of (Yin and Cook 2002). The data contains 46 observations of daily soil evaporation, daily air and soil temperature curves, daily humidity curves and wind speed. For illustration purposes, we consider a regression analysis of daily soil evaporation given the integrated area of and the range of the daily air and soil temperatures. We will revisit the data in a later section.
## loading data
evaporat <- read.table("evaporat.txt", header=T); attach(evaporat)
Rat <- Maxat-Minat; Rvh <- Maxh-Minh; Rst <- Maxst-Minst
w <- c(scale(evaporat$Evap,center=TRUE,scale=TRUE)); detach(evaporat)
evaporat <- data.frame(evaporat, Rat, Rvh, Rst)
w2 <- cbind(w, w^2); w3 <- cbind(w2, w^3); w4 <- cbind(w3, w^4)
## dr-package fitting
library(dr)
sir5 <- dr(Evap~Avat+Rat+Avst+Rst, data=evaporat, method="sir", nslice=5)
save5 <- update(sir5, method="save"); phdres <- update(sir5, method="phdres")
cov2 <- dr(w2~Avat+Rat+Avst+Rst,data=evaporat, method="ols")
cov3 <- dr(w3~Avat+Rat+Avst+Rst,data=evaporat, method="ols")
cov4 <- dr(w4~Avat+Rat+Avst+Rst,data=evaporat, method="ols")
## dimension test
round(dr.test(sir5),3)
round(dr.test(save5),3)
round(dr.test(phdres, numdir=4),3)
set.seed(100);round(dr.permutation.test(cov2,npermute=1000)$summary,3)
set.seed(100);round(dr.permutation.test(cov3,npermute=1000)$summary,3)
set.seed(100);round(dr.permutation.test(cov4,npermute=1000)$summary,3)
SIR with 5 slices | SAVE with 5 slices | pHd | cov2 | cov3 | cov4 | |
---|---|---|---|---|---|---|
0.000 | 0.048 | 0.698 | 0.000 | 0.000 | 0.002 | |
0.022 | 0.261 | 0.723 | 0.007 | 0.001 | 0.001 | |
0.202 | 0.546 | 0.751 | N/A | 0.011 | 0.001 | |
0.814 | 0.755 | 0.452 | N/A | N/A | 0.004 |
The set.seed(100)
was used to have reproducible
results. According to Table 1, with level 5%, the SIR and the
SAVE determine that
Before developing selection criteria among the four SDR methods, there are some aspects that must be considered first. A selection based on the criteria should be data-driven, unless it is purposely pre-selected. In addition, the criteria should be generally extended to the other SDR methods, although the four methods in dr are highlighted in this paper.
In order to have some idea of how to select a method, consider a
simulated example of
sir.cov <- sir.save <- sir.phd <- save.cov <- save.phd <- phd.cov <- NULL
sir.eta <- save.eta <- phd.eta <- cov.eta <- NULL
set.seed(1)
## starting loop
for (i in 1:500){
## model construction
X <- matrix(rnorm(100*10), c(100,10)); y <- X[,1] + rnorm(100)
w <- c(scale(y,center=TRUE,scale=TRUE)); w2 <- cbind(w, w^2)
## obtaining basis estimates from the SDR methods
dir.sir5<-dr.direction(dr(y~X, method="sir", nslice=5))[,1]
dir.cov2<-dr.direction(dr(w2~X, method="ols"))[,1]
dir.save5<-dr.direction(dr(y~X, method="save", nslice=5))[,1]
dir.phd <-dr.direction(dr(y~X, method="phdres"))[,1]
## computing the absolute correlation between the pairs of the estimates from the four methods
sir.cov[i]<-abs(cor(dir.sir5, dir.cov2)); sir.save[i]<-abs(cor(dir.sir5, dir.save5))
sir.phd[i]<-abs(cor(dir.sir5, dir.phd)); save.cov[i]<-abs(cor(dir.save5, dir.cov2))
save.phd[i]<-abs(cor(dir.save5, dir.phd)); phd.cov[i]<-abs(cor(dir.phd, dir.cov2))
## computing the absolute correlation between the estimate from each method and x1
sir.eta[i]<-abs(cor(dir.sir5, X[,1])); cov.eta[i]<-abs(cor(dir.cov2, X[,1]))
save.eta[i]<-abs(cor(dir.save5, X[,1])); phd.eta[i]<-abs(cor(dir.phd, X[,1]))
}
## the averages of the absolute correlations by each pair of the methods
round(apply(cbind(sir.cov, sir.save, sir.phd, save.cov, save.phd, phd.cov), 2,mean),3)
sir.cov sir.save sir.phd save.cov save.phd phd.cov
0.966 0.208 0.257 0.214 0.348 0.288
## the averages of the absolute correlations by each method and x1
round(apply(cbind(sir.eta, cov.eta, save.eta, phd.eta), 2, mean), 3)
sir.eta cov.eta save.eta phd.eta
0.941 0.939 0.215 0.267
The highest average of the six pairwise absolute correlations is highest
between SIR and
Let us think about this in a reverse manner. Suppose that the estimates
of sufficient predictors by a pair of SAVE and pHd are more highly
correlated than those of any other pairs of the SDR methods. Then, it
would be not bad reasoning to believe that SAVE or pHd should be
preferable to the data, although we do not know what the true model is.
That is, the pair of the two methods that gives the highest correlation
would be not a bad choice for estimating
First, we list all six possible pairs of the four methods in the
dr-package: (1) (SIR, SAVE); (2) (SIR, pHd); (3) (SIR,
The trace correlation is expected to be maximized under the true
dimension of
sir.cov <- sir.save <- sir.phd <- save.cov <- save.phd <- phd.cov <- NULL
sir.cov2 <- sir.save2 <- sir.phd2 <- save.cov2 <- save.phd2 <- phd.cov2 <- NULL
sir.cov3 <- sir.save3 <- sir.phd3 <- save.cov3 <- save.phd3 <- phd.cov3 <- NULL
sir.eta2 <- save.eta2 <- phd.eta2 <- cov.eta2 <- NULL
set.seed(1)
## true basis matrix of the central subspace
eta <- cbind(c(1, rep(0,9)), c(0, 1, rep(0,8)) )
## starting loop
for (i in 1:500){
## model construction
X <- matrix(rnorm(100*10), c(100,10)); y <- X[,1] + X[,1]*X[,2]+ rnorm(100)
w <- c(scale(y,center=TRUE, scale=TRUE)); w2 <- cbind(w, w^2); w3<- cbind(w2, w^3)
## obtaining basis estimates from the four methods
sir5b <- dr(y~X, method="sir", nslice=5)$raw.evectors
cov2b <- dr(w2~X, method="ols")$raw.evectors
cov3b <- dr(w3~X, method="ols")$raw.evectors
save5b <- dr(y~X, method="save", nslice=5)$raw.evectors
phdb <- dr(y~X, method="phdres")$raw.evectors
## computing trace correlations under d=1 for the six pairs
sir.cov[i] <- tr.cor(sir5b[,1], cov2b[,1], 1)
sir.save[i] <- tr.cor(sir5b[,1], save5b[,1], 1)
sir.phd[i] <- tr.cor(sir5b[,1], phdb[,1], 1)
save.cov[i] <- tr.cor(save5b[,1], cov2b[,1], 1)
save.phd[i] <- tr.cor(save5b[,1], phdb[,1], 1)
phd.cov[i] <- tr.cor(phdb[,1], cov2b[,1], 1)
## computing trace correlations under d=2 for the six pairs
sir.cov2[i] <- tr.cor(sir5b[,1:2], cov2b[,1:2], 2)
sir.save2[i] <- tr.cor(sir5b[,1:2], save5b[,1:2], 2)
sir.phd2[i] <- tr.cor(sir5b[,1:2], phdb[,1:2], 2)
save.cov2[i] <- tr.cor(save5b[,1:2], cov2b[,1:2], 2)
save.phd2[i] <- tr.cor(save5b[,1:2], phdb[,1:2], 2)
phd.cov2[i] <- tr.cor(phdb[,1:2], cov2b[,1:2], 2)
sir.cov3[i] <- tr.cor(sir5b[,1:3], cov3b[,1:3], 3)
## computing trace correlations under d=3 for the six pairs
sir.save3[i] <- tr.cor(sir5b[,1:3], save5b[,1:3], 3)
sir.phd3[i] <- tr.cor(sir5b[,1:3], phdb[,1:3], 3)
save.cov3[i] <- tr.cor(save5b[,1:3], cov3b[,1:3], 3)
save.phd3[i] <- tr.cor(save5b[,1:3], phdb[,1:3], 3)
phd.cov3[i] <- tr.cor(phdb[,1:3], cov3b[,1:3], 3)
}
## averages of the trace correlations for each pair under d=1,2,3
round(apply(cbind(sir.cov, sir.save, sir.phd, save.cov, save.phd, phd.cov),
2, mean), 3)
round(apply(cbind(sir.cov2, sir.save2, sir.phd2, save.cov2, save.phd2, phd.cov2),
2, mean), 3)
round(apply(cbind(sir.cov3, sir.save3, sir.phd3, save.cov3, save.phd3, phd.cov3),
2, mean), 3)
sir.cov sir.save sir.phd save.cov save.phd phd.cov
0.580 0.224 0.476 0.317 0.451 0.616
sir.cov2 sir.save2 sir.phd2 save.cov2 save.phd2 phd.cov2
0.822 0.397 0.656 0.486 0.606 0.801
sir.cov3 sir.save3 sir.phd3 save.cov3 save.phd3 phd.cov3
0.773 0.517 0.665 0.577 0.679 0.753
In the model, the true structural dimension is equal to two, and
Based on this discussion, the selection algorithm can be developed as follows:
Algorithm
Fix the maximum value
Compute
Choose a pair of the methods to provide the maximum
Remove the initial pair and all of the other pairs not containing
one of the methods of the initial pair for all values of
Search a second pair that has the highest
The common method in the initial and final pairs is the representative SDR method to the data.
Steps 4–5 are required to select one of the methods of the initial pair
in Step 2. This approach of selecting SDR methods through the proposed
algorithm is called basis-adaptive selection (BAS). The BAS is
data-driven and not necessarily limited in the four SDR methods of SIR,
SAVE, pHd and
The bas1
function runs the BAS algorithm for SIR, SAVE, pHd and
bas1(formula, data, nsir=5, nsave=4, k=4, plot=TRUE)
The arguments of nsir
, nsave
and k
determine the numbers of slices
for SIR and SAVE as well as the moment for plot=TRUE
, the
function bas1
returns a scatter plot of the trace correlations against
the various choices of nsir
-1, k
) for the six
pairs of the four methods in dr.
The values returned by the bas1
function are selection
, sir
,
save
, phd
and covk
. Their descriptions are as follows:
selection
: a selected method among SIR, SAVE, pHd and
sir
: the SIR application object with the number of slices equal to
nsir
save
: the SAVE application object with the number of slices equal
to nsave
pHd
: the pHd application object
covk
: a list type object by the
We revisit the soil evaporation data. Here, the application of the BAS
is more extensively studied under the cross-combinations of four or five
numbers of slices for SIR and SAVE and the third or forth order of
polynomials for
## BAS application
BAS342 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=3, nsave=4, k=2, plot=FALSE)
BAS343 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=3, nsave=4, k=3, plot=FALSE)
BAS352 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=3, nsave=5, k=2, plot=FALSE)
BAS353 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=3, nsave=5, k=3, plot=FALSE)
BAS443 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=4, nsave=4, k=3, plot=FALSE)
BAS444 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=4, nsave=4, k=4, plot=FALSE)
BAS453 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=4, nsave=5, k=3, plot=FALSE)
BAS454 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=4, nsave=5, k=4, plot=FALSE)
BAS544 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=5, nsave=4, k=4, plot=FALSE)
BAS545 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=5, nsave=4, k=5, plot=FALSE)
BAS554 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=5, nsave=5, k=4, plot=FALSE)
BAS555 <- bas1(Evap~Avat+Rat+Avst+Rst, data=evaporat, nsir=5, nsave=5, k=5, plot=FALSE)
## Selection results
BAS342$selection; BAS343$selection; BAS352$selection; BAS353$selection
BAS443$selection; BAS444$selection; BAS453$selection; BAS454$selection
BAS544$selection; BAS545$selection; BAS554$selection; BAS555$selection
## dimension test
set.seed(100); round(dr.permutation.test(BAS342$covk$cov2, npermute=1000)$summary, 3)
set.seed(100); round(dr.permutation.test(BAS343$covk$cov3, npermute=1000)$summary, 3)
round(dr.test(BAS443$phd, numdir=4), 3)
round(dr.test(BAS544$sir), 3)
Both BAS342 and BAS352 recommended
For this purpose, we considered how reasonably the methods recommended
in Table 2 estimated the true dimension of the central
subspace. The dimension estimation results are already summarized in
Table 1, and the nominal level 5% was used. First, the two
methods of cov2 and cov3 were inspected. According to Table 1,
BAS3 |
BAS3 |
BAS4 |
BAS5 |
|
---|---|---|---|---|
Recommendation | cov2 | cov3 | pHd | SIR |
Predictors
Model 1:
Model 2:
Model 3:
Model 4:
In Models 1 and 2, the column of
Each model was iterated 500 times with
As a summary, the selection percentages of the methods by BAS are reported in Table 3.
SIR | SAVE | pHd | SIR | SAVE | pHd | |||
Model 1 | 41.2 | 0.6 | 0.2 | 58.0 | 49.4 | 0.0 | 0.0 | 50.6 |
Model 2 | 1.0 | 17.8 | 78.4 | 2.8 | 0.0 | 22.8 | 77.2 | 0.0 |
Model 3 | 6.4 | 2.0 | 25.6 | 66.0 | 3.6 | 1.2 | 25.0 | 70.2 |
Model 4 | 6.4 | 0.6 | 6.4 | 86.6 | 26.8 | 1.2 | 1.4 | 70.6 |
According to Table 3, the BAS selects SIR and
set.seed(5); sel1 <- sel2 <- sel3 <- sel4 <- sel12 <- sel22 <- sel32 <- sel42 <- NULL
## starting loop
for (i in 1:500){
## model construction for n=100
X <- matrix(rnorm(100 * 10), c(100, 10))
y1 <- X[,1] + rnorm(100)
y2 <- X[,1]^2 + rnorm(100)
y3 <- X[,1] + X[,1] * X[,2] + rnorm(100)
y4 <- 1 + X[,1] + exp(X[,2]) * rnorm(100)
## model construction for n=200
X2 <- matrix(rnorm(200 * 10), c(200,10))
y12 <- X2[,1] + rnorm(200)
y22 <- X2[,1]^2 + rnorm(200)
y32 <- X2[,1] + X2[,1] * X2[,2] + rnorm(200)
y42 <- 1 + X2[,1] + exp(X2[,2]) * rnorm(200)
## selection by BAS
sel1[i] <- bas1(y1~X, plot=FALSE)$selection
sel2[i]<-bas1(y2~X, plot=FALSE)$selection
sel3[i] <- bas1(y3~X, plot=FALSE)$selection
sel4[i]<-bas1(y4~X, plot=FALSE)$selection
sel12[i] <- bas1(y12~X2, plot=FALSE)$selection
sel22[i]<-bas1(y22~X2, plot=FALSE)$selection
sel32[i] <- bas1(y32~X2, plot=FALSE)$selection
sel42[i]<-bas1(y42~X2, plot=FALSE)$selection
}
## computing the selection percentages for model 1
# n=100
c(length(which(sel1=="sir")), length(which(sel1=="save")),
length(which(sel1=="phd")), length(which(sel1=="covk"))) / 500
# n=200
c(length(which(sel12=="sir")), length(which(sel12=="save")),
length(which(sel12=="phd")), length(which(sel12=="covk"))) / 500
## computing the selection percentages for model 2
# n=100
c(length(which(sel2=="sir")), length(which(sel2=="save")),
length(which(sel2=="phd")), length(which(sel2=="covk"))) / 500
# n=200
c(length(which(sel22=="sir")), length(which(sel22=="save")),
length(which(sel22=="phd")), length(which(sel22=="covk"))) / 500
## computing the selection percentages for model 3
# n=100
c(length(which(sel3=="sir")), length(which(sel3=="save")),
length(which(sel3=="phd")), length(which(sel3=="covk"))) / 500
# n=200
c(length(which(sel32=="sir")), length(which(sel32=="save")),
length(which(sel32=="phd")), length(which(sel32=="covk"))) / 500
## computing the selection percentages for model 4
# n=100
c(length(which(sel4=="sir")), length(which(sel4=="save")),
length(which(sel4=="phd")), length(which(sel4=="covk"))) / 500
# n=200
c(length(which(sel42=="sir")), length(which(sel42=="save")),
length(which(sel42=="phd")), length(which(sel42=="covk"))) / 500
Sufficient dimension reduction (SDR) is a useful dimension reduction method in regression. The popularly used SDR methods among the others include sliced inverse regression (Li 1991), sliced average variance estimation (Cook and Weisberg 1991), principal Hessian directions (Li 1992) and covariance method (Yin and Cook 2002). Currently, the dr-package is the only one to cover the four sufficient dimension reduction methods in R. However, users were left without practical guidelines as to which SDR method should be chosen. To remedy this, we developed the basis-adaptive selection (BAS) algorithm to recommend a SDR method in dr by maximizing a trace correlation (Hooper 1959). A real data example and numerical studies confirm its potential usefulness in practice.
The BAS algorithm requires the two parts. The first is the basis estimates of the dimension reduction subspace, and the second is a quantity to measure the distances between the subspaces spanned by the columns of the estimates. For non-linear feature extractions through different kernel methods, there is no reason why the BAS algorithm cannot be applied if some quantity to measure the distance between the non-linear subspaces defined in different kernels is feasible.
If the sliced inverse regression applications with various numbers of slices replaces the other three methods in the BAS, the BAS algorithm can be utilized to find a good number of slices for this method.
The code for BAS is available through the personal webpage of the author: http://home.ewha.ac.kr/~yjkstat/bas.R.
For the corresponding author Jae Keun Yoo, this work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Ministry of Education (NRF-2017R1A2B1004909).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Yoo, "Basis-Adaptive Selection Algorithm in dr-package", The R Journal, 2018
BibTeX citation
@article{RJ-2018-045, author = {Yoo, Jae Keun}, title = {Basis-Adaptive Selection Algorithm in dr-package}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {124-132} }