We introduce a new R package called glmperm for inference in generalized linear models especially for small and moderate-sized data sets. The inference is based on the permutation of regressor residuals test introduced by Potter (2005). The implementation of glmperm outperforms currently available permutation test software as glmperm can be applied in situations where more than one covariate is involved.
A novel permutation test procedure for inference in logistic regression with small- and moderate-sized datasets was introduced by Potter (2005) and showed good performance in comparison to exact conditional methods. This so-called permutation of regressor residuals (PRR) test is implemented in the R package logregperm. However, the application field is limited to logistic regression models. The new glmperm package offers an extension of the PRR test to generalized linear models (GLMs) especially for small and moderate-sized data sets. In contrast to existing permutation test software, the glmperm package provides a permutation test for situations in which more than one covariate is involved, e.g. if established covariates need to be considered together with the new predictor under test.
Let
with
One is now interested in testing the null hypothesis that the regression
coefficient for a covariate of interest, say without loss of generality
The basic concept of the PRR test is that it replaces the covariate of
interest by its residual
The algorithm for the PRR test to obtain a p-value for testing the null
hypothesis
Calculate least squares residuals
For resampling iterations
Randomly draw
Calculate p-values
Calculate a permutation p-value for the PRR test according to
Thus, the permutation p-value
The number of resampling iterations nrep
in the function prr.test
. By default nrep=1000
. However,
depending on the precision that is desired, the number of iterations
could be increased at the cost of longer computation time. We recommend
using nrep=10000
.
In order to provide an estimate of the variance of the result, the
permutation p-value
For binary outcome variables the PRR test is a competitive approach to exact conditional logistic regression described by Hirji, C.R. Mehta and N.R. Patel (1987) and Mehta and N.R. Patel (1995). The commercial LogXact software has implemented this conditional logistic regression approach. At present, statistical tests employing unconditional permutation methods are not commercially available and the package glmperm bridges this gap.
The rationale of exact conditional logistic regression is based on the
exact permutation distribution of the sufficient statistic for the
parameter of interest, conditional on the sufficient statistics for the
specified nuisance parameters. However, its algorithmic constraint is
that the distribution of interest will be degenerate if a conditioning
variable is continuous. The difference between the two procedures lies
in the permutation scheme: While the conditional approach permutes the
outcome variable, the PRR test permutes the residuals from the linear
model. Note that if one considers regressions with only a single
regressor the residuals
An overview of the methodologic differences between these two and other permutation schemes is provided by Kennedy and B.S. Cade (1996). In the context of their work the permutation method used for the PRR test can be viewed as an extension of their shuffle-R permutation scheme for linear models. The variable associated with the parameter under the null hypothesis is regressed on the remaining covariables and replaced by the corresponding residuals; these residuals are then permuted while the response and the covariates are held constant. Freedman and D. Lane (1983) introduced the shuffle-R method in combination with tests of significance and a detailed discussion of this permutation scheme is provided by Braak (1992). The conditional method can be implied with the shuffle-Z method which was first mentioned in the context of multiple regression by Draper and D.M. Stoneman (1966). Here the variables associated with the parameters being tested under the null hypothesis are randomized while the response and the covariates are held constant. Kennedy and B.S. Cade (1996) discuss the potential pitfalls of the shuffle-Z method and point out that this method violates the ancillarity principle by not holding constant the collinearity between the covariables and the variable associated with the parameter under the null hypothesis. Therefore, Kennedy and B.S. Cade (1996) do not recommend the shuffle-Z method unless it employs a pivotal statistic or the hypothesized variable and the remaining covariables are known to be independent.
Three main modifications have been implemented in the glmperm package in comparison to the logregperm package. Note that the new package glmperm includes all features of the package logregperm. At present, the logregperm package is still available on CRAN. In general, the glmperm package could replace the logregperm package.
The extension of the prr.test
function in the
glmperm package
provides several new arguments compared to the version in
logregperm. The
input is now organised as a formula expression equivalent to fitting
a generalized linear model with glm
(R package
stats). Hence, for
easier usage the syntax of prr.test
is adapted from glm
. The
covariable of interest about which inference is to be made is to be
included as argument var=’x1’
. An argument seed
is provided to
allow for reproducibility.
The implicit function glm.perm
is extended to calculate not only
the deviances for the different resampling iterations but also the
dispersion parameter for each permutation separately. For Poisson
and logistic regression models the dispersion parameters are
pre-defined as
A new feature of the package is that it includes a summary function
to view the main results of prr.test
. It summarises the
permutation of regressor residual-based p-value family=quasibinomial()
or quasipoisson()
instead.
To illustrate the usage of the
glmperm package for a
GLM, an example session with simulated data is presented. First, we
simulated data for three independent variables with
# binary response variable
n <- 20
set.seed(4278)
x1 <- rnorm(n)
x0 <- rnorm(n)+x1
y1 <- ifelse(x0+x1+2*rnorm(n)>0,1,0)
test1 <- prr.test(y1~x0+x1,
var="x0", family=binomial())
x2 <- rbinom(n,1,0.6)
y2 <- ifelse(x1+x2+rnorm(n)>0,1,0)
test2 <- prr.test(y2~x1+x2, var="x1",
nrep=10000,family=binomial())
# Poisson response variable
set.seed(4278)
x1 <- rnorm(n)
x0 <- rnorm(n) + x1
nu <- rgamma(n, shape = 2, scale = 1)
y <- rpois(n, lambda = exp(2) * nu)
test3 <- prr.test(y~x0+x1,
var="x0", family=poisson())
test4 <- prr.test(y~x0, var="x0",
nrep=1000,family=poisson())
A condensed version of the displayed result summary of test2
(only
factors
> summary(test2)
-----------------------------------------
Results based on chi-squared distribution
-----------------------------------------
observed p-value: 0.0332
--------------------
Results based on PRR
--------------------
permutation p-value for simulated p-values <=
observed p-value: 0.0522 (Std.err: 0.0018)
permutation p-value for simulated p-values <=
1.02 observed p-value: 0.0531 (Std.err: 0.0018)
For the above example test2
the exact conditional logistic regression
p-value calculated via LogXact-4 is 0.0526, whereas the p-value of the
PRR test is 0.0522 for factor
For the computation of GLM the dispersion parameter family=quasipoisson()
or quasibinomial()
instead of family=poisson()
or binomial().
We experienced a stable performance of the PRR test for overdispersed
data. The following treepipit data (Müller and T. Hothorn 2004) provides an example of
overdispersed data. We show the results of the chi-squared approximation
of the likelihood-ratio test statistic as well as the results of the PRR
test for the usage of family=poisson()
and family=quasipoisson()
,
respectively.
# example with family=poisson()
data(treepipit, package="coin")
test5<-prr.test(counts~cbpiles+coverstorey
+coniferous+coverregen,data=treepipit,
var="cbpiles",family=poisson())
summary(test5)
-----------------------------------------
Results based on chi-squared distribution
-----------------------------------------
observed p-value: 0.0037
--------------------
Results based on PRR
--------------------
permutation p-value for simulated p-values <=
observed p-value: 0.083 (Std.err: 0.0019)
permutation p-value for simulated p-values <=
1.02 observed p-value: 0.084 (Std.err: 0.0019)
# example with family=quasipoisson()
test6<-prr.test(counts~cbpiles+coverstorey
+coniferous+coverregen,data=treepipit,
var="cbpiles",family=quasipoisson())
summary(test6)
-----------------------------------------
Results based on chi-squared distribution
-----------------------------------------
observed p-value: 0.0651
--------------------
Results based on PRR
--------------------
permutation p-value for simulated p-values <=
observed p-value: 0.07 (Std.err: 0.0078)
permutation p-value for simulated p-values <=
1.02 observed p-value: 0.071 (Std.err: 0.0079)
The p-values based on the chi-squared distribution of the
likelihood-ratio test statistic are family=poisson()
and family=quasipoisson()
, respectively. Hence, a
different test decision is made whereas the test decision for the PRR
test is the same for both cases (
The glmperm package provides a permutation of regressor residuals test for generalized linear models. This version of a permutation test for inference in GLMs is especially suitable for situations in which more than one covariate is involved in the model. The key input feature of the PRR test is to use the orthogonal projection of the variable of interest on the space spanned by all other covariates instead of the variable of interest itself. This feature provides a reasonable amendment to existing permutation test software which do not incorporate situations with more than one covariate. Applications to logistic and Poisson models show good performance when compared to gold standards. For the special case of data with overdispersion the PRR test is more robust compared to methods based on approximations of the test statistic distribution.
We would like to acknowledge the many valuable suggestions made by two anonymous referees.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Werft & Benner, "glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear Models", The R Journal, 2010
BibTeX citation
@article{RJ-2010-007, author = {Werft, Wiebke and Benner, Axel}, title = {glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear Models}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {1}, issn = {2073-4859}, pages = {39-43} }