Semiparametric Generalized Linear Models with the gldrm Package
Abstract:
This paper introduces a new algorithm to estimate and perform inferences on a recently proposed and developed semiparametric generalized linear model (glm). Rather than selecting a particular parametric exponential family model, such as the Poisson distribution, this semiparametric glm assumes that the response is drawn from the more general exponential tilt family. The regression coefficients and unspecified reference distribution are estimated by maximizing a semiparametric likelihood. The new algorithm incorporates several computational stability and efficiency improvements over the algorithm originally proposed. In particular, the new algorithm performs well for either small or large support for the nonparametric response distribution. The algorithm is implemented in a new R package called gldrm.
(Rathouz and Gao 2009) introduced the generalized linear density
ratio model (gldrm), which is a novel semiparametric formulation of the
classical glm. Although Rathouz and Gao did not use the term gldrm, we
refer to it as such because it is a natural extension of the density
ratio model (see e.g. (Lovric 2011)). Like a standard glm, the gldrm
relates the conditional mean of the response to a linear function of the
predictors through a known link function . To be specific, let the
random variable be a scalar response, and let be a
covariate vector for a particular observation. The model for the mean is
Because Equation (1) holds for both gldrm and glm,
the regression coefficients have the same interpretation for both
models. The gldrm relaxes the standard glm assumption that the
distribution of comes from a particular exponential family model.
Instead, assume that comes from an exponential tilt model of the
form
where
and is defined implicitly as the solution to
Here is an unspecified probability density with respect to a
measure . We call the reference distribution. Measure
is Lebesgue if is continuous, a counting measure if
is discrete, or a mixture of the two if has a mixture
distribution. Note that and
, which are standard glm properties.
The regression coefficients and reference distribution are
estimated by maximizing a semiparametric likelihood function, which
contains a nonparametric representation of that has point mass
only at values of observed in the data. The quantity is not
a parameter in this model, but rather a function of the free parameters
and , as well as of . This work was fully developed by
(Huang and Rathouz 2012), who first focused on the case where the
covariate vector takes one of finitely many values, drawing on
theoretical arguments advanced in the empirical likelihood literature
(e.g. (Owen et al. 2001)). Drawing on semiparametric profile
likelihood methods, (Huang 2014) went on to fully generalize the
asymptotic arguments, proving consistency and asymptotic normality of
the regression coefficient estimators, and deriving the asymptotic
variance matrix using the profile likelihood. Huang also proved
pointwise asymptotic normality of the reference cumulative distribution
function estimator.
Despite these important theoretical advances, computation for this model
has remained a practical challenge. The original algorithm in
(Rathouz and Gao 2009) is somewhat rudimentary and applies mostly to
cases with finite support. It does not scale well, and stability and
speed are challenges. This paper proposes a new algorithm to address
these challenges and render the model more widely applicable.
In particular, the issue of optimizing the semiparametric likelihood
over the point masses is improved with application of the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) technique
(Broyden 1970; Fletcher 1970; Goldfarb 1970; Shanno 1970). Because the
number of parameters in the semiparametric likelihood associated with
is equal to the number of unique points in the response vector,
the rank of the Hessian matrix can become unwieldy, especially for
continuous response data. The BFGS method uses an easily-computed
rank-two approximate Hessian matrix that has an inverse which is much
less computationally intensive to calculate.
The optimization techniques in this paper have been incorporated into
the R package gldrm(Wurm and Rathouz 2018), the first CRAN package for gldrm. This package estimates
the gldrm model and provides coefficient standard errors. In addition,
since the publication of (Rathouz and Gao 2009), (Huang 2014)
developed likelihood ratio tests and likelihood ratio-based confidence
intervals for regression coefficients and demonstrated their strong
properties. The gldrm package provides utilities for likelihood ratio
tests of nested models, as well as confidence intervals.
We present our optimization algorithm in Section 2. In Section 3, we
present simulation experiments to check the finite sample performance of
the gldrm estimators and the accuracy of estimated standard errors using
this algorithm. In Section 4, we discuss methods for inference. Section
5, we present a simulation that benchmarks the computational time and
how it scales with the number of observations and number of covariates.
In Section 6, we present a summary of the functions in the gldrm
package. In Section 7, we demonstrate the functionality of the gldrm
package with an example in R. In Section 8, we make concluding remarks.
2 Optimization algorithm
Suppose we have an observed sample
generated by the model specified in Equations
(1)-(2). The can be fixed or
random, and the are conditionally independent given the
covariates. Let be the observed
support, i.e. a vector containing the unique values in the set
. If the response follows a continuous
distribution, then will equal . Otherwise may be less than
due to ties. Let the vector
represent probability mass at the points . This is a
nonparametric representation of because the true reference density
may be continuous or have probability mass at values not contained in
.
The semiparametric log-likelihood is
where each is defined implicitly as the solution to
There exists a that satisfies Equation
(4) as long as for
all , where and
.
We obtain parameter estimates from this likelihood as
,
subject to the following constraints:
C1. for all
C2. for all
C3.
C4. for some chosen
Constraint (C4) is necessary for identifiability because for any
nonparametric density , the
“exponentially tilted” density
has the same semiparametric log-likelihood for any
, i.e.
for all
. We can set to be any value within the range of observed
response values, but choosing a value too extreme can lead to numerical
instability in estimating . It usually works well to choose
.
To perform the optimization, we take an iterative approach, alternating
between optimizing over and over (Rathouz and Gao 2009). Each optimization step marginally optimizes
the log-likelihood over one set of parameters, holding the other fixed.
Neither optimization step has a closed form solution, so iterative
procedures must be used. To update , we propose using the
BFGS technique. To update , we use Fisher scoring, which is
equivalent to iteratively re-weighted least squares (IRLS). We propose
iterating BFGS until convergence for each update, while
only using a single IRLS iteration to update in between
updates; this is tantamount to a profile likelihood
approach to estimation of
Although the log-likelihood in Equation (3) is a
function of , we have expressed it in terms of
each is implicitly a function
of . The score function is also best expressed in
terms of the ’s (see Equations (7) and
(8)). Consequently, our optimization algorithm
requires the ’s to be computed each time or
is updated. To do this, we use an iterative Newton-Raphson
procedure after each iteration of the update or the
update procedures.
In what follows, we detail updates of the ’s, of
, and of . We use the notation to denote
the function defined in Equation (2) with respect to
the discrete probability distribution specified by , i.e.
. We also
define .
update procedure
is defined implicitly as the solution to Equation
(4), which can be written as
To calculate , we use the Newton-Raphson procedure provided in
Appendix C of (Rathouz and Gao 2009). To satisfy equation
(5), we need to find such that
, or equivalently ,
where
.
(The transformation, , stabilizes the solution.)
We use Newton-Raphson to find the root of
. Let denote
the approximate solution at the Newton-Raphson
iteration. The Newton-Raphson update is given by
where
and
We typically initialize to be the value obtained from the
previous update procedure. The first time is updated,
we initialize to zero for every observation. We define
convergence when , where is a
small threshold such as .
As from the left, .
Likewise, as from the right,
. To prevent numerical instability when
is at or near these boundaries, we cap at a maximum
value (500 by default). The appropriateness of this threshold would
depend on the scale of response variable. Rather than adjust the
threshold, we center and scale the response variable to the interval
[-1, 1] (see the subsequent section on “response variable
transformation”).
optimization procedure
Holding fixed at its current estimate, we need to marginally
optimize the log-likelihood over
, subject to constraints (C2)-(C4). The linear constraints
(C3) and (C4) could be enforced using constrained optimization
techniques such as Lagrange multipliers or reducing the dimension of the
parameter space by two. Huang and Rathouz (2012) used the former
technique, while Rathouz and Gao (2009) used the latter. We propose a
different method, based on the BFGS technique, that is more
computationally efficient. At each iteration, we apply a BFGS update to
to improve the unconstrained log-likelihood and then
transform to satisfy constraints (C3) and (C4).
Application of the constraints does not affect the log-likelihood of the
estimate, as the constraints are only required for identifiability (i.e.
uniqueness of the optimal ). For any set of positive
values, there exists a unique set of values with equal
log-likelihood that satisfies both constraints (C3) and (C4).
We define the transformation
and
consider the log-likelihood as a function of only, with
held fixed. Working on the log scale enforces constraint (C2)
and also improves numerical stability. Specifically, numerical stability
is improved by working with the score and Hessian as a function of
rather than .
BFGS is a quasi-Newton procedure, which makes iterative updates using an
approximate Hessian matrix along with the exact score function. Let
be the estimate at the iteration. The
updates take the form
Here, is the score as a function of
only, holding fixed. It has element
for (derivation in Appendix A). Note that this score
function ignores constraints (C3) and (C4). The matrix is an
approximation to the Hessian of the log-likelihood as a function of
only, holding fixed. This estimate is updated with
each iteration. Letting
and ,
we can write the BFGS estimate of the Hessian recursively as
is a rank-2 update to that satisfies the secant
condition: . Furthermore, is guaranteed to be
symmetric and positive definite, even though the true Hessian is not
full rank. (The true Hessian is not full rank because without imposing
constraints (C3) and (C4), the log-likelihood does not have a unique
optimum.) The BFGS update in Equation (6) requires
the inverse of , which can be calculated efficiently and directly,
without calculating . By the Sherman-Morrison-Woodbury formula,
For an initial estimate, we propose , where
is the identity matrix. We perform a line search along the
gradient to choose an appropriate value for such that
.
As previously mentioned, constraints (C3) and (C4) are only required for
identifiability. After each BFGS iteration, we impose these constraints
on the estimate, which does not affect the log-likelihood
of the estimate. Specifically, we apply (C3) by scaling our estimate of
to sum to one. We then “exponentially tilt” the estimate
to enforce constraint (C4). In other words, we compute such
that
,
and set our final estimate for the iteration to be
for all .
We initialize to the log of the
estimate obtained from the previous update procedure. We
suggest using the empirical response distribution as an initial estimate
of . We define convergence using the relative change in the
log-likelihood. Our default convergence threshold is . If the
log-likelihood decreases after any iteration, we backtrack by half
steps, setting
until the log-likelihood improves. In our experience, we have found that
the log-likelihood improves after most iterations without taking half
steps, but log-likelihood decreases can occur sporadically (both at
early and late iterations).
optimization procedure
Holding fixed at its current estimate, we could marginally
optimize the log-likelihood over using iteratively re-weighted
least squares (IRLS). Rather than iterating until convergence, however,
we propose using a single IRLS iteration to update in between
updates. The IRLS algorithm is simply the Newton-Raphson
algorithm, but using the Fisher information in place of the negative
Hessian matrix. This technique is commonly referred to as Fisher
Scoring. As we now show, the Fisher Scoring update minimizes a weighted
least squares expression, which is why we can refer to the algorithm as
IRLS. The score function is given by
(derivation in Appendix C) and the Fisher information is
where is an matrix with rows ,
is an diagonal matrix with entries
, and
is an vector with entries
.
Let be the estimate obtained from the previous
update procedure. The IRLS step between and the updated
estimate, , is
This is the solution to a weighted least squares expression, which can
be computed efficiently using QR decomposition.
Response variable transformation
Numerical stability issues can occur if the terms
become too large or small during optimization. To prevent these terms
from exceeding R’s default floating-point number range, we transform the
response vector to the interval [-1, 1]. Specifically, the response
values are transformed to
. It turns out the
semiparametric log-likelihood function with the transformed response and
modified link function
is equivalent to the original log-likelihood. The parameters and
that optimize the modified log-likelihood also optimize
the original log-likelihood (proof in Appendix D).
As mentioned in the “ update procedure” section, we choose to
cap for each observation at 500 by default, thereby
restricting the rescaled terms to the interval [-500,
500]. Note that the optimization function in the gldrm R package
returns the estimated values for each observation, and these
values are on the original scale of the response variable.
3 Inference
The gldrm R package (Wurm and Rathouz 2018) can perform inference on the
regression coefficients using likelihood ratio tests for nested models.
It can also calculate likelihood ratio, Wald, or score confidence
intervals. All three confidence intervals should yield similar results
for large samples, but the Wald form may be preferred for its
computational simplicity. Likelihood ratio and score confidence
intervals are more computationally expensive to calculate.
Huang (2014) recommends likelihood ratio tests for small samples.
The Wald asymptotic variance estimate for the estimator can be
obtained from the inverse of the information matrix given in Equation
(9). Recall this information matrix is calculated with
the parameters held fixed. Its inverse is a valid
asymptotic variance matrix because the full information matrix is block
diagonal; i.e., the and estimators are
asymptotically independent. The gldrm optimization function returns
standard error estimates for each coefficient, which are displayed by
the print method along with p-values. Wald confidence intervals are
straightforward to calculate from the standard errors and can be
obtained from the gldrmCI function. For these single-coefficient
hypothesis tests and confidence intervals, we approximate the null
distribution by a -distribution with degrees of freedom, where
is the number of observations and is the rank of the covariate
matrix.
Likelihood ratio tests for nested models are based on the usual test
statistic: 2 times the log-likelihood difference between the full and
reduced model. Following Huang (2014), we approximate the null
distribution by an -distribution with and degrees of
freedom, where is the difference in the number of parameters between
the full and reduced models. This test can be performed by the
gldrmLRT function.
Likelihood ratio and score confidence intervals for a single coefficient
can be obtained from the gldrmCI function. These confidence intervals
are more computationally expensive than Wald confidence intervals
because an iterative method is required to search for the interval
boundaries. We use the following bisection method.
Suppose we want to obtain the lower boundary of a confidence level
interval for a single coefficient , and that the
gldrm estimate of this coefficient is . For the lower
boundary, we need to find such that
and the one-sided likelihood ratio
test has a p-value equal to under the null hypothesis
. As explained in the next paragraph,
we can compute the p-value for any guess of . We
begin by finding a guess that is too low (p-value less than
) and a guess that is too high (p-value greater than
). We then obtain the p-value of the midpoint. If the
p-value is less than then the midpoint becomes the new low
guess, and if it is greater than then the midpoint becomes
the new high guess. We iterate this process, each time halving the
distance between the low guess and high guess, until we have a guess of
that has a p-value arbitrarily close to
. The same procedure can be used to solve for
.
Let be any constant. To calculate the likelihood ratio or
score test p-value of , we need to optimize the
log-likelihood over and all other values, holding
fixed at . This can be done by multiplying
with the column of the matrix corresponding to
and treating this vector as an offset to the linear predictor. In other
words, this vector contains a fixed component of the linear predictor
for each observation. The column is dropped from ,
and the model is optimized with the offset term.
4 Goodness of fit
To check the model fit, we calculate the randomized probability inverse
transform (Smith 1985). This is done by evaluating the fitted
conditional cdf (i.e., the cumulative distribution function, conditional
on the covariates) of each observed response value. If the model fit is
good, the probability inverse transform values should roughly follow a
uniform distribution on the interval (0, 1). Because the gldrm fitted
cdf is discrete, we use randomization to make the distribution
continuous as follows.
Let denote the fitted cdf conditional on a covariate
vector evaluated at , and let
.
For each observation, we draw a random value from a uniform distribution
on the interval
.
To illustrate a good gldrm fit, we generate 1,000 independent pairs
where Normal(mean = 0, sd = 1) and
Normal(mean = x, sd = 1). The mean of is linear in , and the
data generating mechanism is an exponential tilt family, so we expect
the gldrm fit to be good. We fit the model and then do a visual check by
plotting the histogram and uniform QQ plot of the randomized probability
inverse transform values (Figure 1).
Figure 1: Scatterplot and probability inverse transform
histogram and QQ plot for a good gldrm model fit.
To illustrate a poor gldrm fit, we generate 1,000 independent pairs
where Normal(mean = 0, sd = 1) and
Normal(mean = , sd = ). The mean of is again linear in
, but this data generating mechanism is not an exponential tilt
family. The diagnostic plots (Figure 2) confirm that the
gldrm fit is not ideal. The probability inverse transform values are
concentrated near the center of the interval (0, 1) rather than being
uniformly distributed. The gldrm still provides a good estimate of the
regression coefficients and the mean of , but it is not the correct
model for the cdf of . .
Figure 2: Scatterplot and probability inverse transform
histogram and QQ plot for a poor gldrm model fit.
5 Simulation experiments
Under four different simulation scenarios, we check the finite sample
performance of the regression coefficient estimators, , as
well as the reference distribution cdf estimator . For
, we also check the accuracy of the estimated standard
errors.
For each simulation, the data generating mechanism is a gamma glm. The
covariates are the same under each simulation setting, and their values
are fixed, not random. The covariate vector is
. Variable is an intercept column of
ones. Variable is an indicator with exactly 20% ones in each
sample (observations 5, 10, 15, 20, have , and the rest
have ). Variable is a continuous variable that follows a
uniform sequence between zero and one in each sample (the first
observation has , and the last observation has ). Lastly,
.
The coefficient values were
for all simulation scenarios. The intercepts were for
simulations 1 & 2, and for simulations 3 & 4. These values
were selected so the mean response values were centered close to one
and, for the identity link scenarios, guaranteed to be positive.
The log link was used for simulations 1 & 2, and the identity link was
used for simulations 3 & 4. For all four simulations, the response was
drawn from a gamma distribution with
, where varies by
simulation to achieve different values. is defined as the
squared correlation between the response and linear combination of
predictors.
Simulation 1: log link and high (,
)
Simulation 2: log link and low (,
)
Simulation 3: identity link and high (,
)
Simulation 4: identity link and low (,
)
A gldrm was fit to each data set with correct link function and with
constrained to have mean . If we had followed
our usual practice of choosing to be the sample mean of the
observed response values, then the reference distribution would have a
different mean for each simulation replicate. By choosing
for all replicates, the true always follows a gamma distribution
with mean one and variance , where varies by simulation
scenario. The value was chosen because, by design, it fell
within the range of observed response values for all 2,000 replicates of
each simulation scenario. The cumulative reference distribution
estimate, denoted as , was computed at percentiles
0.1, 0.25, 0.5, 0.75, 0.9.
For each simulation scenario, we display four tables. The first table
(Tables 1, 5,
9, and 13) contains the
sample mean of each estimator. For comparison, we also
calculated the gamma glm coefficient estimates for each simulated data
set and display the sample mean alongside in parentheses.
The second table (Tables 2, 6,
10, and 14) contains the root
mean squared error (rmse) of each estimator. The rmse is
calculated as
,
where is the estimator of the simulation
replicate, and is the true parameter value. For comparison, we
also show the relative efficiency compared to the gamma glm estimator.
Relative efficiency is calculated as
, where mse is the
mean squared error (not rmse).
The third table (Tables 3, 7,
11, and 15) contains Wald
confidence interval coverage rates for each estimator.
Confidence intervals were calculated based on a -distribution with
degrees of freedom.
The fourth table (Tables 4, 8,
12, and 16) contains the mean of
, calculated at the true , ,
, , and, percentiles.
To summarize the results, the gldrm estimators perform as expected. In
all four simulation scenarios, the bias of and
goes to zero as the sample size increases. For the high
scenarios, there is very little bias even at . Also, the
relative efficiency of the estimators compared to the gamma glm
estimators goes to one as the sample size increases. This is expected
because, as Rathouz and Gao (2009) demonstrated, the and
estimators are asymptotically orthogonal, so gldrm and glm have equal
asymptotic efficiency when the glm model is correctly specified. For the
high scenarios, the relative efficiency is close to one even at
. For the low scenarios, the gldrm estimator is actually
more efficient than the glm estimator for small sample sizes.
The standard error estimates for gldrm estimators are
consistently low for small sample sizes, as demonstrated by low Wald
confidence interval coverage rates. The coverage rates improve with
increasing sample size, demonstrating good asymptotic properties of the
standard error estimators. Likelihood ratio confidence intervals can be
calculated with the gldrm R package and may have better small sample
performance, as demonstrated by Huang (2014).
Simulation 1
Simulation 1 uses log link with and
. This results in an of approximately 0.13. The
simulation was replicated 2,000 times.
Table 1: Sample mean of . For comparison, the gamma glm
sample mean is shown in parentheses.
mean (gamma glm mean)
n = 25
n = 100
n = 400
-0.02 (-0.02)
0.00 (0.00)
0.00 (0.00)
0.93 (0.92)
0.98 (0.99)
0.99 (0.99)
0.99 (1.00)
0.99 (1.00)
1.00 (1.00)
-2.00 (-2.00)
-2.00 (-2.01)
-1.99 (-1.99)
Table 2: Root mean squared error (rmse) of . In
parentheses is the relative efficiency compared to the gamma glm
estimator.
rmse (relative efficiency)
n = 25
n = 100
n = 400
0.31 (0.99)
0.16 (1.00)
0.08 (1.00)
0.81 (0.96)
0.38 (1.01)
0.18 (1.00)
0.54 (1.00)
0.27 (1.00)
0.14 (0.99)
1.27 (0.96)
0.63 (1.00)
0.30 (1.00)
Table 3: Wald confidence interval coverage rate for .
80% C.I.
90% C.I.
95% C.I.
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
0.755
0.799
0.805
0.870
0.896
0.899
0.920
0.945
0.944
0.677
0.764
0.800
0.782
0.860
0.900
0.837
0.919
0.949
0.742
0.784
0.810
0.853
0.883
0.896
0.910
0.940
0.950
0.681
0.767
0.806
0.800
0.869
0.907
0.867
0.920
0.948
Table 4: Sample mean of at selected true percentiles.
mean
n = 25
n = 100
n = 400
0.10
0.27
0.082
0.097
0.099
0.25
0.48
0.222
0.244
0.249
0.50
0.84
0.487
0.498
0.499
0.75
1.35
0.759
0.751
0.750
0.90
1.94
0.911
0.902
0.900
Simulation 2
Simulation 2 uses log link with and .
This results in an of approximately 0.019. The simulation was
replicated 2,000 times.
Table 5: Sample mean of . For comparison, the gamma glm
sample mean is shown in parentheses.
mean (gamma glm mean)
n = 25
n = 100
n = 400
-0.16 (-0.19)
-0.05 (-0.06)
-0.02 (-0.02)
0.21 (0.29)
0.80 (0.84)
0.96 (0.97)
0.90 (0.96)
0.99 (1.01)
1.00 (1.00)
-1.74 (-1.98)
-1.93 (-2.01)
-1.99 (-2.00)
Table 6: Root mean squared error (rmse) of . In
parentheses is the relative efficiency compared to the gamma glm
estimator.
rmse (relative efficiency)
n = 25
n = 100
n = 400
0.94 (1.14)
0.45 (1.04)
0.23 (1.00)
2.83 (1.43)
1.12 (1.09)
0.51 (1.02)
1.62 (1.19)
0.78 (1.05)
0.39 (1.01)
4.18 (1.59)
1.88 (1.14)
0.89 (1.02)
Table 7: Wald confidence interval coverage rate for .
80% C.I.
90% C.I.
95% C.I.
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
0.704
0.770
0.782
0.806
0.877
0.888
0.869
0.933
0.945
0.626
0.724
0.781
0.732
0.832
0.883
0.794
0.897
0.935
0.680
0.755
0.774
0.795
0.862
0.881
0.869
0.925
0.939
0.633
0.722
0.762
0.750
0.829
0.879
0.819
0.895
0.935
Table 8: Sample mean of at selected true percentiles.
mean
n = 25
n = 100
n = 400
0.10
0.00
0.093
0.100
0.101
0.25
0.01
0.230
0.247
0.250
0.50
0.17
0.465
0.496
0.499
0.75
1.04
0.727
0.747
0.749
0.90
3.00
0.903
0.899
0.900
Simulation 3
Simulation 3 uses identity link with and
. This results in an of approximately 0.13. The
simulation was replicated 2,000 times.
Table 9: Sample mean of . For comparison, the gamma glm
sample mean is shown in parentheses.
mean (gamma glm mean)
n = 25
n = 100
n = 400
1.00 (1.00)
0.99 (0.99)
1.00 (1.00)
1.01 (1.01)
1.00 (1.00)
0.99 (0.99)
1.01 (1.01)
1.01 (1.01)
1.00 (1.00)
-2.02 (-2.01)
-2.01 (-2.01)
-1.99 (-1.99)
Table 10: Root mean squared error (rmse) of . In
parentheses is the relative efficiency compared to the gamma glm
estimator.
rmse (relative efficiency)
n = 25
n = 100
n = 400
0.25 (0.99)
0.13 (0.99)
0.07 (1.00)
0.88 (0.96)
0.40 (0.99)
0.20 (1.00)
0.55 (0.98)
0.28 (0.99)
0.14 (1.00)
1.26 (0.98)
0.63 (0.98)
0.32 (1.00)
Table 11: Wald confidence interval coverage rate for .
80% C.I.
90% C.I.
95% C.I.
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
0.735
0.793
0.802
0.830
0.888
0.892
0.888
0.939
0.947
0.670
0.789
0.799
0.785
0.875
0.892
0.847
0.931
0.949
0.737
0.782
0.796
0.849
0.885
0.893
0.916
0.941
0.943
0.667
0.773
0.792
0.800
0.876
0.896
0.872
0.937
0.942
Table 12: Sample mean of at selected true percentiles.
mean
n = 25
n = 100
n = 400
0.10
0.44
0.077
0.095
0.098
0.25
0.63
0.219
0.244
0.249
0.50
0.92
0.492
0.498
0.499
0.75
1.28
0.769
0.753
0.751
0.90
1.67
0.913
0.903
0.901
Simulation 4
Simulation 4 uses identity link with and
. This results in an of approximately 0.018. The
simulation was replicated 2,000 times.
Table 13: Sample mean of . For comparison, the gamma
glm sample mean is shown in parentheses.
mean (gamma glm mean)
n = 25
n = 100
n = 400
1.02 (1.02)
1.01 (1.01)
1.00 (1.00)
0.91 (0.89)
0.97 (1.00)
0.99 (0.99)
0.99 (1.02)
0.97 (0.98)
0.99 (1.00)
-1.86 (-1.85)
-1.92 (-1.95)
-1.98 (-1.99)
Table 14: Root mean squared error (rmse) of . In
parentheses is the relative efficiency compared to the gamma glm
estimator.
rmse (relative efficiency)
n = 25
n = 100
n = 400
0.73 (1.17)
0.39 (1.02)
0.19 (1.00)
2.35 (1.05)
1.16 (1.09)
0.58 (1.02)
1.58 (1.18)
0.81 (1.03)
0.40 (1.00)
3.27 (1.18)
1.83 (1.11)
0.92 (1.02)
Table 15: Wald confidence interval coverage rate for .
80% C.I.
90% C.I.
95% C.I.
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
n = 25
n = 100
n = 400
0.661
0.761
0.794
0.754
0.845
0.893
0.802
0.893
0.938
0.595
0.718
0.767
0.692
0.822
0.868
0.741
0.870
0.921
0.665
0.769
0.779
0.777
0.864
0.895
0.847
0.910
0.943
0.666
0.711
0.777
0.789
0.824
0.877
0.871
0.892
0.932
Table 16: Sample mean of at selected true percentiles.
mean
n = 25
n = 100
n = 400
0.10
0.02
0.085
0.097
0.100
0.25
0.10
0.223
0.246
0.249
0.50
0.45
0.474
0.493
0.499
0.75
1.32
0.741
0.749
0.750
0.90
2.71
0.907
0.902
0.900
Likelihood ratio and score confidence intervals
To support our claim that likelihood ratio confidence intervals may have
better small sample performance, we compared the Wald, likelihood ratio,
and score 95% confidence interval coverage rates under the settings of
Simulation 1 with . The coverage rates are shown in Table
17. The Wald coverage rates are similar to those
in Table 3, but not identical because this
experiment used a new set of 2,000 replicates.
While the Wald confidence intervals tend to be too narrow, the score
confidence intervals tend to be a bit too wide. The likelihood ratio
intervals have coverage rates much closer to 0.95 than the corresponding
Wald or score intervals. This is consistent with the findings of
(Huang 2014).
Table 17: Coverage rates for 95% confidence intervals under the
settings of Simulation 1.
Wald
0.918
0.823
0.909
0.859
Likelihood ratio
0.969
0.939
0.958
0.944
Score
0.976
0.966
0.967
0.967
6 Computational time and scaling
The following simulation explores the computation time of gldrm and
how it scales with the number of observations and number of
covariates . For each value of and , we fit gldrm to 100
randomly generated data sets. A gldrm was fit once to each data set.
Covariates were drawn independently from a standard normal distribution,
and coefficients were drawn independently from a uniform distribution on
the interval (-1, 1). Simulation 1 used an identity link function with
response drawn from a normal distribution with variance function
. Simulation 2 used a log link function with response drawn
from an exponential distribution, i.e. a gamma distribution with
variance function .
Iteration limits were set at 100 per update, 1,000 per
update, and 100 for the outer loop (for which each iteration consists of
a single-iteration update, followed by an update).
Convergence thresholds were set at for the update,
update, and outer loop update. This experiment was run using a 2.2
GHz AMD Opteron 6174 processor. Figures 3 and
4 show the average CPU seconds for Simulations 1 and 2,
respectively.
We also repeated this experiment with the number of support points fixed
at 25. To do this, we generated response values from the model. The
first 25 values were designated as the support, and the remaining
response values were matched to the nearest support value and set to
that value. This discrete data generating model is not actually a gldrm
model, but we are only using it to evaluate computation time. Figures
5 and 6 show the average
CPU seconds for Simulations 1 and 2, respectively, with support size
fixed at 25.
In summary, the computation time scales roughly linearly with when
the support is fixed at 25. When the support grows with , the
computation time grows faster than linearly. Computation time increases
with , but there is not a consistent pattern, especially with fixed
support.
Figure 3: Mean computation time for Simulation 1. Error bars represent
standard errors over 100
replicates.Figure 4: Mean computation time for Simulation 2. Error bars represent
standard errors over 100
replicates.Figure 5: Mean computation time for Simulation 1 with support size
fixed at 25. Error bars represent standard errors over 100
replicates.Figure 6: Mean computation time for Simulation 2 with support size
fixed at 25. Error bars represent standard errors over 100
replicates.
7 R package: gldrm
The gldrm package (Wurm and Rathouz 2018) was written entirely in the R
programming language. Its primary functions are the following.
gldrm is the main function for fitting gldrm models.
gldrmLRT performs a likelihood ratio test between nested gldrm
models. The test statistic is calculated as
, where is the difference is the
number of parameters between the full and null models. Under the
null hypothesis, the test statistic follows an asymptotic F
distribution with degrees of freedom and , where is the
number of observations and is the number of parameters in the
full model.
gldrmCI calculates Wald or likelihood ratio confidence intervals
for a single coefficient.
predict.gldrmFit is a predict method, which is modeled after the
print.glm method in the stats package. Predictions can be
obtained on the scale of the response or linear predictor, or the
tilted nonparametric density for each observation can be returned.
gldrmPIT Returns a set of randomized probability inverse transform
values and plots the histogram and uniform QQ plot.
8 R example: iris data
We demonstrate the gldrm package using the Iris data set from the
datasets package. This is a data set of observations. We
choose sepal length to be the response variable. This variable has 35
unique values, so the support is . We first demonstrate how to
fit a gldrm model with the optimization function, which is simply called
gldrm. We demonstrate how to perform inference on the regression
coefficients. We check the goodness of fit using the randomized
probability inverse transform. Finally we show how to obtain predictions
for a set of observations, including the predicted mean and
nonparametric estimate of the distribution function.
Fit gldrm model
The gldrm optimization function takes a formula and data argument,
similar to R’s glm function. Instead of passing both an error
distribution and link function through a family argument, the gldrm
only requires a link function. The link argument will accept the name
of a link function (any function supported by make.link in the stats
package is accepted). Alternatively, a custom link function can be
passed as a list containing three items:
linkfun A vectorized link function.
linkinv The corresponding inverse link function, also vectorized.
mu.eta The derivative of the inverse link function, also
vectorized.
This list structure is the same as that of link-glm class objects,
which are constructed by the make.link function. The custom link
option allows great flexibility, but it is imperative that the user
correctly specifies the inverse link and its derivative.
R>### Load gldrm package and Iris data from datasets packageR>library(gldrm)R>data(iris, package ="datasets")R>### Fit gldrm with all variablesR> fit <-gldrm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,R+data = iris, link ="log")R> fitSummary of gldrm fitCoefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 1.18320.036932.10<2e-16***Sepal.Width 0.07880.01286.176.4e-09***Petal.Length 0.11280.010211.04<2e-16***Petal.Width -0.03500.0248-1.410.162Speciesversicolor -0.05610.0395-1.420.157Speciesvirginica -0.09940.0557-1.790.076 . ---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Likelihood ratio test against null model:F-statistic:57.4Numerator degrees of freedom:5Denominator degrees of freedom:144P-value:<2e-16
Inference
The gldrmLRT function performs a semiparametric likelihood ratio test
between two nested models. We demonstrate this on the Iris data by
fitting a sub-model that excludes "Species", which is a categorical
variable with three levels and two degrees of freedom. We also obtain
Wald and likelihood ratio confidence intervals for the petal width
coefficient.
R>### Fit gldrm without the categorical variable "Species"R> fit0 <-gldrm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,R+data = iris, link ="log")R>### Likelihood ratio test for the nested modelsR>gldrmLRT(fit, fit0)Likelihood ratio test:F-statistic:2.03Numerator degrees of freedom:2Denomicator degrees of freedom:144P-value:0.135R>### Wald 95% confidence interval for Petal.WidthR>gldrmCI(fit, term ="Petal.Width", test ="Wald", type ="2-sided", level = .95)95% Wald confidence interval for Petal.Width:(-0.084, 0.014)R>### Likelihood ratio 95% confidence interval for Petal.WidthR>gldrmCI(fit, term ="Petal.Width", test ="LRT", type ="2-sided", level = .95)95% likelihood ratio confidence interval for Petal.Width:(-0.094, 0.025)
Goodness of fit
The gldrmPIT function produces goodness of fit plots using the
probability inverse transform. In addition to plotting, this function
also returns the inverse transform values as a vector.
R> pit <-gldrmPIT(fit)
Prediction
We obtain predictions for three selected observations in the data set:
one from each Species. These observations were contained in training
data, but we could obtain predictions for out-of-sample observations in
the same way. Using the predict method, we obtain the fitted mean
response value for each observation, which is the estimate of
.
We also use the predict method to obtain the nonparametric estimate of
the conditional density . This is obtained by setting the
argument type = "fTilt", which returns an matrix with
entry
. Each row contains the
nonparametric conditional density estimate for a given observation and
sums to one. We use this matrix to calculate the conditional
probabilities for the form for each
observation. Note that all observed support values (sepal length values)
fall between four and eight, so all probability mass falls within this
interval.
R>### Select three observations; one from each SpeciesR> newdata <- iris[c(1, 51, 101), ]R>### Fitted mean Sepal.LengthR> fitted_mean <-predict(fit, newdata = newdata, type ="response")R> fitted_mean <-round(fitted_mean, 2)R>### Estimated Sepal.Length distribution of each observationR>### Note: all Sepal.Length values are between 4 and 8R> fTilt <-predict(fit, newdata = newdata, type ="fTilt")R> spt <- fit$sptR> F4 <-rowSums(fTilt[ , spt <=4])R> F5 <-rowSums(fTilt[ , spt <=5])R> F6 <-rowSums(fTilt[ , spt <=6])R> F7 <-rowSums(fTilt[ , spt <=7])R> F8 <-rowSums(fTilt[ , spt <=8])R> Ftilt <-cbind(F5-F4, F6-F5, F7-F6, F8-F7)R> Ftilt <-round(Ftilt, 3)R>colnames(Ftilt) <-c("P(4=Y<=5)", "P(5<Y<=6)", "P(6<Y<=7)", "P(7<Y<=8)")R>cbind(newdata, fitted_mean, Ftilt)
We introduced a new optimization algorithm for gldrm, which
computationally scales better with the number of unique observed
response values. This is especially useful for estimation of continuous
response data where the number of parameters in the
parameter equals the sample size. In particular, the BFGS technique
dramatically speeds up the optimization step, and makes
gldrm much more computationally feasible for either discrete or
continuous response data. The new algorithm is implemented in the
gldrm package. Simulation results show that the algorithm and software
obtain accurate estimates and standard errors. Computational time was
shown to be feasible for support sizes well over one thousand.
Future research directions could include the incorporation of random
effects into the gldrm framework. Optimization techniques for
generalized linear mixed models, such as adaptive Gaussian quadrature,
may be useful for model estimation
(Pinheiro and Bates 1995; Pinheiro and Chao 2006).
10 Acknowledgments
Research reported in this publication was supported by the National
Heart, Lung, and Blood Institute of the National Institutes of Health
under award number T32 HL083806 (for MJW) and R01 HL094786-05A1 (for
PJR). The content is solely the responsibility of the authors and does
not necessarily represent the official views of the National Institutes
of Health.
We thank Dr. Alan Huang, University of Queensland, for suggesting the
use of probability inverse transform diagnostic plots, as well as
providing the corresponding code, which we have incorporated into the
gldrm package.
11 Notation
In this appendix, we use to denote the partial derivative and
to denote the total derivative. We require this notation for chain
rule derivatives, specifically to derive the score function with respect
to and . Each is implicitly a function
of and hence should not be held fixed when
differentiating the log-likelihood with respect to either or
. We also let
denote the contribution of the observation to the
log-likelihood.
12 A. Score function for
We derive Equation (7), which gives the score function
with respect to , holding fixed. Using the
definition of in Equation (11) and applying
the chain rule, the element of the score function is
The first term on the RHS of Equation (12) is
The second term on the RHS of Equation (12) is
Recall that , which does not vary with
. Therefore,
and the third term on the RHS of Equation (12) is
Plugging the results of Equations (13),
(14), and (15) into Equation
(12), we obtain
We obtain the result of Equation (7) by applying the
Jacobian transformation
.
13 B. Information matrix for
The information matrix is not used during the BFGS
optimization routine. However, we derive it here to correct a missing
term in the Rathouz and Gao (2009) derivation. Each matrix element is
where
and
From here on, all expectations are conditional on ,
but we suppress the conditional expression. The first term on the RHS of
Equation (16) is
The second term on the RHS of Equation (16) is
where we use the fact that
. The
third term on the RHS of Equation (16) is
where we use the fact that
.
Plugging the results of Equations
(17)-(19) into Equation
(16), we obtain
14 C. Score function for
We derive Equation (8), which gives the score
function with respect to , holding fixed. By the
chain rule, each element is
We already derived the first term on the RHS of Equation
(20) in Equation (14). Because
,
Because ,
Plugging the results of Equations (14),
(21), and (22) into Equation
(20), we obtain the result of Equation
(8).
15 D. Proof that transformed response with modified link function has equivalent log-likelihood
Let and . Let
and . Equation
(11) can be rewritten
where . Equation (4)
can be rewritten as
Hence, if we define a modified link function
with inverse
, we can write
In other words has the same implicit definition as
when the modified link function and transformed response
variable are used in place of the oginal. This shows that, as a function
of , the log-likelihood with the transformed
response and modified link function is equivalent to the original
log-likelihood. Note that must be multiplied by
if one would like calculate on the original scale.
This article is converted from a Legacy LaTeX article using the
texor package.
The pdf version is the official version. To report a problem with the html,
refer to CONTRIBUTE on the R Journal homepage.
Footnotes
References
C. G. Broyden. The convergence of a class of double-rank minimization algorithms. IMA Journal of Applied Mathematics, 6(3): 222–231, 1970. URL https://doi.org/10.1093/imamat/6.3.222.
A. Huang. Joint estimation of the mean and error distribution in generalized linear models. Journal of the American Statistical Association, 109(505): 186–196, 2014. URL https://doi.org/10.1080/01621459.2013.824892.
A. Huang and P. J. Rathouz. Proportional likelihood ratio models for mean regression. Biometrika, 99(1): 223–229, 2012. URL https://doi.org/10.1093/biomet/asr075.
M. Lovric. International encyclopedia of statistical science. Springer, 2011.
A. B. Owen, H. Crc, B. Raton, L. New, Y. Washington, A. A. B and A. Owen. Empirical Likelihood. CRC Press, 2001.
J. C. Pinheiro and D. M. Bates. Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4(1): 12–35, 1995. URL https://doi.org/10.1080/10618600.1995.10474663.
J. C. Pinheiro and E. C. Chao. Efficient laplacian and adaptive gaussian quadrature algorithms for multilevel generalized linear mixed models. Journal of Computational and Graphical Statistics, 15(1): 58–81, 2006. URL https://doi.org/10.1198/106186006X96962.
P. J. Rathouz and L. Gao. Generalized linear models with unspecified reference distribution. Biostatistics, 10(2): 205–218, 2009. URL https://doi.org/10.1093/biostatistics/kxn030.
J. Smith. Diagnostic checks of non-standard time series models. Journal of Forecasting, 4(3): 283–291, 1985. URL https://doi.org/10.1002/for.3980040305.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
Citation
For attribution, please cite this work as
J. Wurm & J. Rathouz, "Semiparametric Generalized Linear Models with the gldrm Package", The R Journal, 2018
BibTeX citation
@article{RJ-2018-027,
author = {J. Wurm, Michael and J. Rathouz, Paul},
title = {Semiparametric Generalized Linear Models with the gldrm Package},
journal = {The R Journal},
year = {2018},
note = {https://rjournal.github.io/},
volume = {10},
issue = {1},
issn = {2073-4859},
pages = {288-307}
}