The Generalized Hermite distribution (and the Hermite distribution as a particular case) is often used for fitting count data in the presence of overdispersion or multimodality. Despite this, to our knowledge, no standard software packages have implemented specific functions to compute basic probabilities and make simple statistical inference based on these distributions. We present here a set of computational tools that allows the user to face these difficulties by modelling with the Generalized Hermite distribution using the R package hermite. The package can also be used to generate random deviates from a Generalized Hermite distribution and to use basic functions to compute probabilities (density, cumulative density and quantile functions are available), to estimate parameters using the maximum likelihood method and to perform the likelihood ratio test for Poisson assumption against a Generalized Hermite alternative. In order to improve the density and quantile functions performance when the parameters are large, Edgeworth and Cornish-Fisher expansions have been used. Hermite regression is also a useful tool for modeling inflated count data, so its inclusion to a commonly used software like R will make this tool available to a wide range of potential users. Some examples of usage in several fields of application are also given.
The Poisson distribution is without a doubt the most common when dealing with count data. There are several reasons for that, including the fact that the maximum likelihood estimate of the population mean is the sample mean and the property that this distribution is closed under convolutions (see Johnson et al. (2005)). However, it is very common in practice that data presents overdispersion or zero inflation, cases where the Poisson assumption does not hold. In these situations it is reasonable to consider discrete distributions with more than one parameter. The class of all two-parameter discrete distributions closed under convolutions and satisfying that the sample mean is the maximum likelihood estimator of the population mean are characterized in Puig (2003). One of these families is just the Generalized Hermite distribution. Several generalizations of the Poisson distribution have been considered in the literature (see, for instance, Gurland (1957; Kemp and Kemp 1965; Kemp and Kemp 1966; Lukacs 1970)), that are compound-Poisson or contagious distributions. They are families with probability generating function (PGF) defined by
where
After fixing the value of the positive integer
A useful expression for the probability mass function of the Generalized
Hermite distribution in terms of the population mean
where
The probabilities can be also written in terms of the parameters
The case
The probability mass function and the distribution function for some
values of
Gupta and Jain (1974) also develop a recurrence relation that can be used to calculate the probabilities in a numerically efficient way:
where
Like the common distributions in R, the package hermite implements the
probability mass function (dhermite
), the distribution function
(phermite
), the quantile function (qhermite
) and a function for
random generation (rhermite
) for the Generalized Hermite distribution.
It also includes the function glm.hermite
, which allows to calculate,
for a univariate sample of independent draws, the maximum likelihood
estimates for the parameters and to perform the likelihood ratio test
for a Poisson null hypothesis against a Generalized Hermite alternative.
This function can also carry out Hermite regression including covariates
for the population mean, in a very similar way to that of the well known
R function glm
.
The probability mass function of the Generalized Hermite distribution is
implemented in hermite through the function dhermite
. A call to this
function might be
dhermite(x, a, b, m = 2)
The description of these arguments can be summarized as follows:
x
: Vector of non-negative integer values.a
: First parameter for the Hermite distribution.b
: Second parameter for the Hermite distribution.m
: Degree of the Generalized Hermite distribution. Its default
value is 2, corresponding to the classical Hermite distribution
introduced in Kemp and Kemp (1965).The recurrence relation ((4)) is used by dhermite
for the computation of probabilities. For large values of any parameter
a
or b
(above 20), the probability of
The normal approximation is justified taking into account the
representation of any Generalized Hermite random variable
The distribution function of the Generalized Hermite distribution is
implemented in hermite through the function phermite
. A call to this
function might be
phermite(q, a, b, m = 2, lower.tail = TRUE)
The description of these arguments can be summarized as follows:
q
: Vector of non-negative integer quantiles.lower.tail
: Logical; if TRUE
(the default value), the computed
probabilities are All remaining arguments are defined as specified for dhermite
.
If
where
The quantile function of the Generalized Hermite distribution is
implemented in hermite through the function qhermite
. A call to this
function might be
qhermite(p, a, b, m = 2, lower.tail = TRUE)
The description of these arguments can be summarized as follows:
p
: Vector of probabilities.All remaining arguments are defined as specified for phermite
. The
quantile is right continuous: qhermite(p, a, b, m)
is the smallest
integer
When the parameters
The random generation function rhermite
uses the relationship between
Poisson and Hermite distributions detailed in Sections 1
and 2.1. A call to this function might be
rhermite(n, a, b, m = 2)
The description of these arguments can be summarized as follows:
n
: Number of random values to return.All remaining arguments are defined as specified for dhermite
.
Given a sample
where
The maximum likelihood equations do not always have a solution. This is due to the fact that this is not a regular family of distributions because its domain of parameters is not an open set. The following result gives a sufficient and necessary condition for the existence of such a solution (Puig 2003):
Proposition 1. Let
If the likelihood equations do not have a solution, the maximum of the
likelihood function ((6)) is attained at the border of
the domain of parameters, that is,
The package hermite allows to estimate the parameters glm.hermite
:
glm.hermite(formula, data, link = "log", start = NULL, m = NULL)
The description of the arguments can be summarized as follows:
formula
: Symbolic description of the model. A typical predictor
has the form response data
: An optional data frame containing the variables in the
model.link
: Character specification of the population mean link
function: "log"
or "identity
". By default link = "log"
.start
: A vector containing the starting values for the parameters
of the specified model. Its default value is NULL
.m
: Value for parameter m
. Its default value is NULL
, and in
that case it will be estimated as The returned value is an object of class glm.hermite
, which is a list
including the following components:
coefs
: The vector of coefficients.loglik
: Log-likelihood of the fitted model.vcov
: Covariance matrix of all coefficients in the model (derived
from the Hessian returned by the maxLik()
output).hess
: Hessian matrix, returned by the maxLik()
output.fitted.values
: The fitted mean values, obtained by transforming
the linear predictors by the inverse of the link function.w
: Likelihood ratio test statistic.pval
: Likelihood ratio test p-value.If the condition given in Proposition 1 is not met for a sample
x, the glm.hermite
function provides the maximum likelihood
estimates
The function glm.hermite
can also be used for Hermite regression as
described below and as will be shown through practical examples in
Sections 3.3 and 3.4.
Covariates can be incorporated into the model in various ways (see,
e.g., Giles (2007)). In function glm.hermite
, the distribution is
specified in terms of the dispersion index and its mean, which is then
related to explanatory variables as in linear regression or other
generalized linear models. That is, for Hermite regression, we assume
The link function provides the relationship between the linear predictor
and the mean of the distribution function. Although the log is the
canonical link for count data as it ensures that all the fitted values
are positive, the choice of the link function can be somewhat influenced
by the context and data to be treated. For example, the identity link is
the accepted standard in biodosimetry as there is no evidence that the
increase of chromosomal aberration counts with dose is of exponential
shape (IAEA 2011). Therefore, function glm.hermite
allows both link
functions.
A consequence of using the identity–link is that the maximum likelihood
estimate of the parameters obtained by maximizing the log–likelihood
function of the corresponding model may lead to negative values for the
mean. Therefore, in order to avoid negative values for the mean,
constraints in the domain of the parameters must be included when the
model is fitted. In function glm.hermite
, this is carried out by using
the function maxLik
from package
maxLik (Henningsen and Toomet 2011), which
is used internally for maximizing the corresponding log–likelihood
function. This function allows constraints which are needed when the
identity–link is used.
It should also be noted that results may depend of the starting values
provided to the optimization routines. If no starting values are
supplied, the starting values for the coefficients are computed/fixed
internally. Specifically, when the log–link is specified, the starting
values are obtained by fitting a standard Poisson regression model
through a call to the internal function glm.fit
from package stats.
If the link function is the identity and no initial values are provided
by the user, the function takes 1 as initial value for the coefficients.
In both cases, the initial value for the dispersion index
Regarding the order of the Hermite distribution, it can be fixed by the
user. If it is not provided (default option), when the model includes
covariates, the order
When dealing with the Generalized Hermite distribution it seems natural
to wonder if data could be fitted by using a Poisson distribution.
Because the Poisson distribution is included in the Generalized Hermite
family, this is equivalent to test the null hypothesis
Under the null hypothesis glm.hermite
function, using the maximum
likelihood estimates
A summary
method for objects of class glm.hermite
is included in the
hermite package, giving a summary of relevant information, including
the residuals minimum, maximum, median and first and third quartiles,
the table of coefficients including the corresponding standard errors
and significance tests based on the Normal reference distribution for
regression coefficients and the likelihood ratio test against the
Poisson distribution for the dispersion index. The AIC value for the
proposed model is also reported.
Several examples of application of the package hermite in a wide range of contexts are discussed in this section, including classical and recent real datasets and simulated data.
This example by Hartenstein (1961) describes the counts of Collenbola microarthropods in 200 samples of forest soil. The frequency distribution is shown in Table 1.
Microarthropods per sample | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Frequency | 122 | 40 | 14 | 16 | 6 | 2 |
This dataset was analyzed in Puig (2003) with a Generalized Hermite
distribution of order
Using glm.hermite
we calculate the parameter estimates:
> library("hermite")
> data <- c(rep(0, 122), rep(1, 40), rep(2, 14), rep(3, 16), rep(4, 6), rep(5, 2))
> mle1 <- glm.hermite(data ~ 1, link = "log", start =NULL, m = 3)$coefs
(Intercept) dispersion.index order
-0.2875851 1.8905920 3.0000000
We can see that these parameter estimates are equivalent to those reported in Puig (2003).
The estimated expected frequencies are shown in Table 2.
Microarthropods per sample | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Frequency | 118.03 | 49.11 | 10.22 | 14.56 | 5.61 | 1.15 |
The frequencies in Table 2 have been obtained running the following code and using the transformation
> a <- -exp(mle1$coefs[1])*(mle1$coefs[2] - mle1$coefs[3])/(mle1$coefs[3] - 1)
> b <- exp(mle1$coefs[1])*(mle1$coefs[2] - 1)/(mle1$coefs[3]*(mle1$coefs[3] - 1))
> exp <- round(dhermite(seq(0,5,1), a, b, m = 3)*200,2)
Note that the null hypothesis of Poisson distributed data is strongly
rejected, with a likelihood ratio test statistic
> mle1$w
[1] 48.66494
> mle1$pval
[1] 1.518232e-12
In Giles (2010), the author explores an interesting application of the
classical Hermite distribution (glm.hermite
, which are
Currency and banking crises | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Observed | 45 | 44 | 19 | 17 | 19 | 13 | 6 | 4 |
Expected (Hermite) | 38.51 | 35.05 | 37.40 | 24.36 | 15.96 | 8.33 | 4.23 | 1.88 |
Expected (Poisson) | 22.07 | 44.66 | 45.20 | 30.49 | 15.43 | 6.25 | 2.11 | 0.61 |
In this example, the likelihood ratio test clearly rejects the Poisson
assumption in favor of the Hermite distribution (
The expected frequencies of the Hermite distribution shown in Table 3 have been calculated running the code,
> exp2 <- round(dhermite(seq(0, 7, 1), 0.910, 0.557, m = 2)*167, 2)
> exp2
In Giles (2007), the author proposes an application of Hermite regression to the 965 number 1 hits on the Hot 100 chart over the period January 1955 to December 2003. The data were compiled and treated with different approaches by Giles (see Giles (2006) for instance), and is available for download at the author website http://web.uvic.ca/~dgiles/. For all recordings that reach the number one spot, the number of weeks that it stays at number one was recorded. The data also allow for reentry into the number one spot after having being relegated to a lower position in the chart. The actual and predicted counts under Poisson and Hermite distributions are shown in Table 4.
Weeks | Actual | Poisson | Hermite |
---|---|---|---|
0 | 337 | 166.95 | 317.85 |
1 | 249 | 292.91 | 203.58 |
2 | 139 | 256.94 | 214.60 |
3 | 93 | 150.26 | 109.61 |
4 | 47 | 65.90 | 67.99 |
5 | 35 | 23.12 | 29.32 |
6 | 21 | 6.76 | 13.78 |
7 | 13 | 1.69 | 5.20 |
8 | 9 | 0.37 | 2.04 |
9 | 8 | 0.07 | 0.69 |
10 | 5 | 0.01 | 0.24 |
11 | 2 | 0.00 | 0.07 |
12 | 2 | 0.00 | 0.02 |
13 | 4 | 0.00 | 0.01 |
14 | 0 | 0.00 | 0.00 |
15 | 1 | 0.00 | 0.00 |
Several dummy covariates were also recorded, including indicators of whether the recording was by Elvis Presley or not, the artist was a solo female, the recording was purely instrumental and whether the recording topped the charts in nonconsecutive weeks.
The estimates and corresponding standard errors are obtained through the instructions
> data(hot100)
> fit.hot100 <- glm.hermite(Weeks ~ Elvis+Female+Inst+NonCon, data = hot100,
+ start = NULL, m = 2)
> fit.hot100$coefs
(Intercept) Elvis Female Inst NonCon
0.4578140 0.9126080 0.1913968 0.3658427 0.6558621
dispersion.index order
1.5947901 2.0000000
> sqrt(diag(fit.hot100$vcov))
[1] 0.03662962 0.16208682 0.07706250 0.15787552 0.12049736 0.02533045
For instance, we can obtain the predicted value for the average number of weeks that an Elvis record hits the number one spot. According to the model, we obtain a predicted value of 3.9370, whereas the observed corresponding value is 3.9375. The likelihood ratio test result justifies the fitting through a Hermite regression model instead of a Poisson model:
> fit.hot100$w; fit.hot100$pval
[1] 385.7188
[1] 3.53909e-86
In diGiorgio et al. (2004) the authors perform an experimental simulation of in vitro whole body irradiation for high-LET radiation exposure, where peripheral blood samples were exposed to 10 different doses of 1480MeV oxygen ions. For each dose, the number of dicentrics chromosomes per blood cell were scored. The corresponding data is included in the package hermite, and can be loaded into the R session by
> data(hi_let)
In Puig and Barquinero (2011) the authors apply Hermite regression (to contrast the
Poisson assumption) for fitting the dose-response curve, i.e. the yield
of dicentrics per cell as a quadratic function of the absorbed dose
linked by the identity function (which is commonly used in
biodosimetry). This model can be fitted using the glm.hermite
function
in the following way:
> fit.hlet.id <- glm.hermite(Dic ~ Dose+Dose2-1, data = hi_let, link = "identity")
Note that the model defined in fit.hlet.id
has no intercept.
A summary of the most relevant information can be obtained using the summary() method as in
> summary(fit.hlet.id)
Call:
glm.hermite(formula = Dic ~ Dose + Dose2 - 1, data = hi_let,
link = "identity")
Deviance Residuals:
Min 1Q Median 3Q Max
-0.06261842 -0.03536264 0.00000000 0.10373982 1.42251927
Coefficients:
Estimate Std. Error z value p-value
Dose 0.4620671 0.03104362 14.884450 4.158952e-50
Dose2 0.1555365 0.04079798 3.812357 1.376478e-04
dispersion.index 1.2342896 0.02597687 107.824859 1.468052e-25
order 2.0000000 NA NA NA
(Likelihood ratio test against Poisson is reported by *z value* for *dispersion.index*)
AIC: 5592.422
We can see the maximum likelihood estimates and corresponding standard
errors in the output from the summary output. Note also that the
likelihood ratio test rejects the Poisson assumption (
In the first example in Higueras et al. (2015) the Bayesian estimation of the
absorbed dose by Cobalt-60 gamma rays after the in vitro irradiation
of a sample of blood cells is given by a density proportional to the
probability mass function of a Hermite distribution taking 102 counts
whose mean and variance are functions of the dose
The reparametrization in terms of
This density only makes sense when
The following code generates the plot of the resulting density,
> u <- function(x) 45.939*x^2 + 5.661*x
> v <- function(x) 8.913*x^4 - 22.553*x^3 + 69.571*x^2 + 5.661*x
> a <- function(x) 2*u(x) - v(x); b <- function(x) (v(x) - u(x))/2
> dm <- uniroot(function(x) a(x), c(1, 4))$root; dm
> nc <- integrate(Vectorize(function(x) dhermite(102, a(x), b(x))), 0, dm)$value
> cd <- function(x){ vapply(x, function(d) dhermite(102, a(d), b(d)), 1)/nc }
> x <- seq(0, dm, .001)
> plot(x, cd(x), type = "l", ylab = "Probability Density", xlab = "Dose, x, Gy")
Figure 2 shows this resulting density.
A vector of random numbers following a Generalized Hermite distribution
can be obtained by means of the function rhermite
. For instance, the
next code generates 1000 observations according to an Hermite regression
model, including Bernoulli and normal covariates x1
and x2
:
> n <- 1000
> #### Regression coefficients
> b0 <- -2
> b1 <- 1
> b2 <- 2
> #### Covariate values
> set.seed(111111)
> x1 <- rbinom(n, 1, .75)
> x2 <- rnorm(n, 1, .1)
> u <- exp(b0 + b1*x1 + b2*x2)
> d <- 2.5
> m <- 3
> b <- u*(d - 1)/(m*(m - 1))
> a <- u - m*b
> x <- rhermite(n, a, b, m)
This generates a multimodal distribution, as can be seen in
Figure 3. The probability that this distribution take a
value under 5 can be computed using the function phermite
:
> phermite(5, mean(a), mean(b), m = 3)
[1] 0.8734357
Conversely, the value that has an area on the left of 0.8734357 can be
computed using the function qhermite
:
> qhermite(0.8734357, mean(a), mean(b), m = 3)
[1] 5
> mle3 <- glm.hermite(x ~ factor(x1) + x2, m = 3)
> mle3$coefs
(Intercept) factor(x1)1 x2 dispersion.index order
-1.809163 1.017805 1.839606 2.491119 3.000000
> mle3$w;mle3$pval
[1] 771.7146
[1] 3.809989e-170
In order to check the performance of the likelihood ratio test, we can
simulate a Poisson sample and run the funtion glm.hermite
again:
> y <- rpois(n, u)
> mle4 <- glm.hermite(y ~ factor(x1) + x2, m = 3)
> mle4$coefs[4]
dispersion.index
1
> mle4$w; mle4$pval
[1] -5.475874e-06
[1] 0.5
We can see that in this case the maximum likelihood estimate of the
dispersion index glm.hermite
will return a warning:
> z <- rpois(n, 20)
> mle5 <- glm.hermite(z ~ 1, m = 4)
Warning message:
In glm.hermite(z ~ 1, m = 4) : MLE equations have no solution
In this case, we have
Hermite distributions can be useful for modeling count data that presents multi-modality or overdispersion, situations that appear commonly in practice in many fields. In this article we present the computational tools that allow to overcome these difficulties by means of the Generalized Hermite distribution (and the classical Hermite distribution as a particular case) compiled as an R package. The hermite package also allows the user to perform the likelihood ratio test for Poisson assumption and to estimate parameters using the maximum likelihood method. Hermite regression is also a useful tool for modeling inflated count data, and it can be carried out by the hermite package in a flexible framework and including covariates. Currently, the hermite package is also used by the radir package (Moriña et al. 2015) that implements an innovative Bayesian method for radiation biodosimetry introduced in Higueras et al. (2015).
This work was partially funded by the grant MTM2012-31118, by the grant UNAB10-4E-378 co-funded by FEDER “A way to build Europe” and by the grant MTM2013-41383P from the Spanish Ministry of Economy and Competitiveness co-funded by the European Regional Development Fund (EDRF). We would like to thank professor David Giles for kindly providing some of the data sets used as examples in this paper.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Moriña, et al., "Generalized Hermite Distribution Modelling with the R Package hermite", The R Journal, 2015
BibTeX citation
@article{RJ-2015-035, author = {Moriña, David and Higueras, Manuel and Puig, Pedro and Oliveira, María}, title = {Generalized Hermite Distribution Modelling with the R Package hermite}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {2}, issn = {2073-4859}, pages = {263-274} }