Kernel estimation is an important technique in exploratory data analysis. Its utility relies on its ease of interpretation, especially based on graphical means. The Ake package is introduced for univariate density or probability mass function estimation and also for continuous and discrete regression functions using associated kernel estimators. These associated kernels have been proposed due to their specific features of variables of interest. The package focuses on associated kernel methods appropriate for continuous (bounded, positive) or discrete (count, categorical) data often found in applied settings. Furthermore, optimal bandwidths are selected by cross-validation for any associated kernel and by Bayesian methods for the binomial kernel. Other Bayesian methods for selecting bandwidths with other associated kernels will complete this package in its future versions; particularly, a Bayesian adaptive method for gamma kernel estimation of density functions is developed. Some practical and theoretical aspects of the normalizing constant in both density and probability mass functions estimations are given.
Kernel smoothing methods are popular tools for revealing the structure
of data that could be missed by parametric methods. For real datasets,
we often encounter continuous (bounded, positive) or discrete (count,
categorical) data types. The classical kernels methods assume that the
underlying distribution is unbounded continuous, which is frequently not
the case; see, for example, (Duong 2007) for multivariate kernel density
estimation and discriminant analysis. A solution is provided for
categorical data sets by (Hayfield and Racine 2008). In fact, they used kernels
well adapted for these categorical sets (Aitchison and Aitken 1976). Throughout the present
paper, the unidimensional support
The recently developed Ake package, implements associated kernels that seamlessly deal with continuous (bounded, positive) and discrete (categorical, count) data types often found in applied settings; see, for example, (Libengué 2013) and (Kokonendji and Senga Kiessé 2011). These associated kernels are used to smooth probability density functions (p.d.f.), probability mass functions (p.m.f.) or regression functions. The coming versions of this package will contain, among others, p.d.f. estimation of heavy tailed data (e.g., Ziane et al. 2015) and the estimation of other functionals. The bandwidth selection remains crucial in associated kernel estimations of p.d.f., p.m.f. or regression functions. Some methods have been investigated for selecting bandwidth parameters but the commonly used is the least squared cross-validation. A Bayesian approach has been also recently introduced by (Zougab et al. 2012) in the case of a binomial kernel. This method can be extended to various associated kernels with other functionals. Despite the great number of packages implemented for nonparametric estimation in continuous cases with unbounded kernels, to the best of our knowledge, the R packages to estimate p.m.f. with categorical or count variables, p.d.f. with bounded or positive datasets, and regression functions have been far less investigated.
The rest of the paper is organized as follows. In Section 2, we briefly describe the definition of associated kernels and then illustrate examples in both continuous and discrete cases which are discussed. Then, the associated kernel estimator for p.d.f. or p.m.f. is presented and illustrated with some R codes in Section 3. In particular, three bandwidth selection methods are available: cross-validation for any (continuous or discrete) associated kernel, the Bayesian local method for the binomial kernel and also a new theoretical Bayesian adaptive method for the gamma kernel. Also, some practical and theoretical aspects of the normalizing constant in both p.d.f. and p.m.f. estimations are given. Section 4 investigates the case of regression functions with two bandwidth selection techniques: cross-validation and also the Bayesian global method for the binomial kernel. Section 5 concludes.
Recall that the support
(Kokonendji and Senga Kiessé 2011),(Libengué 2013) Let
where
This definition has the following interesting interpretations:
The function
The support
The condition ((1)) can be viewed as
If
Both conditions ((2)) and ((3)) indicate
that the associated kernel is more and more concentrated around
In order to construct an associated kernel
Among the discrete associated kernels found in literature, we here use
the best in sense of Definition 2.2.1. Negative binomial and
Poisson kernels are respectively overdispersed (i.e.,
The binomial (bino
) kernel is defined on the support
The following class of symmetric discrete triangular kernels has
been proposed in (Kokonendji et al. 2007). The support DTr
a) kernel is defined on
A discrete kernel estimator for categorical data has been introduced
in (Aitchison and Aitken 1976). Its asymmetric discrete associated kernel version that
we here label DiracDU (DirDU
) as “Dirac Discrete Uniform” has been
deduced in (Kokonendji and Senga Kiessé 2011) as follows. For fixed
One can find several continuous associated kernels in literature among the Birnbaum-Saunders of (Jin and Kawczak 2003). Here, we present seven associated kernels well adapted for the estimations of density or regression functions on any compact or nonnegative support of datasets. All these associated kernels satisfy Definition 2.2.1.
The extended beta (BE
) kernel is defined on
The gamma (GA
) kernel is given on
The lognormal (LN
) kernel is defined on
The reciprocal inverse Gaussian (RIG
) kernel is given on
see (Scaillet 2004), (Libengué 2013) and also (Igarashi and Kakizawa 2015). It is the p.d.f. of the
classical reciprocal inverse Gaussian distribution with mean
The three continuous associated kernels inverse gamma, inverse Gaussian
and Gaussian are not adapted for density estimation on supports
Indeed:
The inverse gamma (IGA
) kernel, defined on
Also, the inverse Gaussian (IG
) kernel, defined on
From the well known Gaussian kernel
Gaussian
) on
|
|
Figure 1 shows some forms of the above-mentioned
univariate associated kernels. The plots highlight the importance given
to the target point and around it in discrete (a) and continuous (b)
cases. Furthermore, for a fixed bandwidth
We have implemented in R the method kern.fun
for both discrete and
continuous associated kernels. Seven possibilities are allowed for the
kernel function. We enumerate the arguments and results of the default
kern.fun.default
function in Table 1. The kern.fun
is
used as follows for the binomial kernel:
R> x <- 5
R> h <- 0.1
R> y <- 0:10
R> k_b <- kern.fun(x, y, h, "discrete", "bino")
Arguments | Description |
---|---|
x |
The target. |
t |
The single or the grid value where the function is computed. |
h |
The bandwidth or smoothing parameter. |
ker |
The associated kernel. |
a0,a1 |
The left and right bounds of the support for the extended beta kernel. |
a |
The arm of the discrete triangular kernel. Default value is 1. |
c |
The number of categories in DiracDU kernel. Default value is 2. |
Result | Description |
Returns a single value of the associated kernel function. |
The p.d.f. or p.m.f. estimation is an usual application of the
associated kernels. Let
(Kokonendji and Senga Kiessé 2011),(Libengué 2013) Let
It is easy to see that
R> data("faithful", package = "datasets")
R> x <- faithful$waiting
R> f <- dke.fun(x, ker = "GA", 0.1)
R> f$C_n
[1] 0.9888231
Without loss of generality, we study
In discrete cases, the integrated squared error (ISE) defined by
In this section, we present some theoretical aspects of the normalizing
constant
(Kokonendji and Senga Kiessé 2011),(Libengué 2013) Let
Furthermore, for the continuous case, if
It is noticeable that the bias (9) is bigger than the one with symmetric kernels and thus can be reduced; see, e.g., (Zhang 2010), (Zhang and Karunamuni 2010) and (Libengué 2013).
Following notations in Lemma 2.3.2, the mean and variance
of
Proof. From Lemma 2.3.2 and the Fubini theorem, we
successively show (11) as follows:
The variance (12) is trivial from Lemma 2.3.2.
Example 1. Let
Example 2. Let
Considering the binomial kernel with
Now, we consider the bandwidth selection problems which are generally
crucial in nonparametric estimation. Several methods already existing
for continuous kernels can be adapted to the discrete case as the
classical least-squares cross-validation method; see, for example,
(Bowman 1984), (Marron 1987) and references therein. Here, we simply propose three
procedures for the bandwidth selection: cross-validation, Bayesian local
for binomial and adaptive for the gamma kernel. Also, a review of
bayesian bandwidth selection methods is presented. Each time, the
smoothing parameter selection is done with the non-normalized version
For a given associated kernel
Arguments | Description |
---|---|
Vec |
The positive continuous data sample. |
seq.bws |
The sequence of bandwidths where to compute the cross-validation function. |
ker |
The associated kernel. |
a0,a1 |
The bounds of the support of extended beta kernel. Default values are |
respectively |
|
a |
The arm of the discrete triangular kernel. Default value is |
c |
The number of categories in DiracDU kernel. Default value is |
Results | Description |
hcv |
The optimal bandwidth obtained by cross-validation. |
seq.h |
The sequence of bandwidths used to compute hcv . |
CV |
The values of the cross-validation function. |
Table 2 gives the arguments and results of the
cross-validation function hcvc.fun
defined for continuous data are
below. The hcvd.fun
is the corresponding function for discrete data.
The hcvc.fun
is performed with the Old Faithful geyser data described
in (Azzalini and Bowman 1990) and (Härdle 2012). The dataset concerns waiting time
between eruptions and the duration of the eruption for the Old Faithful
geyser in Yellowstone National Park, Wyoming, USA. The following codes
and Figure 2 give smoothing density estimation
with various associated kernels of the waiting time variable.
R> data("faithful", package = "datasets")
R> x <- faithful$waiting
R> f1 <- dke.fun(x, 0.1, "continuous", ker = "GA")
R> f2 <- dke.fun(x, 0.036, "continuous", ker = "LN")
R> f3 <- dke.fun(x, 0.098, "continuous", ker = "RIG")
R> f4 <- dke.fun(x, 0.01, "continuous", ker = "BE", a0 = 40, a1 = 100)
R> t <- seq(min(x), max(x), length.out = 100)
R> hist(x, probability = TRUE, xlab = "Waiting times (in min.)",
+ ylab = "Frequency", main = "", border = "gray")
R> lines(t, f1$fn, lty = 2, lwd = 2, col = "blue")
R> lines(t, f2$fn, lty = 5, lwd = 2, col = "black")
R> lines(t, f3$fn, lty = 1, lwd = 2, col = "green")
R> lines(t, f4$fn, lty = 4, lwd = 2, col = "grey")
R> lines(density(x, width = 12), lty = 8, lwd = 2, col = "red")
R> legend("topleft", c("Gamma", "Lognormal", "Reciprocal inverse Gaussian",
+ "Extended beta", "Gaussian"), col = c("blue", "black", "green", "grey", "red"),
+ lwd = 2, lty = c(2, 5, 1, 4, 8), inset = .0)
Bayesian inference grows out of the simple formula known as Bayes rule.
Assume we have two random variables
Joint probability = Conditional Probability
Thus we have:
This expression (Bayes rule) indicates that we can compute the
conditional probability of a variable
A prior probability is the probability available to us beforehand, and before making any additional observations. A posterior probability is the probability obtained from the prior probability after making additional observation to the prior knowledge available.
The Bayesian approach to parameter estimation works as follows:
Formulate our knowledge about a situation.
Gather data.
Obtain posterior knowledge that updates our beliefs.
How do we formulate our knowledge about a situation?
Define a distribution model which expresses qualitative aspects of our knowledge about the situation. This model will have some unknown parameters, which will be dealt with as random variables.
Specify a prior probability distribution which expresses our subjective beliefs and subjective uncertainty about the unknown parameters, before seeing the data.
After gathering the data, how do we obtain posterior knowledge?
(Zougab et al. 2013) proposed a Bayesian approach based upon a likelihood
cross-validation approximation and a Markov chain Monte Carlo (MCMC)
method for deriving the global optimal bandwidth using the famous
binomial kernel. However, a global bandwidth does not generally provide
a good estimator for complex p.m.f.’s, in particular for small and
moderate sample sizes. Generally, the global discrete associated kernel
estimator tends to simultaneously under- and oversmooth
In order to improve the global discrete associated kernel estimator, in particular for complex count data with small and moderate sample sizes, (Zougab et al. 2012) and (Zougab et al. 2013) adapted two versions of variable bandwidths for discrete associated kernel estimator and proposed Bayesian approaches for selecting these variable bandwidths. Note that these two versions are originally proposed for kernel density estimation (see, e.g., Sain and Scott (1996); Breiman et al. (1977); Abramson (1982); Brewer (2000); Zhang et al. (2006); Zougab et al. (2014b) and Zhang et al. (2016)).
Recently, (Zougab et al. 2012) have considered the local discrete associated kernel
estimator (balloon estimator in discrete case) and have derived the
closed form of the variable bandwidth at each point
An alternative to the cross-validation for bandwidth selection is by
using Bayesian methods. These methods have been investigated with three
different procedures: local, global and adaptive; see respectively
(Zougab et al. 2012),(Zougab et al. 2013),(Zougab et al. 2014a). In terms of integrated squared error and
execution times, the local Bayesian outperforms the other Bayesian
procedures. In the local Bayesian framework, the variable bandwidth is
treated as parameter with prior
First, as we have mentioned above,
Indeed, let
The bandwidth
For fixed
The posterior density (
The Bayesian estimator
Proof. (i) Let us represent
Following (Chen 2000), we assume that for all
with
Also, for
with
From (21), the denominator is successively computed as follows:
with
(ii) Let us remember that the mean of the inverse gamma distribution
This new method of selecting bandwidth by the Bayesian adaptive procedure will be implemented in a future version of the Ake package.
One of the most often encountered models in nonparametric statistics is
the regression model. The function that provides the best prediction of
a dependent variable
As for the density or probability mass function, the estimate of the
regression function by the kernel method is the most used because of its
good asymptotic properties and interest in practice. Introduced
initially for continuous density estimation by (Rosenblatt 1956)
and (Parzen 1962), this method was adopted by (Nadaraya 1964)
and (Watson 1964) for estimating the continuous regression function. It
was also applied to smooth the discrete regression function
Both in continuous and discrete cases, consider the relation between a
response variable
Besides the criterion of kernel support, we retain the root mean squared
error (RMSE) and also the practical coefficient of determination given
respectively by
In discrete cases, the reg.fun
function for (25) is used
with the binomial kernel on milk data as follows. This dataset is about
average daily fat (kg/day) yields from milk of a single cow for each of
the first 35 weeks.
R> data("milk", package = "Ake")
R> x <- milk$week
R> y <- milk$yield
R> h <- reg.fun(x, y, "discrete", "bino", 0.1)
R> h
Bandwidth h:0.1 Coef_det=0.9726
Number of points: 35; Kernel = Binomial
data y
Min. : 1.0 Min. :0.0100
1st Qu.: 9.5 1st Qu.:0.2750
Median :18.0 Median :0.3600
Mean :18.0 Mean :0.3986
3rd Qu.:26.5 3rd Qu.:0.6150
Max. :35.0 Max. :0.7200
eval.points m_n
Min. : 1.0 Min. :0.01542
1st Qu.: 9.5 1st Qu.:0.27681
Median :18.0 Median :0.35065
Mean :18.0 Mean :0.39777
3rd Qu.:26.5 3rd Qu.:0.60942
Max. :35.0 Max. :0.70064
The above reg.fun
is also used for continuous cases; see
Figure 3 and Table 4 for the
motorcycle impact data of (Silverman 1985).
We present two bandwidth selection methods for the regression: the well-known cross-validation for any associated kernel and the Bayesian global for the binomial kernel.
For a given associated kernel, the optimal bandwidth parameter is
hcvreg.fun
function to compute this
optimal bandwidth is described in Table 3.
Arguments | Description |
---|---|
Vec |
The explanatory data sample can be discrete or continuous. |
y |
The response variable. |
ker |
The associated kernel. |
h |
The sequence of bandwidths where to compute the optimal bandwidth. |
a0,a1 |
The bounds of the support of extended beta kernel. Default values are |
respectively |
|
a |
The arm of the discrete triangular kernel. Default value is |
c |
The number of categories in DiracDU kernel. Default value is |
Results | Description |
kernel |
The associated kernel. |
hcv |
The optimal bandwidth obtained by cross-validation. |
CV |
The values of the cross-validation function. |
seqbws |
The sequence of bandwidths used to compute hcv . |
The following code helps to compute the bandwidth parameter by
cross-validation on milk data. The associated kernel used is the
discrete triangular kernel with arm
R> data("milk", package = "Ake")
R> x <- milk$week
R> y <- milk$yield
R> f <- hcvreg.fun(x, y, type_data = "discrete", ker = "triang", a = 1)
R> f$hcv
[1] 1.141073
When we consider the continuous associated kernel, one needs to set the
type of data parameter to “continuous” in the hcvreg.fun
function.
Thus, the hcvreg.fun
and reg.fun
functions are used with gamma,
lognormal, reciprocal inverse Gaussian and Gaussian kernel on the motor
cycle impact data described in (Silverman 1985). The observations consist
of accelerometer reading taken through time in an experimentation on the
efficiency of crash helmets. The results in Table 4 agree
with the shapes of continuous associated kernels of Part (b) of
Figure 1; see also
Figure 3. In fact, since the lognormal
kernel is well concentrated around the target
Gamma | Lognormal | Rec. Inv. Gaussian | Gaussian | |
---|---|---|---|---|
R |
Using Bayes theorem, the joint posterior distribution of IG
The Ake package offers easy tools for R users whose research involves kernel estimation of density functions and/or regression functions through associated kernels that are capable of handling all categorical, count and real positive datasets. Figure 1 shows the importance of the associated kernel choice as well as the bandwidth selection. In fact, symmetric (e.g., Gausssian) kernel estimators (respectively empirical estimators) are not suitable for bounded or positive continuous datasets (respectively discrete small samples). We then need an appropriate associated kernel. The binomial kernel is suitable for small size count data while the discrete triangular or the naive kernel are more indicated for large sample sizes. In continuous cases, the lognormal and gamma kernels give the best estimation for positive data while the extended beta is suitable for any compact support.
This package includes various continuous and discrete associated kernels. It also contains functions to handle the bandwidth selection problems through cross-validation, local and global Bayesian procedures for binomial kernel and also the adaptive Bayesian procedure for the gamma kernel. In general, Bayesian choices of smoothing parameters will be better than their cross-validation counterparts. Future versions of the package will contain Bayesian methods with other associated kernels. Also, these associated kernels are useful for heavy tailed data p.d.f. estimation and can be added later in the package; see, e.g., (Ziane et al. 2015). The case of multivariate data needs to be taken in consideration; see (Kokonendji and Somé 2015) for p.d.f. estimation and (Somé and Kokonendji 2016) for regression. We think that the Ake package can be of interest to nonparametric practitioners of different applied settings.
We sincerely thank the CRAN editors and two anonymous reviewers for their valuable comments. Part of this work was done while the first author was at “Laboratoire de Mathématiques de Besançon” as a Visiting Scientist, with the support of “Agence Universitaire de la Francophonie” (AUF).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wansouwé, et al., "Ake: An R Package for Discrete and Continuous Associated Kernel Estimations", The R Journal, 2016
BibTeX citation
@article{RJ-2016-045, author = {Wansouwé, Wanbitching E. and Somé, Sobom M. and Kokonendji, Célestin C.}, title = {Ake: An R Package for Discrete and Continuous Associated Kernel Estimations}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {258-276} }