Ake : An R Package for Discrete and Continuous Associated Kernel Estimations

Kernel estimation is an important technique in exploratory data analysis. Its utility relies on its ease of interpretation, especially based on graphical means. The Ake package is introduced for univariate density or probability mass function estimation and also for continuous and discrete regression functions using associated kernel estimators. These associated kernels have been proposed due to their specific features of variables of interest. The package focuses on associated kernel methods appropriate for continuous (bounded, positive) or discrete (count, categorical) data often found in applied settings. Furthermore, optimal bandwidths are selected by cross-validation for any associated kernel and by Bayesian methods for the binomial kernel. Other Bayesian methods for selecting bandwidths with other associated kernels will complete this package in its future versions; particularly, a Bayesian adaptive method for gamma kernel estimation of density functions is developed. Some practical and theoretical aspects of the normalizing constant in both density and probability mass functions estimations are given.


Introduction
Kernel smoothing methods are popular tools for revealing the structure of data that could be missed by parametric methods.For real datasets, we often encounter continuous (bounded, positive) or discrete (count, categorical) data types.The classical kernels methods assume that the underlying distribution is unbounded continuous, which is frequently not the case; see, for example, Duong (2007) for multivariate kernel density estimation and discriminant analysis.A solution is provided for categorical data sets by Hayfield and Racine (2008).In fact, they used kernels well adapted for these categorical sets (Aitchison and Aitken, 1976).Throughout the present paper, the unidimensional support T of the variable of interest can be {0, 1, . . ., N}, [a, b] or [0, ∞) for a given integer N and reals a < b.
The recently developed Ake package, implements associated kernels that seamlessly deal with continuous (bounded, positive) and discrete (categorical, count) data types often found in applied settings; see, for example, Libengué (2013) and Kokonendji and Senga Kiessé (2011).These associated kernels are used to smooth probability density functions (p.d.f.), probability mass functions (p.m.f.) or regression functions.The coming versions of this package will contain, among others, p.d.f.estimation of heavy tailed data (e.g., Ziane et al., 2015) and the estimation of other functionals.The bandwidth selection remains crucial in associated kernel estimations of p.d.f., p.m.f. or regression functions.Some methods have been investigated for selecting bandwidth parameters but the commonly used is the least squared cross-validation.A Bayesian approach has been also recently introduced by Zougab et al. (2012) in the case of a binomial kernel.This method can be extended to various associated kernels with other functionals.Despite the great number of packages implemented for nonparametric estimation in continuous cases with unbounded kernels, to the best of our knowledge, the R packages to estimate p.m.f. with categorical or count variables, p.d.f. with bounded or positive datasets, and regression functions have been far less investigated.
The rest of the paper is organized as follows.In Section Non-classical associated kernels, we briefly describe the definition of associated kernels and then illustrate examples in both continuous and discrete cases which are discussed.Then, the associated kernel estimator for p.d.f. or p.m.f. is presented and illustrated with some R codes in Section Density or probability mass function estimations.In particular, three bandwidth selection methods are available: cross-validation for any (continuous or discrete) associated kernel, the Bayesian local method for the binomial kernel and also a new theoretical Bayesian adaptive method for the gamma kernel.Also, some practical and theoretical aspects of the normalizing constant in both p.d.f. and p.m.f.estimations are given.Section Bandwidth selection for kernel regression involving associated kernels investigates the case of regression functions with two bandwidth selection techniques: cross-validation and also the Bayesian global method for the binomial kernel.Section Summary and final remarks concludes.

Non-classical associated kernels
Recall that the support T of the p.m.f., p.d.f. or regression function, to be estimated, is any set {0, 1, . . ., N}, [a, b] or [0, ∞) for a given integer N and reals a < b.The associated kernel in both continuous and discrete cases is defined as follows.
Definition 34.2.1.(Kokonendji and Senga Kiessé, 2011;Libengué, 2013) Let T (⊆ R) be the support of the p.m.f., p.d.f. or regression function, to be estimated, x ∈ T a target and h a bandwidth.A parametrized p.m.f.(respectively p.d.f.)K x,h (•) of support S x,h (⊆ R) is called "associated kernel" if the following conditions are satisfied: x ∈ S x,h , (1) where Z x,h denotes the random variable with p.m.f.(respectively p.d.f.)K x,h and both a(x, h) and b(x, h) tend to 0 as h goes to 0.
Remark 34.2.2.This definition has the following interesting interpretations: (i) The function K x,h (•) is not necessary symmetric and is intrinsically linked to x and h.
(ii) The support S x,h is not necessary symmetric around x; it can depend or not on x and h.
(iii) The condition (1) can be viewed as ∪ x∈T S x,h ⊇ T and it implies that the associated kernel takes into account the support T of the density f , to be estimated.
(iv) If ∪ x∈T S x,h does not contain T then this is the well-known problem of boundary bias.
(v) Both conditions ( 2) and ( 3) indicate that the associated kernel is more and more concentrated around x as h goes to 0. This highlights the peculiarity of the associated kernel which can change its shape according to the target position.
In order to construct an associated kernel K x,h (•) from a parametric (discrete or continuous) probability distribution K θ , θ ∈ Θ ⊂ R d on the support S θ such that S θ ∩ T = ∅, we need to establish a correspondence between (x, h) ∈ T × (0, ∞) and θ ∈ Θ; see Kokonendji and Senga Kiessé (2011).In what follows, we will call K ≡ K θ the type of kernel to make a difference from the classical notion of a continuous symmetric (e.g., Gaussian) kernel.In this context, the choice of the associated kernel becomes important as well as that of the bandwidth.Moreover, we distinguish the associated kernels said sometimes of "second order" of those said of "first order" which verify the two first conditions (1) and (2).The rest of this section is devoted to discuss examples of associated kernels in both discrete and continuous cases.

Discrete associated kernels
Among the discrete associated kernels found in literature, we here use the best in sense of Definition 34.2.1.Negative binomial and Poisson kernels are respectively overdispersed (i.e., V(Z x,h ) > E(Z x,h )) and equisdispersed (i.e., V(Z x,h ) = E(Z x,h )) and thus are not recommended; see Kokonendji and Senga Kiessé (2011) for further details.The first associated kernel listed below, namely the binomial kernel, is the best of the first order or standard kernels which satisfies where V (0) is a neighborhood of 0 which does not depend on x.The two other discrete associated kernels satisfy all conditions of Definition 34.2.1.
• The binomial (bino) kernel is defined on the support S x = {0, 1, . . ., x + 1} with x ∈ T := N = {0, 1, . ..} and then h ∈ (0, 1]: where 1 A denotes the indicator function of any given event A. Note that B x,h is the p.m.f. of the binomial distribution B(x + 1; (x + h)/(x + 1)) with its number of trials x + 1 and its success probability in each trial (x + h)/(x + 1).It is appropriate for count data with small or moderate sample sizes and, also, it satisfies (4) rather than (3); see Kokonendji and Senga Kiessé (2011) and also Zougab et al. (2012) for a bandwidth selection by Bayesian method.
The R Journal Vol.8/2, December 2016 ISSN 2073-4859 • The following class of symmetric discrete triangular kernels has been proposed in Kokonendji et al. (2007).The support T of the p.m.f.f to be estimated, can be unbounded (e.g., N, Z) or finite (e.g., {0, 1, . . ., N}).Then, suppose that h is a given bandwidth parameter and a is an arbitrary and fixed integer.For fixed arm a ∈ N, the discrete triangular (DTra) kernel is defined on S x,a = {x, x ± 1, . . ., x ± a} with x ∈ T = N: where P(a, h) = (2a + 1)(a + 1) − 2 ∑ a k=0 k h is the normalizing constant.It is symmetric around the target x, satisfying Definition 34.2.1 and suitable for count variables; see Kokonendji and Zocchi (2010) for an asymmetric version.Note that h → 0 gives the Dirac kernel.
• A discrete kernel estimator for categorical data has been introduced in Aitchison and Aitken (1976).Its asymmetric discrete associated kernel version that we here label DiracDU (DirDU) as "Dirac Discrete Uniform" has been deduced in Kokonendji and Senga Kiessé (2011) as follows.
For fixed c ∈ {2, 3, . ..} the number of categories, we define S c = {0, 1, . . ., c − 1} and where h ∈ (0, 1] and x ∈ T = S c .In addition, the target x can be considered as the reference point of f to be estimated; and, the smoothing parameter h is such that 1 − h is the success probability of the reference point.This DiracDU kernel is symmetric around the target, satisfying Definition 34.2.1 and appropriated for categorical set T. See, e.g., Racine and Li (2004) for some uses.Note that h = 0 provides the Dirac kernel.

Continuous associated kernels
One can find several continuous associated kernels in literature among the Birnbaum-Saunders of Jin and Kawczak (2003).Here, we present seven associated kernels well adapted for the estimations of density or regression functions on any compact or nonnegative support of datasets.All these associated kernels satisfy Definition 34.2.1.
• The extended beta (BE) kernel is defined on where B(r, s) = 1 0 t r−1 (1 − t) s−1 dt is the usual beta function with r > 0, s > 0; see Libengué (2013).For a = 0 and b = 1, it corresponds to the beta kernel (Chen, 1999) which is the p.d.f. of the beta distribution with shape parameters 1 + x/h and (1 − x)/h.The extended beta kernel is appropriate for any compact support of observations.
• The gamma (GA) kernel is given on S x,h = [0, ∞) = T with x ∈ T and h > 0: where Γ(v) = ∞ 0 s v−1 exp(−s)ds is the classical gamma function with v > 0; see Chen (2000).It is the p.d.f. of the gamma distribution GA(1 + x/h, h) with scale parameter 1 + x/h and shape parameter h.It is suitable for the non-negative real set T = [0, ∞).

• The lognormal (LN) kernel is defined on S
see Libengué (2013) and also Igarashi and Kakizawa (2015).It is the p.d.f. of the classical lognormal distribution with mean log(x) + h 2 and standard deviation h.

√
x 2 + xh and standard deviation 1/h.Remark 34.2.3.The three continuous associated kernels inverse gamma, inverse Gaussian and Gaussian are not adapted for density estimation on supports [0, ∞) and thus are not included in the Ake package; see Part (b) of Figure 1.

Indeed:
• The inverse gamma (IGA) kernel, defined on S x,h = (0, ∞) = T with x ∈ (0, 1/h) and h > 0 such that (Libengué, 2013), is graphically the worst since it does not well concentrate on the target x.Note that it is the p.d.f. of the inverse gamma distribution with scale parameter −1 + 1/(xh) and scale parameter 1/h.
• From ated version (Gaussian) on S x,h = R with x ∈ T := R and h > 0: It has the same shape at any target and thus is well adapted for continuous variables with unbounded supports but not for [0, ∞) or compact set of R; see also Epanechnikov (1969) for another example of a continuous symmetric kernel.
Figure 1 shows some forms of the above-mentioned univariate associated kernels.The plots highlight the importance given to the target point and around it in discrete (a) and continuous (b) cases.Furthermore, for a fixed bandwidth h, the Gaussian keeps its same shape along the support; however, they change according to the target for the other non-classical associated kernels.This The R Journal Vol.8/2, December 2016 ISSN 2073-4859 Arguments Description x The target.t The single or the grid value where the function is computed.h The bandwidth or smoothing parameter.ker The associated kernel.

a0,a1
The left and right bounds of the support for the extended beta kernel.a The arm of the discrete triangular kernel.Default value is 1. c The number of categories in DiracDU kernel.Default value is 2.

Result Description
Returns a single value of the associated kernel function.We have implemented in R the method kern.fun for both discrete and continuous associated kernels.Seven possibilities are allowed for the kernel function.We enumerate the arguments and results of the default kern.fun.defaultfunction in Table 1.The kern.fun is used as follows for the binomial kernel: R> x <-5 R> h <-0.1 R> y <-0:10 R> k_b <-kern.fun(x,y, h, "discrete", "bino")

Density or probability mass function estimations
The p.d.f. or p.m.f.estimation is an usual application of the associated kernels.Let X 1 , . . ., X n be independent and identically distributed (i.i.d.) random variables with an unknown p.d.f.(respectively p.m.f.) f on T.An associated kernel estimator f n of f is simply: Here, we point out some pointwise properties of the estimator (5) in both discrete and continuous cases.
Proposition 34.3.1.(Kokonendji and Senga Kiessé, 2011;Libengué, 2013) Let X 1 , X 2 , . . ., X n be an n random sample i.i.d.from the unknown p.m.f.(respectively p.d.f.) f on T. Let f n = f n,h,K be an estimator (5) of f with an associated kernel.Then, for all x ∈ T and h > 0, we have where Z x,h is the random variable associated to the p.m.f.(respectively p.d.f.)K x,h on S x,h .Furthermore, for a p.m.f.(respectively p.d.f.), we have respectively where C n = C(n; h, K) is a positive and finite constant if T K x,h (t)ν(dx) < ∞ for all t ∈ T, and ν is a count or Lebesgue measure on T.
It is easy to see that C n = 1 for the estimators (5) with DiracDU kernel or any classical (symmetric) The R Journal Vol.8/2, December 2016 ISSN 2073-4859 associated kernel.Indeed, for the DiracDU kernel estimation we have In general we have C n = 1 for other discrete and also continuous associated kernels, but it is always close to 1.In practice, we compute C n depending on observations before normalizing f n to be a p.m.f. or a p.d.f.The following code helps to compute the normalizing constant, e.g., for gamma kernel estimation: R> data("faithful", package = "datasets") R> x <-faithful$waiting R> f <-dke.fun(x,ker = "GA", 0.1) R> f$C_n [1] 0.9888231 Without loss of generality, we study x → f n (x) up to a normalizing constant which is used at the end of the density estimation process.Notice that that non-classical associated kernel estimators f n are improper density estimates or as kind of "balloon estimators"; see Sain (2002).There are two ways to normalize these estimators (5).The first method is the global normalization using C n of (6): Another alternative is to use an adaptive normalization of (5) according to each target x: but this approach, with similar results than (7), is not used here.The representations are done with the global normalization (7).In the package, we also compute the normalizing constant (6) for any data set.
In discrete cases, the integrated squared error (ISE) defined by is the criteria used to measure numerically the discrete smoothness of f n from (7) with the empirical or naive p.m.f.f 0 such that ∑ x∈N f 0 (x) = 1; see, e.g., Kokonendji and Senga Kiessé (2011).Concerning the continuous variables, the histogram gives a graphical measure of comparison with f n ; see, for example, Figure 2.

Some theoretical aspects of the normalizing constant
In this section, we present some theoretical aspects of the normalizing constant C n of ( 6) and two examples in the continuous and discrete cases.We first recall the following result on pointwise properties of the estimator (5).
Lemma 34.3.2.(Kokonendji and Senga Kiessé, 2011;Libengué, 2013) Let x ∈ T be a target and h ≡ h n a bandwidth.Assuming f is in the class C 2 (T) in the continuous case, then Similar expressions (8) hold in the discrete case, except that f and f are finite differences of the first and second order respectively.
Furthermore, for the continuous case, if f is bounded on T then there exists r 2 = r 2 K x,h > 0 the largest The R Journal Vol.8/2, December 2016 ISSN 2073-4859 For discrete situations, the result (9) becomes where Z x,h denotes the discrete random variable with p.m.f.K x,h .
It is noticeable that the bias ( 8) is bigger than the one with symmetric kernels and thus can be reduced; see, e.g., Zhang (2010), Zhang and Karunamuni (2010) and Libengué (2013).
Proposition 34.3.3.Following notations in Lemma 34.3.2, the mean and variance of C n of (6) are respectively: where in f (T) and sup(T) are respectively the infimum and supremum of T, the measure ν is Lesbesgue or count on the support T, and where " " stands for approximation.
Proof.From Lemma 34.3.2 and the Fubini theorem, we successively show (10) as follows: The variance ( 11) is trivial from Lemma 34.3.2.
Example 1.Let f be an exponential density with parameter γ > 0. Thus, one has: Consider the lognormal kernel with Libengué (2013).Then, using the Taylor formula around h, the expressions of E (C n ) and V (C n ) are: 2340 by computation with R. Thus, the quantity C n cannot be equal to 1.

Bandwidth selection
Now, we consider the bandwidth selection problems which are generally crucial in nonparametric estimation.Several methods already existing for continuous kernels can be adapted to the discrete case as the classical least-squares cross-validation method; see, for example, Bowman (1984), Marron (1987) and references therein.Here, we simply propose three procedures for the bandwidth selection: cross-validation, Bayesian local for binomial and adaptive for the gamma kernel.Also, a review of bayesian bandwidth selection methods is presented.Each time, the smoothing parameter selection is done with the non-normalized version f n of the estimator (5) before the global normalization f n of (7).

Cross-validation for any associated kernel
For a given associated kernel K x,h with x ∈ T and h > 0, the optimal bandwidth h cv of h is obtained by cross-validation as h cv = arg min h>0 CV(h) with ) is being computed as f n (X i ) by excluding the observation X i and ν is the Lebesgue or count measure.This method is applied to all estimators (5) with associated kernels cited in this paper, independently on the support T of f to be estimated.
Table 2 gives the arguments and results of the cross-validation function hcvc.fundefined for continuous data are below.The hcvd.fun is the corresponding function for discrete data.The hcvc.fun is performed with the Old Faithful geyser data described in Azzalini and Bowman (1990) and Härdle (2012).The dataset concerns waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.The following codes and Figure 2 give smoothing density estimation with various associated kernels of the waiting time variable.

Vec
The positive continuous data sample.

seq.bws
The sequence of bandwidths where to compute the cross-validation function.ker The associated kernel.

a0,a1
The bounds of the support of extended beta kernel.Default values are respectively 0 and 1. a The arm of the discrete triangular kernel.Default value is 1. c The number of categories in DiracDU kernel.Default value is 2.

Description hcv
The optimal bandwidth obtained by cross-validation.

seq.h
The sequence of bandwidths used to compute hcv.

A review of Bayesian bandwidth selection
Bayesian inference grows out of the simple formula known as Bayes rule.Assume we have two random variables A and B. A principle rule of probability theory known as the chain rule allows us to specify the joint probability of A and B taking on particular values a and b, P(a, b), as the product of the conditional probability that A will take on value a given that B has taken on value b, P(a|b), and the marginal probability that B takes on value b, P(b).Which gives us: Joint probability = Conditional Probability × Marginal Probability.
Thus we have: P(a, b) = P(a|b)P(b).
This expression (Bayes rule) indicates that we can compute the conditional probability of a variable A given the variable B from the conditional probability of B given A. This introduces the notion of prior and posterior knowledge.
Prior and posterior knowledge.A prior probability is the probability available to us beforehand, and before making any additional observations.A posterior probability is the probability obtained from the prior probability after making additional observation to the prior knowledge available.
Summarizing the Bayesian approach.The Bayesian approach to parameter estimation works as follows: 1. Formulate our knowledge about a situation.

Obtain posterior knowledge that updates our beliefs.
How do we formulate our knowledge about a situation?a. Define a distribution model which expresses qualitative aspects of our knowledge about the situation.This model will have some unknown parameters, which will be dealt with as random variables.
b. Specify a prior probability distribution which expresses our subjective beliefs and subjective uncertainty about the unknown parameters, before seeing the data.
After gathering the data, how do we obtain posterior knowledge?c.Compute posterior probability distribution which estimates the unknown parameters using the rules of probability and given the observed data, presenting us with updated beliefs.Zougab et al. (2013) proposed a Bayesian approach based upon a likelihood cross-validation approximation and a Markov chain Monte Carlo (MCMC) method for deriving the global optimal bandwidth using the famous binomial kernel.However, a global bandwidth does not generally provide a good estimator for complex p.m.f.'s, in particular for small and moderate sample sizes.Generally, the global discrete associated kernel estimator tends to simultaneously under-and oversmooth f (x).
In order to improve the global discrete associated kernel estimator, in particular for complex count data with small and moderate sample sizes, Zougab et al. (2012) and Zougab et al. (2013) adapted two versions of variable bandwidths for discrete associated kernel estimator and proposed Bayesian approaches for selecting these variable bandwidths.Note that these two versions are originally proposed for kernel density estimation (see, e.g., Sain and Scott 1996;Breiman et al. 1977;Abramson 1982;Brewer 2000;Zhang et al. 2006;Zougab et al. 2014a andZhang et al. 2016).
Recently, Zougab et al. (2012) have considered the local discrete associated kernel estimator (balloon estimator in discrete case) and have derived the closed form of the variable bandwidth at each point x for which the p.m.f. is estimated by considering the binomial kernel estimator and locally treating the bandwidth as a random quantity with a beta prior distribution.This approach outperforms existing classical global methods, namely, MISE and CV in particular for small and moderate sample sizes.Zougab et al. (2013) have also proposed the adaptive discrete associated kernel estimator (sample-point estimator in discrete situation), which replaces h by h i for each observation x i with i = 1, . . ., n, and then employs the Bayesian approach for estimating the adaptive bandwidths h i .The authors have considered the binomial kernel and the beta prior for each variable bandwidth h i , and have shown that this approach performs better than the popular classical global selectors.

Bayesian estimation of localized bandwidth for the binomial kernel
An alternative to the cross-validation for bandwidth selection is by using Bayesian methods.These methods have been investigated with three different procedures: local, global and adaptive; see respectively Zougab et al. (2012Zougab et al. ( , 2013Zougab et al. ( , 2014b)).In terms of integrated squared error and execution times, the local Bayesian outperforms the other Bayesian procedures.In the local Bayesian framework, the variable bandwidth is treated as parameter with prior π(•).Under squared error loss function, the Bayesian bandwidth selector is the posterior mean; see Zougab et al. (2012).
First, as we have mentioned above, f (x) can be approximated by where B x,h is the binomial kernel and X is a random variable with p.m.f.f .Now, considering h as a scale parameter for f h (x), the local approach consists of using f h (x) and constructing a Bayesian estimator for h(x).
Indeed, let π(h) denote the beta prior density of h with positive parameters α and β.By the Bayes theorem, the posterior of h at the point of estimation x takes the form Since f h is unknown, we use f h as natural estimator of f h , and hence we can estimate the posterior by Under the squared error loss, the Bayes estimator of the smoothing parameter h(x) is the posterior mean and is given by where B(•, •) is the beta function; see Zougab et al. (2012) for more details.

Bayesian estimation of adaptive bandwidth for the gamma kernel
The bandwidth h in the gamma kernel density estimation can be allowed to be adaptive.This approach gives a variable bandwidth h i for each observation X i in place of the initial fixed bandwidth h.Following Zougab et al. (2014a), we suggest using Bayesian methods to estimate such adaptive bandwidths or variable bandwidths h i , i = 1, . . ., n.Thus, we treat h i as a random variable with a prior distribution π(•).The estimator (5) with gamma kernel of Section Continuous associated kernels and variable bandwidths are reformulated as The leave-one-out kernel estimator of f (X i ) deduced from (12) is where {X −i } denotes the set of observations excluding X i .The posterior distribution for each variable bandwidth h i given X i provided from the Bayesian rule is expressed as follow We obtain the Bayesian estimator h i of h i by using the quadratic loss function The R Journal Vol.8/2, December 2016 ISSN 2073-4859 In the following, we assume that each h i = h i (n) has an inverse gamma prior distribution IGA(α, β) with the shape parameter α > 0 and scale parameter β > 0. The density of IGA(a, b) with a, b > 0 is defined as This allows us to obtain the closed form of the posterior density and the Bayesian estimator given by the following result.
Theorem 34.3.4.For fixed i ∈ {1, 2, . . ., n}, consider each observation X i with its corresponding bandwidth h i .Using the gamma kernel estimator (12) and the inverse gamma prior distribution IGA(α, β) given in (16) with α > 1/2 and β > 0 for each h i , then: (i) The posterior density ( 14) is the following weighted sum of inverse gamma (ii) The Bayesian estimator h i of h i , given in (15), is according to the previous notations of A ij , B ij , C j and D ij .13) and (16) the numerator is, first, equal to Following Chen (2000), we assume that for all X i ∈ (0, ∞) one has 1 + (X i /h i ) → ∞ as n → ∞.Using the Stirling formula Γ(z + 1) √ 2πz z+1/2 exp(−z) as z → ∞, the term of the sum in ( 17) can be successively calculated as with The R Journal Vol.8/2, December 2016 ISSN 2073-4859 Also, for X i = 0, the term of the sum ( 17) can be expressed as follows with C j = Γ(α + 1)/[β −α (X j + β) α+1 ] and Φ α+1,X j +β (h i ) is given in (16).Combining ( 18) and ( 19), the expression of N( From (20), the denominator is successively computed as follows: with Finally, the ratio of (20) and (21) leads to the result of Part (i).
(ii) Let us remember that the mean of the inverse gamma distribution IG(α, β) is β/(α − 1).Thus, the expression of π(h i | X i ) in ( 14) is given by and, therefore, This new method of selecting bandwidth by the Bayesian adaptive procedure will be implemented in a future version of the Ake package.

Bandwidth selection for kernel regression involving associated kernels
One of the most often encountered models in nonparametric statistics is the regression model.The function that provides the best prediction of a dependent variable y in terms of an independent variable x is the conditional expectation E(y/x) = m(x).This is called regression function and its estimation from a sequence of n pairs (x i , y i ), i = 1, . . ., n is a problem in statistics.We will consider the case (x, y) ∈ T × R. For simplicity, we take T = R if x is a continuous variable and T = N in the discrete case.The classical non parametric regression model between two variables y and x is where y = (y 1 , . . ., y n ) is a response vector, x = (x 1 , . . ., x n ) is an explanatory vector, = ( 1 , . . ., n ) is the error following a Gaussian distribution with zero mean and finite variance σ 2 , i.e., i ∼ N (0, σ 2 ) and m : T → R is the unknown regression function.Several methods have been proposed to estimate the regression function in the continuous case.We cite for example the histograms introduced by Tukey (1961) and studied by Geoffroy (1980) and Lecoutre (1990), the spline method which can be found in Reinsch (1967), Silverman (1985) and Wahba (1990) and also the regression using partition The R Journal Vol.8/2, December 2016 ISSN 2073-4859 proposed by Breiman et al. (1984).
As for the density or probability mass function, the estimate of the regression function by the kernel method is the most used because of its good asymptotic properties and interest in practice.Introduced initially for continuous density estimation by Rosenblatt (1956) and Parzen (1962), this method was adopted by Nadaraya (1964) and Watson (1964) for estimating the continuous regression function.It was also applied to smooth the discrete regression function m for x ∈ N. Some studies have been done to estimate the discrete regression function, using the Dirac-type kernel (naive estimator) or discrete kernels of Aitchison and Aitken (1976).However, the naive estimator is appropriate only when the sample size is large, and the discrete kernel of Aitchison and Aitken (1976) is only suitable for categorical data; see Hayfield and Racine (2008) and also Hayfield and Racine (2014).Kokonendji et al. (2009) adapted the Nadaraya (1964) and Watson (1964) kernel to the discrete unknown function m, using the discrete associated kernels.In their work, using the integrated mean square error and the coefficient of determination R 2 , they showed that the binomial or discrete triangular kernels are better compared to the optimal Epanechnikov kernel.In this section we present the theoretical foundations of the estimated regression function with continuous and discrete associated kernels.
Both in continuous and discrete cases, consider the relation between a response variable Y and an explanatory variable x given by where m is an unknown regression function from T ⊆ R to R and the disturbance term with null mean and finite variance.Let (X 1 , Y 1 ), . . ., (X n , Y n ) be a sequence of i.i.d.random variables on T × R(⊆ R 2 ) with m(x) = E (Y|X = x) of ( 23).Using (continuous or discrete) associated kernels, the Nadaraya (1964) and Watson (1964) estimator m n of m is where h ≡ h n is the smoothing parameter such that h n → 0 as n → ∞.
Besides the criterion of kernel support, we retain the root mean squared error (RMSE) and also the practical coefficient of determination given respectively by In discrete cases, the reg.fun function for ( 24) is used with the binomial kernel on milk data as follows.This dataset is about average daily fat (kg/day) yields from milk of a single cow for each of the first 35 weeks.

Vec
The explanatory data sample can be discrete or continuous.y The response variable.ker The associated kernel.h The sequence of bandwidths where to compute the optimal bandwidth.a0,a1 The bounds of the support of extended beta kernel.Default values are respectively 0 and 1. a The arm of the discrete triangular kernel.Default value is 1. c The number of categories in DiracDU kernel.Default value is 2.

Description kernel
The associated kernel.hcv The optimal bandwidth obtained by cross-validation.

CV
The values of the cross-validation function.seqbws The sequence of bandwidths used to compute hcv.The above reg.fun is also used for continuous cases; see Figure 3 and Table 4 for the motorcycle impact data of Silverman (1985).

Bandwidth selection
We present two bandwidth selection methods for the regression: the well-known cross-validation for any associated kernel and the Bayesian global for the binomial kernel.

Cross-validation for any associated kernel
For a given associated kernel, the optimal bandwidth parameter is h cv = arg min h>0 LSCV(h) with where m −i (X i ) is computed as m n of (24) excluding X i ; see, e.g., Kokonendji et al. (2009).The hcvreg.funfunction to compute this optimal bandwidth is described in Table 3.
The following code helps to compute the bandwidth parameter by cross-validation on milk data.The associated kernel used is the discrete triangular kernel with arm a = 1.R> data("milk", package = "Ake") R> x <-milk$week R> y <-milk$yield R> f <-hcvreg.fun(x,y, type_data = "discrete", ker = "triang", a = 1) R> f$hcv [1] 1.141073 When we consider the continuous associated kernel, one needs to set the type of data parameter to "continuous" in the hcvreg.funfunction.Thus, the hcvreg.funand reg.fun functions are used with gamma, lognormal, reciprocal inverse Gaussian and Gaussian kernel on the motor cycle impact data described in Silverman (1985).The observations consist of accelerometer reading taken through time in an experimentation on the efficiency of crash helmets.The results in Table 4 agree with the shapes of continuous associated kernels of Part (b) of Figure 1; see also Figure 3.In fact, since the lognormal kernel is well concentrated around the target x, it gives the best R 2 which is 75.9%.The gamma and the reciprocal inverse Gaussian kernels give similar R 2 in the order 73%.Although the Gaussian kernel is well concentrated on the target, it gives the lower result of R 2 = 70.90%.This is mainly due to the symmetry of the kernel which cannot change its shapes according to the target.
Figure 3: Nonparametric regressions of the motors cycle impact data (Silverman, 1985) by some continuous associated kernels.

Bayesian global for binomial kernel
Using Bayes theorem, the joint posterior distribution of h given the observations is π(h|X 1 , X 2 , . . . , , where ∝ denotes proportional, the reals a and b are the parameters of the inverse gamma distribution IG(a, b), and α and β those of the beta distribution Be(α, β).The estimate h bay of the smoothing parameter h is given by Markov chain Monte Carlo (MCMC) techniques with Gibbs sampling: where N 0 is the burn-in period and N the number of iterations; see Zougab et al. (2014b) for further details.It will be implemented in a future version of the Ake package.

Summary and final remarks
The Ake package offers easy tools for R users whose research involves kernel estimation of density functions and/or regression functions through associated kernels that are capable of handling all categorical, count and real positive datasets.Figure 1 shows the importance of the associated kernel choice as well as the bandwidth selection.In fact, symmetric (e.g., Gausssian) kernel estimators (respectively empirical estimators) are not suitable for bounded or positive continuous datasets (respectively discrete small samples).We then need an appropriate associated kernel.The binomial kernel is suitable for small size count data while the discrete triangular or the naive kernel are more indicated for large sample sizes.In continuous cases, the lognormal and gamma kernels give the best The R Journal Vol.8/2, December 2016 ISSN 2073-4859 estimation for positive data while the extended beta is suitable for any compact support.
This package includes various continuous and discrete associated kernels.It also contains functions to handle the bandwidth selection problems through cross-validation, local and global Bayesian procedures for binomial kernel and also the adaptive Bayesian procedure for the gamma kernel.In general, Bayesian choices of smoothing parameters will be better than their cross-validation counterparts.Future versions of the package will contain Bayesian methods with other associated kernels.Also, these associated kernels are useful for heavy tailed data p.d.f.estimation and can be added later in the package; see, e.g., Ziane et al. (2015).The case of multivariate data needs to be taken in consideration; see Kokonendji and Somé (2015) for p.d.f.estimation and Somé and Kokonendji (2016) for regression.We think that the Ake package can be of interest to nonparametric practitioners of different applied settings.

Figure 2 :
Figure 2: Smoothing density estimation of the Old Faithful geyser data(Azzalini and Bowman, 1990) by some continuous associated kernels with the support of observations[43, 96]  = T.

Table 1 :
Summary of arguments and results of kern.fun.explains the inappropriateness of the Gaussian kernel for density or regression estimation in any bounded interval and of the DiracDU kernel for count regression estimation; see Part (b) of Figure1.From Part (v) of Remark 34.2.2, the inverse gamma and inverse Gaussian are the worst since they do not well concentrate on the target x; see Remark 34.2.3.These previous associated kernels can be applied to various functionals.

Table 2 :
Summary of arguments and results of hcvc.fun.

Table 3 :
Summary of arguments and results of hcvreg.fun.