OneStep-Le Cam’s one-step estimation procedure

Abstract The OneStep package proposes principally an eponymic function that numerically computes Le Cam’s one-step estimator which is asymptotically efficient and can be computed faster than the maximum likelihood estimator for large datasets. Monte-Carlo simulations are carried out for several examples (discrete and continuous probability distributions) in order to exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size.


Introduction
In the statistical experiments generated by i.i.d. observation samples, the sequence of maximum likelihood estimators (MLE) is known to be asymptotically efficient under very general assumptions and consequently presents the fastest convergence rate and the lowest possible asymptotic variance.
Although the sequence of MLE is asymptotically efficient, it is generally not expressed in a closed form and requires time consuming numerical computations. On the other hand, the other generic estimation procedures which can sometimes be computed faster, as the method of moments, do not generally reach the optimal asymptotic variance.
In R, the package fitdistrplus (see Delignette-Muller and Dutang (2015)) is commonly used to infer the parameters of univariate probability distributions. For non-censored datasets, fitdistrplus allows four different estimation methods: maximum likelihood, moment matching, quantile matching, and maximum goodness-of-fit estimation. The unified approach is provided by fitdist() which returns an S3-object having usual generic functions (plot, summary, logLik, . . . ) as well as dedicated fitdist-functions (gofstat, bootdist, . . . ).
Other packages with similar purposes are EstimationTools and DistributionFitR, providing MLE for non-censored data. Many other packages also provide estimation procedure on the basis of one function per probability distribution; see, e.g., packages univariateML, propagate. A final package, which is worth mentioning, is fitter, which fits a set of probability distributions on a given dataset sequentially.
However, as soon as the Fisher information matrix is sufficiently regular with respect to the parameter to be estimated, Le Cam's one-step estimation procedures can be achieved (see Le Cam (1956)). They are based on an initial sequence of guess estimators and a single Newton step or a Fisher scoring step on the loglikelihood function. Namely, they write ϑ n = ϑ * n + I(ϑ * n ) −1 · 1 n n ∑ j=1l (ϑ * n , X j ), n ≥ 1, for the Fisher scoring type procedure, where (ϑ * n , n ≥ 1) is the initial sequence of guess estimators, ℓ(ϑ) is the loglikelihood function, the dot is the notation for the gradient with respect to ϑ, and I(ϑ) = −E ϑ (l(ϑ)) is the Fisher information matrix and for the Newton type procedure, where I n (ϑ) = − 1 n ∑ n j=1l (ϑ, X j ) is the opposite of the Hessian for the loglikelihood function.
The sequence of Le Cam's one-step estimators presents certain advantages over the sequence of MLE and over the initial sequence of estimators (method of moments, quantile matching method, etc.) in terms of computational cost and asymptotic variance. It is much less computationally expensive than the MLE, while it has the same rate and the same asymptotic variance. Since there is no full numerical optimization (but only one computation of the Newton step or the Fisher scoring step), the procedure is faster and appropriate for very large datasets. On the other hand, it is asymptotically optimal in terms of asymptotic variance, which is generally not the case for the initial sequence of guess estimators.
For several probability distributions (Gaussian, Exponential, Lognormal, Poisson, Geometric), a fast computable sequence of estimators is already asymptotically efficient, and no correction is needed. When the Fisher information is given in a closed form (Gamma, Beta, χ 2 , Weibull, Pareto II, Cauchy), Le Cam's one-step estimation procedure (1) is executed. In all the other cases, the estimation procedure (2) is used either with the score function and the Hessian in a closed form (Negative Binomial) or with the numerical approximation of the score function and the Hessian with the package numDeriv (see Gilbert and Varadhan (2019)).
Le Cam's one-step estimation procedures are implemented in the package OneStep, and the eponymic function onestep is described in this paper and used on different examples. Monte Carlo simulations are performed for several examples in order to exhibit the performance of Le Cam's one-step procedure on samples of finite size. They exhibit their asymptotic efficiency simultaneously with their better computational cost for several probability distributions.

The function onestep
Let (X 1 , X 2 , . . . , X n ) be a sample of i.i.d. random variables. The probability density function of X 1 is denoted f and depends on an unknown parameter ϑ ∈ Θ ⊂ R p which is to be estimated.
Le Cam's one-step estimation procedure is based on an initial sequence of guess estimators and a Fisher scoring step or a single Newton step on the loglikelihood function. For the novice user, the function onestep automatically chooses the best procedure to be used. There is, consecutively, a single command for the user, which is onestep(data,distr) The function onestep presents several procedures internally depending on whether the initial sequence of guess estimators is in a closed-form or not and whether the score and the Fisher information matrix can be elicited in a closed-form.
Here is the list of the procedures taken into account in the onestep function: 1. Distributions for which the MLE is already explicit: norm, exp, lnorm, invgauss for continuous probability distributions, and pois, geom for discrete probability distributions. For this class, the explicit MLE is returned.
2. Distributions for which the initial sequence of guess estimators, the score and the Fisher information matrix have been elicited in a closed-form: gamma, beta, chisq with an initial sequence of moment estimators, cauchy with an initial sequence of quantile matching estimators, and weibull with an initial sequence of graphical plot estimators. For this class, Le Cam's one-step procedure (1) is applied.
3. Distributions for which the initial sequence of guess estimators, the score, and the Hessian have been elicited in a closed-form: nbinom. For this class, Le Cam's one-step procedure (2) is applied.
4. Distributions for which the initial sequence is numerically computed on a subsample, but the score and the Fisher information matrix have closed-form: pareto. For this class, Le Cam's one-step procedure (1) is executed.
5. For all other distributions, if the density function is well defined, the numerical computation of the Newton step in Le Cam's one-step procedure (2) is proposed with an initial sequence of guess estimators, which is the sequence of maximum likelihood estimators computed on a subsample.
As is described, all explicit distributions of the mmedist function from fitdistrplus have been corrected in the onestep function, except unif and logis. The example unif is a famous example of the singular behavior of the MLE. The correction of logis is of very small gain from the method of moments estimator to the MLE. For these two distributions, the method of moments is returned.
However, the package also offers several new explicit computations as cauchy, chisq and weibull and applies to distributions coming from the actuar package (see Dutang et al. (2008)) such as invgauss and pareto.
The function onestep allows the user to propose its own initial guess estimation in the one-step procedure by specifying the parameter init in the command onestep(data,distr,init) The user can consequently use different initial guess estimators for the aforementioned classical distributions (moment estimators, estimators based on the characteristic function, quantile matching estimators, Bayesian estimators, mode-type estimators, graphical methods, etc.). Several examples can be found in the documentation of the onestep function.
Monte Carlo simulations are done for several examples (discrete and continuous probability distributions) in order to exhibit the performance of Le Cam's one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size.
For the assessment of the efficiency, the proximity of the renormalized statistical errors (Y 1 , Y 2 , . . . , Y M ) to the centered Gaussian asymptotic distribution (for a coordinate in the multivariate setting) is evaluated with the Cramer-Von Mises statistic is the theoretical Gaussian asymptotic cumulative distribution function and is the empirical cumulative distribution function. The asymptotic distribution of T is tabulated, for instance, in the goftest package. Note that a value of the statistic below 0.7434 corresponds to accept the null (and the equivalence to theoretical Gaussian asymptotic distribution) with an error of type I equal to 1%.
Timing performance (given in seconds) is done with the proc.time function ("elapsed" time) on a laptop with an Intel Core i7 2.7 GHz processor with 8GB RAM. The onestep function is compared to the MLE computed with mledist of the fitdistrplus package and to the sequence of initial guess estimators (for instance, the moment estimator is computed with mmedist for the gamma, beta, and nbinom distributions).
The functions benchonestep and benchonestep.replicate were used to compare the performance of estimators (see documentation of the OneStep package).

Closed-form sequence of initial guess estimators
For the majority of the "closed formula" cases, the initial sequence of guess estimators of the unknown parameter ϑ is the sequence of moment estimators.
Let (X 1 , X 2 , . . . , X n ) be a sample of i.i.d. random variables. Let us denote the theoretical moments m k (ϑ) = E ϑ (X k 1 ) and the empirical momentsm k = 1 n ∑ n j=1 X k j . The sequence of moment estimators (ME) is generally defined as the solution of the system of equations Indeed, under very mild conditions, the sequence of moment estimators (ϑ * n , n ≥ 1) is asymptotically normal, and therefore, √ n-consistent (see Ibragimov and Has'minskii (1981)). Namely, Here, the notation M T means the transpose matrix of M. Several examples (Gamma, Beta, and Negative Binomial) are given later on in this section.
However, other √ n-consistent sequences of initial guess estimators can be used. For instance, an initial sequence of quantile matching estimators is employed for the Cauchy distribution.
It had been shown in Le Cam (1956) that for a √ n-consistent initial sequence of guess estimators and an uniformly continuous Fisher information matrix, the sequence of Le Cam's one-step estimators defined in (1) or (2) is consistent, asymptotically normal and efficient (in the Fisher sense) with In other words, for an initial sequence which is asymptotically rate but not variance efficient, the new sequence is asymptotically rate and variance efficient.

Gamma
The first example is the joint estimation of the shape parameter α and scale parameter β in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d. Gamma random variables whose probability density function is given by Let us denote ϑ = (α, β). In this statistical experiment, the sequence of maximum likelihood estimators ( ϑ n , n ≥ 1) of ϑ is not in a closed-form. The sequence of MLE satisfies Here, ψ (n) is the polygamma functions (see (Abramowitz and Stegun, 1972, section 6.4.1, page 260)) defined by ψ (n) (α) = ∂ n ∂α n log Γ(α). Consider that Consecultively, the sequence of moment estimators (ϑ * n = (α * n , β * n ), n ≥ 1) given by and consequently does not reach asymptotical efficiency.
We can see in Figure 1 (and in Table 1 with the Cramer-Von Mises statistics) that the sequence of Le Cam's one-step estimators (LCE) reaches efficiency and naturally overperforms the initial sequence of ME in terms of asymptotic variance. Moreover, this sequence is faster to be computed on M = 10000 Monte Carlo simulations than the sequence of MLE, as shown in Table 1 As it was mentioned in the introduction, the major advantage of the sequence of one-step estimators is that it is computed faster than the maximum likelihood estimator for large datasets. We illustrate this fact in Table 2, where the average computation times (over 10 Monte Carlo simulations) are done for different sample sizes n = 10 r , r = 3, . . . , 9 for both MLE and LCE. LCE is between 20 times and 55 times faster than MLE, especially for large sample sizes when there is memory overload.

10^3
10^4  Beta The second example is the joint estimation of the first shape parameter α and the second shape parameter β in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d. Beta random variables whose probability density function is given by Let us denote ϑ = (α, β). In this statistical experiment, the sequence of maximum likelihood estimators ( ϑ n , n ≥ 1) of ϑ is not in a closed-form. The sequence of MLE satisfies It is worth mentioning that for the Beta distribution that allows one to build closed-form moment estimators. Namely, denotingw n =m 1 (1−m 1 ) m 2 −m 2 1 − 1, we obtain α * n =m 1wn and β * n = (1 −m 1 )w n . Asymptotic variance in (3) of the sequence of moment estimators (ϑ * n = (α * n , β * n ), n ≥ 1) can also be computed with and consequently the sequence does not reach asymptotical efficiency.
Here again, the sequence of Le Cam's one-step estimators naturally overperforms the initial sequence of ME in terms of asymptotic variance (see Figure 2 and the next Table for CvM statistics).
It is computed faster than the sequence of MLE, as shown in Table 3.  Cauchy The third example is the joint estimation of the location parameter m ∈ R and the scale parameter d > 0 in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d. Cauchy random variables whose probability density function is given by Let us denote ϑ = (m, d). In this statistical experiment, the sequence of maximum likelihood estimators ( ϑ n , n ≥ 1) of ϑ is not in a closed-form. The sequence of MLE satisfies It is worth mentioning that the Cauchy distribution has no first moment, and consecutively its unknown parameter ϑ cannot be estimated via the classical method of moments. But a quantile matching method allows one to define where the sample quantile is usually computed for p ∈ (0, 1) as with the order statistics denoted X (1) ≤ X (2) ≤ · · · ≤ X (n) . This sequence of estimators (ϑ * n = (m * n , d * n ), n ≥ 1) can be shown to be consistent and asymptotically normal, namely Here q 1 is the theoretical first quartile. Consequently, this estimator does not reach the efficient asymptotic variance.
As mentioned previously, the sequence of Le Cam's one-step estimators overperforms the initial sequence of quantile matching estimators (QME) in terms of asymptotic variance (see Figure 3 and the next Table for CvM statistics). It is computed faster than the sequence of MLE, as shown in  Pólya (negative binomial) The fourth example is the joint estimation of the size parameter r and the mean parameter µ in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d.
negative binomial random variables whose (discrete) probability density function is given by Let us denote ϑ = (r, µ). In this statistical experiment, the sequence of maximum likelihood estimators ( ϑ n , n ≥ 1) of ϑ is not in a closed-form and the Fisher information matrix neither.
In this discrete case, the sequence of Le Cam's one-step estimators (2) defined with the Hessian still overperforms the initial sequence of ME in terms of asymptotic variance (see Figure 4 and the next  It also overperforms the sequence of MLE in terms of computation time, as shown in Table 5  Weibull The fifth example is the joint estimation of the shape parameter τ > 0 and the rate parameter β > 0 in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d. Weibull random variables whose probability density function is given by Let us denote ϑ = (τ, β). In this example, neither the ME nor the MLE is in closed-form. However, the score and the Fisher information matrix can be explicitly computed. For instance, Moreover, it is possible to define a closed-form initial sequence of graphical plot estimators (based on the explicit form of the cumulative distribution function). Let us denote x i = log(− log(1 − i/(n + 1))) and Z i = log X (i) for i = 1, . . . , n. The sequence of ordinary least square (OLS) based estimators (ϑ * n = (τ * n , β * n ), n ≥ 1) is defined by where Z n stands for the average of Z 1 , . . . , Z n .
The sequence of Le Cam's one-step estimators overperforms the initial sequence of OLS-based estimators in terms of asymptotic variance (see Figure 5 and the next Table). Since OLS presents a small bias for samples of finite size, its CvM statistics are bigger in the next Table. MLE, τ It also overperforms the sequence of MLE in terms of computation time for large datasets, as shown in Table 6  Numerically computed initial sequence of guess estimators and numerical computations in the generic case As soon as the ME is not in closed form or no other closed form estimators can be elicited, the use of a numerical ME as an initial sequence of guess estimator in Le Cam's one-step procedure is difficult to justify. Indeed, no gain will be given in terms of time computation and the use of the MLE is finally equivalent. We propose therefore to use in the onestep function an improvement of the classical Le Cam procedure.
Indeed, it can be shown (see Kamatani and Uchida (2015) or Kutoyants and Motrunich (2016)) that for a n δ/2 -consistent initial sequence of guess estimators (with 1 2 < δ ≤ 1) and a Lipshitz Fisher information matrix, the sequence of Le Cam's one-step estimators is also consistent, asymptotically normal and efficient (in the Fisher sense). In this setting, for a initial sequence which is neither asymptotically rate nor variance efficient, the new sequence is asymptotically rate and variance efficient.
Then the sequence of Le Cam's one-step estimators given by (2) is also consistent, asymptotically normal and efficient (in the Fisher sense) with The choice of the exponent δ that measures the size of the subsample 1 2 < δ ≤ 1 is set to 0.9 by default and can be chosen by the user with the parameter control, for instance in the call onestep(data, distr, control=list(delta=0.7)) The example of Pareto II distribution is interesting and shows the gain in terms of variance (with respect to the initial sequence of guess estimators) and in terms of computation time in comparison with the MLE.
This improved method is also used when the distribution does not belong to the closed-formula family. The initial sequence of maximum likelihood estimators is computed on a subsample with the mledist function of the fitdistrplus package. The score and the Hessian in the Newton step of the Le Cam procedure (2) are numerically computed with the functions grad and hessian of the numDeriv package (see Gilbert and Varadhan (2019)).

Pareto II
The sixth example is the joint estimation of the shape parameter α > 0 and the scale parameter σ > 0 in the statistical experiment generated by a sample (X 1 , X 2 , . . . , X n ) of i.i.d. Pareto II (Lomax) random variables whose probability density function is given by In this example neither the ME (when it exists for α > 2) nor the MLE are in closed-form. But the score and the Fisher information matrix can be explicitly computed.
Let ϑ = (α, σ). Then, considering the MLE computed on a subsample of size n δ , 1 2 < δ ≤ 1, as an initial sequence of guess estimators (ϑ * n , n ≥ 1), we get The computation of the Fisher information matrix can be found in Brazauskas (2003).
Here again, the sequence of Le Cam's one-step estimators naturally overperforms the initial sequence of MLE computed on a subsample in terms of asymptotic variance (see Figure 6 and the next Table for the CvM statistics). It is also computed faster than the sequence of MLE, as shown in Table 7  Generic function For the generic example, we study the Weibull distribution again and force it to be numerically computed with the function parameter method="numeric" in order not to use the closed form processing.
We recall that, in the generic procedure, the score and the Hessian in the Newton step of the Le Cam procedure (2) are numerically computed with the functions grad and hessian of the numDeriv package. By default, the initial sequence of maximum likelihood estimators is computed on a subsample if size n δ with δ = 0.9.
The numerically computed sequence of Le Cam's one-step estimators reaches asymptotic efficiency for a simulation of M = 10000 Monte-Carlo replications of samples of size n = 10000 as shown by the CvM statistics summarized in the next table. It is still computed faster than the sequence of MLE, see