investr: An R Package for Inverse Estimation

Inverse estimation is a classical and well-known problem in regression. In simple terms, it involves the use of an observed value of the response to make inference on the corresponding unknown value of the explanatory variable. To our knowledge, however, statistical software is somewhat lacking the capabilities for analyzing these types of problems. In this paper, we introduce investr (which stands for inverse estimation in R), a package for solving inverse estimation problems in both linear and nonlinear regression models.1


Introduction
Consider the regression model Y i = f (x i ; β) + i (i = 1, . . . , n), where f is a known expectation function (called a calibration curve) that is monotonic over the range of interest and i iid ∼ N 0, σ 2 . A common problem in regression is to predict a future response Y 0 from a known value of the explanatory variable x 0 . Often, however, there is a need to do the reverse; that is, given an observed value of the response Y = y 0 , estimate the unknown value of the explanatory variable x 0 . This is known as the calibration problem, though we refer to it more generally as inverse estimation. In this paper, we consider only controlled calibration, where the values of the explanatory variables are fixed by design. A more thorough overview of the calibration problem, including Bayesian approaches and multivariate calibration, is given in Osborne (1991).
There are three main functions available in the investr package: calibrate operates on objects of class lm and can only be used when the expectation function has the form f (x i ; β) = β 0 + β 1 x i (i.e., the simple linear regression model), where closed-form solutions are available for calculating confidence intervals for x 0 . For more complicated models (e.g., polynomial and nonlinear regression), closed-form expressions are usually not available and iterative techniques will be required. This is the purpose of the function invest, which calculates the point estimate and confidence intervals for x 0 by calling the function uniroot from the stats package. The function plotFit produces a scatterplot of the data with fitted regression curve and the option to add confidence/prediction bands for the response (pointwise or adjusted). It can be used with single-predictor objects of class lm or nls; however, for objects of class nls, confidence/prediction bands are based on the linear approximation and can be misleading (Bates and Watts, 1988, pg. 65). The development version of investr can be found on GitHub at https://github.com/w108bmg/investr. To report bugs or issues, contact the main author or submit them to https://github.com/w108bmg/investr/issues.

Calibration for straight line regression
Consider the most common calibration model, Y i = β 0 + β 1 x i + i (i = 1, . . . , n), where x i is fixed and i iid ∼ N 0, σ 2 . Suppose we obtain a series of m observations y 01 , . . . , y 0m which are associated with the same (unknown) x 0 . It can be shown (Graybill, 1976) that the maximum likelihood (ML) estimator of x 0 , also known as the classical estimator, is where β 0 and β 1 are the usual ML estimators of β 0 and β 1 , respectively, andȳ 0 = ∑ m j=1 y 0j /m. The classical estimator in general is computed as x 0 = f −1 ȳ 0 ; β ; that is, by inverting the fitted calibration curve atȳ 0 (Eisenhart, 1939). Since x 0 is a ratio of jointly normal random variables, it is not surprising to learn that it does not have any finite moments (think of a standard Cauchy distribution).
The sampling distribution of x 0 is complicated, but fortunately, not required for setting an exact 100(1 − α)% confidence interval on x 0 . This is discussed in the next section.

Inversion interval
It can be shown (see, for example, Draper and Smith (1998, pp. 47-51)) that an exact 100(1 − α)% confidence interval for x 0 is given by where g = t 2 σ 2 / β 2 1 /S xx and t = t α/2,n+m−3 is the upper 1 − α percentile of a Student's tdistribution with n + m − 3 degrees of freedom. The inversion interval (2) can also be obtained using a fiducial argument, as in Fieller (1954). For the special case m = 1, the inversion interval is equivalent to inverting a 100(1 − α)% prediction interval for the response. In other words, if one draws a horizontal line through a scatterplot of the data at y 0 , then the abscissas of its intersection with the usual (pointwise) prediction band for f correspond to the endpoints of the inversion interval (2). This interval should only be used when the usual F test for testing H 0 : β 1 = 0 versus H 1 : β 1 = 0 can be rejected at the specified α level; in other words, when the regression line is not "too" flat. If H 0 is not rejected, then such a confidence interval for x 0 may result in either the entire real line or two semi-infinite intervals (Graybill and Iyer, 1994, p. 429) -see Figure 2. The plotFit function in the investr package can be used for drawing scatterplots with the fitted model and confidence/prediction bands. To calculate the inversion interval for the linear calibration problem, we use the calibrate function and specify the option interval = "inversion" (the default) as in the following example.
The data frame arsenic contains the true amounts of arsenic present in 32 water samples (Graybill and Iyer, 1994). Also present, is the amount of arsenic measured by some field test, which is subject to error. A new water sample is obtained and subjected the field test producing a reading of 3.0 µg/ml. It is desired to infer the true amount of arsenic present in the sample. The following code fits a simple linear regression model to the arsenic data and then uses the calibrate function to compute the ML estimate and corresponding 90% calibration interval based on Equation (2). library(investr) mod <-lm(measured~actual, data = arsenic) (res <-calibrate(mod, y0 = 3, interval = "inversion", level = 0.9)) ## Figure 1 plotFit(mod, interval = "prediction", level = 0.9, shade = T, col.pred = "skyblue") abline(h = 3, v = c(res$lower, res$estimate, res$upper), lty = 2) Running the above block of code produces Figure 1

Wald interval
Another common approach to computing calibration intervals is to use the delta method (Dorfman, 1938). It is easy to show that an approximate standard error for the ML estimator (1), based on a first-order Taylor series approximation, is given by Assuming large sample normality for x 0 leads to an approximate 100(1 − α)% Wald confidence interval for x 0 of This is equivalent to taking g = 0 in the inversion interval (2). Unlike the inversion interval, though, the Wald interval always exists and is symmetric about x 0 . This symmetry is attractive, but not always realistic, such as when the calibration curve f is nonlinear and has horizontal asymptotes. To obtain a Wald-type interval we specify interval = "Wald" in the call to calibrate. For instance, calibrate(mod, y0 = 3, interval = "Wald", level = 0.9) produces the output estimate lower upper se 2.9314 2.6040 3.2589 0.1929 The point estimate remains unchanged (as expected) but the calibration interval is slightly different. The major benefit of using the delta method to compute calibration intervals is that it always exists and provides us with an asymptotic estimate of the standard error, which in this case se = 0.1929. The bootstrap, as discussed in Section 2.4, also provides calibration intervals and an estimate of the standard error, but does not require large samples or specific distribution assumptions.

Confidence interval for a specified mean response
Instead of inferring the value of x 0 that corresponds to an observed value of the response y 0 , suppose we want to infer the value of x 0 that corresponds to a specified value of the mean response, say f (x 0 ; β) = µ 0 . The obvious ML estimate of x 0 is which is the same as Equation (1) but withȳ 0 replaced with µ 0 , a fixed population parameter. This is analogous to the difference between (i) predicting a future value of the response corresponding to a given value of the explanatory variable and (ii) estimating the mean response that corresponds to a particular value of the explanatory variable. Thus, the point estimates (1) and (5) are the same but the former has greater variability inherited from the variance of Y 0 . This is sometimes referred to as regulation (as opposed to calibration) and is described in more detail in Graybill and Iyer (1994, chap. 6). The confidence interval formulas for x 0 corresponding to a specified mean response (i.e., regulation) are the same as those given in Equations (2) and (4), but with 1/m = 0 and t α/2,n+m−3 replaced with t α/2,n−2 . For instance, the Wald interval for x 0 corresponding to µ 0 is simply where x 0 is calculated as in Equation (5). To obtain calibration intervals corresponding to a specified mean response, use the option mean.response = TRUE.
To illustrate, consider the crystal growth data taken from Graybill and Iyer (1994, pg. 119). These data are from an experiment in which the weight in grams of 14 crystals were recorded after letting the crystals grow for different (predetermined) amounts of time in hours. The weight of each crystal is plotted against time in Figure 3. Suppose we want to estimate the growth time in hours that corresponds to an average weight of 8 grams; that is, we want to estimate x 0 corresponding to µ 0 = 8. The following code chunk fits a simple linear regression model to the crystal growth data and then computes the ML estimate and a 95% calibration interval for x 0 . mod <-lm(weight~time, data = crystal) (res <-calibrate(mod, y0 = 8, mean.response = T)) ## Figure 3 plotFit(mod, interval = "confidence", pch = 19, shade = T, col.conf = "plum", extend.range = T) abline(h = 8, v = c(res$lower, res$estimate, res$upper), lty = 2) The output for this code chunk should be Thus, in order to produce crystals with an average weight of 8 grams, they should be grown for an estimated 15.8882 hours. A 95% confidence interval for the growth time is (14.6590, 17.1596). Obviously, if mean.response = TRUE, then y0 can only take a single value; otherwise, an error will be displayed as in the following: calibrate (mod, y0 = c(8, 9), mean.response = T) which displays the message Error in calibrate.default(cbind(x, y), ...) : Only one mean response value allowed. This type of calibration problem is similar to computing the median effective dose (ED 0.5 ), or more generally ED p where 0 < p < 1, in binary response models. In R, these models are usually fit with the glm function from the stats package. The function dose.p from the MASS package (Venables and Ripley, 2002) can then be used to compute the ML estimate of ED p for specified p. An estimate of the asymptotic standard error based on the delta method is also given which can be used to calculate a Wald-based confidence interval for ED p . A future release of investr will likely make an inversion-type interval available for these models as well (i.e. by inverting a confidence interval for the mean response on the link scale). The package drc (Ritz and Streibig, 2005), for fitting dose-response curves, may also be of interest.

Simultaneous calibration intervals
The calibration intervals discussed so far are one-at-a-time intervals. If k new values of the response are observed that each correspond to a different unknown, say x 01 , . . . x 0k , then we need to adjust the critical value used in the inversion and Wald intervals accordingly. The simplest approaches are of course, the Bonferroni and Scheffé procedures. These can be computed by specifying the adjust option which can take any of the following arguments: "none" (the default), "bonferroni", or "scheffe". The value k also needs to be specified. See Miller (1981, chap. 3) for more details.

Nonlinear and polynomial calibration
In application, many relationships are nonlinear (e.g., dose-response curves). The classical estimator along with the inversion and Wald intervals can easily be extended to such nonlinear calibration curves. However, classical inference in these models (such as prediction intervals) are based on large samples and linear approximations (see Bates and Watts (1988, chap. 2)). Thus, for nonlinear calibration curves, the inversion interval provides only an approximate 100(1 − α)% confidence interval for x 0 as does the Wald interval. Calibration in nonlinear models is discussed in further detail in Schwenke and Milliken (1991), Seber and Wild (2003, pp. 245-250), and Huet (2004, chap. 5). For calibration in polynomial models, see Brown (1993, pp. 47-88) and Seber and Lee (2003, pg. 172).
The invest function can be used for inverse estimation with any univariate regression model in R that inherits from class lm or nls (with the exception of weighted fits). For instance, consider the regression model Y i = f (x i ; β) + i (i = 1, . . . , n), where f may or may not be linear in θ. If we wish to estimate x 0 given an observation y 0 , then the point estimate x 0 is given by solving y 0 = f x; θ for x. The solution will be unique as long as f is monotonic in the region of interest. The invest function computes x 0 by calling the stats functions predict and uniroot to solve

Approximate confidence intervals
Equation (2) gives a closed-form expression for the inversion interval for the case of simple linear regression. In more complicated cases, such an expression is not available and the interval must be found numerically. It can be shown (see, for example, Seber and Wild (2003, pp. 245-246)) that where S 2 x 0 is the estimated variance of f x 0 ; θ . An approximate 100(1 − α)% confidence interval for x 0 is then given by the set Essentially, invest finds the lower and upper limits for this interval by solving the equations T x − t α/2,n−p = 0 and T x − t 1−α/2,n−p = 0 numerically for x 0 . For the special case of the simple linear regression model, These limits coincide with Equation (2) and the coverage probability is exactly 1 − α. The Wald interval for x 0 is also easily extended. It has the basic form: The estimated standard error se ( x 0 ) is based on a first-order Taylor Series approximation. For the special case of the simple linear regression model, this approximation results in Equation (3).
Since the point estimate and confidence intervals for x 0 are obtained numerically using uniroot, invest has the additional options lower, upper, tol, and maxiter. See the reference manual for details.
To get an idea of how invest works, consider the data in Figure 4 which were generated from the model Y = 5 + x − sin(x) + 1.5Z, where Z is a standard normal random variable. Suppose we want to estimate x 0 given a new observation, say , which is certainly invertible, but f −1 cannot be expressed in closed-form using a finite number of terms; thus, x 0 = f −1 (y 0 = 22) must be obtained numerically. The following chunk of code generates the sample data, calculates x 0 and an approximate 95% Wald interval for x 0 using the invest function.
As indicated by the bootstrap distribution obtained in the next section, the symmetric Wald interval obtained here seems unrealistic for this problem. In the next section, we show how to obtain a bias-corrected and accelerated (BC a ) bootstrap confidence interval for x 0 using the boot package (Canty and Ripley, 2013).

Bootstrap calibration intervals
The bootstrap (Efron, 1979) provides an alternative means for computing calibration intervals. This is useful in nonlinear calibration problems where inference traditionally relies on large samples, approximate normality, and linear approximations. A practical guide to the bootstrap is provided by Davison and Hinkley (1997) and the companion R package boot, and bootstrap resampling for controlled calibration is discussed in Jones and Rocke (1999). Although it is likely for a "bootstrap" option to appear in a future release of investr, it is quite simple to set up using the recommended boot package. First, we discuss a naive approach to calculating bootstrap calibration intervals.
To implement this, all we need to do is make the following changes to boot.fun (the changes are colored magenta): boot.fun <-function(data, i) { boot.mod <-nls(fit + res[i]~theta1/(1 + exp(theta2 + theta3*log(conc))), start = list(theta1 = 1000, theta2 = -1, theta3 = 1), data = data) ## Simulate the correct variance Y0 <-y0 + sample(data$res, size = length(y0), replace = T) ## Make sure the original estimate also gets returned if (all(i == 1:n)) x0.fun(mod, y0) else x0.fun(boot.mod, Y0) } Rerunning the previous example with the modified boot.fun function produces a more realistic 95% confidence interval for x 0 of (1.818, 2.950) and an estimated standard error for x 0 of 0.2861 (similar to that obtained by the delta method earlier). Figure 5 shows a histogram and normal Q-Q plot of the 9,999 bootstrap replicates of x 0 . The bootstrap distribution of x 0 is comparable to the approximate posterior density of x 0 given in Racine-Poon (1988, fig. 9). The bootstrap distribution is skewed to the right and clearly not normal, suggesting that the Wald interval may not be realistic for this problem.

Summary
We introduced the investr package for computing point estimates along with inversion and Waldbased confidence intervals for linear and nonlinear calibration problems with constant variance. We also showed how the boot package can be used for constructing approximate 100(1 − α)% calibration intervals, which for convenience, may be incorporated into a future release of investr. The authors are currently working on extending the package to handle the case of heteroscedastic errors, random coefficient models (e.g., objects of class lme from the recommended nlme package (Pinheiro et al., 2013)), and even multivariate calibration problems (e.g., objects of class mlm).