ipwErrorY: An R Package for Estimation of Average Treatment Effect with Misclassified Binary Outcome

Abstract:

It has been well documented that ignoring measurement error may result in severely biased inference results. In recent years, there has been limited but increasing research on causal inference with measurement error. In the presence of misclassified binary outcome variable, considered the inverse probability weighted estimation of the average treatment effect and proposed valid estimation methods to correct for misclassification effects for various settings. To expedite the application of those methods for situations where misclassification in the binary outcome variable is a real concern, we implement correction methods proposed by and develop an R package ipwErrorY for general users. Simulated datasets are used to illustrate the use of the developed package.

Cite PDF Tweet

Published

Aug. 17, 2019

Received

May 29, 2018

Citation

Shu & Yi, 2019

Volume

Pages

11/1

337 - 351


1 Introduction

Causal inference methods have been widely used in empirical research . The propensity score, defined to be the probability of an individual to receive the treatment, plays an important role in conducting causal inference . Many causal inference methods have been developed based on the propensity score . These methods commonly require modeling the treatment assignment, which can be difficult in some applications. To protect against misspecification of the treatment model, various methods have been proposed . Among them, doubly robust methods are often advocated since the resulting estimators are still consistent when either the treatment model or the outcome model (but not both) is misspecified; such an attractive property is referred to as double robustness.

Although many methods are available for causal inference such as for the estimation of average treatment effects (ATE), those methods are vulnerable to poor quality data. Typically when data are error-contaminated, most existing methods would be inapplicable. It has been well documented that measurement error in variables can often lead to seriously biased inference results in many settings .

In the context of causal inference with error-prone data, there has been limited but increasing research on the impact of measurement error on causal inference and the development of correction methods to deal with measurement error. For instances, proposed a correction estimation method when baseline covariates are error-prone. examined the bias arising from ignoring misclassification in the treatment variable. developed a correction method to correct for treatment misclassification using validation data.

In settings with misclassification in the binary outcome variable, explored the estimation of ATE using the inverse probability weighted (IPW) method. They derived the asymptotic bias caused by misclassification and developed consistent estimation methods to eliminate the misclassification effects. Their development covers practical scenarios where (1) the misclassification probabilities are known, or (2) the misclassification probabilities are unknown but validation data or replicates of outcome measurements are available for their estimation. They further propose a doubly robust estimator to provide protection against possible misspecification of the treatment model.

The methods developed by enjoy wide applications, because misclassified binary outcome data arise commonly in practice. For example, the self-reported smoking status without being confirmed by biochemical tests is subject to misclassification; results of screening tests are often subject to false positive error and/or false negative error. For datasets with outcome misclassification, ignoring misclassification effects may lead to severely biased results. To expedite the application of the correction methods for general users, we develop an R package, called ipwErrorY , to implement the methods by for practical settings where the commonly-used logistic regression model is employed for the treatment model and the outcome model. The package focuses on measurement error in the outcome Y only but not on other types of measurement error, such as measurement error in covariates.

The remainder is organized as follows. First, we introduce the notation and framework. Secondly, we describe the methods to be implemented in R. Thirdly, we present the implementation steps and illustrate the use of the package with examples. Finally, a discussion is given. The developed R package ipwErrorY is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=ipwErrorY.

2 Notation and framework

Let Y1 be the binary potential outcome that would have been observed had the individual been treated, and Y0 be the binary potential outcome that would have been observed had the individual been untreated. Let X be the vector of baseline covariates; T be the observed binary treatment variable; and Y be the observed binary outcome.

Being consistent with the usual causal inference framework , we make the following standard assumptions:

Assumption 1 (Consistency): Y=TY1+(1T)Y0;

Assumption 2 (No Unmeasured Confounding): (Y1,Y0)T|X;

Assumption 3 (Positivity): 0<P(T=1|X)<1.

The objective is to estimate the ATE, defined as τ0=E(Y1)E(Y0). Suppose we have a sample of size n. For i=1,,n, we add subscript i to notations X, T and Y to denote the corresponding variable for individual i.

In the presence of outcome misclassification, instead of Y, its surrogate version Y is observed. We assume that (1)P(Y=a|Y=b,X,T)=P(Y=a|Y=b)fora,b=0,1. That is, conditional on the true value of the outcome, the misclassification probability is assumed to be homogeneous for all the individuals, regardless of their covariate information or treatment status. For ease of exposition, write pab=P(Y=a|Y=b). Then the sensitivity and the specificity are expressed by p11 and p00, respectively, and the misclassification probabilities are p01=1p11 and p10=1p00.

3 Estimation methods

In this section we present estimation methods for τ0 under three scenarios. We first start with the case where misclassification probabilities are given, and then consider settings where misclassification probabilities are unknown but can be estimated by using additional data sources.

Estimation with known misclassification probabilities

In this subsection we assume that misclassification probabilities p11 and p10 are known. For i=1,,n, let ei=P(Ti=1|Xi) be the conditional probability for individual i to receive the treatment, also termed as the propensity score , a quantity that plays an important role in causal inference. To correct for outcome misclassification effects, proposed an estimator of τ0 given by (2)τ^=1p11p10{1ni=1nTiYie^i1ni=1n(1Ti)Yi1e^i}, where e^i is an estimate of the propensity score ei=P(Ti=1|Xi) obtained by fitting the treatment model relating T to X.

The estimator τ^, given by ((2)), is a consistent estimator of τ0, provided regularity conditions including Assumptions 1-3. The sandwich variance estimate of τ^ can be obtained using the theory of estimating functions .

To estimate the propensity score ei for i=1,,n, we specifically characterize ei using the widely-used logistic regression model. That is, the treatment model is given by logitP(Ti=1|Xi)=γ0+γXXi, for i=1,,n, where γ=(γ0,γX) is the vector of parameters. As a result, an unbiased estimating function of γ is taken as the score function {Ti11+exp(γ0γXXi)}(1,Xi).

Let θ=(τ,γ). showed that Ψ(Yi,Ti,Xi;θ)=({Ti11+exp(γ0γXXi)}(1,Xi)TiYiei(1Ti)Yi1ei(p11p10)τ)(#eq:) is an unbiased estimating function of θ. Solving i=1nΨ(Yi,Ti,Xi;θ)=0 for θ yields an estimator of θ, denoted by θ^.

Let θ0=(τ0,γ0) be the true value of θ. Define A(θ0)=E{(/θ)Ψ(Y,T,X;θ)|θ=θ0} and B(θ0)=E{Ψ(Y,T,X;θ)Ψ(Y,T,X;θ)|θ=θ0}. Under regularity conditions, we have that n(θ^θ0)dN(0,A(θ0)1B(θ0)A(θ0)1)asn. Consequently, the variance of θ^ can be estimated by the empirical sandwich estimator: Var^(θ^)=1nAn(θ^)1Bn(θ^)An(θ^)1, where An(θ^)=1ni=1nθΨ(Yi,Ti,Xi;θ)|θ=θ^ and Bn(θ^)=1ni=1nΨ(Yi,Ti,Xi;θ)Ψ(Yi,Ti,Xi;θ)|θ=θ^. Then the variance estimate of τ^ in ((2)) is given by Var^(τ^)=v^11, where v^11 is the element of the first row and the first column of Var^(θ^). A (1α)100% confidence interval of τ0 is given by τ^±zα/2×Var^(τ^), where α is a specified value between 0 and 1, and zα/2 is the upper α/2 quantile of the standard normal distribution.

Estimation with validation data

In this subsection we assume that misclassification probabilities p11 and p10 are unknown and that there is an internal validation subsample V of size nV which collects measurements of variables X, T, Y and Y.

With the validation data, p11 and p10 can be estimated as (3)p^11=iVYiYiiVYiandp^10=iV(1Yi)YiiV(1Yi), respectively.

To estimate τ0, one may use error-free outcome data of the validation subsample to construct an estimator (4)τ^V=1nViVTiYie^i1nViV(1Ti)Yi1e^i, of τ0, or alternatively, one may apply ((2)) to estimate τ0 using non-validation data with the resulting estimator given by (5)τ^N=1p^11p^10{1nnViVTiYie^i1nnViV(1Ti)Yi1e^i}, where p^11 and p^10 are given by ((3)).

Although the validation data based estimator τ^V given by ((4)) and the non-validation data based estimator τ^N given by ((5)) are both consistent estimators of τ0, they both incur efficiency loss due to the inability of utilizing all the available data.

considered the linear combination of τ^V and τ^N (6)τ^(c)=cτ^V+(1c)τ^N, where c is a constant between 0 and 1.

For any c, the consistency of τ^(c) is immediate due to the consistency of τ^V and τ^N. However, the efficiency of τ^(c) depends on the choice of c. Typically, Var{τ^(c)} is minimized at (7)cOPT=Var(τ^N)Cov(τ^V,τ^N)Var(τ^V)+Var(τ^N)2Cov(τ^V,τ^N), suggesting that τ^(cOPT)=cOPTτ^V+(1cOPT)τ^N is the optimal estimator among the linear combination estimators formulated as ((6)). Furthermore, cOPT can be estimated by c^OPT=Var^(τ^N)Cov^(τ^V,τ^N)Var^(τ^V)+Var^(τ^N)2Cov^(τ^V,τ^N), where Var^(τ^N), Cov^(τ^V,τ^N) and Var^(τ^V) are the estimates for Var(τ^N), Cov(τ^V,τ^N) and Var(τ^V), respectively.

To obtain Var^(τ^N), Cov^(τ^V,τ^N) and Var^(τ^V), constructed an unbiased estimating function by combining the estimating functions ((4)) and ((5)), where they introduced different symbols, say τV and τN, to denote the parameter τ for which ((4)) and ((5)), respectively, are used to estimate; both τV and τN have the true value τ0. Let θ=(τV,τN,γ,p11,p10). Define Ψc(Yi,Ti,Xi,Yi;θ)=({Ti11+exp(γ0γXXi)}(1,Xi)(YiYip11Yi)I(iV)nnV{(1Yi)Yip10(1Yi)}I(iV)nnV{TiYiei(1Ti)Yi1eiτV}I(iV)nnV{TiYiei(1Ti)Yi1ei(p11p10)τN}I(iV)nnnV), where I() is the indicator function. Then Ψc(Yi,Ti,Xi,Yi;θ) is an unbiased combined estimating function of θ. Solving i=1nΨc(Yi,Ti,Xi,Yi;θ)=0 for θ yields an estimator of θ, denoted by θ^. The variance of θ^ can be estimated by the empirical sandwich estimator, denoted as Var^(θ^). Let v^i,j be the element of the ith row and the jth column of Var^(θ^). Then Var^(τ^V)=v^1,1, Cov^(τ^V,τ^N)=v^1,2, and Var^(τ^N)=v^2,2.

Finally, pointed out the two associated conditions: Var(τ^V)+Var(τ^N)2Cov(τ^V,τ^N)0 and 0c1. If one or both conditions are violated with empirical estimates, c^OPT is then set to be 1 if τ^V has smaller variance than τ^N and 0 otherwise. The resulting optimal linear combination estimator τ^(c^OPT) is (8)τ^OPT=c^OPTτ^V+(1c^OPT)τ^N, with the variance estimate given by Var^(τ^OPT)={Var^(τ^V)+Var^(τ^N)2Cov^(τ^V,τ^N)}c^OPT2{2Var^(τ^N)2Cov^(τ^V,τ^N)}c^OPT+Var^(τ^N). A (1α)100% confidence interval of τ0 is given by τ^OPT±zα/2×Var^(τ^OPT), where α is a specified value between 0 and 1, and zα/2 is the upper α/2 quantile of the standard normal distribution.

Estimation with replicates

In this subsection we assume that misclassification probabilities p11 and p10 are unknown and that two repeated outcome measurements are available for each individual. Suppose Yi(1) and Yi(2) are two independent replicates of Yi. Let η denote the prevalence P(Y=1) and πr be the probability of obtaining r outcome observations equal to 1 among two repeated outcome measurements for r=0,1. Then π0=η(1p11)2+(1η)(1p10)2;

π1=2η(1p11)p11+2(1η)(1p10)p10. Let θ=(τ,γ,η,p11,p10). considered an unbiased estimating function of θ, given by Ψr{Yi(1),Yi(2),Ti,Xi;θ}=({Ti11+exp(γ0γXXi)}(1,Xi){1Yi(1)}{1Yi(2)}π0Yi(1){1Yi(2)}+Yi(2){1Yi(1)}π1TiYiei(1Ti)Yi1ei(p11p10)τ), where Yi={Yi(1)+Yi(2)}/2, together with a constraint imposed for achieving parameters identifiability .

Let τ^R denote the estimator of τ0 obtained by solving (9)i=1nΨr{Yi(1),Yi(2),Ti,Xi;θ}=0 for θ. The variance of τ^R can be estimated by the empirical sandwich estimator Var^(τ^R). A (1α)100% confidence interval of τ0 is given by τ^R±zα/2×Var^(τ^R), where α is a specified value between 0 and 1, and zα/2 is the upper α/2 quantile of the standard normal distribution.

Finally, we comment that when implementing ((9)), one of the following constraints is often used in applications: (C1) sensitivity equals specificity (i.e., p11=p00), (C2) sensitivity p11 is known, (C3) specificity p00 is known, and (C4) prevalence η is known. These four constraints are implemented in our R package.

Choosing a suitable identifiability constraint is primarily driven by the nature of data. When the false positive rate p10 and the false negative rate p01 are close, it is reasonable to impose the constraint that the sensitivity equals the specificity. When there is prior information on the value of the sensitivity, the specificity, or the prevalence, it is plausible to add the identifiability constraint (C2), (C3) or (C4). For example, in smoking cessation studies, patients who quit smoking (with Y=1) are unlikely to report that they still smoke, so it may be reasonable to set the constraint p11=1. Sometimes, researchers may use the disease prevalence reported from another similar study for their own study, when such a prevalence is perceived to be close to that of the target population.

Doubly robust estimation

To protect against model misspecification, proposed a doubly robust estimator of τ0: (10)τ^DR=E^(Y1)E^(Y0), where (11)E^(Y1)=1ni=1n{TiYie^i(p11p10)Tie^ie^iq^i1Tie^i(p10p11p10)},

(12)E^(Y0)=1ni=1n{(1Ti)Yi(1e^i)(p11p10)+Tie^i1e^iq^i01Ti1e^i(p10p11p10)}, q^i1 is an estimate of qi1=P(Yi=1|Ti=1,Xi) and q^i0 is an estimate of qi0=P(Yi=1|Ti=0,Xi).

The estimator τ^DR enjoys the double robustness property in the sense that it is still consistent if one of the treatment model and the outcome model is incorrectly specified. In our developed R package, we particularly implement the following two scenarios.

Scenario 1 (Shared covariate effects for the treated and untreated groups):

Suppose the outcome model is postulated as (13)logitP(Yi=1|Ti,Xi)=β0+βTTi+βXi, where β0, βT and β are the parameters. The model reflects the setting where the treated and untreated groups share the same covariate effect β on the outcome.

By ((1)) and ((13)), the observed likelihood function contributed from individual i is Li(β0,βT,β)=P(Yi|Xi,Ti)=P(Yi=1|Xi,Ti)P(Yi|Xi,Ti,Yi=1)+P(Yi=0|Xi,Ti)P(Yi|Xi,Ti,Yi=0)=11+exp{β0βTTiβXi}{p11Yi+(1p11)(1Yi)}+exp{β0βTTiβXi}1+exp{β0βTTiβXi}{p10Yi+(1p10)(1Yi)}. With regularity conditions, maximizing the observed likelihood i=1nLi(β0,βT,β) with respect to (β0,βT,β) gives a consistent estimator of (β0,βT,β), denoted as (β^0,β^T,β^). It follows that qi1 and qi0, are, respectively, estimated by q^i1=11+exp(β^0β^Tβ^Xi) and q^i0=11+exp(β^0β^Xi).

Scenario 2 (Possibly different covariate effects for the treated and untreated groups):

Suppose that the outcome model is postulated as logitP(Yi=1|Ti=1,Xi)=β01+β1Xi for the treated group and logitP(Yi=1|Ti=0,Xi)=β00+β0Xi for the untreated group, where the parameters (β01,β1) for the treated group may differ from the parameters (β00,β0) for the untreated group.

To obtain a consistent estimator (β^01,β^1) of (β01,β1) and a consistent estimator (β^00,β^0) of (β00,β0), we employ the observed likelihood for the treated group and the untreated group separately. For example, the observed likelihood function contributed from individual l in the treated group (i.e., Tl=1) is L1,l(β01,β1)=P(Yl|Tl=1,Xl)=P(Yl=1|Tl=1,Xl)P(Yl|Yl=1)+P(Yl=0|Tl=1,Xl)P(Yl|Yl=0)=11+exp{β01β1Xl}{p11Yl+(1p11)(1Yl)}+exp{β01β1Xl}1+exp{β01β1Xl}{p10Yl+(1p10)(1Yl)}. Maximizing the observed likelihood l:Tl=1L1,l(β01,β1) with respect to β01 and β1 gives us a consistent estimator (β^01,β^1), provided regularity conditions. Similarly, we calculate the observed likelihood function L0,k(β00,β0) for individual k in the untreated group (i.e., Tk=0), and then obtain the estimator (β^00,β^0) by maximizing the observed likelihood l:Tk=0L0,k(β00,β0) with respect to β00 and β0. Thus, qi1 and qi0 are estimated by q^i1=11+exp(β^01β^1Xi) and q^i0=11+exp(β^00β^0Xi), respectively.

Variance estimator of τ^DR:

Consistency and asymptotic normality of τ^DR can be established using the theory of estimating functions. Below we derive the sandwich variance estimator of τ^DR by constructing an unbiased estimating function using the “delta method” in the M-estimator framework .

Define βF to be the vector of parameters for the outcome models. Under Scenario 1 with shared covariate effects for the treated and untreated groups, βF=(β0,βT,β). Under Scenario 2 with possibly different covariate effects for the treated and untreated groups, βF=(βF1,βF0) with βF1=(β01,β1) and βF0=(β00,β0). Let θ=(γ,βF,μ1,μ0,τ), where μ1 and μ0 represent E(Y1) and E(Y0), respectively.

We construct the following unbiased estimating function for θ: Ψdr{Yi,Ti,Xi;θ}=({Ti11+exp(γ0γXXi)}(1,Xi)ψ(Yi,Ti,Xi;βF){TiYie^i(p11p10)Tie^ie^iq^i1Tie^i(p10p11p10)}μ1{(1Ti)Yi(1e^i)(p11p10)+Tie^i1e^iq^i01Ti1e^i(p10p11p10)}μ0μ1μ0τ), where ψ(Yi,Ti,Xi;βF) is the unbiased estimating equation for βF derived from the observed likelihood. Specifically, under Scenario 1, ψ(Yi,Ti,Xi;βF)=log{Li(β0,βT,β)}/(β0,βT,β), and under Scenario 2, ψ(Yi,Ti,Xi;βF)=(ψ1(Yi,Ti,Xi;βF1),ψ0(Yi,Ti,Xi;βF0)) with ψ1(Yi,Ti,Xi;βF1)=log{L1,i(β01,β1)}/(β01,β1)I(Ti=1)nnT and ψ0(Yi,Ti,Xi;βF0)=log{L0,i(β00,β0)}/(β00,β0)I(Ti=0)nnnT, where nT is the size of the treated group.

By the theory of estimating functions , solving i=1nΨdr{Yi,Ti,Xi;θ}=0 for θ yields an estimator of θ, denoted by θ^. Let θ0 be the true value of θ. Define

A(θ0)=E{(/θ)Ψdr(Y,T,X;θ)|θ=θ0}

and

B(θ0)=E{Ψdr(Y,T,X;θ)Ψdr(Y,T,X;θ)|θ=θ0}.

Under regularity conditions, we have that n(θ^θ0)dN(0,A(θ0)1B(θ0)A(θ0)1)asn. The sandwich variance estimator of θ^ is given by Var^(θ^)=1nAn(θ^)1Bn(θ^)An(θ^)1, where An(θ^)=1ni=1nθΨdr(Yi,Ti,Xi;θ)|θ=θ^ and Bn(θ^)=1ni=1nΨdr(Yi,Ti,Xi;θ)Ψdr(Yi,Ti,Xi;θ)|θ=θ^. Then Var^(τ^DR) is the element of the last row and the last column of Var^(θ^).

Finally, we comment that Scenario 2 allows for different covariates-outcome associations for the treated and untreated groups and provides more flexibility than Scenario 1. However, implementing Scenario 2 involves separately estimating parameters of the outcome model for the treated and untreated groups. When one group has a small size, estimation results may be unsatisfactory. In this case, imposing common covariate effects for the treated and untreated groups as in Scenario 1 can help achieve reasonable estimation results.

4 Implementation in R and Examples

We develop an R package ipwErrorY, which implements the methods described in the previous section. The developed package imports R packages stats and nleqslv . To illustrate the use of ipwErrorY, for each method, we simulate a dataset and then apply a function to analyze the dataset. To make sure users can reproduce the results, we use the function set.seed to generate data. Moreover, the simulated data provide users a clear sense about the data structure.

Implementation and example with known error

The function KnownError produces the ATE estimate using the correction method with known misclassification probabilities along with the sandwich-variance-based standard error and (1α)100% confidence interval. Specifically, KnownError is defined as

KnownError(data, indA, indYerror, indX, sensitivity, specificity, confidence=0.95)

with arguments described in detail in ipwErrorY documentation. Below is an example to illustrate the use of KnownError.

We first load the package in R:

R> library("ipwErrorY")

Using sensitivity 0.95 and specificity 0.85, we create da, a dataset of size 2000 with “X1”, “A” and “Yast” being the column names for the covariate, treatment and misclassified outcome, respectively:

R> set.seed(100)
R> X1 = rnorm(2000) 
R> A = rbinom(2000, 1, 1/(1 + exp(-0.2 - X1)))
R> Y = rbinom(2000, 1, 1/(1 + exp(-0.2 - A - X1)))
R> y1 = which(Y == 1)
R> y0 = which(Y == 0) 
R> Yast = Y
R> Yast[y1] = rbinom(length(y1), 1, 0.95)
R> Yast[y0] = rbinom(length(y0), 1, 0.15)
R> da = data.frame(X1 = X1, A = A,Yast = Yast)

By using the function head, we print the first six observations of dataset da so that the data structure is clearly shown as follows:

R> head(da)
           X1 A Yast
1 -0.50219235 1    1
2  0.13153117 1    1
3 -0.07891709 1    1
4  0.88678481 0    1
5  0.11697127 1    1
6  0.31863009 1    1

We call the developed function KnownError with sensitivity 0.95 and specificity 0.85, and obtain a list of the estimate, the sandwich-variance-based standard error and a 95% confidence interval:

R> KnownError(data = da, indA = "A", indYerror = "Yast", indX = "X1", 
+     sensitivity = 0.95, specificity = 0.85, confidence=0.95)
$Estimate
[1] 0.1702513

$Std.Error
[1] 0.02944824

$`95% Confidence Interval`
[1] 0.1125338 0.2279688

Implementation and example with validation data

The function EstValidation produces the results for the method with validation data; they include the optimal linear combination estimate, the sandwich-variance-based standard error, (1α)100% confidence interval and the estimated sensitivity and specificity. Specifically, EstValidation is defined as

EstValidation(maindata, validationdata, indA, indYerror, indX, indY, confidence=0.95)

with arguments described in detail in ipwErrorY documentation. Below is an example to illustrate the use of EstValidation.

Using sensitivity 0.95 and specificity 0.85, we create mainda which is the non-validation main data of size 1200, and validationda which is the validation data of size 800:

R> set.seed(100)
R> X1= rnorm(1200)   
R> A = rbinom(1200, 1, 1/(1 + exp(-0.2 - X1)))
R> Y= rbinom(1200, 1, 1/(1 + exp(-0.2 - A - X1)))
R> y1 = which(Y == 1)
R> y0 = which(Y==0) 
R> Yast = Y
R> Yast[y1] = rbinom(length(y1), 1, 0.95)
R> Yast[y0] = rbinom(length(y0), 1, 0.15)
R> mainda = data.frame(A = A, X1 = X1, Yast = Yast)
R> X1 = rnorm(800)   
R> A = rbinom(800, 1, 1/(1 + exp(-0.2 - X1)))
R> Y = rbinom(800, 1, 1/(1 + exp(-0.2 - A - X1)))
R> y1 = which(Y == 1)
R> y0 = which(Y == 0) 
R> Yast = Y
R> Yast[y1] = rbinom(length(y1), 1, 0.95)
R> Yast[y0] = rbinom(length(y0), 1, 0.15)
R> validationda = data.frame(A = A, X1 = X1, Y = Y, Yast = Yast)

We print the first six observations of non-validation data mainda and validation data validationda:

R> head(mainda)
  A          X1 Yast
1 1 -0.50219235    0
2 0  0.13153117    0
3 1 -0.07891709    1
4 1  0.88678481    1
5 0  0.11697127    1
6 1  0.31863009    1
R> head(validationda)
  A            X1 Y Yast
1 0 -0.0749961081 0    0
2 1 -0.9470827924 1    1
3 1  0.0003758095 1    1
4 0 -1.5249574007 0    0
5 1  0.0983516474 0    0
6 0 -1.5266078213 1    1

The preceding output clearly reveals that the non-validation data and validation data differ in the data structure. The non-validation data mainda record measurements of the treatment, covariate and misclassified outcome, indicated by the column names “A”, “X1” and “Yast”, respectively. In comparison, the validation data validationda record measurements of the treatment, covariate, misclassified outcome and the true outcome, indicated by the column names “A”, “X1”, “Yast”, and “Y”, respectively.

To apply the optimal linear combination method with validation data, we call the developed function EstValidation and obtain a list of the estimate, the sandwich-variance-based standard error, a 95% confidence interval, and the estimated sensitivity and specificity:

R> EstValidation(maindata = mainda, validationdata = validationda, indA = "A", 
+     indYerror = "Yast",  indX = "X1",  indY = "Y", confidence=0.95)
$Estimate
[1] 0.1714068

$Std.Error
[1] 0.02714957

$`95% Confidence Interval`
[1] 0.1181946 0.2246189

$`estimated sensitivity and estimated specificity`
[1] 0.9482072 0.8557047

Implementation and example with replicates

The function Est2Replicates produces the results for the method with replicates; they include the estimate, the sandwich-variance-based standard error, (1α)100% confidence interval, and the imposed constraint(s), and the information on sensitivity and specificity. Specifically, Est2Replicates is defined as

Est2Replicates(data, indA, indYerror, indX, constraint=c("sensitivity equals 
  specificity", "known sensitivity", "known specificity", "known prevalence"),
  sensitivity = NULL, specificity = NULL, prevalence = NULL, confidence=0.95)

with arguments described in detail in ipwErrorY documentation. Below is an example to illustrate the use of Est2Replicates.

Using sensitivity 0.95 and specificity 0.85, we create da, a dataset of size 2000 with “A”, “X1”, and {“Yast1”, “Yast2”} being the column names for the treatment, covariate, and two replicates of outcome, respectively:

R> set.seed(100) 
R> X1 = rnorm(2000)   
R> A = rbinom(2000, 1, 1/(1 + exp(-0.2 - X1)))
R> Y = rbinom(2000, 1, 1/(1 + exp(-0.2 - A - X1)))
R> y1 = which(Y == 1)
R> y0 = which(Y == 0) 
R> Yast1 = Y
R> Yast1[y1] = rbinom(length(y1), 1, 0.95)
R> Yast1[y0] = rbinom(length(y0), 1, 0.15)
R> Yast2 = Y
R> Yast2[y1] = rbinom(length(y1), 1, 0.95)  
R> Yast2[y0] = rbinom(length(y0), 1, 0.15)
R> da = data.frame(A = A, X1 = X1, Yast1 = Yast1, Yast2 = Yast2)

By using the function head, we print the first six observations of dataset da so that the data structure is clearly shown as follows:

R> head(da)
  A          X1 Yast1 Yast2
1 1 -0.50219235     1     1
2 1  0.13153117     1     1
3 1 -0.07891709     1     1
4 0  0.88678481     1     0
5 1  0.11697127     1     1
6 1  0.31863009     1     1

To apply the method with replicates, we call the developed function Est2Replicates with the imposed constraint that specificity equals 0.85. The following list of the estimate, the sandwich-variance-based standard error, a 95% confidence interval, the imposed constraint and the information on sensitivity and specificity is returned:

R> Est2Replicates(data = da,  indA = "A", indYerror = c("Yast1", "Yast2"),
+     indX = "X1", constraint = "known specificity", sensitivity = NULL, 
+     specificity = 0.85, prevalence = NULL, confidence=0.95)
$Estimate
[1] 0.1908935

$Std.Error
[1] 0.02687287

$`95% Confidence Interval`
[1] 0.1382236 0.2435634

$`imposed constraint`
[1] "known specificity"

$`estimated sensitivity and assumed specificity`
[1] 0.95 0.85

Implementation and example of doubly robust estimation

The function KnownErrorDR produces the ATE estimate using the doubly robust correction method along with the sandwich-variance-based standard error and (1α)100% confidence interval. Specifically, KnownErrorDR is defined as

KnownErrorDR(data, indA, indYerror, indXtrt, indXout, sensitivity, specificity, 
  sharePara=FALSE, confidence=0.95)

with arguments described in detail in ipwErrorY documentation. Below is an example to illustrate the use of KnownErrorDR.

Using sensitivity 0.95 and specificity 0.85, we create da, a dataset of size 2000 with “A”, {“X”, “xx”} and “Yast” being the column names for the treatment, covariates and misclassified outcome, respectively:

R> set.seed(100)
R> X = rnorm(2000)   
R> xx = X^2
R> A = rbinom(2000, 1, 1/(1 + exp(-0.1 - X - 0.2*xx)))
R> Y = rbinom(2000, 1, 1/(1 + exp(1 - A - 0.5*X - xx)))
R> y1 = which(Y == 1)
R> y0 = which(Y == 0) 
R> Y[y1] = rbinom(length(y1), 1, 0.95)
R> Y[y0] = rbinom(length(y0), 1, 0.15)
R> Yast = Y
R> da = data.frame(A = A, X = X, xx = xx, Yast = Yast)

By using the function head, we print the first six observations of dataset da so that the data structure is clearly shown as follows:

R> head(da)
  A           X          xx Yast
1 1 -0.50219235 0.252197157    1
2 1  0.13153117 0.017300447    1
3 1 -0.07891709 0.006227907    1
4 0  0.88678481 0.786387298    0
5 1  0.11697127 0.013682278    1
6 1  0.31863009 0.101525133    0

When applying the doubly robust method with sensitivity 0.95 and specificity 0.85, covariates indicated by column names “X” and “xx” are both included in the treatment model and the outcome model. Let the outcome model be fit for the treated and untreated groups separately. We call the developed function KnownErrorDR and obtain a list of the estimate, the sandwich-variance-based standard error, and a 95% confidence interval:

R> KnownErrorDR(data = da, indA = "A", indYerror = "Yast", indXtrt = c("X", "xx"), 
+    indXout = c("X", "xx"), sensitivity = 0.95, specificity = 0.85,  
+    sharePara = FALSE, confidence=0.95)
$Estimate
[1] 0.2099162

$Std.Error
[1] 0.02811472

$`95% Confidence Interval`
[1] 0.1548124 0.2650201

5 Discussion

Misclassified binary outcome data arise frequently in practice and present a challenge in conducting causal inference. Discussion on addressing this issue is rather limited in the literature. developed the IPW estimation methods for ATE with mismeasured outcome effects incorporated. To expedite the application of these correction methods, we develop an R package ipwErrorY. For practical settings where the treatment model and the outcome model are specified as logistic regression models, we implement the correction methods developed by for settings with known misclassification probabilities, validation data, or replicates of the outcome data as well as the doubly robust method with known misclassification probabilities. Our package offers a useful and convenient tool for data analysts to perform valid inference about ATE when the binary outcome variable is subject to misclassification.

For each function of ipwErrorY, we implement the sandwich variance estimate to construct a normality-based confidence interval. Confidence intervals can also be constructed by bootstrapping , which can be done by leveraging available functions of ipwErrorY. Below we provide example code to produce normality-based and percentile-based bootstrap confidence intervals for a doubly robust estimate with 200 bootstrap replicates.

R> drFUN<-function(dt) {
+   KnownErrorDR(data = dt, indA = "A",  indYerror = "Yast", indXtrt = c("X", "xx"), 
+                indXout = c("X", "xx"), sensitivity = 0.95, specificity = 0.85,  
+                sharePara = FALSE, confidence=0.95)$`Estimate`
+ }
R> EST=drFUN(dt=da)
R> set.seed(100)
R> resultsBoot=replicate(200,drFUN(dt=da[sample(1:nrow(da),replace=TRUE),]))
R> STD=sd(resultsBoot)
R> lowN=EST-qnorm(1-(1-0.95)/2)*STD
R> upN=EST+qnorm(1-(1-0.95)/2)*STD
R> CIn=c(lowN,upN)
R> lowP=as.numeric(quantile(resultsBoot,probs=0.025))
R> upP=as.numeric(quantile(resultsBoot,probs=0.975))
R> CIp=c(lowP,upP)

We print the resultant bootstrap normality-based and percentile-based confidence intervals, respectively, as follows.

R> CIn
[1] 0.1562355 0.2635969
R> CIp
[1] 0.1610038 0.2655065

To make sure the users can reproduce the results, here we call the function set.seed before KnownErrorDR. If set.seed is not used, then the variance estimates generated at different times can differ due to the inner randomness of the bootstrap method.

This example code can be easily modified to produce bootstrap confidence intervals for an estimate obtained from a different method; one needs only to replace KnownErrorDR with the function in ipwErrorY that corresponds to the method.

Package ipwErrorY requires the data be complete (i.e., no missing values). An error message is shown when NAs in the dataset are detected. For example, if we artificially introduce an NA in dataset da and call the developed function KnownErrorDR, an error message is displayed:

R> da[1,1]=NA
R> KnownErrorDR(data = da, indA = "A",  indYerror = "Yast", indXtrt = c("X", "xx"), 
+              indXout = c("X", "xx"), sensitivity = 0.95, specificity = 0.85,  
+              sharePara = FALSE, confidence=0.95)
Error in KnownErrorDR(data = da, indA = "A", indYerror = "Yast", indXtrt = c("X",  : 
  invalid dataset with NAs (missing data detected)

Once seeing this error message, users need to check their dataset to see if the NAs can be replaced with suitable values. If missing values do occur, the easiest way is to take the subsample of complete observations to conduct analysis. The resulting point estimates can be reasonable if the missing data mechanism is missing completely at random (MCAR) ; in this instance, efficiency loss can occur. However, when missing data are not MCAR, this procedure often yields biased results.

Our implementation uses a logistic regression model with a linear function of covariates for both the treatment and the outcome processes, as it is perhaps the most widely-used parametric tool to model a binary variable. Such a logistic regression model can be generalized to include additional terms, such as higher order terms, nonlinear functions, or interactions of the covariates. In this case, the users need only to first create an expanded dataset with those terms included as additional columns of new “covariates", and then use the ipwErrorY package to analyze the expanded dataset.

6 Acknowledgments

The authors thank the editor and two reviewers for their helpful comments and suggestions which improved the manuscript. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and was partially supported by a Collaborative Research Team Project of Canadian Statistical Sciences Institute (CANSSI).

CRAN packages used

ipwErrorY, nleqslv

CRAN Task Views implied by cited packages

NumericalMathematics

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Footnotes

    References

    M. Babanezhad, S. Vansteelandt and E. Goetghebeur. Comparison of causal effect estimators under exposure misclassification. Journal of Statistical Planning and Inference, 140: 1306–1319, 2010. URL https://doi.org/10.1016/j.jspi.2009.11.015.
    H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61: 962–973, 2005. URL https://doi.org/10.1111/j.1541-0420.2005.00377.x.
    D. Braun, C. Zigler, F. Dominici and M. Gorfine. Using validation data to adjust the inverse probability weighting estimator for misclassified treatment. Harvard University Biostatistics Working Paper Series. Working Paper 201, 1–19, 2016.
    J. P. Buonaccorsi. Measurement error: Models, methods, and applications. Chapman & Hall/CRC, London, 2010.
    R. J. Carroll, D. Ruppert, L. A. Stefanski and C. M. Crainiceanu. Measurement Error in Nonlinear Models: A Modern Perspective, 2nd Edition. Chapman & Hall/CRC, Boca Raton, 2006.
    B. Efron. The jackknife, the bootstrap and other resampling plans. SIAM, Philadelphia, 1982.
    B. Efron and R. Tibshirani. An introduction to the bootstrap. London: Chapman & Hall/CRC, 1993.
    W. A. Fuller. Measurement error models. John Wiley & Sons, 1987.
    P. Gustafson. Measurement error and misclassification in statistics and epidemiology. Chapman & Hall/CRC, London, 2003.
    B. Hasselman. Nleqslv: Solve systems of nonlinear equations. 2016. URL https://CRAN.R-project.org/package=nleqslv. R package version 3.0.
    M. A. Hernán and J. M. Robins. Causal inference. Chapman & Hall/CRC, Boca Raton, forthcoming, 2019.
    C. C. Heyde. Quasi-Likelihood and Its Application: A General Approach to Optimal Parameter Estimation. Springer-Verlag, 1997.
    G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, New York, 2015.
    R. J. Little and D. B. Rubin. Statistical analysis with missing data, 2nd edition. John Wiley & Sons, 2002.
    J. K. Lunceford and M. Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine, 23: 2937–2960, 2004. URL https://doi.org/10.1002/sim.1903.
    D. F. McCaffrey, J. Lockwood and C. M. Setodji. Inverse probability weighting with error-prone covariates. Biometrika, 100: 671–680, 2013. URL https://doi.org/10.1093/biomet/ast022.
    W. K. Newey and D. McFadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4: 2111–2245, 1994.
    J. M. Robins, M. A. Hernán and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11: 550–560, 2000.
    J. M. Robins, A. Rotnitzky and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89: 846–866, 1994. URL https://doi.org/10.2307/2290910.
    P. R. Rosenbaum. Model-based direct adjustment. Journal of the American Statistical Association, 82: 387–394, 1987. URL https://doi.org/10.2307/2289440.
    P. R. Rosenbaum. Propensity score. In Encyclopedia of Biostatistics, 5: 3551–3555, 1998.
    P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70: 41–55, 1983. URL https://doi.org/10.1093/biomet/70.1.41.
    K. J. Rothman, S. Greenland and T. L. Lash. Modern epidemiology, 3rd edition. Lippincott Williams & Wilkins, Philadelphia, 2008.
    D. O. Scharfstein, A. Rotnitzky and J. M. Robins. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94: 1096–1120, 1999. URL https://doi.org/10.2307/2669930.
    D. Shu and G. Y. Yi. Causal inference with measurement error in outcomes: Bias analysis and estimation methods. Statistical Methods in Medical Research, 2017. URL https://doi.org/10.1177/0962280217743777.
    D. Shu and G. Y. Yi. IpwErrorY: Inverse probability weighted estimation of average treatment effect with misclassified binary outcome. 2019. URL https://CRAN.R-project.org/package=ipwErrorY. R package version 2.1.
    L. A. Stefanski and D. D. Boos. The calculus of m-estimation. The American Statistician, 56: 29–38, 2002. URL https://doi.org/10.1198/000313002753631330.
    I. White, C. Frost and S. Tokunaga. Correcting for measurement error in binary and continuous variables using replicates. Statistics in Medicine, 20: 3441–3457, 2001. URL https://doi.org/10.1002/sim.908.
    G. Y. Yi. Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. Springer-Verlag, 2017.
    G. Y. Yi and W. He. Analysis of case-ontrol data with interacting misclassified covariates. Journal of Statistical Distributions and Applications, 4: 1–16, 2017. URL https://doi.org/10.1186/s40488-017-0069-0.

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Shu & Yi, "ipwErrorY: An R Package for Estimation of Average Treatment Effect with Misclassified Binary Outcome", The R Journal, 2019

    BibTeX citation

    @article{RJ-2019-029,
      author = {Shu, Di and Yi, Grace Y.},
      title = {ipwErrorY: An R Package for Estimation of Average Treatment Effect with Misclassified Binary Outcome},
      journal = {The R Journal},
      year = {2019},
      note = {https://rjournal.github.io/},
      volume = {11},
      issue = {1},
      issn = {2073-4859},
      pages = {337-351}
    }