There is an inherent relationship between two-sided hypothesis tests and confidence intervals. A series of two-sided hypothesis tests may be inverted to obtain the 100(1-
Applied statisticians are increasingly being encouraged to report
confidence intervals (CI) and parameter estimates along with p-values
from hypothesis tests. The htest
class of the
stats package is ideally
suited to these kinds of analyses, because all the related statistics
may be presented when the results are printed. For exact two-sided tests
applied to discrete data, a test-CI inconsistency may occur: the p-value
may indicate a significant result at level
I was asked to help design a study to determine if adding a new drug
(albendazole) to an existing treatment regimen (ivermectin) for the
treatment of a parasitic disease (lymphatic filariasis) would increase
the incidence of a rare serious adverse event when given in an area
endemic for another parasitic disease (loa loa). There are many
statistical issues related to that design (Fay et al. 2007), but here consider
a simple scenario to highlight the point of this paper. A previous mass
treatment using the existing treatment had 2 out of 17877 experiencing
the serious adverse event (SAE) giving an observed rate of 11.2 per
100,000. Suppose the new treatment was given to 20,000 new subjects and
suppose that 10 subjects experienced the SAE giving an observed rate of
50 per 100,000. Assuming Poisson rates, an exact test using
poisson.test(c(2,10),c(17877,20000))
from the
stats package (throughout
we assume version 2.11.0 for the stats package) gives a p-value of
poisson.test
also gives a 95% confidence interval of
We briefly review inferences using the p-value function for discrete
data. For details see Hirji (2006) or Blaker (2000). Suppose you have a
discrete statistic
We list 3 ways to define the two-sided p-value for testing
central
, minlike
, and blaker
methods, respectively:
central
is
motivated by the associated inversion confidence intervals which are
central intervals, i.e., they guarantee that the lower (upper) limit
of the 100(1-
blaker
is motivated by Blaker (2000) which comprehensively
studies the associated method for confidence intervals, although the
method had been mentioned in the literature earlier, see e.g.,
Cox and Hinkley (1974), p. 79. This is called the CT (combined tail) method by
Hirji (2006).
There are other ways to define two-sided p-values, such as defining
extreme values according to the score statistic (see e.g., Hirji (2006 3), or Agresti and Y. Min (2001)). Note that
If central
test are
If matching confidence intervals are used then test-CI inconsistencies
will not happen for the central
method, and will happen very rarely
for the minlike
and blaker
methods. We discuss those rare test-CI
inconsistencies in the ‘Unavoidable inconsistencies’ section later, but
the main point of this article is that it is not rare for central
confidence interval (Fay 2010) and that
particular test-CI combination is the default for many exact tests in
the stats package. We show
some examples of such inconsistencies in the following sections.
If central
exact interval is the Clopper-Pearson confidence interval. These are the
intervals given by binom.test
. The p-value given by binom.test
is
When binom.test
. When
Note that there is a theoretically proven set of shortest confidence
intervals for this problem. These are called the Blyth-Still-Casella
intervals in StatXact (StatXact Procs Version 8). The problem with these
shortest intervals is that they are not nested, so that one could have a
parameter value that is included in the 90% confidence interval but not
in the 95% confidence interval (see Theorem 2 of Blaker (2000)). In
contrast, the matching intervals of the binom.exact
function of the
exactci will always give
nested intervals.
If poisson.test
from
stats gives the exact
central
confidence intervals (Garwood 1936), while the p-value is poisson.test(5, r=1.8)
gives a p-value of
The exactci package
contains the poisson.exact
function, which has options for each of the
three methods of defining p-values and gives matching confidence
intervals. The code poisson.exact(5, r=1.8, tsmethod="central")
gives
the same confidence interval as above, but a p-value of poisson.exact(5, r=1.8, tsmethod="minlike")
returns a p-value
equal to tsmethod="blaker"
we get
For the control group, let the random variable of the counts be
Thus, the null hypothesis that poisson.exact
function when dealing with two-sample tests simply use the binom.exact
function and transform the results using Equation (1).
Let us return to our motivating example (i.e., testing the difference
between observed rates poisson.test
output poisson.exact
function avoids such test-CI
inconsistency in this case by giving the matching confidence interval;
here are the results of the three tsmethod
options:
tsmethod | p-value | |
---|---|---|
central | 0.061 | (0.024, 1.050) |
minlike | 0.042 | (0.035, 0.942) |
blaker | 0.042 | (0.035, 0.936) |
The fisher.test(matrix(c(4,11,50,569), 2, 2))
gives minlike
method, but 95% confidence interval on the odds ratio of
central
method. As with the other examples,
the test-CI inconsistency disappears when we use either the exact2x2
or fisher.exact
function from the
exact2x2 package.
The case not studied in Fay (2010) is when the data are paired, the case
which motivates McNemar’s test. For example, suppose you have twins
randomized to two treatment groups (Test and Control) then test on a
binary outcome (pass or fail). There are 4 possible outcomes for each
pair: (a) both twins fail, (b) the twin in the control group fails and
the one in the test group passes, (c) the twin on the test group fails
and the one in the control group passes, or (d) both twins pass. Here is
a table where the numbers of sets of twins falling in each of the four
categories are denoted
Test | ||
Control | Fail | Pass |
Fail | ||
Pass |
In order to test if the treatment is helpful, we use only the numbers of
discordant pairs of twins, mcnemar.test
.
Case-control data may be analyzed this way as well. Suppose you have a set of people with some rare disease (e.g., a certain type of cancer); these are called the cases. For this design you match each case with a control who is as similar as feasible on all important covariates except the exposure of interest. Here is a table:
Exposed | ||
Not Exposed | Control | Case |
Control | ||
Case |
For this case as well we can use
For either design, we can estimate the odds ratio by
Test | ||
Control | Fail | Pass |
Fail | 21 | 9 |
Pass | 2 | 12 |
When we perform McNemar’s test with the continuity correction we get
mcnemar.exact
we get the exact McNemar’s test p-value of
After conditioning on the total number of discordant pairs,
(Breslow and Day (1980), p. 166). Since it is easy to perform exact tests on a
binomial parameter, we can perform exact versions of McNemar’s test
internally using the binom.exact
function of the package
exactci then transform
the results into odds ratios via Equation (2). This is how the
calculations are done in the exact2x2
function when paired=TRUE
. The
alternative
and the tsmethod
options work in the way one would
expect. So although McNemar’s test was developed as a two-sided test
testing the null that tsmethod
options, but
all three are equivalent to the exact version of McNemar’s test when
testing the usual null that vignette("exactMcNemar")
in
exact2x2). If we
narrowly define McNemar’s test as only testing the null that
tsmethod
options become apparent in the calculation of the confidence intervals.
The default is to use central
confidence intervals so that they
guarantee that the lower (upper) limit of the 100(1-minlike
and blaker
two-sided confidence
intervals; however, the latter give generally tighter confidence
intervals.
In order to gain insight as to why test-CI inconsistencies occur, we can
plot the p-value function. This type of plot explores one data
realization and its many associated p-values on the vertical axis
representing a series of tests modified by changing the point null
hypothesis parameter (binom.exact
, poisson.exact
, and exact2x2
that plots the p-value as a function of the point null hypotheses, draws
vertical lines at the confidence limits, draws a line at 1 minus the
confidence level, and adds a point at the null hypothesis of interest.
Other plot functions (exactbinomPlot
, exactpoissonPlot
, and
exact2x2Plot
) can be used to add to that plot for comparing different
methods. In Figure 1 we create such a plot for the
motivating example. Here is the code to create that figure:
x <- c(2, 10) n <- c(17877, 20000) poisson.exact(x, n, plot = TRUE) exactpoissonPlot(x, n, tsmethod = "minlike", doci = TRUE, col = "black", cex = 0.25, newplot = FALSE)
We see from Figure 1 that the central
method has smoothly
changing p-values, while the minlike
method has discontinuous ones.
The usual confidence interval is the inversion of the central
method
(the limits are the vertical gray lines, where the dotted line at the
significance level intersects with the gray p-values), while the usual
p-value at the null that the rate ratio is 1 is where the black line is.
To see this more clearly we plot the lower right hand corner of
Figure 1 in Figure 2.
From Figure 2 we see why the test-CI inconsistencies occur,
the minlike
method is generally more powerful than the central
method, so that is why the p-values from the minlike
method can reject
a specific null when the confidence intervals from the central
method
imply failing to reject that same null. We see that in general if you
use the matching confidence interval to the p-value, there will not be
test-CI inconsistencies.
Although the exactci and
exact2x2 packages do
provide a unified report in the sense described in Hirji (2006), it is
still possible in rare instances to obtain test-CI inconsistencies when
using the minlike
or blaker
two-sided methods (Fay 2010). These
rare inconsistencies are unavoidable due to the nature of the problem
rather than any deficit in the packages.
To show the rare inconsistency problem using the motivating example, we
consider the unrealistic situation where we are testing the null
hypothesis that the rate ratio is 0.93 at the 0.0776 level. The
corresponding confidence interval would be a
poisson.exact(x, n, r=.93, tsmethod="minlike", conf.level=1-0.0776)
we
reject the null (since
Additionally, the options tsmethod="minlike"
or "blaker"
can have
other anomalies (see Vos and S. Hudson (2008) for the single sample binomial case, and
Fay (2010) for the two-sample binomial case). For example, the data
reject, but fail to reject if an additional observation is added
regardless of the value of the additional observation. Thus, although
the power of the blaker
(or minlike
) two-sided method is always
(almost always) greater than the central
two-sided method, the
central
method does avoid all test-CI inconsistencies and the
previously mentioned anomalies.
We have argued for using a unified report whereby the p-value and the
confidence interval are calculated from the same p-value function (also
called the evidence function or confidence curve). We have provided
several practical examples. Although the theory of these methods have
been extensively studied (Hirji 2006), software has not been readily
available. The exactci
and exact2x2 packages
fill this need. We know of no other software that provides the minlike
and blaker
confidence intervals, except the
PropCIs package which
provides the Blaker confidence interval for the single binomial case
only.
Finally, we briefly consider closely related software. The
rateratio.test
package does the two-sample Poisson exact test with confidence intervals
using the central
method. The
PropCIs package does
several different asymptotic confidence intervals, as well as the
Clopper-Pearson (i.e. central
) and Blaker exact intervals for a single
binomial proportion. The
PropCIs package also
performs the mid-p adjustment to the Clopper-Pearson confidence interval
which is not currently available in
exactci. Other exact
confidence intervals are not covered in the current version of
PropCIs (Version 0.1-6).
The coin and
perm packages give very
general methods for performing exact permutation tests, although neither
perform the exact matching confidence intervals for the cases studied in
this paper.
I did not perform a comprehensive search of commercial statistical
software; however, SAS (Version 9.2) (perhaps the most comprehensive
commercial statistical software) and StatXact (Version 8) (the most
comprehensive software for exact tests) both do not implement the
blaker
and minlike
confidence intervals for binomial, Poisson and
2x2 table cases.
exactci, exact2x2, stats, PropCIs, rateratio.test, coin, perm
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Fay, "Two-sided Exact Tests and Matching Confidence Intervals for Discrete Data", The R Journal, 2010
BibTeX citation
@article{RJ-2010-008, author = {Fay, Michael P.}, title = {Two-sided Exact Tests and Matching Confidence Intervals for Discrete Data}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {1}, issn = {2073-4859}, pages = {53-58} }