This article introduces testforDEP, a portmanteau R package implementing for the first time several modern tests and visualization tools for independence between two variables. While classical tests for independence are in the base R packages, there have been several recently developed tests for independence that are not available in R. This new package combines the classical tests including Pearson’s product moment correlation coefficient method, Kendall’s testforDEP
provides a common interface for each of the tests and returns test statistics, corresponding AUK
provides an interface to visualize Kendall plots and computes the area under the Kendall plot similar to computing the area under a receiver operating characteristic (ROC) curve.
In this article, we present the
testforDEP R package,
a package for testing dependence between variables. Since in
nonparametric settings there are no most powerful tests for
independence, it is important to implement various reasonable
independence tests tuned for different classes of alternatives. This
package addresses a need for implementing both classical and novel
tests of independence in a single easy to use function. The function
cor.test
offered in the base R package gives classical tests for
association/correlation between two samples, using the Pearson product
moment correlation coefficient (Pearson 1920), Kendall cor.test
is
helpful to test for independence between two samples when the samples
are linearly dependent or monotonically associated. However the function
cor.test
is less powerful to detect general structures of dependence
between two random variables, including non-linear and/or random-effect
dependence structures. Many modern statistical methodologies have been
proposed to detect these general structures of dependence. These methods
include an empirical likelihood based test (Einmahl and McKeague 2003), a
density-based empirical likelihood ratio test for independence
(Vexler et al. 2014), data-driven rank test for independence
(Kallenberg and Ledwina 1999), maximal information coefficient
(Reshef et al. 2011) and a continuous analysis of variance test
(Wang et al. 2015). These methods are useful to detect complex
structures of dependence and until now there were no R packages
available to implement these modern tests.
We propose the new package testforDEP, combining both classical and modern tests. The package testforDEP also provides dependence visualization tools such as the Kendall plot with the area under Kendall plot (Vexler et al. 2017) which previously has not been available in R packages. Moreover, we develop an exact test based on the maximal information coefficient to detect dependence between two random variable and we perform a power analysis for these different tests. The testforDEP package is available from the Comprehensive R Archive Network and also available for download at the author’s department webpage.
We will focus on tests of bivariate independence of two random variables
In the following subsections we outline the classical tests of independence.
The Pearson product-moment correlation coefficient
The Kendall rank correlation coefficient
Spearman’s rank correlation coefficient
Similar to the Kendall rank correlation coefficient, a weakness of the Spearman rank correlation coefficient test is that it has low power to detect non-monotonic dependency structures.
Hoeffding’s test for dependence is proposed in (Hoeffding 1948) as a
test for two random variables with continuous distribution functions.
Hoeffding’s
We denote
It can be shown that
Due to the limiting computing tools available for the original
publication in 1948, the author only provided cutoff tables for small
sample sizes. With advanced computing power, we compute the
0.2976 | 0.1145 | 0.0377 | 0.0191 | 0.0086 | 0.0037 | |
0.1548 | 0.0589 | 0.0209 | 0.0098 | 0.0045 | 0.0020 | |
0.0952 | 0.0378 | 0.0131 | 0.0061 | 0.0028 | 0.0013 | |
0.0437 | 0.0184 | 0.0060 | 0.0028 | 0.0011 | 0.0006 | |
-0.0635 | -0.0232 | -0.0078 | -0.0037 | -0.0018 | -0.0007 | |
-0.0794 | -0.0294 | -0.0097 | -0.0045 | -0.0021 | -0.0008 |
We note this test is not as powerful as Pearson’s test when the dependency is linear.
In the following subsections we outline the modern tests of independence.
A density-based empirical likelihood ratio test is proposed in
(Vexler et al. 2014) as a nonparametric test of dependence for two
variables. Following (Vexler et al. 2014), to test the null hypothesis we
use the test statistic
The statistic
This test is very powerful for many nonparametric alternatives as it is designed to approximate corresponding nonparametric likelihood ratios, however, when the dependency is linear with normal data other methods such as Pearson and Spearman tests will outperform this method.
Kallenberg and Ledwina (1999) propose two data-driven rank tests for
independence,
In (Kallenberg and Ledwina 1999) the
This test is very powerful when data are from an exponential family, however the statistic is rather complicated and a required assumption is that the data distributions belong to semiparametrically defined families.
The maximal information coefficient (MIC) proposed in
(Reshef et al. 2011) is a measure of strength of the linear and
non-linear association between two variables
It has been noted in (Simon and Tibshirani 2014) that the MIC approach has serious power deficiencies and when used for large-scale exploratory analysis will result in too many false positives.
N | E | RN | N | E | RN | N | E | RN | N | E | RN | |
0.61|0.61|0.61 | 0.50|0.52|0.50 | 0.38|0.37|0.38 | 0.33|0.33|0.33 | |
0.61|0.61|0.61 | 0.43|0.43|0.43 | 0.33|0.33|0.33 | 0.30|0.30|0.30 | |
0.61|0.61|0.61 | 0.41|0.40|0.41 | 0.31|0.31|0.31 | 0.28|0.28|0.28 | |
0.24|0.24|0.24 | 0.31|0.30|0.31 | 0.25|0.25|0.25 | 0.23|0.24|0.23 | |
0.24|0.24|0.24 | 0.27|0.26|0.27 | 0.23|0.23|0.23 | 0.21|0.21|0.21 | |
0.12|0.12|0.12 | 0.24|0.23|0.24 | 0.21|0.21|0.21 | 0.20|0.20|0.20 | |
0.12|0.12|0.12 | 0.23|0.22|0.23 | 0.20|0.20|0.20 | 0.19|0.19|0.19 | |
0.11|0.11|0.11 | 0.20|0.20|0.20 | 0.18|0.18|0.18 | 0.17|0.17|0.17 |
An empirical likelihood based test is proposed by
(Einmahl and McKeague 2003). The approach is via localizing the empirical
likelihood with one or more time variables implicit in the given null
hypothesis and then constructing an omnibus test statistic by
integrating the log-likelihood ratio over those variables. We first
consider a null hypothesis,
This test is characterized by moments and is very powerful for many nonparametric alternatives. However, when the dependency is linear with normally distributed data, this test is outperformed by Pearson and Spearman tests.
The Kendall plot, also called K-plot, is a visualization of dependence in a bivariate random sample. It is proposed in (Genest and Boies 2012). Similar to a Chi-plot which detects association in random samples from continuous bivariate distributions, the K-plot adapts the concept of a probability plot to detect dependence.
The horizontal axis in a K-plot,
Note
Let
when
when
when
The area under the Kendall plot (AUK) is proposed in (Vexler et al. 2017) as an index to independence. It applies area-under-curve analysis and computes the area under the Kendall plot. Some properties are listed below:
when
when
when
Note
The R Package testforDEP
includes two functions, testforDEP()
,
AUK()
and one data frame, LSAT
.
The function testforDEP()
is the interface for the tests. The function
testforDEP()
takes two vectors
test statistic,
bootstrap confidence intervals.
The interface of testforDEP()
is:
testforDEP(x, y, data, test = c("PEARSON", "KENDALL", "SPEARMAN",
"VEXLER", "TS2", "V", "MIC", "HOEFFD", "EL"),
p.opt = c("dist", "MC", "table"), num.MC = 10000,
BS.CI = 0, rm.na = FALSE, set.seed = FALSE)
where x
and y
are two equal-length numeric vectors of input data.
The parameter data
is an alternative input that takes a data frame
with two columns representing x
or y
are not
provided, parameter data
is taken as input. The parameter test
specifies the hypothesis test to implement. A summary of the test
options with the associated statistic returned is given below,
PEARSON
,
KENDALL
,
SPEARMAN
,
VEXLER
,
TS2
,
V
,
MIC
,
HOEFFD
,
EL
,
Parameter p.opt
is the option for computing p.opt = "MC"
. The
parameter num.MC
gives the Monte Carlo simulation number and will only
be used when p.opt = "MC"
. When p.opt = "dist"
or p.opt = "table"
,
num.MC
will be ignored. To balance accuracy and computation time,
num.MC
must num.MC = 10000
as default. The
parameter BS.CI
specifies a BS.CI = 0
, the confidence intervals will not be computed.
Parameter rm.na
is a flag for removing rows with missing data.
Parameter set.seed
is a flag for setting the random number generator
seed.
The function AUK()
provides an interface for Kendall plots and AUK. It
takes two vectors
AUK,
sorted
bootstrap confidence intervals.
The interface of AUK()
is:
AUK(x, y, plot = FALSE, main = "Kendall plot", Auxiliary.line = TRUE,
BS.CI = 0, set.seed = FALSE)
where x
and y
are two equal-length numeric vectors of input data.
Parameter plot
is a flag for drawing the Kendall plot. Parameter
main
determines the title of the plot. If plot = FALSE
, main
will
be ignored. Parameter Auxiliary.line
is a flag for auxiliary line.
Parameter BS.CI
specifies a BS.CI = 0
,
confidence intervals will not be computed. Parameter set.seed
is a
flag for setting seed.
Here we present two data analysis examples to demonstrate the utility of the testforDEP package. In the first example, the data are average law school admission test (LSAT) and grade point average (GPA) from 82 law schools (details described in (Efron and Tibshirani 1994)). The data frame LSAT is available in testforDEP package. The aim is to determine the dependence between LSAT and GPA. Table 3 gives the data and a scatter plot is shown in Figure 1 (Top Left).
School | LSAT | GPA | School | LSAT | GPA | School | LSAT | GPA |
---|---|---|---|---|---|---|---|---|
1 | 622 | 3.23 | 28 | 632 | 3.29 | 56 | 641 | 3.28 |
2 | 542 | 2.83 | 29 | 587 | 3.16 | 57 | 512 | 3.01 |
3 | 579 | 3.24 | 30 | 581 | 3.17 | 58 | 631 | 3.21 |
4 | 653 | 3.12 | 31 | 605 | 3.13 | 59 | 597 | 3.32 |
5 | 606 | 3.09 | 32 | 704 | 3.36 | 60 | 621 | 3.24 |
6 | 576 | 3.39 | 33 | 477 | 2.57 | 61 | 617 | 3.03 |
7 | 620 | 3.10 | 34 | 591 | 3.02 | 62 | 637 | 3.33 |
8 | 615 | 3.40 | 35 | 578 | 3.03 | 63 | 572 | 3.08 |
9 | 553 | 2.97 | 36 | 572 | 2.88 | 64 | 610 | 3.13 |
10 | 607 | 2.91 | 37 | 615 | 3.37 | 65 | 562 | 3.01 |
11 | 558 | 3.11 | 38 | 606 | 3.20 | 66 | 635 | 3.30 |
12 | 596 | 3.24 | 39 | 603 | 3.23 | 67 | 614 | 3.15 |
13 | 635 | 3.30 | 40 | 535 | 2.98 | 68 | 546 | 2.82 |
14 | 581 | 3.22 | 41 | 595 | 3.11 | 69 | 598 | 3.20 |
15 | 661 | 3.43 | 42 | 575 | 2.92 | 70 | 666 | 3.44 |
16 | 547 | 2.91 | 43 | 573 | 2.85 | 71 | 570 | 3.01 |
17 | 599 | 3.23 | 44 | 644 | 3.38 | 72 | 570 | 2.92 |
18 | 646 | 3.47 | 45 | 545 | 2.76 | 73 | 605 | 3.45 |
19 | 622 | 3.15 | 46 | 645 | 3.27 | 74 | 565 | 3.15 |
20 | 611 | 3.33 | 47 | 651 | 3.36 | 75 | 686 | 3.50 |
21 | 546 | 2.99 | 48 | 562 | 3.19 | 76 | 608 | 3.16 |
22 | 614 | 3.19 | 49 | 609 | 3.17 | 77 | 595 | 3.19 |
23 | 628 | 3.03 | 50 | 555 | 3.00 | 78 | 590 | 3.15 |
24 | 575c | 3.01 | 51 | 586 | 3.11 | 79 | 558 | 2.81 |
25 | 662 | 3.39 | 52 | 580 | 3.07 | 80 | 611 | 3.16 |
26 | 627 | 3.41 | 53 | 594 | 2.96 | 81 | 564 | 3.02 |
27 | 608 | 3.04 | 54 | 594 | 3.05 | 82 | 575 | 2.74 |
55 | 560 | 2.93 |
Figure 1 (Top Left) suggests a linear positive relationship between LSAT and GPA. To confirm this, we draw the Kendall plot and compute AUK (see Figure 1 (Top Right)). Figure 1 (Top Right) shows a curve above the diagonal and AUK of 0.665, both of which suggest a positive dependence.
Now consider the dependence tests provided in package testforDEP. Table 4 shows the test statistics and p-values. All tests, classical and modern, suggest dependence between LSAT and GPA. From this analysis, it is clear that LSAT and GPA variables are dependent.
To further examine this data and explore the comparable power of the
modern methods, we also examine a subset of 15 randomly chosen points
from the full dataset as shown in Figure 1 (Bottom Left). An analysis of
this subset reveals significant
Full Data | Subset Data | |||
---|---|---|---|---|
test | statistic | p-value | statistic | p-value |
Pearson | 10.46 | < 0.0001 | 1.77 | 0.1074 |
Kendall | 7.46 | < 0.0001 | 69.00 | 0.0990 |
65.70 | < 0.0001 | 12.05 | 0.0067 | |
80.76 | < 0.0001 | 3.16 | 0.2003 | |
69.04 | < 0.0001 | 8.51 | 0.0146 | |
MIC | 0.53 | < 0.0001 | 0.38 | 0.0993 |
Hoeffding | 0.22 | < 0.0001 | 0.08 | 0.0658 |
13.64 | < 0.0001 | 2.15 | 0.0238 |
test | statistic | p-value |
---|---|---|
Pearson | -1.015 | 0.3156 |
Kendall | -2.321 | 0.0196 |
Spearman | -2.900 | 0.0050 |
140.554 | < 0.0001 | |
13.347 | 0.0042 | |
82.712 | < 0.0001 | |
MIC | 0.666 | < 0.0001 |
Hoeffding | 0.085 | < 0.0001 |
12.254 | < 0.0001 |
|
|
|
|
In the second example, we demonstrate the power of the modern methods to
detect non-linear dependence. This example was originally published in
(Rohrer et al. 2004) and re-examined in
(Chen et al. 2010). It involves analyzing microarray expression
data collected to identify critical genes in photoreceptor degeneration.
The data in Figure 2 are from 2 genes (Pcp2 and Pla2g7) with each point
representing another gene with its x-coordinate being the Euclidean
distance from Pla2g7 and the y-coordinate as the Euclidean distance from
Pcp2. The Euclidean distance is calculated from expression profiles
between wild type (wt) and rod degeneration (rd) mice. The
correlation shown in Figure 2 measures the relationship between how the
Pla2g7 gene interacts with all the other genes vs. how the Pcp2 gene
interacts with all the other genes. From Figure 2 we see the genes are
negatively correlated when their transformed levels are both less than
5. Otherwise, these genes are positively correlated. We proceed to
analyze the data using the testforDEP package. The results are
presented in Table 5. From this analysis, we see a large gain in power
using the modern methods such as
|
|
The R package testforDEP is a new package that allows users for the first time to analyze complex structures of dependence via modern test statistics and visualization tools. Moreover, a novel exact method based on the MIC measurement has been proposed in the package testforDEP. Future work is necessary to further develop a formal testing structure based on the MIC statistic. We note that users of this package should carefully consider the strengths and weaknesses of the proposed tests as outlined in each section when deciding what test to apply. Further, a topic for future work is to develop a complete set of consistent criteria to determine which test is most appropriate for a specific setting. We note extensive simulations comparing each of the tests are provided in (Miecznikowski et al. 2017) which may help guide users to choose among the tests. Ultimately, we believe that the package testforDEP will help investigators identify dependence using the cadre of modern tests implemented to detect dependency.
Albert Vexler’s efforts were supported by the National Institutes of Health (NIH) grant 1G13LM012241-01. We are grateful to the Editor, Associate Editor and the two referees for helpful comments that clearly improved this manuscript.
Bayesian, ClinicalTrials, Databases, Econometrics, MissingData, ReproducibleResearch
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Miecznikowski, et al., "testforDEP: An R Package for Modern Distribution-free Tests and Visualization Tools for Independence", The R Journal, 2018
BibTeX citation
@article{RJ-2018-057, author = {Miecznikowski, Jeffrey and Hsu, En-shuo and Chen, Yanhua and Vexler, Albert}, title = {testforDEP: An R Package for Modern Distribution-free Tests and Visualization Tools for Independence}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {282-295} }