testforDEP: An R Package for Modern Distribution-free Tests and Visualization Tools for Independence

Abstract:

This article introduces testforDEP, a portmanteau R package implementing for the first time several modern tests and visualization tools for independence between two variables. While classical tests for independence are in the base R packages, there have been several recently developed tests for independence that are not available in R. This new package combines the classical tests including Pearson’s product moment correlation coefficient method, Kendall’s τ rank correlation coefficient method and Spearman’s ρ rank correlation coefficient method with modern tests consisting of an empirical likelihood based test, a density-based empirical likelihood ratio test, Kallenberg data-driven test, maximal information coefficient test, Hoeffding’s independence test and the continuous analysis of variance test. For two input vectors of observations, the function testforDEP provides a common interface for each of the tests and returns test statistics, corresponding p values and bootstrap confidence intervals as output. The function AUK provides an interface to visualize Kendall plots and computes the area under the Kendall plot similar to computing the area under a receiver operating characteristic (ROC) curve.

Cite PDF Tweet

Published

Dec. 8, 2018

Received

Feb 2, 2018

Citation

Miecznikowski, et al., 2018

Volume

Pages

10/2

282 - 295


1 Introduction

In this article, we present the testforDEP R package, a package for testing dependence between variables. Since in nonparametric settings there are no most powerful tests for independence, it is important to implement various reasonable independence tests tuned for different classes of alternatives. This package addresses a need for implementing both classical and novel tests of independence in a single easy to use function. The function cor.test offered in the base R package gives classical tests for association/correlation between two samples, using the Pearson product moment correlation coefficient , Kendall τ rank correlation coefficient and Spearman ρ rank correlation coefficient . The function cor.test is helpful to test for independence between two samples when the samples are linearly dependent or monotonically associated. However the function cor.test is less powerful to detect general structures of dependence between two random variables, including non-linear and/or random-effect dependence structures. Many modern statistical methodologies have been proposed to detect these general structures of dependence. These methods include an empirical likelihood based test , a density-based empirical likelihood ratio test for independence , data-driven rank test for independence , maximal information coefficient and a continuous analysis of variance test . These methods are useful to detect complex structures of dependence and until now there were no R packages available to implement these modern tests.

We propose the new package testforDEP, combining both classical and modern tests. The package testforDEP also provides dependence visualization tools such as the Kendall plot with the area under Kendall plot which previously has not been available in R packages. Moreover, we develop an exact test based on the maximal information coefficient to detect dependence between two random variable and we perform a power analysis for these different tests. The testforDEP package is available from the Comprehensive R Archive Network and also available for download at the author’s department webpage.

We will focus on tests of bivariate independence of two random variables X and Y where we have n sample measurements or observations (X1,X2,,Xn) and (Y1,Y2,,Yn), respectively. If X and Y have cumulative distribution functions FX(x) and FY(y) and probability density functions fX(x) and fY(y) we say X and Y are independent if and only if the combined random variable (X,Y) has a joint cumulative distribution function F(x,y)=FX(x)FY(y) or equivalently, if the joint density exists, f(x,y)=fX(x)fY(y) for all x,yR. We are interested in testing the null hypothesis: (1)H0:X and Y are independent, which is equivalent to (2)H0:F(x,y)=FX(x)FY(y) for all x,yR. We note there are no most powerful testing strategies in this setting. Thus a strategy for choosing a test depends on the type of correlation that a user is trying to detect. The following sections detail the different tests for independence, outline the components of the testforDEP package and provides real data examples demonstrating the power of the modern methods to detect nonlinear dependence. Finally, we conclude with a brief summary and future directions.

2 Classical tests of independence

In the following subsections we outline the classical tests of independence.

Pearson product moment correlation coefficient

The Pearson product-moment correlation coefficient γ is a well-known measure of linear correlation (or dependence) between two random variables with a full presentation in and . Using the standard Pearson estimator (denoted γ^) we have that Tγ^ asymptotically follows a t distribution with n2 degrees of freedom under independence. Accordingly, a size α rejection rule is: (3)|Tγ^|>t1α/2, where t1α/2 is the 1α/2 quantile for the t distribution with n2 degree of freedom. Advantages of the Pearson test are that it is simple to execute, has high power to detect linear dependency and also when the underlying data are normally distributed it takes the form of a maximum likelihood estimator. A weakness of the Pearson test is that it has very weak power to detect dependency that is not characterized by first moments and is often not powerful in many nonparametric settings. For example, if Xi is a normally distributed random variable and Yi=1/Xi then the Pearson test will have very lower power to detect this structure (see, ).

Kendall rank correlation coefficient

The Kendall rank correlation coefficient τ is a well-known method proposed in as a nonparametric measure of monotonic association between two variables. Using the standard estimator τ^, the test statistic Zτ is given by: Zτ=3τn(n1)2(2n+5), where Zτ approximately follows a standard normal distribution under independence. A level α rejection rule for the null hypothesis is as follows: (4)|Zτ|>z1α/2, where z1α/2 is 1α/2 quantile for a standard normal distribution. A weakness of the Kendall rank correlation coefficient test is that it has low power to detect non-monotonic dependency structures.

Spearman rank correlation coefficient

Spearman’s rank correlation coefficient ρ proposed by is a nonparametric measure of statistical dependence between two variables. Spearman’s rank correlation measures the monotonic association between two variables . The statistic ρ is defined as: (5)ρ=16i=1ndi2n(n21),di=RiSi, where Ri, Si are the ranks of Xi and Yi, respectively. Note ρ=0 suggests X and Y are independent. To conduct a test for statistical independence under the null hypothesis, we use the test statistic Tρ defined as: (6)Tρ=ρn21ρ2, which is distributed approximately as a Student’s t  distribution with n2 degrees of freedom under the null hypothesis. Accordingly, a level α rejection rule is: (7)|Tρ|>t1α/2.

Similar to the Kendall rank correlation coefficient, a weakness of the Spearman rank correlation coefficient test is that it has low power to detect non-monotonic dependency structures.

Hoeffding’s test for independence

Hoeffding’s test for dependence is proposed in as a test for two random variables with continuous distribution functions. Hoeffding’s D is a nonparametric measure of the distance between the joint distribution, F(x,y) and the product of marginal distributions, FX(x)FY(y) . The coefficient D is defined as: (8)D(x,y)=F(x,y)FX(x)FY(y). We can also define Δ as: (9)Δ=Δ(F)=D2(x,y)dF(x,y), where F is the joint distribution of X and Y and the integral is over R2.

We denote Ω as the class of X, Y’s such that their joint density function F(x,y) is continuous. Denote Ω as the class of X, Y’s such that their joint and marginal probability density functions are continuous. It has been shown that if F(x,y) belongs to Ω, Δ(F)=0D(x,y)=0. The random variables X and Y are independent if and only if D(x,y)=0.

It can be shown that D ranges between 160 and 130 where larger values of D suggest dependence. An estimator of D, D^ is defined as: (10)D^(x,y)=F^(x,y)F^X(x)F^Y(y). We use D^ as the test statistic and note that it only depends on the rank order of observations . For computing the test statistic D^, we use the R package Hmisc . Note the returned test statistic D^ is 30 times the original D^ in . A size α test can be given by: D^>Cα, where Cα is a size α critical value.

Due to the limiting computing tools available for the original publication in 1948, the author only provided cutoff tables for small sample sizes. With advanced computing power, we compute the Cα cutoffs for n=10,20,50,100,200,500 in Table 1.

Table 1: Hoeffding cutoff values for D^ based on 5000 Monte Carlo simulations.
α=0.01 0.2976 0.1145 0.0377 0.0191 0.0086 0.0037
α=0.05 0.1548 0.0589 0.0209 0.0098 0.0045 0.0020
α=0.1 0.0952 0.0378 0.0131 0.0061 0.0028 0.0013
α=0.2 0.0437 0.0184 0.0060 0.0028 0.0011 0.0006
α=0.8 -0.0635 -0.0232 -0.0078 -0.0037 -0.0018 -0.0007
α=0.9 -0.0794 -0.0294 -0.0097 -0.0045 -0.0021 -0.0008

We note this test is not as powerful as Pearson’s test when the dependency is linear.

3 Modern tests of independence

In the following subsections we outline the modern tests of independence.

Density-based empirical likelihood ratio test for independence

A density-based empirical likelihood ratio test is proposed in as a nonparametric test of dependence for two variables. Following , to test the null hypothesis we use the test statistic VTn defined as: (11)VTn=max0.5nβ2mγnmax0.5nβ2rγni=1nnΔi~(m,r)2m, where γn=min(n0.9,n/2),0.75<β2<0.9 and Δi~(m,r) is defined as, $Δi~(m,r)F^(X(si+r),Y(i+m))F^(X(sir),Y(i+m))F^(X(si+r),Y(im))+F^(X(sir),Y(im))+nβ1FX^(X(si+r))FX^(X(sir))$, where F^ is the empirical joint distribution of X, Y and Y(1)Y(2),,Y(n) are the order statistics of Y1,Y2,Yn and Xt(i) is the concomitant of the ith order statistic defined in . F^X is the empirical marginal distribution of X, si is the integer number such that X(s)=Xt(i), β1(0,0.5).

The statistic VTn reaches its maximum with respect to m0.5nβ2 and r0.5nβ2 at m=0.5nβ2 and r=0.5nβ2 . Thus, we simplify (11) to (12)VTn=i=1nn1β2Δi~([0.5nβ2],[0.5nβ2]), where the function [x] denotes the nearest integer to x. Accordingly, a size α rejection rule of the test is given by: (13)log(VTn)>Cα, where Cα is an α-level test threshold. It is established in that VTn is distribution free under (1) and the critical values Cα are estimated by 50,000 Monte Carlo simulations from X1,,XnUniform[0,1] and Y1,,YnUniform[0,1].

This test is very powerful for many nonparametric alternatives as it is designed to approximate corresponding nonparametric likelihood ratios, however, when the dependency is linear with normal data other methods such as Pearson and Spearman tests will outperform this method.

Kallenberg data-driven test for independence

propose two data-driven rank tests for independence, TS2 and V. The TS2 statistic uses the intermediate statistic Tk defined as: Tk=j=1k{1ni=1nbj(Ri12n)bj(Si12n)}2, where bj denotes the jth orthonormal Legendre polynomial. The selection of the order k in Tk is done by the modified Schwarz’s rule, given by S2=min{1kd(n),TkklognTjjlogn,j=1,,d(n)}, where d(n) is a sequence of numbers tending to infinity as n . The test statistic is, (14)TS2=j=1S2{1ni=1nbj(Ri12n)bj(Si12n)}2. It is reported in that there is almost no change in the critical value of TS2 for d(n)>2. By default, we choose d(n)=4. The decision rule to reject the null hypothesis in (2) is (15)TS2>Cα, where Cα is an α-level test threshold.

In the TS2 test statistic in (14) is called the diagonal  test statistic and the other test statistic, V is called the mixed  statistic due to the fact that it involves mixed  products. To derive the V statistic, we only consider the case d(n)=2 and have sets of indexes {(1,1)},{(1,1),(i,j)}, where (i,j)(1,1). Let Λ be one of these sets and define TΛ=(r,s)ΛV(r,s), where V(r,s)={1ni=1nbr(Ri12n)bs(Si12n)}2. Letting |Λ| denote the cardinality of Λ, we now search for a model, say Λ(max), for which TΛ|Λ|logn is maximized. Having obtained Λ(max), the second test statistic of (2) is V=TΛ(max). It can be easily seen that the test statistic V can be rewritten (for d(n)=2) in the simple form, V={V(1,1),ifmax{V(1,2),V(2,1),V(2,2)}<lognV(1,1)+max{V(1,2),V(2,1),V(2,2)},otherwise. A size α rejection rule for the test is given by (16)V>Cα, where Cα is a size α critical value.

This test is very powerful when data are from an exponential family, however the statistic is rather complicated and a required assumption is that the data distributions belong to semiparametrically defined families.

Maximal information coefficient

The maximal information coefficient (MIC) proposed in is a measure of strength of the linear and non-linear association between two variables X and Y. The maximal information coefficient uses binning as a means to apply mutual information on continuous random variables. Defining D as a finite set of ordered pairs, we can partition the x-values of D into x bins and the y-values of D into y bins, allowing empty bins. We call such a partition an x-by-y grid, denoted G. For a fixed D, different grids G result in different distributions D|G. The MIC of a set D of two-variable data with sample size n and grid size less than B(n) is defined as: (17)MIC(D)=maxxy<B(n){M(D)x,y}, where x and y are observed values of variables X and Y, ω(1)<B(n)o(n1ε)for some0<ε<1. Note M(D)x,y is called the characteristic matrix of a set D of two-variable data x,y and defined as: M(D)x,y=I(D,x,y)logmin{x,y}, where I(D,x,y) is defined as: I(D,x,y)=maxI(D|G), for a finite set DR2 and positive integers x,y. We define the mutual information of two discrete random variables X and Y as: I(D|G)=yYxXF^(x,y)log(F^(x,y)F^X(x)F^Y(y)), The MIC statistic in (17) is computed using the R package minerva (see ). To our knowledge there is no hypothesis tests published to detect general structures of dependence based on MIC. We use an approach similar to the one in to develop an exact test based on the MIC statistic. A size α MIC test can be given by, (18)MIC(D)>Cα where Cα is a size α critical value. To evaluate our approach, we simulated 5000 Monte Carlo sets of independent random variables X and Y of size n from the standard normal distribution, exponential distribution and reverse of the standard normal distribution. The cutoff results are shown in Table 2. Regardless of the data distribution, the cutoff values for a given sample size are very similar.

It has been noted in that the MIC approach has serious power deficiencies and when used for large-scale exploratory analysis will result in too many false positives.

Table 2: Cutoff values from 5000 Monte Carlo simulations for a normal distribution (N), exponential distribution (E) and reverse normal distribution (RN).
n=10 n=35 n=75 n=100
N | E | RN N | E | RN N | E | RN N | E | RN
α=0.01 0.61|0.61|0.61 0.50|0.52|0.50 0.38|0.37|0.38 0.33|0.33|0.33
α=0.05 0.61|0.61|0.61 0.43|0.43|0.43 0.33|0.33|0.33 0.30|0.30|0.30
α=0.1 0.61|0.61|0.61 0.41|0.40|0.41 0.31|0.31|0.31 0.28|0.28|0.28
α=0.5 0.24|0.24|0.24 0.31|0.30|0.31 0.25|0.25|0.25 0.23|0.24|0.23
α=0.75 0.24|0.24|0.24 0.27|0.26|0.27 0.23|0.23|0.23 0.21|0.21|0.21
α=0.9 0.12|0.12|0.12 0.24|0.23|0.24 0.21|0.21|0.21 0.20|0.20|0.20
α=0.95 0.12|0.12|0.12 0.23|0.22|0.23 0.20|0.20|0.20 0.19|0.19|0.19
α=0.99 0.11|0.11|0.11 0.20|0.20|0.20 0.18|0.18|0.18 0.17|0.17|0.17

Empirical likelihood based test for independence

An empirical likelihood based test is proposed by . The approach is via localizing the empirical likelihood with one or more time  variables implicit in the given null hypothesis and then constructing an omnibus test statistic by integrating the log-likelihood ratio over those variables. We first consider a null hypothesis, H0:FX=F0, where F0 is a fully specified distribution function. We define the localized empirical likelihood ratio R(x) as: R(x)=sup{L(FX~):F~X(x)=F0(x)}sup{L(F~X)}, where F~ is an arbitrary distribution function, L(F~X)=i=1n(F~X(Xi)F~X(Xi)). The supremum in the denominator is achieved when F~=F^, the empirical distribution function. The supremum in the numerator is attained by putting mass F0/(nF^(x)) on each observation up to and including x and mass (1F0(x))/(n(1F^(x))) on each observation beyond x . This gives the log localized empirical likelihood ratio: logR(x)=nF(x)logF0(x)F^(x)+n(1F^(x))log1F0(x)1F^(x). Note that 0log(a/0) is defined as 0. For an independence test, the local likelihood ratio test statistic is: R(x,y)=sup{L(F~):F~(x,y)=F~X(x)F~Y(y)}sup{L(F~)}, for (x,y)R2, with L(F~)=i=1nP~({Xi}), where P~ is the probability measure corresponding to F~. The log local likelihood ratio test statistic is then: logR(x,y)=nP^(A11)logF^X(x)F^Y(y)P^(A11)+nP^(A12)logF^X(x)(1F^Y(y))P^(A12)+nP^(A21)log(1F^X(x))F^Y(y)P^(A21)+nP^(A22)log(1F^X(x))(1F^Y(y))P^(A22), where P^ is the empirical measure of joint probability and A11=(,x]×(,y],A12=(,x]×(y,),A21=(x,)×(,y],A22=(x,)×(y,). The test statistic Tel is defined as: Tel=2logR(x,y)dF^X(x)F^Y(y), where Tel is distribution-free. We reject H0 when Tel>Cα, where Cα is a size α critical value.

This test is characterized by moments and is very powerful for many nonparametric alternatives. However, when the dependency is linear with normally distributed data, this test is outperformed by Pearson and Spearman tests.

Kendall plot and area under Kendall plot

The Kendall plot, also called K-plot, is a visualization of dependence in a bivariate random sample. It is proposed in . Similar to a Chi-plot which detects association in random samples from continuous bivariate distributions, the K-plot adapts the concept of a probability plot to detect dependence.

The horizontal axis in a K-plot, Wi:n is the expectation of the ith order statistic in a random sample of size n from the distribution K0 of the Hi (defined below) under the null hypothesis of independence. . Wi:n can be computed as: Wi:n=n(n1i1)01w{K0(w)}i1×{1K0(w)}nidK0(w), for all 1in. Note K0 is given as: K0(w)=wwlog(w),0w1. The vertical axis is the sorted Hi. Unlike Wi:n, Hi is computed based on information provided by the data.

Note Hi is defined as: Hi=1n1i=1nInd(XXi,YYi), where Ind is indicator function.

Let 0p1 then some properties of the K-plot are:

  1. when Y = X, all points fall on curve K1(p),

  2. when Y = X, all points fall on a horizontal line,

  3. when X and Y are independent, the graph is linear.

The area under the Kendall plot (AUK) is proposed in as an index to independence. It applies area-under-curve analysis and computes the area under the Kendall plot. Some properties are listed below:

  1. when Y = X, AUK=0.75,

  2. when Y = X, AUK=0,

  3. when X and Y are independent, AUK=0.

Note AUK=0 does not imply independence. The AUK is similar to computing the area under the curve (AUC) for a receiver operating curve (ROC). In an ROC curve the true positive rate (sensitivity) is plotted as a function of the false positive rate for different cut-off points. A large AUC (near 1) suggests a good prediction model whereas a small AUC (near 0.5) suggests a model that is no better than chance. Similarly, a large AUK suggests dependence while an AUK near 0 may be suggestive of independence and should be further examined using one of the proposed tests. Further, more complicated correlation patterns such as a mixture of positive and negative correlations as in Example 2 (see Data analysis examples) can be detected with the K-plot.

4 What is the package testforDEP?

The R Package testforDEP includes two functions, testforDEP(), AUK() and one data frame, LSAT.

testforDEP

The function testforDEP() is the interface for the tests. The function testforDEP() takes two vectors X, Y as inputs and returns an S4 object containing the

  1. test statistic,

  2. p-value,

  3. bootstrap confidence intervals.

The interface of testforDEP() is:

testforDEP(x, y, data, test = c("PEARSON", "KENDALL", "SPEARMAN",
     "VEXLER", "TS2", "V", "MIC", "HOEFFD", "EL"),
     p.opt = c("dist", "MC", "table"), num.MC = 10000,
     BS.CI = 0, rm.na = FALSE, set.seed = FALSE)

where x and y are two equal-length numeric vectors of input data. The parameter data is an alternative input that takes a data frame with two columns representing X and Y. When x or y are not provided, parameter data is taken as input. The parameter test specifies the hypothesis test to implement. A summary of the test options with the associated statistic returned is given below,

Parameter p.opt is the option for computing p-values in which p-values can be computed by the asymptotic null distribution of the test statistic (applicable for Pearson, Kendall and Spearman only), by an exact method (applicable for all tests) or by pre stored simulated tables based on an exact method. By default, p.opt = "MC". The parameter num.MC gives the Monte Carlo simulation number and will only be used when p.opt = "MC". When p.opt = "dist" or p.opt = "table", num.MC will be ignored. To balance accuracy and computation time, num.MC must [100,10000], with num.MC = 10000 as default. The parameter BS.CI specifies a 1α bootstrap confidence intervals. When BS.CI = 0, the confidence intervals will not be computed. Parameter rm.na is a flag for removing rows with missing data. Parameter set.seed is a flag for setting the random number generator seed.

AUK

The function AUK() provides an interface for Kendall plots and AUK. It takes two vectors X, Y and returns a list containing:

  1. AUK,

  2. Wi:n,

  3. sorted Hi,

  4. bootstrap confidence intervals.

The interface of AUK() is:

AUK(x, y, plot = FALSE, main = "Kendall plot", Auxiliary.line = TRUE,
    BS.CI = 0, set.seed = FALSE)

where x and y are two equal-length numeric vectors of input data. Parameter plot is a flag for drawing the Kendall plot. Parameter main determines the title of the plot. If plot = FALSE, main will be ignored. Parameter Auxiliary.line is a flag for auxiliary line. Parameter BS.CI specifies a 1α bootstrap confidence intervals. The bootstrap confidence intervals were produced using a non-parametric bootstrap procedure with a normal, pivotal and percentile based confidence interval obtained as described in . The bootstrap confidence intervals are on the scale of the test statistics that are returned from each test in testforDEP. When BS.CI = 0, confidence intervals will not be computed. Parameter set.seed is a flag for setting seed.

5 Data analysis examples

Here we present two data analysis examples to demonstrate the utility of the testforDEP package. In the first example, the data are average law school admission test (LSAT) and grade point average (GPA) from 82 law schools (details described in ). The data frame LSAT  is available in testforDEP package. The aim is to determine the dependence between LSAT and GPA. Table 3 gives the data and a scatter plot is shown in Figure 1 (Top Left).

Table 3: LSAT data from .
School LSAT GPA School LSAT GPA School LSAT GPA
1 622 3.23 28 632 3.29 56 641 3.28
2 542 2.83 29 587 3.16 57 512 3.01
3 579 3.24 30 581 3.17 58 631 3.21
4 653 3.12 31 605 3.13 59 597 3.32
5 606 3.09 32 704 3.36 60 621 3.24
6 576 3.39 33 477 2.57 61 617 3.03
7 620 3.10 34 591 3.02 62 637 3.33
8 615 3.40 35 578 3.03 63 572 3.08
9 553 2.97 36 572 2.88 64 610 3.13
10 607 2.91 37 615 3.37 65 562 3.01
11 558 3.11 38 606 3.20 66 635 3.30
12 596 3.24 39 603 3.23 67 614 3.15
13 635 3.30 40 535 2.98 68 546 2.82
14 581 3.22 41 595 3.11 69 598 3.20
15 661 3.43 42 575 2.92 70 666 3.44
16 547 2.91 43 573 2.85 71 570 3.01
17 599 3.23 44 644 3.38 72 570 2.92
18 646 3.47 45 545 2.76 73 605 3.45
19 622 3.15 46 645 3.27 74 565 3.15
20 611 3.33 47 651 3.36 75 686 3.50
21 546 2.99 48 562 3.19 76 608 3.16
22 614 3.19 49 609 3.17 77 595 3.19
23 628 3.03 50 555 3.00 78 590 3.15
24 575c 3.01 51 586 3.11 79 558 2.81
25 662 3.39 52 580 3.07 80 611 3.16
26 627 3.41 53 594 2.96 81 564 3.02
27 608 3.04 54 594 3.05 82 575 2.74
55 560 2.93

Figure 1 (Top Left) suggests a linear positive relationship between LSAT and GPA. To confirm this, we draw the Kendall plot and compute AUK (see Figure 1 (Top Right)). Figure 1 (Top Right) shows a curve above the diagonal and AUK of 0.665, both of which suggest a positive dependence.

Now consider the dependence tests provided in package testforDEP. Table 4 shows the test statistics and p-values. All tests, classical and modern, suggest dependence between LSAT and GPA. From this analysis, it is clear that LSAT and GPA variables are dependent.

To further examine this data and explore the comparable power of the modern methods, we also examine a subset of 15 randomly chosen points from the full dataset as shown in Figure 1 (Bottom Left). An analysis of this subset reveals significant p-values for the empirical likelihood, density based empirical likelihood and Kallenberg tests (see Table 4). The small p-values, especially for the density-based empirical likelihood method, indicate the power of modern methods to detect non linear dependence in small sample sizes. We further explore this point in the next example.

Table 4: Example 1: Test results for the full LSAT/GPA data and for a randomly chosen subset of 15 data points.
Full Data Subset Data
test statistic p-value statistic p-value
Pearson 10.46 < 0.0001 1.77 0.1074
Kendall 7.46 < 0.0001 69.00 0.0990
log(VTn) 65.70 < 0.0001 12.05 0.0067
TS2 80.76 < 0.0001 3.16 0.2003
V 69.04 < 0.0001 8.51 0.0146
MIC 0.53 < 0.0001 0.38 0.0993
Hoeffding 0.22 < 0.0001 0.08 0.0658
Tel 13.64 < 0.0001 2.15 0.0238
Table 5: Example 2: Dependence test results for the microarray study with genes Pla2g7 and Pcp2 displayed in Figure 2 and published in with re-examination in .
test statistic p-value
Pearson -1.015 0.3156
Kendall -2.321 0.0196
Spearman -2.900 0.0050
log(VTn) 140.554 < 0.0001
TS2 13.347 0.0042
V 82.712 < 0.0001
MIC 0.666 < 0.0001
Hoeffding 0.085 < 0.0001
Tel 12.254 < 0.0001
graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
Figure 1: Top Left: Scatter plot based on data in Table . Top Right: Kendall plot of the full dataset. Bottom Left: A scatter plot of a subset of the LSAT/GPA data. Bottom Right: A Kendall plot of the subset of data.

In the second example, we demonstrate the power of the modern methods to detect non-linear dependence. This example was originally published in and re-examined in . It involves analyzing microarray expression data collected to identify critical genes in photoreceptor degeneration. The data in Figure 2 are from 2 genes (Pcp2 and Pla2g7) with each point representing another gene with its x-coordinate being the Euclidean distance from Pla2g7 and the y-coordinate as the Euclidean distance from Pcp2. The Euclidean distance is calculated from expression profiles between wild type (wt) and rod degeneration (rd) mice. The correlation shown in Figure 2 measures the relationship between how the Pla2g7 gene interacts with all the other genes vs. how the Pcp2 gene interacts with all the other genes. From Figure 2 we see the genes are negatively correlated when their transformed levels are both less than 5. Otherwise, these genes are positively correlated. We proceed to analyze the data using the testforDEP package. The results are presented in Table 5. From this analysis, we see a large gain in power using the modern methods such as log(VTn),TS2,V,MIC,Hoeffding,Tel compared to the traditional methods (Pearson, Kendall, Spearman). Taken together, these examples demonstrate the power advantage of using modern methods to detect nonlinear dependence while still maintaining comparable power to traditional methods in detecting linear dependence.

graphic without alt text graphic without alt text
Figure 2: Left: Scatter plot of the transformed expression levels of Pla2g7 and Pcp2 genes studied in (Chen et al. 2010). Right: Kendall plot of the dataset highlighting the negative correlation (points below y = x line) for smaller expression values and positive dependence (points above y = x line) for larger values.

6 Conclusion

The R package testforDEP is a new package that allows users for the first time to analyze complex structures of dependence via modern test statistics and visualization tools. Moreover, a novel exact method based on the MIC measurement has been proposed in the package testforDEP. Future work is necessary to further develop a formal testing structure based on the MIC statistic. We note that users of this package should carefully consider the strengths and weaknesses of the proposed tests as outlined in each section when deciding what test to apply. Further, a topic for future work is to develop a complete set of consistent criteria to determine which test is most appropriate for a specific setting. We note extensive simulations comparing each of the tests are provided in which may help guide users to choose among the tests. Ultimately, we believe that the package testforDEP will help investigators identify dependence using the cadre of modern tests implemented to detect dependency.

7 Acknowledgments

Albert Vexler’s efforts were supported by the National Institutes of Health (NIH) grant 1G13LM012241-01. We are grateful to the Editor, Associate Editor and the two referees for helpful comments that clearly improved this manuscript.

CRAN packages used

testforDEP, Hmisc, minerva

CRAN Task Views implied by cited packages

Bayesian, ClinicalTrials, Databases, Econometrics, MissingData, ReproducibleResearch

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Footnotes

    References

    D. Albanese, M. Filosi, R. Visintainer, S. Riccadonna, G. Jurman and C. Furlanello. Minerva and minepy: A C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3): 407–408, 2013. URL https://doi.org/10.1093/bioinformatics/bts707.
    Y. A. Chen, J. S. Almeida, A. J. Richards, P. Müller, R. J. Carroll and B. Rohrer. A nonparametric approach to detect nonlinear correlation in gene expression. Journal of Computational and Graphical Statistics, 19(3): 552–568, 2010. URL https://doi.org/10.1198/jcgs.2010.08160.
    H. A. David and H. N. Nagaraja. Order statistics. John Wiley & Sons, 1970.
    B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC Press, 1994.
    J. H. Einmahl and I. W. McKeague. Empirical likelihood based hypothesis testing. Bernoulli, 267–290, 2003. URL https://doi.org/10.3150/bj/1068128978.
    C. Genest and J.-C. Boies. Detecting dependence with Kendall plots. The American Statistician, 2012. URL https://doi.org/10.1198/0003130032431.
    F. E. Harrell Jr and M. C. Dupont. The hmisc package. R package version, 3: 0–12, 2006.
    F. E. Harrell Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell miscellaneous. 2017. R package version 4.0-3.
    J. Hauke and T. Kossowski. Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaestiones Geographicae, 30(2): 87–93, 2011. URL https://doi.org/10.2478/v10117-011-0021-1.
    W. Hoeffding. A non-parametric test of independence. The Annals of Mathematical Statistics, 546–557, 1948. URL https://doi.org/10.1214/aoms/1177730150.
    W. C. Kallenberg and T. Ledwina. Data-driven rank tests for independence. Journal of the American Statistical Association, 94(445): 285–301, 1999. URL https://doi.org/10.1080/01621459.1999.10473844.
    M. G. Kendall. A new measure of rank correlation. Biometrika, 81–93, 1938. URL https://doi.org/10.1093/biomet/30.1-2.81.
    J. Miecznikowski, E. Hsu, Y. Chen and A. Vexler. testforDEP: An R package for distribution-free tests and visualization tools for independence. SUNY University at Buffalo Biostatistics Technical Reports, 1701: 2017.
    K. Pearson. Notes on the history of correlation. Biometrika, 25–45, 1920. URL https://doi.org/10.1093/biomet/13.1.25.
    D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher and P. C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062): 1518–1524, 2011. URL https://doi.org/10.1126/science.1205438.
    B. Rohrer, F. R. Pinto, K. E. Hulse, H. R. Lohr, L. Zhang and J. S. Almeida. Multidestructive pathways triggered in photoreceptor cell death of the rd mouse as determined through gene expression profiling. Journal of Biological Chemistry, 279(40): 41903–41910, 2004. URL https://doi.org/10.1074/jbc.M405085200.
    N. Simon and R. Tibshirani. Comment on "Detecting Novel Associations in Large Data Sets" by Reshef et al, Science Dec 16, 2011. arXiv Preprint arXiv:1401.7645, 2014.
    C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1): 72–101, 1904. URL https://doi.org/10.2307/1412159.
    A. Vexler, X. Chen and A. D. Hutson. Dependence and independence: Structure and inference. Statistical Methods in Medical Research, 26(5): 2114–2132, 2017. URL https://doi.org/10.1177/0962280215594198.
    A. Vexler, W.-M. Tsai and A. D. Hutson. A simple density-based empirical likelihood ratio test for independence. The American Statistician, 68(3): 158–169, 2014. URL https://doi.org/10.1080/00031305.2014.901922.
    Y. Wang, Y. Li, H. Cao, M. Xiong, Y. Y. Shugart and L. Jin. Efficient test for nonlinear dependence of two continuous variables. BMC Bioinformatics, 16(1): 1, 2015. URL https://doi.org/10.1186/s12859-015-0697-7.
    L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer-Verlag, 2013.

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Miecznikowski, et al., "testforDEP: An R Package for Modern Distribution-free Tests and Visualization Tools for Independence", The R Journal, 2018

    BibTeX citation

    @article{RJ-2018-057,
      author = {Miecznikowski, Jeffrey and Hsu, En-shuo and Chen, Yanhua and Vexler, Albert},
      title = {testforDEP: An R Package for Modern Distribution-free Tests and Visualization Tools for Independence},
      journal = {The R Journal},
      year = {2018},
      note = {https://rjournal.github.io/},
      volume = {10},
      issue = {2},
      issn = {2073-4859},
      pages = {282-295}
    }