Nonparametric Independence Tests and k-sample Tests for Large Sample Sizes Using Package HHG
Abstract:
Nonparametric tests of independence and -sample tests are ubiquitous in modern applications, but they are typically computationally expensive. We present a family of nonparametric tests that are computationally efficient and powerful for detecting any type of dependence between a pair of univariate random variables. The computational complexity of the suggested tests is sub-quadratic in sample size, allowing calculation of test statistics for millions of observations. We survey both algorithms and the HHG package in which they are implemented, with usage examples showing the implementation of the proposed tests for both the independence case and the -sample problem. The tests are compared to existing nonparametric tests via several simulation studies comparing both runtime and power. Special focus is given to the design of data structures used in implementation of the tests. These data structures can be useful for developers of nonparametric distribution-free tests.
A common question that arises in the analysis of data is whether two
random variables, and , are independent. The null hypothesis is
where and are the marginal cumulative distribution functions
of and , and is the joint cumulative distribution
function. The case where is categorical and is continuous is the
-sample problem. An omnibus consistent test will reject the null
hypothesis in (1) for any dependence between and ,
with probability increasing to one as the sample size tends to infinity.
In recent years, there has been great interest in developing tests of
independence that are able to identify complex dependencies based on
independent observations from . For univariate random variables,
the first omnibus consistent test was based on summation of a score over
all partitions of the sample space where every data
point serves as a partition point (Hoeffding 1948). This test is
available via the function hoeffd from package
Hmisc(Harrell Jr et al. 2018). Another classic approach is based on the measure of
mutual information following partitioning of the data into a
2-dimensional grid (Paninski 2003). This approach is taken in
the R packages
minet(Meyer et al. 2008; Meyer et al. 2017),
infotheo(Meyer 2014), and
entropy(Hausser and Strimmer 2009, 2014) with various extensions to the
partitioning schemes used.
Recently, several nonparametric omnibus consistent tests have been
suggested that have computational complexity at least quadratic in
sample size. Reshef et al. (2011), with CRAN package
minerva(Albanese et al. 2013; Filosi et al. 2017), suggested MIC, which is based
on the maximum of penalized estimated mutual information partitions.
Gretton et al. (2008), with CRAN package
dHSIC(Pfister and Peters 2017; Pfister et al. 2018), suggested HSIC, which is a kernel
test based on the empirical estimate of the Hilbert Schmidt norm of the
cross-covariance operator. Székely et al. (2007, 2009),
with CRAN package
energy(Rizzo and Szekely 2017), suggested dCov, which is based on the joint empirical
characteristic function. Both kernel and characteristic function methods
may be implemented in a scenario where or are multivariate.
Heller et al. (2016), with CRAN package
HHG, suggested tests
which aggregate by maximization or by summation the likelihood ratio
test (LRT) scores over all possible partitions of data, with
partition points for each variable. The suggested tests aggregate over
all partitions of the data of size (i.e., having
cells) for a range of values.
For the -sample problem, Székely and Rizzo (2004) suggested a test based
on joint empirical characteristic function, which they implemented in
package energy as well. Gretton et al. (2012a) suggested a family of
consistent two sample tests based on kernels. The function kmmd from
package kernlab(Zeileis et al. 2004) implements this family of tests for several kernel
choices. Jiang et al. (2015), with CRAN package
dslice, suggested the
dynamic slicing test statistic, which aggregates by maximization the
penalized LRT score with a penalty for fine partitions.
Heller et al. (2016), with CRAN package HHG, suggested tests
which aggregate by maximization or by summation the LRT scores over all
possible partitions of data.
The potential advantage in power as sample size increases in the
state-of-the-art tests listed above is hindered by the computational
cost. In practice, many of the state-of-the-art tests cannot be applied
when sample sizes are in the thousands. The computational problem is
compounded when multiple tests are to be carried out in the same
analysis, e.g., when the aim is to detect all pairwise associations
among the variables in a dataset.
Modifications to some of the tests listed above can be used for large
sample sizes. For the HSIC test statistic, it was suggested to compute
the quadratic time HSIC test statistic for subsets of the data, and then
aggregate the HSIC test statistics towards the test statistic for the
null hypothesis in (1)(Gretton et al. 2012a,b). Two approximate HSIC statistics,
the HSIC test statistic and the random fourier feature HSIC, have been
shown to have reduced computational complexity while enjoying power
comparable to the original HSIC statistic (Zhang et al. 2018). Other
computationally efficient ways to compute HSIC were suggested in
Jitkrittum et al. (2016b,a). The three
computationally efficient adaptations of the HSIC test RFF, FSIC) have
an intrinsic trade-off between power and computational complexity given
by a resolution parameter of the method. The user is able to ‘pay in
runtime’ for more power. A computationally efficient algorithm for
computing the univariate dCov test statistic in time was
developed in (Huo and Székely 2016). For the suggested test in
Gretton et al. (2012a), computationally efficient modifications have been
considered in (Zhao and Meng 2015) and (Chwialkowski et al. 2015). A
computationally efficient algorithm for computing the univariate energy
test statistic of Székely and Rizzo (2004) in time was developed
in Huang and Huo (2017). Jiang et al. (2015) suggested considering only a
subset of partition locations for large sample sizes for their dynamic
slicing test. In our present work, we suggest a similar modification for
the tests in (Heller et al. 2016).
This paper describes two main contributions. The first is to provide a
method and software for discovering dependence that has reasonable
computational time for any sample size. Specifically, we modify the
algorithms for the tests of Heller et al. (2016) , which were shown to
have good power in complex settings, to aggregate only a representative
subset of all partitions of the data, thus achieving a computational
complexity which is sub-quadratic in sample size. The suggested tests
have power competitive with state-of-the-art methods, but can be
computed at a fraction of the time. Second, we extend the algorithms for
the tests of Heller et al. (2016) to allow partitions with a different
number of partition points on each axis. LRT scores of all partitions of
size of the sample space are aggregated where and
are the number of partition points of the and variables,
respectively. This generalization does not increase the computational
complexity, yet it can result in better power for alternatives where the
optimal number of partition points in each axis is different.
The paper is organized as follows. We introduce the atom based
statistic and detail our novel contributions in the following two
sections. Then, in “Usage examples”, we present the work flow of the
package (function calls and outputs). In “-sample tests” we present
the computationally efficient tests for the K-Sample problem and their
work flow. In “Simulation” we compare the novel tests to other
state-of-the-art tests in terms of power and runtime. Finally, in
“Discussion” we provide some final remarks. Other tests available in
HHG are detailed in Appendix B.
2 The atom based test statistics
Suppose we have sample points, where each sample is a pair .
We split the plane into parts along the axis and into parts
along the axis. (Note that for now, and are fixed; later we
will show how we choose the best and automatically so that the
user will not need to fix them.) We consider the set of all
partitions of the data, where a split point is possible only once every
ordered observations in each variable. The indivisible blocks of
observations are called atoms. We assume for simplicity that the number
of atoms is an integer multiple of , . Figure
1 shows an example partition of the sample space
based on atoms.
Figure 1: A visualization of a single partition of the atoms plane.
The atom size is , so only partitions on the dashed black grid
lines are allowed at boundary values . For
and , a single partition is depicted with blue
line boundaries. For sample points (black dots), we have
atoms dividing each axis. The cell coloured in blue, which
has width atoms and height atoms, has sample points,
whereas only are expected under
.
A cell is defined by four integers, marking its left, right, top and
bottom boundaries on an equidistant grid of points. A
cell with all four boundaries in the interior of the grid will be
considered a ‘center’ cell (cell type 1). A cell with either its top or
bottom boundary at the edge of the grid will be considered a ‘top’ or
‘bottom’ cell, respectively (cell type 2). A cell with either its left
or right boundary at the edge of the grid will be considered a ‘left’ or
‘right’ cell, respectively (cell type 3). A set which has two boundaries
at the edge of the grid (e.g., ‘left’ and ‘top’) will be called a
‘corner’ cell (cell type 4). Let denote the number of samples
observed in a cell, and the expected number of samples when
is true. For a cell of width atoms and height atoms,
. Let be the set of all cells and
the set of all cells of size
atoms and type , where . We define the function
, returning the number of partitions a cell of
type and size participates in:
Let be the set of all
partitions of the plane, where a split point is possible only between
atoms. For a given partition size, the aggregated by sum
test statistic is
where is the indicator function. The last equality in
(3) demonstrates that for computing , we
can iterate over cells instead of partitions, thus achieving a
computational complexity of ,
even though the number of possible partitions is
.
Since the optimal partition size is unknown, we propose
taking the minimum -value over the plausible range of partition
sizes:
where is the -value of the test statistic
.
In Appendix A we present the full pseudo-code for the algorithm,
including the case when N is not a multiple of . The pseudo-code also
shows how is
computed at the same computational complexity as a single
. The atom based test in (4) is consistent
as long as and (for a
proof see Appendix C in Brill 2016).
3 The null distribution of and
In this section, we show how to tabulate the null distribution of our
proposed statistic . This tabulation requires tabulating the null
distribution of . Fortunately, the test statistics in
(3)-(4) are based on the ranked observations
and therefore are distribution free. Consequently, the null
distributions of can be
tabulated off-line (prior to seeing the data) in order to evaluate the
-value of any test statistic that combines
.
The tabulation of the null distribution of and the
test statistics is described in the schematic diagram in
Figure 2. The structure of generated null
distributions and stored tabulations of null distributions differs.
While one generates the vector of statistics from a
single sample, the package data structure for a null table is
constructed by sorted arrays of the marginal distributions. Given a null
table of size repetitions, one can compute all marginal -values
for in time once each of the
statistics is computed from data. Then the statistic
is simply the minimum of all these p-values. An additional
search is required for computing the true -value of the achieved
test statistic. Given the null table, calculating the
statistic takes altogether , and
since , this is at most .
Since typically , the complexity is typically
.
Figure 2: Schematic for the computation of the -value for the
statistic. In step A, sample pairs without replacement from
, times. In step B, compute all
test statistics for each sample of pairs, color coded so that we
have rows and columns of test statistics. In step C,
compute the within column rank for each test statistic: the rank is in
the subscript for each test statistic, and its -value is solely
determined by the rank and . In step D, compute for each
sample (row) and its rank in the subscript. In step E, each sorted
column along with the sorted column is stored individually for
fast access and computation of
-values.
In practice, one does not need to maintain all marginal null
distributions at a fixed resolution. Only the lower -values (high
scores) are used for rejections. Thus, when one
simulates a large null table such as , marginal ECDFs can be
maintained at 0.001 increments of the cumulative probability
distribution function for -values bigger than some parameter for
compression (e.g., ) and at maximum
resolution for lower -values. Using the above parameters as the two
different resolutions and , a null table of values is
compressed to just over values. This data structure makes
-value computation via the null table simple and efficient, with null
tables sizes being maintainable even for large values of .
In addition, this data structure is utilized as a combination method for
statistics with a nominal false positive rate of . Importantly,
one can utilize this efficient data structure with any set of statistics
and a general combination score which takes into account only marginal
-values. For example, Heller et al. (2016) propose another type of
combination score for their tests, of the form
. The combination
score makes use of this data structure as well.
4 Usage examples
Function calls and input arguments
The test procedure utilizes two function calls. The first function call
carries out the tabulation of the null distributions for both the
statistics and the combination statistic. The
second function call computes the test statistic and its
-value, given the look-up tables for the distributions of
and (see Table 1).
The arguments for the null distribution tabulation are the sample size,
the maximal partition size considered, and the number of atoms. These
are denoted by the parameters n, mmax, and nr.atoms, respectively.
The default parameters for the test are mmax = min(10,n) and
nr.atoms = min(40,n). In Section “Simulations”, we discuss these
defaults in terms of power, runtime, and the trade-off between the two.
Other parameters include the type of combination score used ( or
Fisher) and the type of data partition used ( or
).
The arguments for computing the MinP test statistic via
Fast.independence.test are two vectors of size n for the joint
observations of , and a null table object produced by
the function Fast.independence.test.nulltable. All test parameters are
kept in the null table object.
Table 1: The novel atom-based tests in the HHG package.
Function Name
Description
Fast.independence.test.nulltable
Function creates null table objects for
the atoms based omnibus
distribution-free test of independence
between two univariate random
variables. Input arguments are the
number of atoms, , and the type
of partitioning (all or all
partitions).
Fast.independence.test
Performs the atoms based
distribution-free test of independence
of two univariate random variables,
which is computationally efficient for
large data sets. Input arguments are
two numeric vectors of data, and
optionally a null table object (if a
null table has not been precomputed,
one is generated on the fly).
Output description
The output of the function Fast.independence.test.nulltable is a null
table data structure. This data structure contains the sorted marginal
distributions of statistics, the sorted distribution of
the test score, and the test parameters as in Figure
2.
The output of Fast.independence.test is an object of class
"UnivariateStatistic" containing the results of both the marginal
tests performed on the data, along with the results of
the test. The fields m.stats and pvalues.of.single.m contain
the test statistics for the tests of fixed partitions size and their
respective P-values, given by (3). The fields MinP and
MinP.pvalue contain the data adaptive test statistic and its P-value,
given by (4).
Example code
We begin by computing the a look-up tables of the null distributions.
This is done by calling the first function, with the sample size as a
parameter:
# compute null table, using default m.max,# number of atoms and number of permutations under the null. nt =Fast.independence.test.nulltable(n)
The number of atoms for the procedure and the maximal partition size are
set to their default values. The object nt now holds a look-up table
which is specific for the selected test parameters and sample size. We
will use this object to carry out the test using
Fast.independence.test:
# carry out test, parameters set by null table passed. res =Fast.independence.test(x,y,nt) # print results and P-value. P-value given under entry 'MinP.pvalue'. res
The last line prints the output for the test. The output begins by
describing the test parameters: test statistic chosen, partition sizes
considered, and number of atoms:
HHG univariate combined independence statisticStatistics type combined:sum of ADP-EQP-ML on Likelihood Ratio scores.Single m statistics are the sum of scores over All Derived Partitions (ADP) of the data.Statistics are normalized by the number of possible partitions and sample size. Minimum partition size:2 Maximum partition size:10Sample size:1200Equipartition nr.atoms:40
The output printed shows the test statistics along with
their marginal -values:
Finally, the output shows the MinP test statistic score along with it’s
-value, which is the -value for the test. The selected partition
size attaining the smallest -value is also shown:
MinP Statistic - Test statistic is minimum of above single m pvalues:0.01Partition size with minimum p-value:6X6p-value for MinP test:0.01
As stated above, the proposed method is distribution free. As such, the
look-up table generated can be used with any data of the same size, as
we show in the next example.
# generate data of size n: x.2=rnorm(n) y.2= x.2+rnorm(n)# carry out test using exactly the same null table as before. res2 =Fast.independence.test(x.2,y.2,nt)
The main advantage of distribution free tests is that while standard
permutation based tests performed on null hypotheses with
permutations (in each test) require independent calculations
of the test statistics, distribution free tests require only test
statistics and permutations, in total. This condition makes
them ideal for scenarios where a large number of hypotheses are examined
simultaneously.
For large values of and table size , the computation of the
look-up table can still be cumbersome. The package vignette shows how
this process can be parallelized.
5-sample tests
A special case of the independence problem is the -sample problem:
independence testing for continuous and categorical , where
has different categories, . The null hypothesis can be
formulated as:
We modify the test statistic in (Heller et al. 2016) to
consider only partitions at the atom level. For example, instead of
considering all possible partitions, we consider only partition points
that are found on an equidistant grid of points. Let be the
-value of the test statistic that aggregates by summation or
maximization all the LRT scores with partitions of size of the real
line, where a split point is possible only every points. The atom
based test statistic is
The code snippets below show how to analyze an example. The two vectors
X and Y are of size 1000. The entries in Y are the group labels:
500 zeros and 500 ones.
# generate null table nt =hhg.univariate.ks.nulltable(c(500,500), #group structurevariant ='KSample-Equipartition', #computationally# efficient variant of the K-sample test. mmax =10, #m.max parameternr.atoms =30, #number of atomsnr.replicates =10^4) #number of replicates# run test res =hhg.univariate.ks.combined.test(X,Y,nt) # Shows Average LRT scores: AVG LRT score of # m = 2,...,10 cell tables res
As with independence tests in which many -sample tests with the same
sample sizes are carried out, the same null table can be used. We
demonstrate this in the next example:
# generate sample with the same group structure -# normal deviates with a shift between groups. X2 =rnorm(length(Y),0,1)+1*Y# perform test using the same null table. res2 =hhg.univariate.ks.combined.test(X2,Y,nt) # view results res2
6 Simulations
We used simulations to assess the performance of the atom based tests in
terms of both run time and power. Full source code for reproducing the
simulation results, graphs and usage examples is found in the
supplementary material, and at the GitHub repository
(https://barakbri.github.io/HHG_large_sample_framework/).
All run times were measured using the CRAN package
rbenchmark(Kusnierczyk 2012). Run times were measured separately from power
estimation, as simulation for power estimation has been parallelized via
the doRNG package
(Gaujoux 2017). All run time experiments were done serially, without
parallelization.
The test of independence
In order to assess the power of the test statistics along with
the actual run time required, we present four different scenarios of
dependence between univariate random variables.
Figure 3 shows the bivariate relationship along
with a representative noisy sample: two monotone relationships (left
panels), and two non-monotone relationships (right panels).
The presented methods were run with ,
, and with summation over all
or partitions. This allows one to assess the affect of
on power and run time, along with the possible affect of summation over
tables with a different number of partition points on the two axes.
Reducing the number of atoms , allows one to estimate the breakdown
of the method in terms of power.
The parameter was set to the package default (),
except for where was constrained by the number of
atoms to . Heller et al. (2016) and Brill (2016) show by
various simulation studies that the power of the test is quite robust to
the selection of since it selects the optimal adaptively;
the test considers all partitions of sizes to
and selects the best partition in the sense that
its -value is minimal.
Methods compared to in the simulation study are dCOV from CRAN package
energy, MIC from CRAN package minerva, and HSIC from CRAN
package dHSIC. We note that typically faster competitive methods
are somewhat degenerate variations on these methods and would have much
lower power than the originals.
Figure 4 shows the trade-off between runtime
and power. We see that as the number of atoms increase, the run time
increases but power is almost the same for 30 atoms or more. We set the
default value for the number of atoms in the functions given in Table
1 to be the minimum between sample size and 40,
promising the user a result in reasonable time regardless of sample
size. Even for a small number of atoms such as , the MinP test
power is similar to the MinP test with a high number of atoms. For the
smallest number of atoms considered, , power may drop for
complex signals which require fine partitions of the data. For the
monotone settings considered, the method maintains competitive power
also for .
Figure 5 shows the power as a function of
sample size. The power of the test for the monotone settings is highest
for dCov with the atoms based test a close second, similar to the best
one achieved by the competitors. For the Circles setting, power for the
and variant is similar. The setting is symmetric
in practice, and the variant enjoys higher power since it
needs to account for fewer possible selections of partition size under
the test statistic. Nevertheless, the loss of power is small when
considering all partitions of the data. For the Sine
setting, power differs greatly between and
variants. The setting is not symmetric in and . One can see that
the optimal partition of the plane to capture the dependence requires
few partitions of the axis but many partitions of the axis. See
Brill (2016) for thorough simulations of different types of
bivariate relationships: in the non-symmetric settings, considering
partitions of the data leads to substantial power gain; in
symmetric settings, considering partitions of the data leads
to little power loss over considering only partitions.
This simulation study demonstrated that the presented tests have a power
advantage over competitors in settings where the underlying dependence
is complex, i.e., settings where in multiple regions the joint density
and the product of the marginal densities differ. For those settings, a
fine partition of the sample space is optimal. Competing tests have
tuning parameters, and the choice of tuning parameter can affect the
power of the test. Specifically, tuning parameters in competitor methods
include the degree of the norm in dCOV, the kernel bandwidth in
dHSIC, and the maximum partition size considered in MIC. The low power
achieved by alternative methods in the ‘Circles’ and ‘Sine’ settings
could be partially attributed to the use of the package default setting
for the tuning parameters. The MinP procedure does not have a tuning
parameter that can materially affect its performance, since the single
best partition size is chosen in a data adaptive manner. For many other
scenarios demonstrating this, see Heller et al. (2016) and
Brill (2016).
Figure 3: Four bivariate relationships (in red), along with a sample
of observations in blue.Figure 4: Power and logarithm of run time for . Run times for
each method were computed for a full procedure, including tabulation of
the null hypothesis and parameter selection. Specifically, for dCov
and HSIC, the number of permutations was a single run of the test
function with 1000 permutations (as computation of kernel distance
matrices is done only once for the original data); for MinP, runtime
consists of 1000 computations for the null table and the test statistic
(computation of statistics under the null hypothesis, efficient
tabulation of marginal distributions, and , computation of
for the tested dataset, which was then used for computing the
-value); for MIC, run times consist of 1001 computations of the
test statistic for the dataset tested and distribution under the null.
For a pre-computed null table of size , running times for
and would be one thousandth of the time shown above, since these
two tests are distribution free. Power computation is based on 10000
datasets for the MinP procedure and 2000 datasets for alternative tests.
Errors bars show 95% confidence
intervals.Figure 5: Power versus sample size, in each of the scenarios of Figure
3. Based on 10000 datasets for the MinP
procedure, and 2000 datasets for alternative tests. Error bars show 95%
confidence intervals.
Figure 6 shows the run-time as a function
of sample size and atom size. In the left panel, the total time to carry
out a test with 1000 permutations is computed for 4 methods:
(including null table construction) with test statistic
(), dCov, HSIC, and MIC (including null table construction). We
see that the portion of the computation time of
takes the largest portion of the computation, being greater than
the part. As such, computation time
is constant for all values considered. For other tests considered,
computation time might be shorter for smaller samples, but the
computational complexity is quadratic (or above) and hence more
computationally demanding than for larger sample sizes. The right
panel presents the run time versus number of atoms for and
. The relationship is clearly linear on the log scale.
The linear fit shows that the computational complexity is indeed to the
fourth order in terms of . Overall, the algorithm has a
computational complexity which is
.
Figure 6: A (Left): Runtime analysis on log scale for different
methods for sample sizes . B (Right): Runtime comparison on log scale, compared
with number of atoms. Number of atoms varied between 100 and 300.
Computations performed were with (black) and
(red). Linear fit verifies known computational complexity,
. Time measurements were
performed with 100 and 50 repetitions for left and right panels,
respectively.
The -sample test
In order to assess the power of the test statistics, we present a
simulation study. Methods compared were the energy two sample test,
given by eqdist.etest from package energy, and the kernel MMD
test, given by kmmd from kernlab.
Let denote the mixture
distribution of Gaussian distributions with means and
variances . We examine the following three settings: the
shifted normal, sampling from and from
; the 2 component normal mixture, sampling
from and
; and the 3
component normal mixture, sampling from
,
and
.
Table 2: A power comparison of the test with a different
number of atoms, the energy test, and the MMD test. For the three data
generations, the power is given in rows 1-3, as well as the run time
in row 4. Datasets are samples with two equal groups of size
each to a total of . For each setting, we simulate 10000
datasets for the MinP procedure and 2000 datasets for alternative
tests, with 1000 permutations for each test. Run times are measured
over 100 repetitions.
10 Atoms
25 Atoms
50 Atoms
ENERGY
KMMD
Normal, Shift
0.65
0.70
0.70
0.72
0.53
Mixture, 2 Components
0.72
0.75
0.74
0.21
0.36
Mixture, 3 Components
0.86
0.94
0.94
0.09
0.07
Run time (seconds)
2.96
3.09
3.23
5.22
146.51
Table 2 provides the power result for competitor
tests, along with the test statistic for . The
test has excellent power in comparison with energy and MMD, and
the run time is lower. For additional simulations that include many
additional settings, see (Brill 2016). In general, the proposed
-sample test performs best when the difference between the
distributions is complex and specifically when the density plots of the
two distributions intersect multiple times, see (Heller et al. 2016)
for details.
7 Discussion
We presented computationally fast and powerful novel tests for detecting
complex dependencies among univariate random variables. These tests can
be used in the framework suggested in Heller and Heller (2016) in order to construct
computationally-fast, distribution-free multivariate tests for powerful
identification of complex multivariate relationships. Briefly, one may
reduce tests of independence between multivariate and to
univariate ones by several methods, such as choosing a reference point
in and and testing whether distances between observations and
reference points are associated in and , or by choosing a
direction and projecting and on it. Heller and Heller (2016) discuss methods
of aggregation over several reference points. If the univariate tests
utilized are computationally efficient for large sample sizes and
consistent, the resulting multivariate tests will also be consistent and
computationally efficient.
Using our approach for obtaining a look-up table for , the null
distribution of any test statistic that combines individual -values
from distribution-free tests can be efficiently tabulated. Steps A to E
depicted in Figure 2 can be performed with
columns replaced by other rank based test statistics
, and the can be replaced by any -value
combining function . For example, package coin
contains various rank based tests to detect shift or scale differences
across distributions. The null distribution tabulation we presented can
be useful for constructing a look-up table for a distribution-free
combined test for shift and scale.
8 Appendix A
Let be the number of observations, and be the number of atoms
such that the size of an atom is given by an integer . The
locations of splits between atoms are thus given by
for . If is not a
complete multiple of , we define the locations of splits between
atoms to be . Step
I (compute ECDF): The empirical CDF can be computed in
time for all possible split locations: ,
. First, execute Algorithm 1 to compute the matrix
.
Algorithm 1: Compute ECDF at all inter-atoms split
points
The empirical cumulative distribution function at the split point given
by ranks is given by . For a cell of borders
, the
expected number of observations is given by:
Once the empirical CDF has been tabulated, the observed counts of the
cell can be calculated in constant time:
Step II (Aggregate LRT scores of all cells of given size): A cell of
the sample space
is given
by its boundaries . We define a cell to be ‘top’ or
‘bottom’ cell if or . A cell is considered ‘right’ or
‘left’ if or . If a cell is in both categories, it is
called ‘corner’. If a cell is not ‘top’,‘bottom’,‘left’, or ‘right’, it
is called a ‘center’ cell. Let be a function taking
the borders of a cell and returning for ‘center’ cells, for
‘top’ or ‘bottom’, for ‘left’ or ‘right’, and for ‘corner’
cells.
We define the width of the cell to be and the height
of the cell to be . For each size and type, we sum up
the scores of that size and type.
Algorithm 2 computes the array , which stores the sum of
scores of the form for all cells of
type , width , and height .
Algorithm 2: Aggregate contribution to statistic by cell
type and size
Step III (Computing ): Next, we aggregate over
, over all cell sizes and types, to compute
with .
The number of partitions in which a cell is found is given by the number
of possible ways to select additional partitioning locations around that
cell, denoted by (2). For example, a
‘center’ cell requires choices of additional X partition points
and additional Y partition points to fully define a
partition. Hence a ‘center’ cell of size , is found in
partitions. Note
that the number of tables in which a cell is found does not depend on
its edge locations but only on its size and type. Thus, the contribution
for of all center cells of size atoms is
given by the sum of their scores,
multiplied by .
The differentiation by cell type is needed since a cell bordering the
edge of the sample space requires only one partition point, thus
allowing the selection of an extra partition point.
The final step is given in algorithm 3. Algorithm 3 computes
by summing over all values of the array with
weights .
Algorithm 3: Compute Sm × l
for all m, l ∈ [2,m.max]
9 Appendix B
In Table 3 we list the additional available
tests in the HHG package. Specifically, the multivariate tests of
Heller et al. (2013) and the univariate test of Heller et al. (2016).
Table 3: Nonparametric tests previously available in the HHG
package for small to moderate sample sizes
Function Name
Description
hhg.test
Implements the multivariate test of independence described in Heller et al. (2013), testing for the independence of two random vectors and ,of dimensionality and , respectively.
hhg.test.2.samplehhg.test.k.sample
Adaptation of the above procedure to the -sample problem (equality of distributions). Given a multivariate measurement and a group factor variable , test whether the distributions are equal. This is the nonparametric extension of the MANOVA problem (sensitive not only to shifts in means).
hhg.univariate.ind.combined.test
Univariate test of independence for problem (1), as described in Heller et al. (2016).
hhg.univariate.ks.combined.test
Test for univariate equality of distributions (i.e., ), as described in Heller et al. (2016). This is the nonparametric extensions of the ANOVA problem, which is sensitive to any difference in distribution (not only to shifts in means).
This article is converted from a Legacy LaTeX article using the
texor package.
The pdf version is the official version. To report a problem with the html,
refer to CONTRIBUTE on the R Journal homepage.
Footnotes
References
D. Albanese, M. Filosi, R. Visintainer, S. Riccadonna, G. Jurman and C. Furlanello. Minerva and minepy: A c engine for the MINE suite and its r, python and MATLAB wrappers. Bioinformatics, 29(3): 407–408, 2013. URL https://doi.org/10.1093/bioinformatics/bts707.
K. P. Chwialkowski, A. Ramdas, D. Sejdinovic and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in neural information processing systems, pages. 1981–1989 2015.
M. Filosi, R. Visintainer and D. Albanese. Minerva: Maximal information-based nonparametric exploration for variable analysis. 2017. URL https://CRAN.R-project.org/package=minerva. R package version 1.4.7.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar): 723–773, 2012a.
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf and A. J. Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages. 585–592 2008.
A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu and B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems 25, Eds F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger pages. 1205–1213 2012b. Curran Associates, Inc. URL http://papers.nips.cc/paper/4727-optimal-kernel-choice-for-large-scale-two-sample-tests.pdf.
F. E. Harrell Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell miscellaneous. 2018. URL https://CRAN.R-project.org/package=Hmisc. R package version 4.1-1.
J. Hausser and K. Strimmer. Entropy inference and the james-stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(Jul): 1469–1484, 2009.
J. Hausser and K. Strimmer. Entropy: Estimation of entropy, mutual information and related quantities. 2014. URL https://CRAN.R-project.org/package=entropy. R package version 1.2.1.
R. Heller, Y. Heller and M. Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2): 503–510, 2013. URL https://doi.org/10.1093/biomet/ass070.
R. Heller, Y. Heller, S. Kaufman, B. Brill and M. Gorfine. Consistent distribution-free k-sample and independence tests for univariate random variables. Journal of Machine Learning Research, 17(29): 1–54, 2016.
C. Huang and X. Huo. An efficient and distribution-free two-sample test based on energy statistics and random projections. arXiv preprint arXiv:1707.04602, 2017.
B. Jiang, C. Ye and J. S. Liu. Non-parametric k-sample tests via dynamic slicing. Journal of the American Statistical Association, 110(510): 642–653, 2015. URL https://doi.org/10.1080/01621459.2014.920257.
P. E. Meyer, F. Lafitte and G. Bontempi. Minet: AR/bioconductor package for inferring large transcriptional networks using mutual information. BMC bioinformatics, 9(1): 461, 2008. URL https://doi.org/10.1186/1471-2105-9-461.
P. E. Meyer, F. Lafitte and G. Bontempi. Minet: Mutual information NETworks. 2017. URL http://minet.meyerp.com. R package version 3.36.0.
N. Pfister, P. Bühlmann, B. Schölkopf and J. Peters. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1): 5–31, 2018. URL https://doi.org/10.1111/rssb.12235.
N. Pfister and J. Peters. dHSIC: Independence testing via hilbert schmidt independence criterion. 2017. URL https://CRAN.R-project.org/package=dHSIC. R package version 2.0.
D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher and P. C. Sabeti. Detecting novel associations in large data sets. science, 334(6062): 1518–1524, 2011. URL https://doi.org/10.1126/science.1205438.
M. L. Rizzo and G. J. Szekely. Energy: E-statistics: Multivariate inference via the energy of data. 2017. URL https://CRAN.R-project.org/package=energy. R package version 1.7-2.
G. J. Székely, M. L. Rizzo, et al. Brownian distance covariance. The annals of applied statistics, 3(4): 1236–1265, 2009. URL https://doi.org/10.1214/09-aoas312.
G. J. Székely and M. L. Rizzo. Testing for equal distributions in high dimension. InterStat, 5(16.10): 2004.
G. J. Székely, M. L. Rizzo, N. K. Bakirov, et al. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6): 2769–2794, 2007. URL https://doi.org/10.1214/009053607000000505 .
A. Zeileis, K. Hornik, A. Smola and A. Karatzoglou. Kernlab-an S4 package for kernel methods in r. Journal of statistical software, 11(9): 1–20, 2004. URL https://doi.org/10.18637/jss.v011.i09 .
Q. Zhang, S. Filippi, A. Gretton and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 28(1): 113–130, 2018. URL https://doi.org/10.1007/s11222-016-9721-7.
J. Zhao and D. Meng. Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural computation, 27(6): 1345–1372, 2015. URL https://doi.org/10.1162/neco_a_00732.
Reuse
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
Citation
For attribution, please cite this work as
Brill, et al., "Nonparametric Independence Tests and $k$-sample Tests for Large Sample Sizes Using Package HHG", The R Journal, 2018
BibTeX citation
@article{RJ-2018-008,
author = {Brill, Barak and Heller, Yair and Heller, Ruth},
title = {Nonparametric Independence Tests and $k$-sample Tests for Large Sample Sizes Using Package HHG},
journal = {The R Journal},
year = {2018},
note = {https://rjournal.github.io/},
volume = {10},
issue = {1},
issn = {2073-4859},
pages = {424-438}
}