High-dimensional low sample size (HDLSS) data sets frequently emerge in many biomedical applications. The direction-projection-permutation (DiProPerm) test is a two-sample hypothesis test for comparing two high-dimensional distributions. The DiProPerm test is exact, i.e., the type I error is guaranteed to be controlled at the nominal level for any sample size, and thus is applicable in the HDLSS setting. This paper discusses the key components of the DiProPerm test, introduces the diproperm R package, and demonstrates the package on a real-world data set.
Advancements in modern technology and computer software have dramatically increased the demand and feasibility to collect high-dimensional data sets. Such data introduce analytical challenges which require the creation of new and adaptation of existing statistical methods. One such challenge is that we may observe many more covariates (features) \(p\) than the number of observations \(N\), especially in small sample size studies. These data structures are known as high-dimensional, low sample size (HDLSS) data sets, or “small \(N\), big \(p\).”
HDLSS data frequently occur in many health-related fields. For example, in genomic studies, a single microarray experiment can produce tens of thousands of gene expressions compared to the few samples studied, often being less than a hundred (Alag 2019). In medical imaging studies, a single region of interest for analysis in an MRI or CT-scan image often contains thousands of features compared to the small number of samples studied (Limkin et al. 2017). In the pre-clinical evaluation of vaccines and other experimental therapeutic agents, the number of biomarkers measured (e.g., immune responses) may be much greater than the number of samples studied (e.g., mice, rabbits, or non-human primates) (Kimball et al. 2018).
One common task in the HDLSS setting entails comparing the distribution of covariates between two groups (classes). For example, in pre-clinical studies of vaccines, investigators may wish to compare the distribution of biomarkers between animals who survive to a certain time point and animals who do not survive. In the example presented below, the distribution of \(p=112\) covariates is compared between edible and poisonous mushrooms. The covariate distributions may be compared between two samples using the direction-projection-permutation (DiProPerm) test (Wei et al. 2016). This test is well-suited for the HDLSS setting because the test is exact, i.e., the type I error is guaranteed to be controlled at the nominal level for any sample size. Below we discuss the key components of the DiProPerm test, introduce the diproperm R package, and demonstrate the use of the package on a real-world data set.
To compare the distribution of covariates between two groups, the DiProPerm tests use one-dimensional projections of the data based on a binary linear classifier to construct a univariate test statistic and then permutes class labels to determine the sampling distribution of the test statistic under the null. The details of the DiProPerm test are as follows.
Let \(X_{1}, \ldots ,X_{n} \sim F_{1}\) and \(Y_{1}, \ldots ,Y_{m} \sim F_{2}\) be independent random samples of \(p\) dimensional random vectors from multivariate distributions \(F_{1}\) and \(F_{2}\) where \(p\) may be larger than \(m\) and \(n\). The DiProPerm test evaluates the hypotheses \[H_{0}:F_{1}=F_{2} \text{ \ versus \ } H_{1}:F_{1} \neq F_{2}\] The test entails three steps:
Direction: find the normal vector to the separating hyperplane between two samples after training a binary linear classifier
Projection: project data on to the normal vector and calculate a univariate two-sample statistic
Permutation: conduct a permutation test using the univariate statistic as follows:
Permute class membership after pooling samples
Re-train binary classifier and find the normal vector to the separating hyperplane
Recalculate the univariate two sample statistic
Repeat a-c multiple times (say 1000) to determine the sampling distribution of the test statistic under the null \(H_{0}\)
Compute p-value by comparing the observed statistic to the sampling distribution
Different binary linear classifiers may be used in the first step of DiProPerm. Linear discriminant analysis (LDA), particularly after conducting principal component analysis (PCA), is one possible classifier for the direction step. However, using LDA with PCA in the HDLSS setting has some disadvantages, including lack of interpretability, sensitivity to outliers, and tendency to find spurious linear combinations due to a phenomenon known as data piling (Marron et al. 2007; Aoshima et al. 2018). Data piling occurs if data are projected onto some projection direction, and many of the projections are the same or piled on one another. The support vector machine (SVM) is another popular classifier (Hastie et al. 2001). The SVM finds the hyperplane that maximizes the minimum distance between data points and the separating hyperplane. However, the SVM can also suffer from data piling in the HDLSS setting. To overcome data piling, the distance weighted discrimination (DWD) classifier was developed (Marron et al. 2007). The DWD classifier finds the separating hyperplane minimizing the average inverse distance between data points and the hyperplane. The DWD performs well in HDLSS settings with good separation and is more robust to data piling.
In the second step of DiProPerm, a univariate statistic is calculated using the projected values onto the normal vector to the separating hyperplane from the first step. Suppose \(x_{1}, \ldots ,x_{n}\) and \(y_{1}, \ldots ,y_{m}\) are the projected values from samples \(X_{1}, \ldots ,X_{n}\) and \(Y_{1}, \ldots ,Y_{m}\), respectively. One common choice for the univariate test statistic for DiProPerm includes the difference of means statistic, \(|\bar{x}-\bar{y}|\). Other two-sample univariate statistics such as the two-sample t-statistic or difference in medians are also possible for use with the DiProPerm.
The last step of DiProPerm entails determining the distribution of the test statistic under the null. In this step, the two samples are pooled, class labels are permuted, then a univariate statistic is calculated. Repeat this process multiple times (say 1000) to determine the sampling distribution of the test statistic under the null \(H_{0}\). P-values are then calculated by the proportion of statistics higher than the original value.
When the DiProPerm test is implemented using the DWD classifier, it is common practice to look at the loadings of the DWD classifier (An et al. 2016; Nelson et al. 2019). The DWD loadings represent the relative contribution of each variable to the class difference. A higher absolute value of a variable’s loading indicates a greater contribution for that variable to the class difference. The loadings vector is a unit vector; thus, the individual loadings range from -1 to 1, and the sum of the squares of the loadings equals one. The loadings direction vector points from the negative to positive class. Thus, positive entries correspond to variables that tend to be larger for the positive class. Combining the use of the DiProPerm and evaluation of the DWD loadings in applications can provide insights into high-dimensional data and be used to generate rational hypotheses for future research.
The DiProPerm test has been used in several areas of biomedical research, including osteoarthritis and neuroscience (An et al. 2016; Bendich et al. 2016; Nelson et al. 2019). However, currently, there does not exist an R package that implements DiProPerm. Therefore we developed diproperm, a free, publicly available R software package to analyze data from two high-dimensional distributions (Allmon et al. 2020). Functions in the diproperm package allow users to conduct the DiProPerm test and create corresponding diagnostic plots. Loadings for the binary linear classifier are also available for display in order from highest to lowest relative to their contribution toward the separation of the two distributions.
The diproperm package is comprised of three functions:
DiProPerm()
: Conducts the DiProPerm testplotdpp()
: Plots diagnostics from the DiProPerm testloadings()
: Returns the variable indices with the highest loadings
in the binary classification. The absolute values of the loading
values indicate a variable’s relative contribution toward the
separation between the two classes. The loadings vector is a unit
vector, thus, the sum of its squares must be equal to one and range
from -1 to 1. Also, the loadings direction vector points from the
negative to positive class. Thus, positive entries correspond to
variables that are larger for the positive class.The diproperm package can be installed from CRAN using the code in the ‘Example’ section below. Additionally, the development version of the package can be installed from GitHub ("allmondrew/diproperm").
The example below creates a Gaussian data set containing 100 samples, 2
features, clustered around 2 centers with a standard deviation of 2. The
class labels are then re-classified to -1 and 1 to match the input
requirements of DiProPerm()
. The DiProPerm test is then conducted
using the DWD classifier, the mean difference univariate statistic, and
1000 permutations. The results from DiProPerm()
are then displayed
with plotdpp()
. Last, the top five indices of the highest absolute
loadings are listed.
install.packages("diproperm")
::install_github("elbamos/clusteringdatasets")
remoteslibrary(diproperm)
library(clusteringdatasets)
<- make_blobs(n_samples = 100, n_features = 2, centers = 2, cluster_std = 2)
cluster.data
<- cluster.data[[1]]
X <- cluster.data[[2]]
y ==2] <- -1
y[y
<- DiProPerm(X,y,B=1000,classifier = "dwd",univ.stat = "md")
dpp
plotdpp(dpp)
loadings(dpp,loadnum = 5)
The main function to be called first by the user is DiProPerm()
, which
takes in an \(n \times p\) data matrix and a vector of \(n\) binary class
labels both provided by the user. Factor variables for the data matrix
must be coded as 0/1 dummy variables, and the class labels for the
vector of binary class labels must be coded as -1 and 1. By default, the
DiProPerm()
uses the DWD classifier, the mean difference as the
univariate statistics, and 1000 balanced permutations. The permutations
are balanced in the sense that after relabeling, the new -1 group
contains \(n/2\) members from the original -1 group and \(n/2\) members not
from the original -1 group. Users can also implement an unbalanced,
randomized permutation design if desired. DiProPerm()
implements DWD
from the genDWD
function in the
DWDLargeR package
(Lam et al. 2018a,b). The penalty parameter, C
, in the
genDWD
function is calculated using the penaltyParameter
function in
DWDLargeR.
DWDLargeR has several
parameters which are set to recommended default values. More details on
the algorithm used to compute genDWD
and penaltyParameter
can be
found in Lam et al. (2018b). Other options included in DiProPerm()
for the
binary linear classifier are the mean difference direction “md” and
support vector machine “svm”. DiProPerm()
uses parallel processing to
delegate computation to the number of cores on the user’s computer for
increased efficiency. DiProPerm()
returns a list of the observed data
matrix, vector of observed class labels, observed test statistic,
projection scores used to compute the observed test statistic, the
loadings of the binary classification, the z-score, cutoff value for an
\(\alpha\) level of significance, the p-value for the DiProPerm test, a
list of each permutation’s projection scores and permuted class labels,
and a vector of permuted test statistics the size of the number of
permutations used.
After calling DiProPerm()
, the function plotdpp()
can be used to
create a panel plot for assessing the diagnostics of the DiProPerm test.
plotdpp()
takes in a DiProPerm list and the user may specify which
diagnostics they would like to display. By default, plotdpp()
displays
a facet plot with the observed score distribution, the projection score
distribution of the permutation with the smallest test statistic value,
the projection score distribution of the permutation with the largest
test statistic value, and the test statistic permutation distribution.
For the permutation distribution plot, the z-score, cutoff value,
observed test statistic and p-value are displayed on the graph. Larger
individual graphs may be displayed by using the plots option in
plotdpp()
. Additional graphs include the projection score
distributions for the first and second permutations. The diagnostic
plots show the user the characteristics of their data and facilitate the
visual assessment of the separation of the two high-dimensional
distributions being tested.
Lastly, after calling the DiProPerm()
, the user may call the
loadings()
function. The loadings()
function returns the variable
indices in the data matrix which have the highest absolute loadings in
the binary classification. Because the loadings vector is a unit vector,
the sum of the squares of loadings is constrained to equal to one, with
each loading between -1 to 1. The loadings direction vector points from
the negative to the positive class. Thus, positive entries correspond to
variables that tend to be larger for the positive class. Higher absolute
loading values indicate a greater contribution for a particular variable
toward the separation between the two classes. By default, loadings()
returns the indices for all variables sorted by their absolute loading
value. Therefore, the top variable index is the variable which
contributes the most toward the separation of the two classes, and the
last variable is the one which contributes the least. The user may also
change the number of loadings displayed.
To illustrate the use of the diproperm package, consider the mushrooms data set, which is freely available from the UCI Machine Learning Repository (Dua and Graff 2019) and within diproperm. This data set includes various characterizations of \(23\) species of gilled mushrooms in the Agaricus and Lepiota families. Each mushroom species is labeled as either definitely edible or poisonous/unknown. There are \(n=8124\) mushrooms in total, and \(p=112\) binary covariates coded as 0/1 corresponding to 22 categorical attributes. Below, we demonstrate the diproperm package functionality using data from the first \(n=50\) mushrooms in the data set.
install.packages("diproperm")
library(diproperm)
data(mushrooms)
The above code installs the
diproperm package and
loads the mushroom data into R. Now, let us check the structure of the
data to make sure it is compatible with DiProPerm()
.
dim(mushrooms$X)
1] 112 8124
[
table(mushrooms$y)
-1 1
4208 3916
The vector of class labels must be -1 or 1 for DiProPerm()
, which is
the case for this data. However, the data set is in \(p \times n\)
format. For DiProPerm()
, the dataset must be in \(n \times p\)
format. This can be done using the transpose function from the
Matrix package in R
(Bates and Maechler 2019). After taking the transpose, we subset the data and
vector of class labels to the first 50 observations and store the
results.
<- Matrix::t(mushrooms$X)
X <- X[1:50,]
X <- mushrooms$y[1:50] y
Now, that the data is in the proper format the call to DiProPerm()
is
as follows:
<- DiProPerm(X=X,y=y,B=1000)
dpp
2.35e-08
Algorithm stopped with error = 50, feature dimension = 112
sample size = 12, negative sample = 38
positve sample = 51
number of iterations = 0.10 time taken
Characteristics of the DWD algorithm used to find the solution for the
observed data are displayed by DiProPerm()
. The algorithm took 51
iterations and 0.10 seconds to converge to the tolerance threshold with
a zero percent classification error on the training data set. The
runtime for 1000 permutations was less than 3 minutes on a four-core
machine but would be faster on a machine with more cores. The dpp
object stores the output list from DiProPerm()
described in the
package. Storing the information allows us to plot the diagnostics in
the next step.
plotdpp(dpp)
Figure 1 displays the default diagnostics for a DiProPerm list. From the observed projection score distribution, one can see clear separation between the two classes. Also, from the projected score distributions of the permutations, which yield the smallest and largest test statistic, we see the score distributions overlap well, so there is some visual justification that the distributions in the observed plot are truly different. Lastly, the bottom plot shows the sampling distribution under the null is located around 0.4 while the observed test statistic is greater than 2. Each individual plot can also be output by the following set of commands:
plotdpp(dpp,plots="obs")
plotdpp(dpp,plots="min")
plotdpp(dpp,plots="max")
plotdpp(dpp,plots="permdist")
The permutation p-value in Figure 1 suggests that the two high-dimensional distributions of mushroom attributes are indeed different between the two classes. Also displayed is a z-score, calculated by fitting a Gaussian distribution to the test statistic permutation distribution. The mushroom data z-score 12.9 indicates the observed test statistic is approximately 13 standard deviations from the expected value of the test statistic under the null. Finally, the cutoff value 0.697 is displayed, corresponding to the critical value for a hypothesis test at the 0.05 significance level.
In order to assess which variables contributed most toward the separation in step 3, we can print the top five contributors with the code
loadings(dpp,loadnum = 5)
index sorted_loadings29 0.5395016
37 0.3170037
36 -0.2481763
111 0.2228389
20 -0.2087244
The top five contributors toward the separation seen in the observed distribution in Figure 1 are indices 29, 37, 36, 111, and 20. These indices correspond to a pungent odor, narrow gill size, broad gill size, urban habitat, and yellow cap color, respectively. For these data, \(y = 1\) corresponds to poisonous and \(y = -1\) to edible; thus, loadings with positive entries, such as pungent odor, are indicative of poisonous mushrooms. These results are similar to previous analyses, which have also found odor, gill size, habitat, and cap color predictive of mushroom edibility (Wibowo et al. 2018; Pinky et al. 2019).
DiProPerm is an exact test for comparing two high-dimensional distributions. The diproperm package allows the user to visually assess and conduct a DiProPerm test to determine if there is a difference between the high-dimensional distributions of two classes and, if so, evaluate the key features contributing to the separation between the classes.
This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under award number R37AI054165. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors thank two anonymous reviewers for helpful comments.
Econometrics, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Allmon, et al., "diproperm: An R Package for the DiProPerm Test", The R Journal, 2021
BibTeX citation
@article{RJ-2021-072, author = {Allmon, Andrew G. and Marron, J. S. and Hudgens, Michael G.}, title = {diproperm: An R Package for the DiProPerm Test}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {266-272} }