High-dimensional low sample size (HDLSS) data sets frequently emerge in many biomedical applications. The direction-projection-permutation (DiProPerm) test is a two-sample hypothesis test for comparing two high-dimensional distributions. The DiProPerm test is exact, i.e., the type I error is guaranteed to be controlled at the nominal level for any sample size, and thus is applicable in the HDLSS setting. This paper discusses the key components of the DiProPerm test, introduces the diproperm R package, and demonstrates the package on a real-world data set.
Advancements in modern technology and computer software have
dramatically increased the demand and feasibility to collect
high-dimensional data sets. Such data introduce analytical challenges
which require the creation of new and adaptation of existing statistical
methods. One such challenge is that we may observe many more covariates
(features)
HDLSS data frequently occur in many health-related fields. For example, in genomic studies, a single microarray experiment can produce tens of thousands of gene expressions compared to the few samples studied, often being less than a hundred (Alag 2019). In medical imaging studies, a single region of interest for analysis in an MRI or CT-scan image often contains thousands of features compared to the small number of samples studied (Limkin et al. 2017). In the pre-clinical evaluation of vaccines and other experimental therapeutic agents, the number of biomarkers measured (e.g., immune responses) may be much greater than the number of samples studied (e.g., mice, rabbits, or non-human primates) (Kimball et al. 2018).
One common task in the HDLSS setting entails comparing the distribution
of covariates between two groups (classes). For example, in pre-clinical
studies of vaccines, investigators may wish to compare the distribution
of biomarkers between animals who survive to a certain time point and
animals who do not survive. In the example presented below, the
distribution of
To compare the distribution of covariates between two groups, the DiProPerm tests use one-dimensional projections of the data based on a binary linear classifier to construct a univariate test statistic and then permutes class labels to determine the sampling distribution of the test statistic under the null. The details of the DiProPerm test are as follows.
Let
Direction: find the normal vector to the separating hyperplane between two samples after training a binary linear classifier
Projection: project data on to the normal vector and calculate a univariate two-sample statistic
Permutation: conduct a permutation test using the univariate statistic as follows:
Permute class membership after pooling samples
Re-train binary classifier and find the normal vector to the separating hyperplane
Recalculate the univariate two sample statistic
Repeat a-c multiple times (say 1000) to determine the sampling
distribution of the test statistic under the null
Compute p-value by comparing the observed statistic to the sampling distribution
Different binary linear classifiers may be used in the first step of DiProPerm. Linear discriminant analysis (LDA), particularly after conducting principal component analysis (PCA), is one possible classifier for the direction step. However, using LDA with PCA in the HDLSS setting has some disadvantages, including lack of interpretability, sensitivity to outliers, and tendency to find spurious linear combinations due to a phenomenon known as data piling (Marron et al. 2007; Aoshima et al. 2018). Data piling occurs if data are projected onto some projection direction, and many of the projections are the same or piled on one another. The support vector machine (SVM) is another popular classifier (Hastie et al. 2001). The SVM finds the hyperplane that maximizes the minimum distance between data points and the separating hyperplane. However, the SVM can also suffer from data piling in the HDLSS setting. To overcome data piling, the distance weighted discrimination (DWD) classifier was developed (Marron et al. 2007). The DWD classifier finds the separating hyperplane minimizing the average inverse distance between data points and the hyperplane. The DWD performs well in HDLSS settings with good separation and is more robust to data piling.
In the second step of DiProPerm, a univariate statistic is calculated
using the projected values onto the normal vector to the separating
hyperplane from the first step. Suppose
The last step of DiProPerm entails determining the distribution of the
test statistic under the null. In this step, the two samples are pooled,
class labels are permuted, then a univariate statistic is calculated.
Repeat this process multiple times (say 1000) to determine the sampling
distribution of the test statistic under the null
When the DiProPerm test is implemented using the DWD classifier, it is common practice to look at the loadings of the DWD classifier (An et al. 2016; Nelson et al. 2019). The DWD loadings represent the relative contribution of each variable to the class difference. A higher absolute value of a variable’s loading indicates a greater contribution for that variable to the class difference. The loadings vector is a unit vector; thus, the individual loadings range from -1 to 1, and the sum of the squares of the loadings equals one. The loadings direction vector points from the negative to positive class. Thus, positive entries correspond to variables that tend to be larger for the positive class. Combining the use of the DiProPerm and evaluation of the DWD loadings in applications can provide insights into high-dimensional data and be used to generate rational hypotheses for future research.
The DiProPerm test has been used in several areas of biomedical research, including osteoarthritis and neuroscience (An et al. 2016; Bendich et al. 2016; Nelson et al. 2019). However, currently, there does not exist an R package that implements DiProPerm. Therefore we developed diproperm, a free, publicly available R software package to analyze data from two high-dimensional distributions (Allmon et al. 2020). Functions in the diproperm package allow users to conduct the DiProPerm test and create corresponding diagnostic plots. Loadings for the binary linear classifier are also available for display in order from highest to lowest relative to their contribution toward the separation of the two distributions.
The diproperm package is comprised of three functions:
DiProPerm()
: Conducts the DiProPerm testplotdpp()
: Plots diagnostics from the DiProPerm testloadings()
: Returns the variable indices with the highest loadings
in the binary classification. The absolute values of the loading
values indicate a variable’s relative contribution toward the
separation between the two classes. The loadings vector is a unit
vector, thus, the sum of its squares must be equal to one and range
from -1 to 1. Also, the loadings direction vector points from the
negative to positive class. Thus, positive entries correspond to
variables that are larger for the positive class.The diproperm package can be installed from CRAN using the code in the ‘Example’ section below. Additionally, the development version of the package can be installed from GitHub ("allmondrew/diproperm").
The example below creates a Gaussian data set containing 100 samples, 2
features, clustered around 2 centers with a standard deviation of 2. The
class labels are then re-classified to -1 and 1 to match the input
requirements of DiProPerm()
. The DiProPerm test is then conducted
using the DWD classifier, the mean difference univariate statistic, and
1000 permutations. The results from DiProPerm()
are then displayed
with plotdpp()
. Last, the top five indices of the highest absolute
loadings are listed.
install.packages("diproperm")
remotes::install_github("elbamos/clusteringdatasets")
library(diproperm)
library(clusteringdatasets)
cluster.data <- make_blobs(n_samples = 100, n_features = 2, centers = 2, cluster_std = 2)
X <- cluster.data[[1]]
y <- cluster.data[[2]]
y[y==2] <- -1
dpp <- DiProPerm(X,y,B=1000,classifier = "dwd",univ.stat = "md")
plotdpp(dpp)
loadings(dpp,loadnum = 5)
The main function to be called first by the user is DiProPerm()
, which
takes in an DiProPerm()
uses the DWD classifier, the mean difference as the
univariate statistics, and 1000 balanced permutations. The permutations
are balanced in the sense that after relabeling, the new -1 group
contains DiProPerm()
implements DWD
from the genDWD
function in the
DWDLargeR package
(Lam et al. 2018a,b). The penalty parameter, C
, in the
genDWD
function is calculated using the penaltyParameter
function in
DWDLargeR.
DWDLargeR has several
parameters which are set to recommended default values. More details on
the algorithm used to compute genDWD
and penaltyParameter
can be
found in Lam et al. (2018b). Other options included in DiProPerm()
for the
binary linear classifier are the mean difference direction “md” and
support vector machine “svm”. DiProPerm()
uses parallel processing to
delegate computation to the number of cores on the user’s computer for
increased efficiency. DiProPerm()
returns a list of the observed data
matrix, vector of observed class labels, observed test statistic,
projection scores used to compute the observed test statistic, the
loadings of the binary classification, the z-score, cutoff value for an
After calling DiProPerm()
, the function plotdpp()
can be used to
create a panel plot for assessing the diagnostics of the DiProPerm test.
plotdpp()
takes in a DiProPerm list and the user may specify which
diagnostics they would like to display. By default, plotdpp()
displays
a facet plot with the observed score distribution, the projection score
distribution of the permutation with the smallest test statistic value,
the projection score distribution of the permutation with the largest
test statistic value, and the test statistic permutation distribution.
For the permutation distribution plot, the z-score, cutoff value,
observed test statistic and p-value are displayed on the graph. Larger
individual graphs may be displayed by using the plots option in
plotdpp()
. Additional graphs include the projection score
distributions for the first and second permutations. The diagnostic
plots show the user the characteristics of their data and facilitate the
visual assessment of the separation of the two high-dimensional
distributions being tested.
Lastly, after calling the DiProPerm()
, the user may call the
loadings()
function. The loadings()
function returns the variable
indices in the data matrix which have the highest absolute loadings in
the binary classification. Because the loadings vector is a unit vector,
the sum of the squares of loadings is constrained to equal to one, with
each loading between -1 to 1. The loadings direction vector points from
the negative to the positive class. Thus, positive entries correspond to
variables that tend to be larger for the positive class. Higher absolute
loading values indicate a greater contribution for a particular variable
toward the separation between the two classes. By default, loadings()
returns the indices for all variables sorted by their absolute loading
value. Therefore, the top variable index is the variable which
contributes the most toward the separation of the two classes, and the
last variable is the one which contributes the least. The user may also
change the number of loadings displayed.
To illustrate the use of the
diproperm package,
consider the mushrooms data set, which is freely available from the UCI
Machine Learning Repository (Dua and Graff 2019) and within
diproperm. This data
set includes various characterizations of
install.packages("diproperm")
library(diproperm)
data(mushrooms)
The above code installs the
diproperm package and
loads the mushroom data into R. Now, let us check the structure of the
data to make sure it is compatible with DiProPerm()
.
dim(mushrooms$X)
[1] 112 8124
table(mushrooms$y)
-1 1
4208 3916
The vector of class labels must be -1 or 1 for DiProPerm()
, which is
the case for this data. However, the data set is in DiProPerm()
, the dataset must be in
X <- Matrix::t(mushrooms$X)
X <- X[1:50,]
y <- mushrooms$y[1:50]
Now, that the data is in the proper format the call to DiProPerm()
is
as follows:
dpp <- DiProPerm(X=X,y=y,B=1000)
Algorithm stopped with error 2.35e-08
sample size = 50, feature dimension = 112
positve sample = 12, negative sample = 38
number of iterations = 51
time taken = 0.10
Characteristics of the DWD algorithm used to find the solution for the
observed data are displayed by DiProPerm()
. The algorithm took 51
iterations and 0.10 seconds to converge to the tolerance threshold with
a zero percent classification error on the training data set. The
runtime for 1000 permutations was less than 3 minutes on a four-core
machine but would be faster on a machine with more cores. The dpp
object stores the output list from DiProPerm()
described in the
package. Storing the information allows us to plot the diagnostics in
the next step.
plotdpp(dpp)
plotdpp()
for the mushrooms data
set. The top graph is the observed projection score distribution of the
two classes, the two middle graphs are the projection score
distributions of the permutation with the smallest and largest test
statistic value, and the bottom graph is the test statistic permutation
distribution with the observed statistic value marked by the red dotted
line.Figure 1 displays the default diagnostics for a DiProPerm list. From the observed projection score distribution, one can see clear separation between the two classes. Also, from the projected score distributions of the permutations, which yield the smallest and largest test statistic, we see the score distributions overlap well, so there is some visual justification that the distributions in the observed plot are truly different. Lastly, the bottom plot shows the sampling distribution under the null is located around 0.4 while the observed test statistic is greater than 2. Each individual plot can also be output by the following set of commands:
plotdpp(dpp,plots="obs")
plotdpp(dpp,plots="min")
plotdpp(dpp,plots="max")
plotdpp(dpp,plots="permdist")
The permutation p-value in Figure 1 suggests that the two high-dimensional distributions of mushroom attributes are indeed different between the two classes. Also displayed is a z-score, calculated by fitting a Gaussian distribution to the test statistic permutation distribution. The mushroom data z-score 12.9 indicates the observed test statistic is approximately 13 standard deviations from the expected value of the test statistic under the null. Finally, the cutoff value 0.697 is displayed, corresponding to the critical value for a hypothesis test at the 0.05 significance level.
In order to assess which variables contributed most toward the separation in step 3, we can print the top five contributors with the code
loadings(dpp,loadnum = 5)
index sorted_loadings
29 0.5395016
37 0.3170037
36 -0.2481763
111 0.2228389
20 -0.2087244
The top five contributors toward the separation seen in the observed
distribution in Figure 1 are indices 29, 37, 36, 111,
and 20. These indices correspond to a pungent odor, narrow gill size,
broad gill size, urban habitat, and yellow cap color, respectively. For
these data,
DiProPerm is an exact test for comparing two high-dimensional distributions. The diproperm package allows the user to visually assess and conduct a DiProPerm test to determine if there is a difference between the high-dimensional distributions of two classes and, if so, evaluate the key features contributing to the separation between the classes.
This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under award number R37AI054165. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors thank two anonymous reviewers for helpful comments.
Econometrics, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Allmon, et al., "diproperm: An R Package for the DiProPerm Test", The R Journal, 2021
BibTeX citation
@article{RJ-2021-072, author = {Allmon, Andrew G. and Marron, J. S. and Hudgens, Michael G.}, title = {diproperm: An R Package for the DiProPerm Test}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {266-272} }