Weighted Distance Based Discriminant Analysis: The R Package WeDiBaDis

The WeDiBaDis package provides a user friendly environment to perform discriminant analysis (supervised classification). WeDiBaDis is an easy to use package addressed to the biological and medical communities, and in general, to researchers interested in applied studies. It can be suitable when the user is interested in the problem of constructing a discriminant rule on the basis of distances between a relatively small number of instances or units of known unbalanced-class membership measured on many (possibly thousands) features of any type. This is a current situation when analyzing genetic biomedical data. This discriminant rule can then be used both, as a means of explaining differences among classes, but also in the important task of assigning the class membership for new unlabeled units. Our package implements two discriminant analysis procedures in an R environment: the well-known distance-based discriminant analysis (DB-discriminant) and a weighted-distance-based discriminant (WDB-discriminant), a novel classifier rule that we introduce. This new procedure is based on an improvement of the DB rule taking into account the statistical depth of the units. This article presents both classifying procedures and describes the implementation of each in detail. We illustrate the use of the package using an ecological and a genetic experimental example. Finally, we illustrate the effectiveness of the new proposed procedure (WDB), as compared with DB. This comparison is carried out using thirty-eight, high-dimensional, class-unbalanced, cancer data sets, three of which include clinical features.

Itziar Irigoien (Department of Computation Science and Artificial Intelligence, University of the Basque Country UPV/EHU) , Francesc Mestres (Department of Genetics, Microbiology and Statistics. Genetics Section, University of Barcelona) , Concepcion Arenas (Department of Genetics, Microbiology and Statistics. Statistics Section, University of Barcelona)
2017-01-03

1 Introduction

Discriminant analysis (supervised classification) is used to differentiate between two or more naturally occurring groups based on a suite of discriminating features. This analysis can be used as a means of explaining differences among groups and for classification. That is, to develop a rule based on features measured on a group of units with known membership (the so-called training set), and to use this classification rule to assign a class membership to new unlabeled units. Classification is used by researchers in a wide variety of settings and fields including biological and medical sciences. For example, in biology it is used for taxonomic classification, morphometric analysis for species identification, and to study species distribution. Discriminant analysis is applicable to a wide range of ecological problems, e.g., testing for niche separation by sympatric species or for the presence or absence of a particular species. Marine ecologists commonly use discriminant analysis to evaluate the similarity of distinct populations and to classify units of unknown origin to known populations. The discriminant technique is also used in genetic studies in order to summarize the genetic differentiation between groups. In studies with Single Nucleotide Polymorphism (SNP) or re-sequencing data sets, usually the number of variables (alleles) is greater than the number of observations (units), so discriminant methods are available for data sets with more variables than units, as necessary. Furthermore, class prediction is currently one of the most important tasks in biomedical studies. The diagnosis of diseases, as cancer type or psychiatric disorder, has recently received a great deal of attention. With actual data, classification presents serious difficulties, because diagnosis is based on both clinical/pathological features (usually nominal data) and gene expression information (continuous data). For this reason, classification rules that could be applied to all types of data are desirable. The most popular classification rules are the linear (LDA) and quadratic (QDA) discriminant analyses (Fisher 1936), which are easy to use as they are found in most statistical packages. However, they require the assumption of normally distributed data; when this condition is violated, their use may yield poor classification results. Many distinct classifiers exist, differing in the definition of the classification rule and whether they utilize statistical (Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and C. D. Bloomfield 1999; Hastie, R. Tibshirani, D. Botstein, and P. Brown 2001) or machine learning (Breiman 2001; Boulesteix, C. Porzelius, and M. Daumer 2008) methods. However, the problem of classification with data obtained from microarrays is challenging because there are a large number of genes and a relatively small number of samples. In this situation, the classification methods based on the within-class covariance matrix fail, as an inverse is not defined. This is known as the singularity or under-sample problem (Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas 1995). The shrunken centroid method can be seen as a modification of the diagonal discriminant analysis (Dudoit, J. Fridlyand, and T. P. Speed 2002) and was developed for continuous high-dimensional data (Tibshirani, T. Hastie, B. Narasimhan, and G. Chu 2002). Nowadays, another issue that requires attention is the class-unbalanced situation, that is, the number of units belonging to each class is not the same. Some classifiers on class-unbalanced data tend to classify most of the new data in the majority class. This bias is higher when using high dimensional data. Recently, a method which improves the shrunken centroid method when the high-dimensional data is class-unbalanced was presented (Blagus and L. Lusa 2013). Furthermore, some statistical approaches are characterized by having an explicit underlying probability model, but it is not possible to always assume this requirement. One of the most popular nonparametric, machine-learning, classification methods is the \(k\)-nearest neighbor classification (\(k\)-NN) (Cover and P. E. Hart 1967; Duda, P. E. Hart, and D. G. Stork 2000). Given a new unit to be classified, this method finds the \(k\) nearest neighbors and classifies the new unit in the class to which belong the majority of neighbours. This classification may depend on the selected value for \(k\). As ecologists have repeatedly argued, the Euclidean distance is inappropriate for raw species abundance data involving null abundances (Orloci 1967; Legendre and L. Legendre 1998) and it is necessary to use discriminant analyses that incorporate adequate distances. In this situation, discriminant analysis based on distances (DB-discriminant), where any symmetric distance or dissimilarity function can be used, is a useful alternative (Cuadras 1989, 1992; Cuadras, J. Fortiana, and F. Oliva 1997; Anderson and J. Robinson 2003). To our knowledge, this technique is only included in GINGKO a suite of programs for multivariate analysis, oriented towards ordination and classification of ecological data (De Caceres, F. Oliva, and X. Font 2003; Bouin 2005; Kent 2011). These programs are written in Java language, so it is therefore necessary to have a Java Virtual Machine to execute it. Even though GINGKO is a very useful tool, it does not provide the option of a class prediction for new unlabeled units or feature selection. Recently, data depth was proposed as the basis for nonparametric classifiers (Jornstein 2004; Ghosh and P. Chaudhuri 2005; Jin and H. Cui 2010; Hlubinka and O. Vencalek 2013). A depth of a unit is a nonnegative number, which measures the centrality of the unit. That is, depth in the sample version reflects position of the unit with respect to the observed data cloud. The so-called maximal depth classifier is the simple and natural classifier defined from a depth function: to allocate a new observation to the class to which it has maximal depth. There are many possibilities how to define the depth of the data (Liu 1990; Vardi and C. Zhang 2000; Zuo and R. Serfling 2000; Serfling 2002), nevertheless the computation of the most popular depth functions is very slow, in particular, for high-dimensional data the time needed for classification grows rapidly. A new less-computer intensive depth function \(I\) (Irigoien, F. Mestres, and C. Arenas 2013) was developed, but the authors did not study its use in relation to the classification problem.

A discriminant method should have several abilities. First, the classifier rule has to be able to properly separate the classes. In this sense, the classifier evaluation is most often based on the error rate, the percentage of incorrect prediction divided by the total number of predictions. Second, the rule has to be useful to classify new unlabeled units. Then, cross validation evaluation is needed. Cross-validation involves a series of sub-experiments, each of which involves the removal of a subset of objects from a data set (the test set), construction of a classifier using the remaining objects in the data set (the model building set), and subsequent application of the resulting model to the removed objects. The leave-one-out method is a special case of cross-validation; it considers each single object in the data set as a test set. Furthermore, other measures, such as the sensitivity, specificity, positive predictive value for each class, and the generalized correlation coefficient, are useful to know the ability of the rule in the prediction task.

Here we introduce WeDiBaDis, an R package which provides a user-friendly interface to run the DB-discriminant analysis and a new classification procedure, the weighted-distance-based discriminant (WDB-discriminant) that performs well and improves the DB-discriminant rule. It is based on both, the DB-discriminant rule and the depth function \(I\) (Irigoien, F. Mestres, and C. Arenas 2013). First, we will describe the DB and WDB discriminant rules. Then, we will provide details about the WeDiBaDis package and will illustrate its use and its main outputs using an ecological and a genetic data set. To compare both DB and WDB rules—and in order to avoid the criticism that artificial data can favour particular methods—we present a large analysis of thirty-eight, high-dimensional, class-unbalanced, cancer gene expression data sets, three of which include clinical features. Furthermore, the data sets include more than two classes. Finally, we conclude the paper presenting the main conclusions. WeDiBaDis is available at https://github.com/ItziarI/WeDiBaDis.

2 Discriminant rules and evaluation criteria

Let \(\textbf{y}_i\) (\(i=1,2,\ldots,n\)) be \(m\)-dimensional units measured in any kind of features, with associated class labels \(l_i \in \{1, 2, \ldots, K\}\), where \(n\) and \(K\) denote the number of units and classes, respectively. Let \(\textbf{Y}\) be the matrix of all units and \(d\) a distance defined between any pair of units, \(d_{ij}=d(\textbf{y}_i,\textbf{y}_j)\). Let \(\textbf{y}^*\) be a new unlabeled unit to be classified in one of the given classes \(C_k\), \(k=1,2,\ldots,K\).

DB-discriminant

The distance-based or DB-discriminant rule (Cuadras, J. Fortiana, and F. Oliva 1997) takes as a discriminant score \[\begin{aligned} \delta^1_k(\textbf{y}^*)=\hat{\phi}^2(\textbf{y}^*,C_k), \end{aligned}\] where \(\hat{\phi}^2(\textbf{y}^*,C_k)\) is the proximity function which measures the proximity between \(\textbf{y}^*\) and \(C_k\). This function is defined by, \[\begin{aligned} \hat{\phi}^2(\textbf{y}^*,C_k)=\frac{1}{n_k}\sum\limits_{i=1}^{n_k} d^2(\textbf{y}^*,\textbf{y}_i)-\frac{1}{2n_k^2}\sum\limits_{i,j=1}^{n_k} d^2(\textbf{y}_i,\textbf{y}_j), \end{aligned}\] where \(n_k\) indicates the number of units in class \(k\). Note that the second term in (2), \[\hat{V}(C_k)=\frac{1}{2n_k^2}\sum\limits_{i,j=1}^{n_k} d^2(\textbf{y}_i,\textbf{y}_j),\] called geometric variability of \(C_k\), measures the dispersion of \(C_k\). When \(d\) is the Euclidean distance, \(\hat{V}(C_k)\) is the trace of the covariance matrix of \(\textbf{Y}\).

The DB classification rule allocates \(\textbf{y}^*\) to the class which has the minimal proximity value: \[\begin{aligned} C_{DB}(\textbf{y}^*)=l \quad \mbox{where} \quad \delta^1_l(\textbf{y}^*)=\min_{k=1,\ldots,K} \left\{\delta^1_k(\textbf{y}^*)\right\}. \end{aligned}\]

That is, this distance-based rule assigns a unit to the nearest group. Furthermore, using appropriate distances, Equation (3) reduces to some classic and well-studied rules (see Table 1 in Cuadras, J. Fortiana, and F. Oliva (1997)). For example, under the normality assumption, Equation (3) is equivalent to a linear discriminant or to a quadratic discriminant if the Mahalanobis distance or the Mahalanobis distance plus a constant is selected, respectively.

WDB-discriminant

For any unit \(\textbf{y}\), let \(I_k\) be the depth function in class \(C_k\) defined by (Irigoien, F. Mestres, and C. Arenas 2013), \[\begin{aligned} I_k(\textbf{y})=\left[1+\frac{\hat{{\phi}}^2(\textbf{y}, C_k)}{\hat{{V}}(C_k)}\right]^{-1}. \end{aligned}\]

Function \(I\) takes values in \([0,1]\) and it verifies the following desirable properties: For a distribution having a uniquely defined “center” \(I\) attains maximum value at this center (maximality at center); When one unit moves away from the deepest unit (the unit at which the depth function attains maximum value; in particular, for a symmetric distribution, the center) along any fixed ray through the center, the depth at this unit decreases monotonically (monotonicity relative to the deepest point) and the depth of a unit \(\textbf{y}\) should approach zero as \(||\textbf{y}||\) approaches infinity (vanishing at infinity). According to the distance used, the depth of a unit may or may not depend on the underlying coordinate system or, in particular, of the scales of the underlying measurements. In any case the affine invariance holds for translations and rotations. Thus, according to (Zuo and R. Serfling 2000), \(I\) is a type C depth function. As \(I\) is a depth function, it assigns to any observation a degree of centrality. While most of the depth functions assign zero depth to units outside a convex hull and then, it is possible that some training units have zero depth, the function in Equation (4) attains the zero value if \(V(C_k)=0\), that is, in presence of a constant distribution.

For each class \(C_k\) we weight the discriminant score \(\delta^1_k\) by \(1-I_k(\textbf{y}^*)\), that is, given a new unit \(\textbf{y}^*\), we define a new discriminant score for class \(k\) by: \[\begin{aligned} \delta^2_k(\textbf{y}^*)=\delta^1_k(1-I_k(\textbf{y}^*))={\phi}^2(\textbf{y}^*,C_k)(1-I_k(\textbf{y}^*)). \end{aligned}\]

The shrinkage we use, reduces the proximity values, this reduction being greater for deeper units. Thus, this new classification rule, \[\begin{aligned} C_{WDB}(\textbf{y}^*)=l \quad \mbox{where} \quad \delta^2_l(\textbf{y}^*)=\min_{k=1,\ldots,K} \left\{\delta^2_k(\textbf{y}^*)\right\}, \end{aligned}\] allocates a new unit \(\textbf{y}^*\) to the class which has the minimal proximity and maximal depth values.

Evaluation criteria

First consider the case of two classes (\(K=2\)) and the most common measures of performance for a classification rule. As it is usual in medical statistics, for a fixed class \(k\), let TP, FN, FP, and TN denote the true positive (number of units of class \(k\) correctly classified in class \(k\)), the false negative (number of units of class \(k\) misclassified as units in class \(l\), with \(l\neq k\)), the false positive (number of units of class \(l\), with \(l\neq k\) misclassified as units in class \(k\)), and the true negative (number of units of class \(l\), with \(l\neq k\) correctly classified as units in class \(l\)), respectively. Then (Zhou, N. A. Obuchowski, and D. K. McClish 2002), the sensitivity (recall) for class \(k\) is defined as the ability of a rule to correctly classify units belonging to class \(k\), thus \(Q_k^{se}=\frac{TP}{TP+FN}\). The specificity is the ability of a rule to correctly exclude a unit from class \(k\) when it really belongs to another class, thus \(Q_k^{sp}=\frac{TN}{TN+FP}\). Furthermore, the positive predictive value (precision) is the probability that a classification in class \(k\) is correct, thus \(P_k^{+}=\frac{TP}{TP+FP}\) and the negative predictive value is the probability that a classification in class \(l\) with \(l\neq k\) is correct, thus \(P_k^{-}=\frac{TN}{TN+FN}\). However, these measures do not take into account all the TP, FN, FP and TN values. For this reason, in biomedical applications the Matthew’s correlation coefficient (Matthews 1975) \(MC\) it is often used. This is defined by: \[MC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}.\] It ranges from \(-1\) if all the classifications are wrong to \(+1\) for perfect classification. A value equal to zero indicates that the classifications are random or the classifier always predicts only one of the two classes.
In the general case of \(K\) classes with \(K\geq 2\), one obtains a \(K\times K\) contingency or confusion matrix \(\textbf{Z}=(z_{kl})\), where \(z_{kl}\) is the number of times that units are classified to be in class \(l\) while belonging in reality to class \(k\). Then, \(z_{k.}=\sum\limits_{l}{z_{kl}}\) and \(z_{.l}=\sum\limits_{k}{z_{kl}}\) represent the number of units belonging to class \(k\) and the number of units predicted to be in class \(l\), respectively. Obviously \(n=\sum\limits_{kl}{z_{kl}}=\sum\limits_{k}{z_{k.}}=\sum\limits_{l}{z_{.l}}\). One standard criterium to evaluate a classification rule is to compute the percentage of all correct predictions, \[Q_t= 100\frac{\sum{z_{kk}}}{n},\] the percentage of units correctly predicted to belong to class \(k\) relative to the total number of units in class \(k\) (sensitivity for class \(k\)), \[Q_k^{se}=100\frac{z_{kk}}{z_{k.}},\] the percentage of units correctly predicted to belong to any class \(l\) with \(l \neq k\) relative to the total number of units in any class \(l\) with \(l \neq k\) (specificity of class \(k\)), \[Q_k^{sp}=100\frac{\sum\limits_{l \neq k}{z_{l.}}-\sum\limits_{l \neq k}{z_{lk}}}{n-z_{k.}},\] and the percentage of units correctly classified to be in class \(k\) with respect to the total number of units classified in class \(k\) (positive predictive value for class \(k\)), \[P_k^{+}=100\frac{z_{kk}}{z_{.k}}.\] However, we also consider a generalization of the Matthew’s correlation coefficient, the so called generalized squared correlation \(GC^2\) (Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen 2000), which is defined by \[GC^2=\frac{\sum\limits_{k,l}{(z_{kl}-e_{kl})^2}/{e_{kl}}}{n(K-1)},\] where \(e_{kl}=\frac{z_{k.}z_{.l}}{n}\). This coefficient ranges between 0 and 1, and may often provide a much more balanced evaluation of the prediction than, for instance, the above percentages. A value equal to zero indicates that there is at least one class in which no units are classified.

Another interesting coefficient is the \(Kappa\) statistic, which measures the agreement of classification to the true class (Cohen 1960; Landis and G. G. Koch 1977). It can be calculated by: \[Kappa=\frac{\frac{TP+TN}{n}-\frac{(TN+FP) \cdot (TN+FN)+(FN+TP) \cdot (FP+TP)}{n^2}}{ 1-\frac{(TN+FP) \cdot (TN+FN)+(FN+TP)\cdot (FP+TP)}{n^2}},\] and the interpretation is: \(Kappa<0\), less than chance agreement; \(Kappa\) in \(0.01-0.20\), slight agreement; \(Kappa\) in \(0.21-0.40\), fair agreement; \(Kappa\) in \(0.41-0.60\), moderate agreement; \(Kappa\) in \(0.61-0.80\), substantial agreement; and \(Kappa\) in \(0.81-0.99\), almost perfect agreement.

Finally, another measure used as a result of classification is the \(F_1\) statistic (Powers 2011). For each class, it is calculated based on the precision \(P_k^{+}\) and the recall \(Q_k^{se}\) as follows: \(F_1=2\cdot \frac{P_k^{+}Q_k^{se}}{P_k^{+}+Q_k^{se}}\). However, note that \(F_1\) does not take the true negatives into account.

3 Distance functions

The DB and WDB procedures require the previous calculation of a distance between units. In biomedical, genetic, and ecological studies different types of dissimilarities are frequently used. For this reason, WeDiBaDis includes several distance functions. Although these distances can be found in other packages they were included for ease their use for non-expert R users.

The package contains the usual Euclidean distance, \[\label{Euclidean_distance} d_E(\textbf{y}_i,\textbf{y}_j)=\sqrt{\sum_{k=1}^{m}(y_{ik}-y_{jk})^2}, \tag{1}\] the well known correlation distance, where \(r\) is the Pearson correlation coefficient, \[\label{correlation_distance} d_c(\textbf{y}_i,\textbf{y}_j)=\sqrt{(1-r(\textbf{y}_i,\textbf{y}_j))}, \tag{2}\] and the Mahalanobis distance (Mahalanobis 1936) with \(S\) the variance-covariance matrix, \[\label{Mahalanobis_distance} d_M(\textbf{y}_i,\textbf{y}_j)=\sqrt{(\textbf{y}_i-\textbf{y}_j)'S^{-1}(\textbf{y}_i-\textbf{y}_j)}. \tag{3}\] The function named mahalanobis() that calculates the Mahalanobis distance already exists in the stats package, but it is not suitable in our context. While this function calculates the Mahalanobis distance with respect to a given center, our function is designed to calculate the Mahalanobis distance between each pair of units given a data matrix.

Next, we briefly comment on the other distances included in the package. The Bhattacharyya distance (Bhattacharyya 1946) is a very well-known distance between populations in the genetic context. Each population is characterized by a vector \((p_{i1},\ldots,p_{im})\) whose coordinates are the relative frequencies of the features (usually chromosomal arrangements), with \[p_{ij}>0, j=1,\ldots,m \quad{ } \mbox{and} \quad{ } \sum_{j=1}^{m}p_{ij}=1, i=1,\ldots,n.\] Then, the distance between two units (populations) with frequencies \(\textbf{y}_i=(p_{i1}, \ldots, p_{im})\) and \(\textbf{y}_j=(p_{j1}, \ldots, p_{jm})\) is defined by: \[\label{Bhattacharyya_distance} d_B(\textbf{y}_i,\textbf{y}_j)=\arccos \sum_{l=1}^m{\sqrt{p_{il}p_{jl}}}. \tag{4}\]

The Gower distance (Gower 1971), used for mixed variables, is defined by: \[\label{similaridad_gower} d_G(\textbf{y}_i,\textbf{y}_j)=\sqrt{2(1-s(\textbf{y}_i,\textbf{y}_j))}, \tag{5}\] where \(s(\textbf{y}_i,\textbf{y}_j)\) is the similarity coefficient between unit \(\textbf{y}_i=(\textbf{x}_i, \textbf{q}_i, \textbf{ b}_i)\) and unit \(\textbf{y}_j=(\textbf{x}_j, \textbf{q}_j, \textbf{ b}_j)\), and \(\textbf{x}_., \textbf{q}_., \textbf{ b}_.\) are the values for the \(m_1\) continuous, \(m_2\) binary and \(m_3\) qualitative features, respectively. The coefficient \(s(\textbf{y}_i, \textbf{y}_j)\) is calculated by: \[s(\textbf{y}_i,\textbf{y}_j)=\frac{\sum_{l=1}^{m_1}\left({1-\frac{|x_{il}-x_{jl}|}{R_l}}\right) + a + \alpha}{m_1+(m_2-d)+m_3},\] with \(R_l\) the range of the \(l\)th continuous variable (\(l=1,\ldots,m_1\)); for the \(m_2\) binary variables, \(a\) and \(d\) represent the number of matches presence-presence and absence-absence, respectively; and \(\alpha\) is the number of matches between states for the \(m_3\) qualitative variables. Note that there is also the daisy() function in the cluster package, which can calculate the Gower distance for mixed variables. The difference between this function and dGower() in WeDiBaDis is that in daisy() the distance is calculated as \(d(\textbf{y}_i,\textbf{y}_j) = 1-s(\textbf{y}_i,\textbf{y}_j)\) and in dGower() as \(d(\textbf{y}_i,\textbf{y}_j) = \sqrt{2(1-s(\textbf{y}_i,\textbf{y}_j))}\). Moreover, dGower() allows us to include missing values (such as NA) and therefore calculates distances based on Gower’s weighted similarity coefficients. The dGower() function improves the function dgower() included in the package ICGE (Irigoien, B. Sierra, and C. Arenas 2013).

The Bray-Curtis distance (Bray and J. T. Curtis 1957) is one of the most well-known ways of quantifying the difference between samples when the information is ecological abundance data collected at different sampling locations. It is defined by: \[\label{similaridad_bray} d_B(\textbf{y}_i,\textbf{y}_j)=\frac{\sum_{l=1}^{m}|y_{il}-y_{jl}|}{y_{i+}+y_{j+}}, \tag{6}\] where \(y_{il}\), \(y_{jl}\) are the abundance of specie \(l\) in samples \(i\) and \(j\), respectively, and \(y_{i+}\), \(y_{j+}\) are the total specie’s abundance in samples \(i\) and \(j\), respectively. This distance can be also found in the vegan package.

The Hellinger (Rao 1995) and Orloci (or chord distance) (Orloci 1967) distances are also measures recommended for quantifying differences between sampling locations when the ecological abundance of species is collected. The Hellinger distance is given by: \[\label{similaridad_hellinger} d_H(\textbf{y}_i,\textbf{y}_j)=\sqrt{\sum_{l=1}^{m}\left(\sqrt{\frac{y_{il}}{\sum_{k=1}^{m}y_{ik}}}-\sqrt{\frac{y_{jl}}{\sum_{k=1}^{m}y_{jk}}}\right)^2}, \tag{7}\] and the Orloci distance that represents the Euclidean distance computed after scaling the site vectors to length 1 is defined by: \[\label{similaridad_orloci} d_O(\textbf{y}_i,\textbf{y}_j)=\sqrt{\sum_{l=1}^{m}\left(\frac{y_{il}}{\sqrt{\sum_{k=1}^{m}y_{ik}^2}}-\frac{y_{jl}}{\sqrt{\sum_{k=1}^{m}y_{jk}^2}}\right)^2}. \tag{8}\] This distance between two sites is equivalent to the length of a chord joining two points within a segment of a hypersphere of radius 1.

The Prevosti distance (Prevosti, J. Ocaña, and G. Alonso 1975) is a very useful genetic distance between units representing populations. Now, we consider that genetic data is stored in a table where the rows represent the populations and the columns represent potential allelic states grouped by loci. The distance between two units at a single locus \(k\) with \(m(k)\) allelic states is: \[\label{similaridad_prevosti} d_P(\textbf{y}_i,\textbf{y}_j)=\frac{1}{2\nu}\sum_{k=1}^{\nu}\sum_{s=1}^{m(k)}|p_{iks}-p_{jks}|, \tag{9}\] where \(\nu\) is the number of loci or chromosomes (in the case of chromosomal polymorphism) considered and \(p_{iks}\), \(p_{jks}\) are the sample relative frequencies of the allele or chromosomal arrangement \(s\) in the locus or chromosome \(k\), in the ith and jth population, respectively. With presence/absence data coded by \(1\) and \(0\), respectively, the term \(\frac{1}{2\nu}\) is omitted.
As we explain in the next section, WeDiBaDis allows the user to introduce alternative distances by means of a distance matrix. Therefore, the user can work with any distance matrix that is considered appropriate for their data set and analysis. For this reason, no more distances were included in our package.

4 Using the package

We have developed the WeDiBaDis package to implement both the DB-discriminant and the new WDB-discriminant. It can be used with different distance functions and NA values are allowed. When an unit has a NA value in some features, those features are excluded in the computation of the distances for that unit and the computation is scaled up to the number \(m\) of features involved in the data set. Package WeDiBaDis requires a version 3.3.1 or a greater of R.

The principal function is WDBdisc with arguments:

WDBdisc(data, datatype, classcol, new.ind, distance, type, method)

where:

The function returns an object with associated plot and summary methods offering:

Moreover, given a data set, the distances commented on in Section “3” can be obtained through the functions: dcor (correlation distance); dMahal (Mahalanobis distance); dBhatta (Bhattacharyya distance); dGower (Gower distance); dBrayCurtis (Bray and Curtis distance); dHellinger (Hellinger distance); dOrloci (Orloci distance), and dPrevosti (Prevosti distance).

Example 1: Ecological data

We consider the data from (Fielding 2007), which relate to the core area (the region close to the nest) of the golden eagle Aquila chrysaetos in three regions of Western Scotland. The data consist of eight habitat variables: POST (mature planted conifer forest in which the tree canopy has closed); PRE (pre-canopy closure planted conifer forest); BOG (flat waterlogged land); CALL (Calluna (heather) heath land); WET (wet heath, mainly purple moor grass); STEEP (steeply sloping land); LT200 (land below 200 m), and L4-600 (land between 200 and 400 m). The values are the numbers of four-hectare grid cells covered by the habitat, whose values are the amounts of each habitat variable, measured as the number of four hectare blocks within a region defined as a "core area." In order to evaluate if the habitat variables allow to discriminate between these three regions, for example, a WDB-discriminant using the Euclidean distance using the following instructions may be performed:

library(WeDiBaDis)
out <- WDBdisc(data = datafile, datatype = "m", classcol = 1)

The summary method shows, as usual, the more complete information:

summary(out)  
 
Discriminant method: WDB
Leave-one-out confusion matrix:
           Predicted
 Real    1     2     3
 1       7     0     0
 2       0    14     2
 3       2     0     15
Total correct prediction:   90%
Generalized squared correlation:  0.7361
Cohen's Kappa coefficient:  0.84375 
Sensitivity for each class:
 1       2      3
100.00 87.50  88.24
Predictive value for each class:
 1      2       3
77.78 100.00  88.24
Specificity for each class:
  1     2     3
87.88 91.67 91.30
F1-score for each class: 
    1     2     3 
87.50 93.33 88.24 
------  ------  ------  ------  ------  ------
No predicted individuals

As we can observe, perfect classification is obtained for samples from region 1. For regions 2 and 3, only two samples were not correctly classified.

If we want to obtain the barplot for the classification table (see Figure 1), we use the command

plot(out)
graphic without alt text
Figure 1: Plot of leave-one-out classification table for ecological data in Example 1.

These commands generate the sensitivity, specificity and positive predicted values barplot (see Figure 2):

outplot <- summary(out, show = FALSE)
plot(outplot)
graphic without alt text
Figure 2: Plot of the sensitivity, specificity, and positive predicted value for each class for ecological data in Example 1.

Finally to perform a DB discriminant using a different distance that the Euclidean, the following commands are used:

library(WeDiBaDis)
out <- WDBdisc(data = datafile, datatype = "m", distance = "name of the distance",
               method = "DB", classcol = 1)
summary(out)
plot(out)
outplot <- summary(out, show = FALSE)
plot(outplot)

Example 2: Population genetics data

The chromosomal polymorphism for inversions is very useful to characterize the natural populations of Drosophila subobscura. Furthermore, lethal genes located in chromosomal inversions allow the understanding of important evolutionary events. We consider the data from a study of 40 samples of this polymorphism for the O chromosome of this species (Solé, F. Mestres, J. Balanyà, C. Arenas, and L. Serra 2000; Balanyà, E. Solé, J. M. Oller, D. Sperlich, and L. Serra 2004; Mestres, J. Balanyà, M. Pascual, C. Arenas, G. W. Gilchrist, R. B. Huey, and L. Serra 2009). Four groups can be considered: NLE with 16 no lethal European samples, LE with 4 lethal European samples, NLA with 14 no lethal American samples and LA with 6 lethal American samples. In this example, two samples one of the group NLA and one of the group NLE were randomly selected, and considered as new unlabeled units to be classified. The Bhattacharyya distances between all pairs of units were calculated. Therefore, the input for the WDBdisc function is an \(n \times (n+1)\) matrix dat =\((l_i, \; d_B(\textbf{y}_i, \textbf{y}_1), \ldots, d_B(\textbf{y}_i, \textbf{y}_n))_{i=1, \ldots, n}\) where the first column contains the class label and the following columns the distance matrix. Furthermore, xnew is a two-row matrix where each row contains the distances between the new unlabeled units to be classified and the units in the four classes. In this situation, the commands to call the WDB procedure to classify the xnew units and to obtain the available graphics in the package, are:

library(WeDiBaDis)
out <- WDBdisc(data = dat, datatype = "d", classcol = 1, new.ind = xnew)
plot(out)
outplot <- summary(out, show = FALSE)
plot(outplot)

The summary method shows the following information. We can see that the xnew units were correctly classified:

summary(out)  

Discriminant method: WDB
Leave-one-out confusion matrix:
        Predicted
 Real LA  LE  NLA  NLE
  LA   6   0   0  0
  LE   0   3   0  1
  NLA  0   0  13  0
  NLE  0   3   0 12
Total correct prediction:   89.47%
Generalized squared correlation:  0.7442
Cohen's Kappa coefficient:  0.8509804 
Sensitivity for each class:
 LA      LE     NLA      NLE
100.00  75.00  100.00   80.00
Predictive value for each class:
 LA      LE     NLA      NLE
100.00  50.00  100.00   92.31
Specificity for each class:
LA      LE     NLA      NLE
87.50 91.18   84.00    95.65
F1-score for each class: 
    LA     LE    NLA    NLE 
100.00  60.00 100.00  85.71 
------ ------ ------ ------ ------ ------
Prediction for new individuals:
Pred. class
1 "NLE"
2 "NLA"

Now, the two unlabeled new units were correctly classified. The barplots are in Figure 3 and Figure 4, respectively.

graphic without alt text
Figure 3: Plot of leave-one-out classification table for population genetics data in Example 2.
graphic without alt text
Figure 4: Plot of the sensitivity, specificity, and positive predicted value for each class for population genetics data in Example 2.

Data files

The package contains some examples of data files, each with a corresponding explanation. The data sets are corearea, containing the data for the example presented in the subsection Example 1: Ecological data; abundances, which is a simulated data set for abundance data matrix; and microsatt, a data set containing allele frequencies for 18 cattle breeds (bull or zebu), of French and African descent, typed on 9 microsatellites.

Computing time

To illustrate the time consumed by the WDB procedure, which requires more computation than DB, we performed the following simulation with artificial data. We generated multinormal samples containing 50, 100, 200, 300,…,900, 1000, 2000, and 3000 units, respectively. Then, for each sample size we created sets containing respectively 50, 100, 500, 1000, 1500, 2000, 2500, …, 4500, and 5000 features. For each combination of sample sizes and features, we considered 2, 3, 4, and 10 classes. All the computations presented in this paper have been performed on a personal computer with Intel(R) Core(TM) i5-2450M and 6 GB of memory using a single 2.50GHz CPU processor. The results of the simulation for two classes are displayed in Figure 5, where the elapsed time (the actual elapsed time since the process started) is reported in seconds. We can observe that the runtime is mainly affected by the number of units (Figure 4, top), but affected very little by the number of variables (Figure 4, bottom). This is expected, as the procedure is based on distances and therefore the dimension of the distance matrix (number of units) determines the runtime required. The number of classes also affects the runtime, although its variation with increasing the number of classes is very slight. For example, with 300 units and 4000 variables, the elapsed time for 2, 3, 4, and 10 classes are 3.38, 3.40, 3.62, and 4.82 seconds, respectively.

graphic without alt text
Figure 5: Artificial data sets with two classes. Top: Elapsed timing in seconds (y axes) for WDB procedure with respect to the number of units (x axes). Each line (colours in the legend) corresponds to the set with identical number of features. Bottom: Elapsed timing in seconds (y axes) for WDB procedure with respect to the number of features (x axes). Each line (colours in the legend) corresponds to the set with identical number of units.

5 DB and WDB comparison using cancer data sets

In order to compare the performance of DB and WDB procedures, thirty-eight available cancer data sets were considered in our analysis (Table 1). These are available at http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/datasets.htm and (Lê Cao, E. Meugnier, and G. J. McLachlan 2010). As we can observe in Table 1, three of them include clinical features and some of the data sets have unbalanced classes. We performed the evaluation for DB and WDB classifiers using the leave-one-out procedure. We present the total misclassification rate \(MQ_t=100-Q_t\) and the generalized squared correlation coefficient \(GC^2\) (Table 2). For simplicity, the sensitivity \(Q_ki^{se}\), the specificity \(Q_k^{sp}\), the positive predictive value \(P_k^{+}\) for each class, the \(Kappa\) and \(F_1\) statistcis are not presented. For the microarray data sets with only continuous features we used the Euclidean distance, and for those including clinical and genetic data, we considered the Gower distance (Gower 1971). As we can observe in Table 2, considering only \(MQ_t\), the total misclassification percentage rate, WDB was the best classifier in 18 data sets and it shared this quality in 11 data sets with DB (Wilcoxon signed rank test; one side p-value = 0.0265). Using the generalized squared correlation \(GC^2\) coefficient (Table 2), WDB was the best rule in 16 data sets and it shared this quality in 11 data sets with DB (Wilcoxon signed rank test; one side p-value = 0.0378). Note that for data sets 30 and 38 the \(GC^2\) value is 0. For example, in the Risinger-2003 case, all units of the second class (class with 3 units) were badly classified with DB and WDB methods. However, while with the DB method, 4 units belonging to other classes were badly classified in class 2, with the WDB method none of the units of other classes were badly classified in class 2, and for this reason the \(GC^2\) is equal to 0. With the Yeoh-2002-v2 data set something similar happened. For all these results, WDB seems to obtain in general the best results and to be a slightly better in the case where classes are unbalanced with respect to their sizes.

Table 1: Cancer data sets (ID = identification number). They present different number of classes (\(K\)), number of samples (\(n\)), number of samples in each class (\(n_i\)), number of features (\(p\)), number of continuous features (\(cuant\)) and number of qualitative features (\(quali\)). The percentage corresponding to the number of samples belonging to each class is in brackets in column five.
ID Data set \(K\) \(n\) \(n_i\) \(p\) \(cuant\) \(quali\)
1 Alizadeh-2000-v1 2 42 21(50%), 21(50%) 1095 1095
2 Alizadeh-2000-v2 3 62 42(67.74%), 9(14.52%), 11(17.74%) 2093 2093
3 Armstrong-2002-v1 2 72 24(33.33%), 48(66.67%) 1081 1081
4 Armstrong-2002-v2 3 72 24(33.33%), 20(27.78%), 28(38.89%) 2194 2194
5 Bhattacharjee-2001 5 203 139(68.47%), 17(8.37%), 6(2.96%), 1543 1543
21(10.34%), 20(9.85%)
6 Bittner-2000-V1 2 38 19(50%), 19(50%) 2201 2201
7 Bittner-2000-V2 3 38 19(50%), 12(31.58%), 7(18.42%) 2201 2201
8 Breast 2 256 75(29.30%), 181(70.70%) 5545 5537 8
9 Bredel-2005 3 50 31(62%), 14(28%), 5(10%) 1739 1739
10 Chen-2002 2 179 104(58.10%), 75(41.90%) 85 85
11 Chowdary-2006 2 104 62(59.62%), 42(38.89%) 182 182
12 CNS 2 60 21(35%), 39(65%) 7134 7128 6
13 Dyrskjot-2003 3 40 9(22.5%), 20(50%), 11(27.5%) 1203 1203
14 Garber-2001 4 66 17(25.76%), 40(60.61%), 4(6.06%), 5(7.58%) 4553 4553
15 Golub-1999-v1 2 72 47(65.28%), 25(34.72%) 1877 1877
16 Golub-1999-v2 3 72 38(52.78%), 9(12.5%), 25(34.72%) 1877 1877
17 Gordon-2002 2 181 31(17.13%), 150(82.87%) 1626 1626
18 Khan-2001 4 83 29(34.94%), 11(13.25%), 18(21.69%), 25(30.12%) 1069 1069
19 Laiho-2007 2 37 8(21.62%), 29(78.38%) 2202 2202
20 Lapointe-2004-v1 3 69 11(15.94%), 39(56.52%), 19(27.54%) 1625 1625
21 Lapointe-2004-v2 4 110 11(10%), 39(35.45%), 19(17.27%), 41(37.27%) 2496 2496
22 Liang-2005 3 37 28(75.67%), 6(16.22%), 3(8.11%) 1411 1411
23 Nutt-2003-v1 4 50 14(50%), 7(14%), 14(28%), 15(30%) 1377 1377
24 Nutt-2003-v2 2 28 14(50%),14(50%) 1070 1070
25 Nutt-2003-v3 2 22 7(31.82%),15(68.18%) 1152 1152
26 Pomeroy-2002-v1 2 34 25(73.53%), 9(26.47%) 857 857
27 Pomeroy-2002-v2 5 42 10(23.81%), 10(23.81%), 10(23.81%), 4(9.52%) 1379 1379
8(19.05%)
28 Prostate 2 79 37(46.84%), 42(53.16%) 7892 7884 8
29 Ramaswamy-2001 14 190 11(5.79%), 11(5.79%), 20(10.53%), 11(5.79%), 1363 1363
30(15.79%), 11(5.79%), 22(11.28%), 11(5.79%),
10(5.26%),11(5.79%), 11(5.79%), 10(5.26%),
11(5.79%), 10(5.26%)
30 Risinger-2003 4 42 13(30.95%), 3(7.14%), 19(45.24%), 7(16.67%) 1771 1771
31 Shipp-2002-v1 2 77 58(75.32%), 19(24.67%) 798 798
32 Singh-2002 2 102 50(49.02%), 52(50.98%) 339 339
33 Su-2001 10 174 8(4.60%), 26(14.94%), 23(13.22%), 12(6.90%), 1571 1571
11(6.32%), 7(4.02%), 28(16.09%), 27(15.52%),
6(3.45%), 26(14.94%)
34 Tomlins-2006-v1 5 104 27(25.96%), 20(19.23%), 32(30.77%), 13(12.5%), 2315 2315
12(11.54%) 2315 2315
35 Tomlins-2006-v2 4 92 27(26.35%), 20(21.74%), 32(34.78%), 13(14.13%) 1288 1288
36 West-2001 2 49 25(51.02%), 24 (48.98%) 1198 1198
37 Yeoh-2002-v1 2 248 43(17.34%), 205(82.66%) 2526 2526
38 Yeoh-2002-v2 6 248 15(6.05%), 27(10.89%), 64(25.81%), 20(8.06%), 2526 2526
43(17.34%), 79(31.85%)
Table 2: In the first column identification number for cancer data sets. In the second and third columns, total leave-one-out misclassification rate \(100-Q_t\) (in percentage) for classifiers \(DB\) and \(WDB\), respectively. In bold the smallest misclassification rate. In the forth and fifth columns, generalized squared correlation \(GC^2\) coefficient for classifiers \(DB\) and \(WDB\), respectively. In bold the greater \(GC^2\) value.
ID \(100-Q_t\) \(100-Q_t\) \(GC^2\) \(GC^2\)
DB WDB DB WDB
1 7.14 7.14 0.74 0.74
2 1.61 0.00 0.94 1.00
3 8.33 5.56 0.684 0.77
4 4.17 4.17 0.88 0.88
5 19.21 15.27 0.49 0.56
6 13.16 13.16 0.56 0.56
7 36.84 36.84 0.25 0.25
8 32.81 30.47 0.11 0.13
9 18.00 18.00 0.34 0.34
10 11.17 8.94 0.61 0.67
11 18.27 9.62 0.42 0.64
12 41.67 38.33 0.01 0.01
13 15.00 12.50 0.58 0.65
14 21.21 28.79 0.38 0.19
15 6.94 4.17 0.72 0.82
16 6.94 6.94 0.81 0.81
17 12.71 13.26 0.47 0.42
18 1.20 1.20 0.97 0.97
19 21.62 21.62 0.23 0.23
20 31.88 30.43 0.23 0.26
21 30.91 30.91 0.34 0.34
22 13.51 10.81 0.72 0.76
23 32.00 34.00 0.40 0.33
24 17.86 10.71 0.43 0.65
25 4.55 9.09 0.80 0.67
26 29.41 20.59 0.12 0.16
27 16.67 21.43 0.65 0.63
28 34.18 34.18 0.10 0.10
29 36.84 29.47 0.44 0.53
30 28.57 26.19 0.36 0.00
31 29.87 12.99 0.24 0.48
32 30.39 30.39 0.18 0.16
33 20.11 16.67 0.63 0.70
34 17.31 21.15 0.66 0.58
35 23.91 26.09 0.46 0.41
36 20.41 14.29 0.35 0.52
37 1.61 2.02 0.89 0.87
38 21.77 24.60 0.57 0.00

6 Conclusions

The package WeDiBaDis, available at https://github.com/ItziarI/WeDiBaDis, is an implementation of two discriminant analysis procedures in an R environment. The classifiers are the Distance-Based (DB) and the new proposed procedure Weighted-Distance-Based (WDB). Thee are useful to solve the classification problem for high-dimensional data sets with mixed features or when the input information is a distance matrix. This software provides functions to compute both discriminant procedures and to assess the performance of the classification rules it offers: the leave-one-out classification table; the general correlation coefficient; the sensitivity, specificity, and positive predictive value for each class; the \(Kappa\) and the \(F_1\) statistics. The package also presents these results in a graphical form (barplots for the classification table and, for sensitivity, specificity and positive predictive values, respectively). Furthermore, it allows the classification for new unlabeled units. WeDiBaDis provides a user-friendly environment, which can be of great utility in biology, ecology, biomedical, and, in general, any applied study involving discrimination between groups and classification of new unlabeled units. In addition, it can be very useful in multivariate methods courses aimed at biologists, medical researchers, psychologists, etc.

7 Acknowledgements

This research was partially supported: II by the Spanish ‘Ministerio de Economia y Competitividad’ (TIN2015-64395-R) and by the Basque Government Research Team Grant (IT313-10) SAIOTEK Project SA-2013/00397 and by the University of the Basque Country UPV/EHU (Grant UFI11/45 (BAILab). FM by the Spanish ‘Ministerio de Economia y Competitividad’ (CTM2013-48163) and by Grant 2014 SGR 336 from the Departament d’Economia i Coneixement de la Generalitat de Catalunya. CA by the Spanish ‘Ministerio de Economia y Competitividad’ (SAF2015-68341-R), by the Spanish ‘Ministerio de Economia y Competitividad’ (TIN2015-64395-R) and by Grant 2014 SGR 464 (GRBIO) from the Departament d’Economia i Coneixement de la Generalitat de Catalunya.

CRAN packages used

cluster, ICGE, vegan

CRAN Task Views implied by cited packages

Cluster, Environmetrics, Phylogenetics, Psychometrics, Robust, Spatial

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

M. J. Anderson and J. Robinson. Generalized discriminant analysis based on distances. Australian; New Zealand Journal of Statistics 45:, 2003.
I. Balanyà, E. Solé, J. M. Oller, D. Sperlich, and L. Serra. Long-term changes in chromosomal inversion polymorphism of textitDrosophila subobscura: II. European populations. Journal of Zoological Systematics; Evolutionary Research,42:, 2004.
P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16:, 2000.
A. Bhattacharyya. On a measure of divergence of two multinominal populations. Sankhya 7:, 1946.
R. Blagus and L. Lusa. Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinformatics 14: 64, 2013.
G. Bouin. Computer program review: Ginkgo, a multivariate analysis package. Journal of Vegetation Science 16:, 2005.
A. L. Boulesteix, C. Porzelius, and M. Daumer. Microarray-based classification and clinical predictors: On combined classifiers and additional predictive value. Bioinformatics 24:, 2008.
J. R. Bray and J. T. Curtis. An ordination of upland forest communities of southern Wisconsin. Ecological Monographs 27:, 1957.
L. Breiman. Random forests. Machine Learning 45:, 2001.
J. Cohen. A coefficient of agreement for nominal scales. Educational; Psychological Measurement 20:, 1960.
T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13:, 1967.
C. M. Cuadras. Some examples of distance based discrimination. Biometrical Letters 29:, 1992.
C. M. Cuadras. Statistical Data Analysis and Inference, chapter Distance Analysis. In: Discrimination and Classfication Using Both Continuous and Categorical Variables, pages 459–473. Elsevier Science Publishers BV Amsterdam, 1989.
C. M. Cuadras, J. Fortiana, and F. Oliva. The proximity of an individual to a population with applications in discriminant analysis. Journal of Classification 14:, 1997.
M. De Caceres, F. Oliva, and X. Font. GINKGO, a multivariate analysis program oriented towards distance-based classifications. In International Conference on Correspondence Analysis andRelated Methods (CARME’ 03), 2003.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley Interscience Publication John Wiley; Sons New York, 2000.
S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of American Statistical Association 97:, 2002.
A. H. Fielding. Cluster and Classification Techniques for the Biosciences. Cambridge University Press, 2007.
R. A. Fisher. The use of multiple measurements in taxonomic problems. The Annals of Eugenics 7:, 1936.
A. K. Ghosh and P. Chaudhuri. On data depth and distribution-free discriminant analysis using separating surfaces. Bernoulli 11:, 2005.
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and C. D. Bloomfield. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:, 1999.
J. C. Gower. A general coefficient of similarity and some of its properties. Biometrics 27:, 1971.
T. Hastie, R. Tibshirani, D. Botstein, and P. Brown. Supervised harvesting of expression trees. Genome Biology 2:, 2001.
D. Hlubinka and O. Vencalek. Depth-based classification for distributions with nonconvex support. Journal of Probability; Statistics 28:, 2013.
I. Irigoien, B. Sierra, and C. Arenas. ICGE: An R package for detecting relevant clusters and atypical units in gene expression. BMC Bioinformatics 13: natexlabb, 2013.
I. Irigoien, F. Mestres, and C. Arenas. The depth problem: Identifying the most representative units in a data group. IEEE/ACM Transactions on Computational Biology andBioinformatics 10: natexlaba, 2013.
J. Jin and H. Cui. Discriminant analysis based on statistical depth. Journal of Systems Science; Complexity 23:, 2010.
R. Jornstein. Clustering and classification based on \(L_1\) data depth. Journal of Multivariate Analysis 90:, 2004.
M. Kent. Vegetation Description and Data Analysis: A Practical Approach. Wiley-Blackwey, 2011.
W. I. Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas. Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Applied Statistics 44:, 1995.
J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics 33:, 1977.
K. A. Lê Cao, E. Meugnier, and G. J. McLachlan. Integrative mixture of experts to combine clinical factors and gene markers. Bioinformatics 26:, 2010.
P. Legendre and L. Legendre. Numerical Ecology. Elsevier Amsterdam, 1998.
R. Y. Liu. On a notion of data depth based on random simplices. Annals of Statistics 18:, 1990.
P. V. Mahalanobis. On the generalized distance in statistics. Procedures of the Natural Institute of Science of India,2:, 1936.
B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica Biophysica Acta 405:, 1975.
F. Mestres, J. Balanyà, M. Pascual, C. Arenas, G. W. Gilchrist, R. B. Huey, and L. Serra. Evolution of Chilean colonizing populations of textitD. subobscura: lethal genes and chromosomal arrangements. Genetica 136:, 2009.
L. Orloci. An agglomerative method for classification of plant communities. Journal of Ecology 55:, 1967.
D. M. W. Powers. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlaction. Journal of Machine Learning Technologies 2:, 2011.
A. Prevosti, J. Ocaña, and G. Alonso. Distances between populations of textitD. subobscura, based on chromosome arrangement frequencies. Theoretical; Applied Genetics 45:, 1975.
C. R. Rao. A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Q"uestii’o 19:, 1995.
R. Serfling. Statistic and Data Analysis Based on \(L_1\)-Norm and Related Methods, chapter A Depth Function and a Scale Curve Based on Spatial Quantiles, pages 25–38. Birkh"auser Boston, 2002.
E. Solé, F. Mestres, J. Balanyà, C. Arenas, and L. Serra. Colonization of America by textitD. subobscura: Spatial and temporal lethal-gene allelism. Hereditas 133:, 2000.
R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the UnitedStates of America 99:, 2002.
Y. Vardi and C. Zhang. The multivariate \(L_1\)-median and associated data depth. Proceedings of the National Academy of Sciences of the UnitedStates of America 97:, 2000.
X. H. Zhou, N. A. Obuchowski, and D. K. McClish. Statistical Methods in Diagnostic Medicine. Wiley Series in Probability; Statistics John Wiley; Sons NewJersey, 2002.
S. Zuo and R. Serfling. General notions of statistical depth function. Annals of Statistics 28:, 2000.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Irigoien, et al., "Weighted Distance Based Discriminant Analysis: The R Package WeDiBaDis", The R Journal, 2017

BibTeX citation

@article{RJ-2016-057,
  author = {Irigoien, Itziar and Mestres, Francesc and Arenas, Concepcion},
  title = {Weighted Distance Based Discriminant Analysis: The R Package WeDiBaDis},
  journal = {The R Journal},
  year = {2017},
  note = {https://rjournal.github.io/},
  volume = {8},
  issue = {2},
  issn = {2073-4859},
  pages = {434-450}
}