The WeDiBaDis package provides a user friendly environment to perform discriminant analysis (supervised classification). WeDiBaDis is an easy to use package addressed to the biological and medical communities, and in general, to researchers interested in applied studies. It can be suitable when the user is interested in the problem of constructing a discriminant rule on the basis of distances between a relatively small number of instances or units of known unbalanced-class membership measured on many (possibly thousands) features of any type. This is a current situation when analyzing genetic biomedical data. This discriminant rule can then be used both, as a means of explaining differences among classes, but also in the important task of assigning the class membership for new unlabeled units. Our package implements two discriminant analysis procedures in an R environment: the well-known distance-based discriminant analysis (DB-discriminant) and a weighted-distance-based discriminant (WDB-discriminant), a novel classifier rule that we introduce. This new procedure is based on an improvement of the DB rule taking into account the statistical depth of the units. This article presents both classifying procedures and describes the implementation of each in detail. We illustrate the use of the package using an ecological and a genetic experimental example. Finally, we illustrate the effectiveness of the new proposed procedure (WDB), as compared with DB. This comparison is carried out using thirty-eight, high-dimensional, class-unbalanced, cancer data sets, three of which include clinical features.
Discriminant analysis (supervised classification) is used to
differentiate between two or more naturally occurring groups based on a
suite of discriminating features. This analysis can be used as a means
of explaining differences among groups and for classification. That is,
to develop a rule based on features measured on a group of units with
known membership (the so-called training set), and to use this
classification rule to assign a class membership to new unlabeled units.
Classification is used by researchers in a wide variety of settings and
fields including biological and medical sciences. For example, in
biology it is used for taxonomic classification, morphometric analysis
for species identification, and to study species distribution.
Discriminant analysis is applicable to a wide range of ecological
problems, e.g., testing for niche separation by sympatric species or for
the presence or absence of a particular species. Marine ecologists
commonly use discriminant analysis to evaluate the similarity of
distinct populations and to classify units of unknown origin to known
populations. The discriminant technique is also used in genetic studies
in order to summarize the genetic differentiation between groups. In
studies with Single Nucleotide Polymorphism (SNP) or re-sequencing data
sets, usually the number of variables (alleles) is greater than the
number of observations (units), so discriminant methods are available
for data sets with more variables than units, as necessary. Furthermore,
class prediction is currently one of the most important tasks in
biomedical studies. The diagnosis of diseases, as cancer type or
psychiatric disorder, has recently received a great deal of attention.
With actual data, classification presents serious difficulties, because
diagnosis is based on both clinical/pathological features (usually
nominal data) and gene expression information (continuous data). For
this reason, classification rules that could be applied to all types of
data are desirable. The most popular classification rules are the linear
(LDA) and quadratic (QDA) discriminant analyses (Fisher 1936), which are
easy to use as they are found in most statistical packages. However,
they require the assumption of normally distributed data; when this
condition is violated, their use may yield poor classification results.
Many distinct classifiers exist, differing in the definition of the
classification rule and whether they utilize statistical
(Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and C. D. Bloomfield 1999; Hastie, R. Tibshirani, D. Botstein, and P. Brown 2001) or machine learning (Breiman 2001; Boulesteix, C. Porzelius, and M. Daumer 2008) methods. However,
the problem of classification with data obtained from microarrays is
challenging because there are a large number of genes and a relatively
small number of samples. In this situation, the classification methods
based on the within-class covariance matrix fail, as an inverse is not
defined. This is known as the singularity or under-sample problem
(Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas 1995). The shrunken centroid method can be seen as a modification of
the diagonal discriminant analysis (Dudoit, J. Fridlyand, and T. P. Speed 2002) and was developed for
continuous high-dimensional data (Tibshirani, T. Hastie, B. Narasimhan, and G. Chu 2002). Nowadays, another issue that
requires attention is the class-unbalanced situation, that is, the
number of units belonging to each class is not the same. Some
classifiers on class-unbalanced data tend to classify most of the new
data in the majority class. This bias is higher when using high
dimensional data. Recently, a method which improves the shrunken
centroid method when the high-dimensional data is class-unbalanced was
presented (Blagus and L. Lusa 2013). Furthermore, some statistical approaches are
characterized by having an explicit underlying probability model, but it
is not possible to always assume this requirement. One of the most
popular nonparametric, machine-learning, classification methods is the
A discriminant method should have several abilities. First, the classifier rule has to be able to properly separate the classes. In this sense, the classifier evaluation is most often based on the error rate, the percentage of incorrect prediction divided by the total number of predictions. Second, the rule has to be useful to classify new unlabeled units. Then, cross validation evaluation is needed. Cross-validation involves a series of sub-experiments, each of which involves the removal of a subset of objects from a data set (the test set), construction of a classifier using the remaining objects in the data set (the model building set), and subsequent application of the resulting model to the removed objects. The leave-one-out method is a special case of cross-validation; it considers each single object in the data set as a test set. Furthermore, other measures, such as the sensitivity, specificity, positive predictive value for each class, and the generalized correlation coefficient, are useful to know the ability of the rule in the prediction task.
Here we introduce WeDiBaDis, an R package which provides a
user-friendly interface to run the DB-discriminant analysis and a new
classification procedure, the weighted-distance-based discriminant
(WDB-discriminant) that performs well and improves the DB-discriminant
rule. It is based on both, the DB-discriminant rule and the depth
function
Let
The distance-based or DB-discriminant rule (Cuadras, J. Fortiana, and F. Oliva 1997) takes as a
discriminant score
The DB classification rule allocates
That is, this distance-based rule assigns a unit to the nearest group. Furthermore, using appropriate distances, Equation (3) reduces to some classic and well-studied rules (see Table 1 in Cuadras, J. Fortiana, and F. Oliva (1997)). For example, under the normality assumption, Equation (3) is equivalent to a linear discriminant or to a quadratic discriminant if the Mahalanobis distance or the Mahalanobis distance plus a constant is selected, respectively.
For any unit
Function
For each class
The shrinkage we use, reduces the proximity values, this reduction being
greater for deeper units. Thus, this new classification rule,
First consider the case of two classes (
In the general case of
Another interesting coefficient is the
Finally, another measure used as a result of classification is the
The DB and WDB procedures require the previous calculation of a distance between units. In biomedical, genetic, and ecological studies different types of dissimilarities are frequently used. For this reason, WeDiBaDis includes several distance functions. Although these distances can be found in other packages they were included for ease their use for non-expert R users.
The package contains the usual Euclidean distance,
mahalanobis()
that calculates the Mahalanobis
distance already exists in the stats package, but it is not suitable
in our context. While this function calculates the Mahalanobis distance
with respect to a given center, our function is designed to calculate
the Mahalanobis distance between each pair of units given a data matrix.
Next, we briefly comment on the other distances included in the package.
The Bhattacharyya distance (Bhattacharyya 1946) is a very well-known
distance between populations in the genetic context. Each population is
characterized by a vector
The Gower distance (Gower 1971), used for mixed variables, is defined by:
daisy()
function in the cluster
package, which can calculate the Gower distance for mixed variables. The
difference between this function and dGower()
in WeDiBaDis is that
in daisy()
the distance is calculated as
dGower()
as
dGower()
allows us to include missing values (such as NA
)
and therefore calculates distances based on Gower’s weighted similarity
coefficients. The dGower()
function improves the function dgower()
included in the package
ICGE (Irigoien, B. Sierra, and C. Arenas 2013).
The Bray-Curtis distance (Bray and J. T. Curtis 1957) is one of the most well-known ways of
quantifying the difference between samples when the information is
ecological abundance data collected at different sampling locations. It
is defined by:
The Hellinger (Rao 1995) and Orloci (or chord distance) (Orloci 1967)
distances are also measures recommended for quantifying differences
between sampling locations when the ecological abundance of species is
collected. The Hellinger distance is given by:
The Prevosti distance (Prevosti, J. Ocaña, and G. Alonso 1975) is a very useful genetic distance
between units representing populations. Now, we consider that genetic
data is stored in a table where the rows represent the populations and
the columns represent potential allelic states grouped by loci. The
distance between two units at a single locus
As we explain in the next section, WeDiBaDis allows the user to
introduce alternative distances by means of a distance matrix.
Therefore, the user can work with any distance matrix that is considered
appropriate for their data set and analysis. For this reason, no more
distances were included in our package.
We have developed the WeDiBaDis package to implement both the
DB-discriminant and the new WDB-discriminant. It can be used with
different distance functions and NA
values are allowed. When an unit
has a NA
value in some features, those features are excluded in the
computation of the distances for that unit and the computation is scaled
up to the number
The principal function is WDBdisc
with arguments:
WDBdisc(data, datatype, classcol, new.ind, distance, type, method)
where:
data
: a data matrix or a distance matrix. If the Prevosti distance
will be used, data
must be a named matrix where the name of the
loci and allele must be separeted by a dot (LociName.AlleleName).
datatype
: if the data is a data matrix, datatype = "m"
; if the
data is a distance matrix datatype = "d"
.
classcol
: a number indicating which column in the data contains
the class variable. By default the class variable is in the first
column.
new.ind
: is only required if there are new unlabeled units to be
classified; if datatype = "m"
it is a matrix containing the
feature values for the new units to be classified; if
datatype = "d"
it is a matrix containing the distances between the
new units to be classified and the units in the classes.
distance
: the distance measure to be used. This must be either
“euclidean” (default option), “correlation” , “Bhattacharyya”,
“Gower”, “Mahalanobis”, “BrayCurtis”, “Orloci”, “Hellinger”, or
“Prevosti”.
type
: is only required if distance = "Gower"
. The value for
type
is a list (e.g., type = list(cuant, nom, bin)
) indicating
the position of the columns for continuous (cuant
), nominal
(nom
) and binary (bin
) features, respectively.
method
the discriminant method to be used. This must be either
"DB"
or "WDB"
for the DB-discriminant and WDB-discriminant,
respectively. The default method is WDB.
The function returns an object with associated plot
and summary
methods offering:
The classification table obtained with the leave-one-out cross-validation.
The total well classification rate in percentage (
The generalized squared correlation (
The sensitivity, specificity, and positive predictive values for
each class (
The
The assigned class for new unlabeled units to be classified.
A barplot for the classification table.
A barplot for the sensitivity, specificity, and positive predictive values for each class.
Moreover, given a data set, the distances commented on in
Section “3” can be obtained through the
functions: dcor
(correlation distance); dMahal
(Mahalanobis
distance); dBhatta
(Bhattacharyya distance); dGower
(Gower
distance); dBrayCurtis
(Bray and Curtis distance); dHellinger
(Hellinger distance); dOrloci
(Orloci distance), and dPrevosti
(Prevosti distance).
We consider the data from (Fielding 2007), which relate to the core area (the region close to the nest) of the golden eagle Aquila chrysaetos in three regions of Western Scotland. The data consist of eight habitat variables: POST (mature planted conifer forest in which the tree canopy has closed); PRE (pre-canopy closure planted conifer forest); BOG (flat waterlogged land); CALL (Calluna (heather) heath land); WET (wet heath, mainly purple moor grass); STEEP (steeply sloping land); LT200 (land below 200 m), and L4-600 (land between 200 and 400 m). The values are the numbers of four-hectare grid cells covered by the habitat, whose values are the amounts of each habitat variable, measured as the number of four hectare blocks within a region defined as a "core area." In order to evaluate if the habitat variables allow to discriminate between these three regions, for example, a WDB-discriminant using the Euclidean distance using the following instructions may be performed:
library(WeDiBaDis)
out <- WDBdisc(data = datafile, datatype = "m", classcol = 1)
The summary
method shows, as usual, the more complete information:
summary(out)
Discriminant method: WDB
Leave-one-out confusion matrix:
Predicted
Real 1 2 3
1 7 0 0
2 0 14 2
3 2 0 15
Total correct prediction: 90%
Generalized squared correlation: 0.7361
Cohen's Kappa coefficient: 0.84375
Sensitivity for each class:
1 2 3
100.00 87.50 88.24
Predictive value for each class:
1 2 3
77.78 100.00 88.24
Specificity for each class:
1 2 3
87.88 91.67 91.30
F1-score for each class:
1 2 3
87.50 93.33 88.24
------ ------ ------ ------ ------ ------
No predicted individuals
As we can observe, perfect classification is obtained for samples from region 1. For regions 2 and 3, only two samples were not correctly classified.
If we want to obtain the barplot for the classification table (see Figure 1), we use the command
plot(out)
These commands generate the sensitivity, specificity and positive predicted values barplot (see Figure 2):
outplot <- summary(out, show = FALSE)
plot(outplot)
Finally to perform a DB discriminant using a different distance that the Euclidean, the following commands are used:
library(WeDiBaDis)
out <- WDBdisc(data = datafile, datatype = "m", distance = "name of the distance",
method = "DB", classcol = 1)
summary(out)
plot(out)
outplot <- summary(out, show = FALSE)
plot(outplot)
The chromosomal polymorphism for inversions is very useful to
characterize the natural populations of Drosophila subobscura.
Furthermore, lethal genes located in chromosomal inversions allow the
understanding of important evolutionary events. We consider the data
from a study of 40 samples of this polymorphism for the O chromosome of
this species (Solé, F. Mestres, J. Balanyà, C. Arenas, and L. Serra 2000; Balanyà, E. Solé, J. M. Oller, D. Sperlich, and L. Serra 2004; Mestres, J. Balanyà, M. Pascual, C. Arenas, G. W. Gilchrist, R. B. Huey, and L. Serra 2009). Four groups can be considered:
NLE with 16 no lethal European samples, LE with 4 lethal European
samples, NLA with 14 no lethal American samples and LA with 6 lethal
American samples. In this example, two samples one of the group NLA and
one of the group NLE were randomly selected, and considered as new
unlabeled units to be classified. The Bhattacharyya distances between
all pairs of units were calculated. Therefore, the input for the
WDBdisc
function is an dat =
xnew
is a two-row matrix
where each row contains the distances between the new unlabeled units to
be classified and the units in the four classes. In this situation, the
commands to call the WDB procedure to classify the xnew
units and to
obtain the available graphics in the package, are:
library(WeDiBaDis)
out <- WDBdisc(data = dat, datatype = "d", classcol = 1, new.ind = xnew)
plot(out)
outplot <- summary(out, show = FALSE)
plot(outplot)
The summary
method shows the following information. We can see that
the xnew
units were correctly classified:
summary(out)
Discriminant method: WDB
Leave-one-out confusion matrix:
Predicted
Real LA LE NLA NLE
LA 6 0 0 0
LE 0 3 0 1
NLA 0 0 13 0
NLE 0 3 0 12
Total correct prediction: 89.47%
Generalized squared correlation: 0.7442
Cohen's Kappa coefficient: 0.8509804
Sensitivity for each class:
LA LE NLA NLE
100.00 75.00 100.00 80.00
Predictive value for each class:
LA LE NLA NLE
100.00 50.00 100.00 92.31
Specificity for each class:
LA LE NLA NLE
87.50 91.18 84.00 95.65
F1-score for each class:
LA LE NLA NLE
100.00 60.00 100.00 85.71
------ ------ ------ ------ ------ ------
Prediction for new individuals:
Pred. class
1 "NLE"
2 "NLA"
Now, the two unlabeled new units were correctly classified. The barplots are in Figure 3 and Figure 4, respectively.
The package contains some examples of data files, each with a
corresponding explanation. The data sets are corearea
, containing the
data for the example presented in the subsection Example 1: Ecological
data; abundances
, which is a simulated data set for abundance data
matrix; and microsatt
, a data set containing allele frequencies for 18
cattle breeds (bull or zebu), of French and African descent, typed on 9
microsatellites.
To illustrate the time consumed by the WDB procedure, which requires more computation than DB, we performed the following simulation with artificial data. We generated multinormal samples containing 50, 100, 200, 300,…,900, 1000, 2000, and 3000 units, respectively. Then, for each sample size we created sets containing respectively 50, 100, 500, 1000, 1500, 2000, 2500, …, 4500, and 5000 features. For each combination of sample sizes and features, we considered 2, 3, 4, and 10 classes. All the computations presented in this paper have been performed on a personal computer with Intel(R) Core(TM) i5-2450M and 6 GB of memory using a single 2.50GHz CPU processor. The results of the simulation for two classes are displayed in Figure 5, where the elapsed time (the actual elapsed time since the process started) is reported in seconds. We can observe that the runtime is mainly affected by the number of units (Figure 4, top), but affected very little by the number of variables (Figure 4, bottom). This is expected, as the procedure is based on distances and therefore the dimension of the distance matrix (number of units) determines the runtime required. The number of classes also affects the runtime, although its variation with increasing the number of classes is very slight. For example, with 300 units and 4000 variables, the elapsed time for 2, 3, 4, and 10 classes are 3.38, 3.40, 3.62, and 4.82 seconds, respectively.
In order to compare the performance of DB and WDB procedures,
thirty-eight available cancer data sets were considered in our analysis
(Table 1). These are available at
http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/datasets.htm
and (Lê Cao, E. Meugnier, and G. J. McLachlan 2010). As we can observe in Table 1, three of them
include clinical features and some of the data sets have unbalanced
classes. We performed the evaluation for DB and WDB classifiers using
the leave-one-out procedure. We present the total misclassification rate
ID | Data set | ||||||
---|---|---|---|---|---|---|---|
1 | Alizadeh-2000-v1 | 2 | 42 | 21(50%), 21(50%) | 1095 | 1095 | |
2 | Alizadeh-2000-v2 | 3 | 62 | 42(67.74%), 9(14.52%), 11(17.74%) | 2093 | 2093 | |
3 | Armstrong-2002-v1 | 2 | 72 | 24(33.33%), 48(66.67%) | 1081 | 1081 | |
4 | Armstrong-2002-v2 | 3 | 72 | 24(33.33%), 20(27.78%), 28(38.89%) | 2194 | 2194 | |
5 | Bhattacharjee-2001 | 5 | 203 | 139(68.47%), 17(8.37%), 6(2.96%), | 1543 | 1543 | |
21(10.34%), 20(9.85%) | |||||||
6 | Bittner-2000-V1 | 2 | 38 | 19(50%), 19(50%) | 2201 | 2201 | |
7 | Bittner-2000-V2 | 3 | 38 | 19(50%), 12(31.58%), 7(18.42%) | 2201 | 2201 | |
8 | Breast | 2 | 256 | 75(29.30%), 181(70.70%) | 5545 | 5537 | 8 |
9 | Bredel-2005 | 3 | 50 | 31(62%), 14(28%), 5(10%) | 1739 | 1739 | |
10 | Chen-2002 | 2 | 179 | 104(58.10%), 75(41.90%) | 85 | 85 | |
11 | Chowdary-2006 | 2 | 104 | 62(59.62%), 42(38.89%) | 182 | 182 | |
12 | CNS | 2 | 60 | 21(35%), 39(65%) | 7134 | 7128 | 6 |
13 | Dyrskjot-2003 | 3 | 40 | 9(22.5%), 20(50%), 11(27.5%) | 1203 | 1203 | |
14 | Garber-2001 | 4 | 66 | 17(25.76%), 40(60.61%), 4(6.06%), 5(7.58%) | 4553 | 4553 | |
15 | Golub-1999-v1 | 2 | 72 | 47(65.28%), 25(34.72%) | 1877 | 1877 | |
16 | Golub-1999-v2 | 3 | 72 | 38(52.78%), 9(12.5%), 25(34.72%) | 1877 | 1877 | |
17 | Gordon-2002 | 2 | 181 | 31(17.13%), 150(82.87%) | 1626 | 1626 | |
18 | Khan-2001 | 4 | 83 | 29(34.94%), 11(13.25%), 18(21.69%), 25(30.12%) | 1069 | 1069 | |
19 | Laiho-2007 | 2 | 37 | 8(21.62%), 29(78.38%) | 2202 | 2202 | |
20 | Lapointe-2004-v1 | 3 | 69 | 11(15.94%), 39(56.52%), 19(27.54%) | 1625 | 1625 | |
21 | Lapointe-2004-v2 | 4 | 110 | 11(10%), 39(35.45%), 19(17.27%), 41(37.27%) | 2496 | 2496 | |
22 | Liang-2005 | 3 | 37 | 28(75.67%), 6(16.22%), 3(8.11%) | 1411 | 1411 | |
23 | Nutt-2003-v1 | 4 | 50 | 14(50%), 7(14%), 14(28%), 15(30%) | 1377 | 1377 | |
24 | Nutt-2003-v2 | 2 | 28 | 14(50%),14(50%) | 1070 | 1070 | |
25 | Nutt-2003-v3 | 2 | 22 | 7(31.82%),15(68.18%) | 1152 | 1152 | |
26 | Pomeroy-2002-v1 | 2 | 34 | 25(73.53%), 9(26.47%) | 857 | 857 | |
27 | Pomeroy-2002-v2 | 5 | 42 | 10(23.81%), 10(23.81%), 10(23.81%), 4(9.52%) | 1379 | 1379 | |
8(19.05%) | |||||||
28 | Prostate | 2 | 79 | 37(46.84%), 42(53.16%) | 7892 | 7884 | 8 |
29 | Ramaswamy-2001 | 14 | 190 | 11(5.79%), 11(5.79%), 20(10.53%), 11(5.79%), | 1363 | 1363 | |
30(15.79%), 11(5.79%), 22(11.28%), 11(5.79%), | |||||||
10(5.26%),11(5.79%), 11(5.79%), 10(5.26%), | |||||||
11(5.79%), 10(5.26%) | |||||||
30 | Risinger-2003 | 4 | 42 | 13(30.95%), 3(7.14%), 19(45.24%), 7(16.67%) | 1771 | 1771 | |
31 | Shipp-2002-v1 | 2 | 77 | 58(75.32%), 19(24.67%) | 798 | 798 | |
32 | Singh-2002 | 2 | 102 | 50(49.02%), 52(50.98%) | 339 | 339 | |
33 | Su-2001 | 10 | 174 | 8(4.60%), 26(14.94%), 23(13.22%), 12(6.90%), | 1571 | 1571 | |
11(6.32%), 7(4.02%), 28(16.09%), 27(15.52%), | |||||||
6(3.45%), 26(14.94%) | |||||||
34 | Tomlins-2006-v1 | 5 | 104 | 27(25.96%), 20(19.23%), 32(30.77%), 13(12.5%), | 2315 | 2315 | |
12(11.54%) | 2315 | 2315 | |||||
35 | Tomlins-2006-v2 | 4 | 92 | 27(26.35%), 20(21.74%), 32(34.78%), 13(14.13%) | 1288 | 1288 | |
36 | West-2001 | 2 | 49 | 25(51.02%), 24 (48.98%) | 1198 | 1198 | |
37 | Yeoh-2002-v1 | 2 | 248 | 43(17.34%), 205(82.66%) | 2526 | 2526 | |
38 | Yeoh-2002-v2 | 6 | 248 | 15(6.05%), 27(10.89%), 64(25.81%), 20(8.06%), | 2526 | 2526 | |
43(17.34%), 79(31.85%) |
ID | |||||
DB | WDB | DB | WDB | ||
1 | 7.14 | 7.14 | 0.74 | 0.74 | |
2 | 1.61 | 0.00 | 0.94 | 1.00 | |
3 | 8.33 | 5.56 | 0.684 | 0.77 | |
4 | 4.17 | 4.17 | 0.88 | 0.88 | |
5 | 19.21 | 15.27 | 0.49 | 0.56 | |
6 | 13.16 | 13.16 | 0.56 | 0.56 | |
7 | 36.84 | 36.84 | 0.25 | 0.25 | |
8 | 32.81 | 30.47 | 0.11 | 0.13 | |
9 | 18.00 | 18.00 | 0.34 | 0.34 | |
10 | 11.17 | 8.94 | 0.61 | 0.67 | |
11 | 18.27 | 9.62 | 0.42 | 0.64 | |
12 | 41.67 | 38.33 | 0.01 | 0.01 | |
13 | 15.00 | 12.50 | 0.58 | 0.65 | |
14 | 21.21 | 28.79 | 0.38 | 0.19 | |
15 | 6.94 | 4.17 | 0.72 | 0.82 | |
16 | 6.94 | 6.94 | 0.81 | 0.81 | |
17 | 12.71 | 13.26 | 0.47 | 0.42 | |
18 | 1.20 | 1.20 | 0.97 | 0.97 | |
19 | 21.62 | 21.62 | 0.23 | 0.23 | |
20 | 31.88 | 30.43 | 0.23 | 0.26 | |
21 | 30.91 | 30.91 | 0.34 | 0.34 | |
22 | 13.51 | 10.81 | 0.72 | 0.76 | |
23 | 32.00 | 34.00 | 0.40 | 0.33 | |
24 | 17.86 | 10.71 | 0.43 | 0.65 | |
25 | 4.55 | 9.09 | 0.80 | 0.67 | |
26 | 29.41 | 20.59 | 0.12 | 0.16 | |
27 | 16.67 | 21.43 | 0.65 | 0.63 | |
28 | 34.18 | 34.18 | 0.10 | 0.10 | |
29 | 36.84 | 29.47 | 0.44 | 0.53 | |
30 | 28.57 | 26.19 | 0.36 | 0.00 | |
31 | 29.87 | 12.99 | 0.24 | 0.48 | |
32 | 30.39 | 30.39 | 0.18 | 0.16 | |
33 | 20.11 | 16.67 | 0.63 | 0.70 | |
34 | 17.31 | 21.15 | 0.66 | 0.58 | |
35 | 23.91 | 26.09 | 0.46 | 0.41 | |
36 | 20.41 | 14.29 | 0.35 | 0.52 | |
37 | 1.61 | 2.02 | 0.89 | 0.87 | |
38 | 21.77 | 24.60 | 0.57 | 0.00 |
The package WeDiBaDis, available at
https://github.com/ItziarI/WeDiBaDis, is an implementation of two
discriminant analysis procedures in an R environment. The classifiers
are the Distance-Based (DB) and the new proposed procedure
Weighted-Distance-Based (WDB). Thee are useful to solve the
classification problem for high-dimensional data sets with mixed
features or when the input information is a distance matrix. This
software provides functions to compute both discriminant procedures and
to assess the performance of the classification rules it offers: the
leave-one-out classification table; the general correlation coefficient;
the sensitivity, specificity, and positive predictive value for each
class; the
This research was partially supported: II by the Spanish ‘Ministerio de Economia y Competitividad’ (TIN2015-64395-R) and by the Basque Government Research Team Grant (IT313-10) SAIOTEK Project SA-2013/00397 and by the University of the Basque Country UPV/EHU (Grant UFI11/45 (BAILab). FM by the Spanish ‘Ministerio de Economia y Competitividad’ (CTM2013-48163) and by Grant 2014 SGR 336 from the Departament d’Economia i Coneixement de la Generalitat de Catalunya. CA by the Spanish ‘Ministerio de Economia y Competitividad’ (SAF2015-68341-R), by the Spanish ‘Ministerio de Economia y Competitividad’ (TIN2015-64395-R) and by Grant 2014 SGR 464 (GRBIO) from the Departament d’Economia i Coneixement de la Generalitat de Catalunya.
Cluster, Environmetrics, Phylogenetics, Psychometrics, Robust, Spatial
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Irigoien, et al., "Weighted Distance Based Discriminant Analysis: The R Package WeDiBaDis", The R Journal, 2017
BibTeX citation
@article{RJ-2016-057, author = {Irigoien, Itziar and Mestres, Francesc and Arenas, Concepcion}, title = {Weighted Distance Based Discriminant Analysis: The R Package WeDiBaDis}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {434-450} }