Most classification algorithms deal with datasets which have a set of input features, the variables to be used as predictors, and only one output class, the variable to be predicted. However, in late years many scenarios in which the classifier has to work with several outputs have come to life. Automatic labeling of text documents, image annotation or protein classification are among them. Multilabel datasets are the product of these new needs, and they have many specific traits. The mldr package allows the user to load datasets of this kind, obtain their characteristics, produce specialized plots, and manipulate them. The goal is to provide the exploratory tools needed to analyze multilabel datasets, as well as the transformation and manipulation functions that will make possible to apply binary and multiclass classification models to this data or the development of new multilabel classifiers. Thanks to its integrated user interface, the exploratory functions will be available even to non-specialized R users.
Pattern classification is an important task nowadays and is in use everywhere, from our e-mail client, which is able to separate spam from legit messages, to credit institutions, that rely on it to detect fraud and grant or deny loans. All these cases operate with binary datasets, since a message is either spam or legit, and multiclass datasets, the loan is safe, medium, risky or highly risky, for instance. In both cases the user expects only one output.
The huge growth of the amount of information stored in late years on the web, such as blog posts, pictures taken from cameras and phones, videos hosted on YouTube, and messages on social networks, has led to more complex classification work. A blog post can be classified into several non-exclusive categories, for instance news, economy and politics simultaneously. A picture can be assigned a set of labels, such as landscape, sky and forest. A video can be labeled into several music genres at once, etc. All of these are examples of problems in need of multilabel classification.
Binary and multiclass datasets can be managed in R by using data frames.
Usually the last attribute (column of the “data.frame”) is the output
class, which might contain only TRUE
/FALSE
values or values
belonging to a finite set (a factor). Multilabel datasets (MLDs) can
also be stored in an R “data.frame”, but an additional structure to know
which attributes are output labels is needed. Moreover, this kind of
datasets have many specific characteristics that do not exist in the
traditional ones. The average number of labels per instance, the
imbalance ratio for each label, the number of labelsets (sets of labels
assigned to each row) and their frequencies, and the level of
concurrence among imbalanced labels are some of the traits that
differentiate an MLD from the others.
Until now, most of the software to work with MLDs has been written in Java. The two best known frameworks are MULAN (Tsoumakas et al. 2011) and MEKA (Read and Reutemann 2012). Both implementations rely on WEKA (Hall et al. 2009), which offers a large variety of binary and multiclass classifiers, as well as functions needed to deal with ARFF (Attribute-Relation File Format) files. Most of the existing MLDs are stored in ARFF format. MULAN and MEKA provide the specialized tools needed to deal with multilabel ARFFs, and the infrastructure to build multilabel classifiers (MLCs). Although R can access WEKA functionality through the RWeka (Hornik et al. 2009) package, handling MLDs is far from an easy task in R. This has been the main motivation behind the development of the mldr package. To the best of our knowledge, mldr is the first R package aimed to ease the work with multilabel data.
The mldr package aims to provide the user with the functions needed to perform exploratory analysis of MLDs, determining their main traits both statistically and visually. Moreover, it also brings the proper tools to manipulate this kind of datasets, including the application of the most common transformation methods, BR (Binary Relevance) and LP (Label Powerset), that will be described in the following section. These would be the foundation for processing MLDs with traditional classifiers, as well as for developing new multilabel algorithms.
The mldr package does not depend on the RWeka package, and it is not linked to MULAN nor MEKA. It has been designed to allow reading both MULAN and MEKA MLDs, but without any external dependencies. In fact, it would be possible to load MLDs stored in other file formats, as well as creating them from scratch. When loaded, MLDs are wrapped in an S3 type object with class “mldr”, which allows for the use of methods. The object will contain the data in the MLD and also a large set of measures obtained from it. The functions provided by the package ease the access to this information, produce some specific plots, and make possible the manipulation of its content. A web-based graphical user interface, developed using the shiny (Chang et al. 2015) package, puts the exploratory analysis tools of the mldr package at the fingertips of all users, even those who have little experience using R.
In the following section the foundations related to MLDs and MLC will be
briefly introduced. After that, the structure of the mldr package, and
the operations it provides will be explained. Finally, the user
interface provided by mldr to ease exploratory analysis tasks over
MLDs will be shown. All code displayed in this paper is available in a
vignette, accessible by entering vignette("mldr", package = "mldr")
.
MLDs are generated from text documents (Klimt and Yang 2004), sets of images (Duygulu et al. 2002), music collections, and protein attributes (Diplaris et al. 2005), among other sources. For each sample a set of features (input attributes) is collected, and a set of labels (the output labelset) is assigned. Usually there are several hundreds or even thousands of attributes, and it is not rare that a MLD has more labels than features. Some MLDs have only a few labels per instance, while others have dozens of them. In some MLDs the number of label combinations (labelsets) is quite short, whereas in others it can be very large. Most MLDs are imbalanced, which means that some labels are very frequent while others are scarcely represented. The labels in an MLD can be correlated or not. Moreover, frequent labels and rare labels can appear together in the same instances.
As can be seen, a lot of different scenarios can be found depending on the MLD characteristics. This is the reason why several specific measures have been designed to assess MLD traits (Tsoumakas et al. 2010), since they can have a serious impact on the MLC’s performance. The following two subsections introduce several of these measures and some of the approaches pursued to face multilabel classification.
The most common characterization measures for MLDs can be grouped into four categories, as depicted in Figure 1.
The most basic information that can be obtained from an MLD is the number of instances, attributes and labels. For any MLD containing \(\lvert D\rvert\) instances, any instance \(D_i, i \in \{1..\lvert D\rvert \}\) will be the union of a set of attributes and a set of labels (\(X_i\), \(Y_i\)), \(X_i \in X^1\times X^2\times \dots\times X^f, Y_i \subseteq L\), where \(f\) is the number of input features and \(X^j\) is the space of possible values for the \(j\)-th attribute, \(j \in \{1..f\}\). \(L\) being the full set of labels used in \(D\), \(Y_i\) could be any subset of items in \(L\). Therefore, theoretically the number of potential labelsets could be \(2^{\lvert L\rvert}\). In practice this number tends to be limited by \(\lvert D\rvert\).
Each instance \(D_i\) has an associated labelset, whose length (number of active labels) can be in the range {0..\(\lvert L\rvert\)}. The average number of active labels per instance is the most basic measure of any MLD, usually known as Card (standing for cardinality). It is calculated as shown in Equation (1). Dividing this measure by the number of labels in \(L\), as shown in Equation (2), results in a dimension-less measure, known as Dens (standing for label density).
\[\begin{aligned} \label{eq:Card} Card\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \lvert Y_i\rvert, \end{aligned} \tag{1} \]
\[\begin{aligned} \label{eq:Dens} Dens\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert Y_i\rvert}{\lvert L\rvert}. \end{aligned} \tag{2} \]
Most multilabel datasets are imbalanced, meaning that some of the labels are very frequent whereas others are quite rare. The level of imbalance of a label can be measured by the imbalance ratio, IRLbl, defined in Equation (3). To know how much imbalance there is in \(D\), the MeanIR measure (Charte et al. 2015) is calculated as the mean imbalance ratio among all labels, as shown in Equation (4). In order to know the significance of this last measure, the standard CV (Coefficient of Variation, Equation (5)) can be used.
\[\begin{aligned} \label{eq:IRLbl} \textit{IRLbl}\left(y\right) &= \frac{ \max\limits_{y'\in L} \left(\displaystyle\sum\limits_{i=1}^{\lvert D\rvert}{h\left(y', Y_i\right)}\right) } { \displaystyle\sum\limits_{i=1}^{\lvert D\rvert}{h\left(y, Y_i\right)}} \quad h\left(y, Y_i\right) = \begin{cases} 1 & y \in Y_i \\ 0 & y \notin Y_i \end{cases}, \end{aligned} \tag{3} \]
\[\begin{aligned} \label{eq:MeanIR} \textit{MeanIR} &= \frac{1}{\lvert L\rvert} \displaystyle\sum\limits_{y\in L}\textit{IRLbl}\left(y\right), \end{aligned} \tag{4} \]
\[\begin{aligned} \label{eq:CV} \textit{CV} &= \frac{\textit{IRLbl}\sigma}{\textit{MeanIR}}\quad \textit{IRLbl}\sigma = \sqrt{ \displaystyle\sum\limits_{y\in L}^{}{ \frac{\left(\mathit{IRLbl\left(y\right) - MeanIR}\right)^2}{\lvert L\rvert-1} } }. \end{aligned} \tag{5} \]
The number of different labelsets, as well as the amount of them being unique labelsets (i.e., appearing only once in \(D\)), give us an idea on how sparsely the labels are distributed. The labelsets by themselves indicate how the labels in \(L\) are related. A very frequent labelset implies that the labels in it tend to appear jointly in \(D\). The SCUMBLE measure, introduced in Charte et al. (2014) and shown in Equation (7), is used to assess the concurrence level among frequent and infrequent labels.
\[\begin{aligned} \label{eq:SCUMBLEIns} \textit{SCUMBLE}_{ins}\left(i\right) &= 1 - \frac{1}{\overline{\textit{IRLbl}_i}}\left(\prod\limits_{l=1}^{\lvert L\rvert} \textit{IRLbl}_{il}\right)^{\left(1/\lvert L\rvert\right)}, \end{aligned} \tag{6} \]
\[\begin{aligned} \label{eq:SCUMBLE} \textit{SCUMBLE}\left(D\right) &= \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \textit{SCUMBLE}_{ins}\left(i\right). \end{aligned} \tag{7} \]
Besides the aforementioned insights, there are some other interesting traits that can be indirectly obtained from the previous measures, such as the ratio between input features and output labels, the maximum IRLbl, or the coefficient of variation in the imbalance of levels, among others.
Although the raw numbers given by these calculations describe the nature of any multilabel dataset to a good level, in general a visualization of its characteristics is desirable to ease its interpretation by researchers.
The information obtained from the previous measures depicts the characteristics of the dataset. These insights, along with other factors such as the loss function used by the classifier, help in choosing the most proper algorithm to learn from it and, in the future, make predictions on new data. Traditional classification models, such as trees and support vector machines, are designed to give only one output as result. Multilabel classification can mainly be faced through two different approaches discussed in the following.
Algorithm adaptation: The goal is to modify existing algorithms taking into account the multilabel nature of the samples, for instance hosting more than one class in the leaves of a tree instead of only one.
Problem transformation: This approach transforms the original data to make it suitable to traditional classification algorithms, then combines the obtained predictions to build the labelsets given as output result.
Although several transformation methods have been defined in the specialized literature, there are two among them that stand out because they are the foundation for many others:
Binary Relevance (BR): Introduced by Godbole and Sarawagi (2004) as an adaptation of OVA (One-vs-All) to the multilabel scenario, this method transforms the original multilabel dataset into several binary datasets, as many as there are different labels. In this way any binary classifier can be used by joining the individual predictions to generate the final output.
Label Powerset (LP): Introduced by Boutell et al. (2004), this method transforms the multilabel dataset into a multiclass dataset by using the labelset of each instance as class identifier. Any multiclass classifier can be used, transforming back the predicted class into a labelset.
BR and LP have been used not only as a direct technique to implement multilabel classifiers, but also as a base method to build more sophisticated algorithms. Several ensembles of binary classifiers relying on BR have been proposed, such as CC (Classifier Chains) or ECC (Ensemble of Classifier Chains), both by Read et al. (2011). The same is applicable to the LP transformation, the foundation of ensemble multilabel classifiers such as RAkEL (Random k-Labelsets for Multi-Label Classification, Tsoumakas and Vlahavas (2007)) and EPS (Ensemble of Pruned Sets, Read et al. (2008)).
For the readers interested in more details, a recent review on multilabel classification has been published by Zhang and Zhou (2014).
R is among the most used tools when it comes to performing data mining tasks, including binary and multiclass classification. However, the work with MLDs in R is not as easy as it is with classic datasets. This is the main motivation behind the development of the mldr package, whose goals and functionality are described in this section.
When we planned the development of this package, our main objective was to ease the exploration of MLDs in R. This included loading existing MLDs in different formats, as well as obtaining from them all the available information. These functions should be accessible to everyone, even to users not used to the R command line but to GUIs (Graphical User Interfaces) such as those provided by packages Rcmdr (aka R Commander, Fox (2005)) or rattle (Williams 2011).
At the same time, we aimed to include the tools needed to manipulate the MLDs, to apply filters and transformations, as well as to create MLDs from scratch. This functionality, directed to more experienced R users, opens the doors to implement other algorithms on top of mldr, for instance preprocessing methods or multilabel classifiers.
The mldr package is available from the Comprehensive R Archive Network (CRAN), therefore it can be installed as any other package, by simply typing:
> install.packages("mldr")
mldr depends on three R packages: XML (Lang and the CRAN Team 2015), circlize (Gu et al. 2014) and shiny. The first one allows reading XML (eXtensible Markup Language) files, the second one is used to generate a specific type of plot (described below), and the third one is the base of its user interface.
Older releases of mldr, as well as the development version, are
available at http://github.com/fcharte/mldr. It is possible to install
the development version using the install_github()
function from
devtools (Wickham and Chang 2015).
Once installed, the package has to be loaded before it can be used. This
can be done through the library()
or require()
functions, as usual.
After loading the package three sample MLDs will be available: birds
,
emotions
and genbase
. These are contained in the birds.rda
,
emotions.rda
and genbase.rda
files, which are lazily loaded along
with the package.
The mldr package uses its own internal representation for MLDs, which
are assigned the “mldr” class. Inside an “mldr” object, such as the
previous mentioned emotions
or birds
, both the data in the MLD and
all the information obtained from this data can be found.
Besides the three sample MLDs included in the package, the mldr()
function allows to load any MLD stored in MULAN or MEKA file formats.
Assuming that the files corel5k.arff
and corel5k.xml
, which hold the
Corel5k (Duygulu et al. 2002) MLD in MULAN format, are in the current directory,
the loading is done as follows:
> corel5k <- mldr("corel5k")
If the XML file is not available, it is possible to indicate just the number of labels in the MLD instead. In this case, the function assumes that the labels are at the end of the list of features. For instance:
> corel5k <- mldr("corel5k", label_amount = 374)
Loading an MLD in MEKA file format is equally easy. In this case there
is no XML file with label information used, but a special header inside
the ARFF file, a fact that will be indicated to mldr()
with the
use_xml
argument:
> imdb <- mldr("imdb", use_xml = FALSE)
In all cases the result, as long as the MLD can be correctly loaded and parsed, will be a new “mldr” object ready to use.
If the MLD we are interested in is not in MULAN or MEKA format, firstly
it will have to be loaded into a “data.frame”, for instance using
functions such as read.csv()
, read.table()
or a more specialized
reader, and secondly this “data.frame” and an integer vector stating the
indices of the labels inside it are given to the mldr_from_dataframe()
function. This is a general function for creating an “mldr” object from
any “data.frame”, so it can also be used to generate new MLDs on the
fly, as shown in the following example:
> df <- data.frame(matrix(rnorm(1000), ncol = 10))
> df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
> df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
> mymldr <- mldr_from_dataframe(df, labelIndices = c(11, 12), name = "testMLDR")
This will assign to mymldr
an MLD, named testMLDR
, with 10 input
attributes and 2 labels.
After loading any MLD, a quick summary of its main characteristics can
be obtained by means of the usual summary()
function, as shown below:
> summary(birds)
num.attributes num.instances num.labels num.labelsets num.single.labelsets279 645 19 133 73
max.frequency cardinality density meanIR scumble num.inputs294 1.013953 0.05336597 5.406996 0.03302765 260
Any of these measures can be individually obtained through the
measures
element of the “mldr” class, like this:
> emotions$measures$num.attributes
1] 78
[
> genbase$measures$scumble
1] 0.0287591 [
Full information about the labels in the MLD, including the number of
times they appear, their IRLbl and SCUMBLE measures, can be
retrieved by using the labels
element of the “mldr” class:
> birds$labels
index count freq IRLbl SCUMBLE261 14 0.021705426 7.357143 0.12484341
Brown Creeper 262 81 0.125581395 1.271605 0.05232609
Pacific Wren -slope Flycatcher 263 46 0.071317829 2.239130 0.06361470
Pacific-breasted Nuthatch 264 9 0.013953488 11.444444 0.15744451
Red-eyed Junco 265 20 0.031007752 5.150000 0.10248336
Dark-sided Flycatcher 266 14 0.021705426 7.357143 0.18493760
Olive267 47 0.072868217 2.191489 0.06777263
Hermit Thrush -backed Chickadee 268 40 0.062015504 2.575000 0.06807452
Chestnut269 61 0.094573643 1.688525 0.07940806
Varied Thrush 270 53 0.082170543 1.943396 0.07999006
Hermit Warbler 's Thrush 271 103 0.159689922 1.000000 0.11214301
SwainsonHammond's Flycatcher 272 28 0.043410853 3.678571 0.06129884
273 33 0.051162791 3.121212 0.07273988
Western Tanager -headed Grosbeak 274 9 0.013953488 11.444444 0.20916487
Black275 37 0.057364341 2.783784 0.09509474
Golden Crowned Kinglet 276 17 0.026356589 6.058824 0.14333613
Warbling Vireo 's Warbler 277 6 0.009302326 17.166667 0.24337605
MacGillivrayStellar's Jay 278 10 0.015503876 10.300000 0.12151527
279 26 0.040310078 3.961538 0.06520272 Common Nighthawk
The same is applicable for labelsets and attributes, by means of the
labelsets
and attributes
elements of the class.
To access the MLD content, attributes and label values, the print()
function can be used, as well as the dataset
element of the “mldr”
object.
Exploratory analysis of MLDs can be tedious, since most of them have
thousands of attributes and hundreds of labels. The mldr package
provides a plot()
function specific for dealing with “mldr” objects,
allowing the generation of several specific types of plots. The first
argument given to plot()
must be an “mldr” object, while the second
one specifies the type of plot to be produced.
> plot(emotions, type = "LH")
There are seven different types of plots available: three histograms showing relations between instances and labels, two bar plots with similar purpose, a circular plot indicating types of attributes and a concurrence plot for labels. All of them are shown in Figure 2, generated by the following code:
> layout(matrix(c(1, 1, 7, 1, 1, 4, 5, 5, 4, 2, 6, 3), 4, 3, byrow = TRUE))
> plot(emotions, type = c("LC", "LH", "LSH", "LB", "LSB", "CH", "AT"))
The concurrence plot is the default one, with type "LC"
, and responds
to the need of exploring interactions among labels, and specifically
between majority and minority ones. This plot has a circular shape, with
the circumference partitioned into several disjoint arcs representing
labels. Each arc has length proportional to the number of instances
where the label is present. These arcs are in turn divided into bands
that join two of them, showing the relation between the corresponding
labels. The width of each band indicates the strength of the relation,
since it is proportional to the number of instances in which both labels
appear simultaneously. In this manner, a concurrence plot can show
whether imbalanced labels appear frequently together, a situation which
could limit the possible improvement of a preprocessing technique
(Charte et al. 2014).
Since drawing interactions among a lot of labels can produce a confusing
result, this last type of plot accepts more arguments: labelCount
,
which accepts an integer that will be used to generate the plot with
that number of labels chosen at random; and labelIndices
, which allows
to indicate exactly the indices of the labels to be displayed in the
plot. For example, in order to plot the first ten labels of genbase
:
> plot(genbase, labelIndices = genbase$labels$index[1:10])
The label histogram (type "LH"
) relates labels and instances in a way
that shows how well-represented labels are in general. The X axis
represents the number of instances and the Y axis the amount of labels.
This means that if a large number of labels are appearing in very few
instances, all data will be concentrated on the left side of the plot.
On the contrary, if labels are generally present in many instances, data
will tend to accumulate on the right side. This plot shows imbalance of
labels when there is data accumulated on both sides of the plot, which
implies that many labels are underrepresented, and a large amount are
overrepresented as well.
The labelset histogram (named "LSH"
) is similar to the former.
However, instead of representing the number of instances in which each
label appears, it shows the amount of labelsets. This indicates
quantitatively whether labelsets repeat consistently or not among
instances.
The label and labelset bar plots display exactly the number of instances
for each one of the labels and labelsets, respectively. Their codes are
"LB"
for the label bar plot and "LSB"
for the labelset one.
The cardinality histogram (type "CH"
) represents the amount of labels
instances have in general. Therefore data accumulating on the right side
of the plot indicates that instances do have a notable amount of labels,
whereas data concentrating on the left side shows the opposite
situation.
The attribute types plot (named "AT"
) is a pie chart displaying the
number of labels, numeric attributes and finite set (character)
attributes, thus showing the proportions between these types of
attributes to ease the understanding of the amount of input information
and that of output data.
Additionally, plot()
accepts coloring arguments, col
and
color.function
. The former can be used on all plot types except for
the label concurrence plot, and must be a vector of colors. The latter
is only used on the label concurrence plot and accepts a coloring
function, such as rainbow
or heat.colors
, as can be seen in the
following example:
> plot(emotions, type = "LC", color.function = heat.colors)
> plot(emotions, type = "LB", col = terrain.colors(emotions$measures$num.labels))
Manipulation of datasets is a crucial task in multilabel classification.
Since transformation is one of the main approaches to tackle the
problem, both BR and LP transformations are implemented in package
mldr. They can be obtained using the mldr_transform
function, which
accepts an “mldr” object as first argument, the type of transformation,
"BR"
or "LP"
, as second, and an optional vector of label indices to
be included in the transformation as last argument:
> emotionsbr <- mldr_transform(emotions, type = "BR")
> emotionslp <- mldr_transform(emotions, type = "LP", emotions$labels$index[1:4])
The BR transformation will return a list of “data.frame” objects, each one of them using one of the labels as class, whereas the LP transformation will return a single “data.frame” representing a multiclass dataset using each labelset as a class. Both of these transformations can be directly used in order to apply binary and multiclass classification algorithms, or even implement new ones.
> emo_lp <- mldr_transform(emotions, "LP")
> library(RWeka)
> classifier <- IBk(classLabel ~ ., data = emo_lp, control = Weka_control(K = 10))
> evaluate_Weka_classifier(classifier, numFolds = 5)
=== 5 Fold Cross Validation ===
=== Summary ===
205 34.57 %
Correctly Classified Instances 388 65.43 %
Incorrectly Classified Instances 0.2695
Kappa statistic 0.057
Mean absolute error 0.1748
Root mean squared error 83.7024 %
Relative absolute error 94.9069 %
Root relative squared error cases (0.95 level) 75.3794 %
Coverage of size (0.95 level) 19.574 %
Mean rel. region 593 Total Number of Instances
A filtering utility is included in the package as well. Using it is
intuitive, since it can be called with the square bracket operator [
.
This allows to partition an MLD or filter it according to a logical
condition.
> emotions$measures$num.instances
1] 593
[
> emotions[emotions$dataset$.SCUMBLE > 0.01]$measures$num.instances
1] 222 [
Combined with the joining operator, +
, this enables users to implement
new preprocessing techniques that modify information in the MLD in order
to improve classification results. For example, the following would be
an implementation of an algorithm disabling majority labels on instances
with highly imbalanced labels:
> mldbase <- mld[.SCUMBLE <= mld$measures$scumble]
> # Samples with coocurrence of highly imbalanced labels
> mldhigh <- mld[.SCUMBLE > mld$measures$scumble]
> majIndexes <- mld$labels[mld$labels$IRLbl < mld$measures$meanIR, "index"]
> # Deactivate majority labels
> mldhigh$dataset[, majIndexes] <- 0
> mldbase + mldhigh # Join the instances without changes with the filtered ones
In this last example, the first two commands filter the MLD, separating instances with their SCUMBLE lower than the mean and those with it higher. Then, the third line obtains the indices of the labels with lower IRLbl than the mean, thus these are the majority labels of the dataset. Finally, these labels are set to 0 in the instances with high SCUMBLE, and then the two partitions are joined again.
Lastly, another useful feature included in the mldr package is the MLD
comparison with the ==
operator. This indicates whether both MLDs in
comparison share the same structure, which would mean they have the same
attributes, and these would have the same type.
> emotions[1:10] == emotions[20:30]
1] TRUE
[
> emotions == birds
1] FALSE [
Assuming that a set of predictions has been obtained for a MLD, e.g.,
through a set of binary classifiers, a multiclass classifier or any
other algorithm, the next step would be to evaluate the classification
performance. In the literature there exist more than 20 metrics for this
task, and some of them are quite complex to calculate. The mldr
package provides the mldr_evaluate
function to accomplish this task,
supplying both example based and label based metrics.
Multilabel evaluation metrics are grouped into two main categories: example based and label based metrics. Example based metrics are computed individually for each instance, then averaged to obtain the final value. Label based metrics are computed per label, instead of per instance. There are two approaches called micro-averaging and macro-averaging (described below). The output of the classifier can be a bipartition (i.e., a set of 0s and 1s denoting the predicted labels) or a ranking (i.e., a set of real values denoting the relevance of each label). For this reason, there are bipartition based and ranking based evaluation metrics for each one of the two previous categories.
\(D\) being the MLD, \(L\) the full set of labels used in \(D\), \(Y_i\) the
subset of predicted labels for the i-th instance, and \(Z_i\) the true
subset of labels, the example/bipartition based metrics returned by
mldr_evaluate
are the following:
Accuracy: It is defined (see Equation (8)) as the proportion of correctly predicted labels with respect to the total number of labels for each instance.
\[\label{eq:Accuracy} Accuracy = \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert} \frac{\lvert Y_i \cap Z_i\rvert}{\lvert Y_i \cup Z_i\rvert}. \tag{8} \]
Precision: This metric is computed as indicated in Equation (9), giving as result the ratio of relevant labels predicted by the classifier.
\[\label{eq:Precision} Precision = \frac{1}{\lvert D\rvert} \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap Z_i\rvert }{\lvert Z_i\rvert }. \tag{9} \]
Recall: It is a metric (see Equation (10)) commonly used along with the previous one, measuring the proportion of predicted labels which are relevant.
\[\label{eq:Recall} Recall = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \cap Z_i\rvert}{\lvert Y_i\rvert}. \tag{10} \]
F-Measure: As can be seen in Equation (11), this metric is the harmonic mean between Precision (see Equation (9)) and Recall (see Equation (10)), providing a balanced assessment between precision and sensitivity.
\[\label{eq:F1} \textit{FMeasure} = 2 * \frac{Precision \cdot Recall}{Precision + Recall}. \tag{11} \]
Hamming Loss: It is the most common evaluation metric in the multilabel literature, computed (see Equation (12)) as the symmetric difference between predicted and true labels and divided by the total number of labels in the MLD.
\[\label{eq:HL} HammingLoss = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{\lvert Y_i \triangle Z_i\rvert}{\lvert L\rvert}. \tag{12} \]
Subset Accuracy: This metric is also known as 0/1 Subset Accuracy and Classification Accuracy, and it is the most strict evaluation metric. The \(\left[\!\!\left[ expr \right]\!\!\right]\) operator (see Equation (13)) returns 1 when \(expr\) is true and 0 otherwise. In this case its value is 1 only if the predicted set of labels equals the true one.
\[\label{eq:SubsetAccuracy} SubsetAccuracy = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert }\left[\!\!\left[ Y_i = Z_i \right]\!\!\right] . \tag{13} \]
Let \(rank\left(x_i, y\right)\) be a function returning the position of
\(y\), a certain label, in the \(x_i\) instance. The example/ranking based
evaluation metrics returned by the mldr_evaluate
function are the
following ones:
Average Precision: This metric (see Equation (14)) computes the proportion of labels ranked ahead of a certain relevant label. The goal is to establish how many positions have to be traversed until this label is found.
\[\label{eq:AveragePrecision} \textit{AveragePrecision} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert Y_i\rvert} \displaystyle\sum\limits_{y \in Y_i} \frac{\lvert \left\{y'\in Y_i : rank\left(x_i, y'\right) \leq rank\left(x_i, y\right) \right\}\rvert}{rank\left(x_i, y\right)}. \tag{14} \]
Coverage: Defined as indicated in Equation (15), this metric calculates the extent to which it is necessary to go up in the ranking to cover all relevant labels.
\[\label{eq:Coverage} \textit{Coverage} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \displaystyle\max\limits_{y \in Y_i} rank\left(x_i, y\right) - 1. \tag{15} \]
One Error: It is a metric (see Equation (16)) which determines how many times the best ranked label given by the classifier is not part of the true label set of the instance.
\[\label{eq:OneError} \textit{OneError} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \left[\!\!\left[ \mathop{argmax}\limits_{y \in Z_i} rank\left(x_i, y\right) \notin Y_i \right]\!\!\right] . \tag{16} \]
Ranking Loss: This metric (see Equation (17)) compares each pair of labels in \(L\), computing how many times a relevant label (member of the true labelset) appears ranked lower than a non-relevant label. In the equation, \(\overline{Y_i}\) denotes \(L\backslash Y_i\).
\[\label{eq:RankingLoss} \small \textit{RankingLoss} = \frac{1}{\lvert D\rvert } \displaystyle\sum\limits_{i=1}^{\lvert D\rvert } \frac{1}{\lvert Y_i\rvert\lvert \overline{Y_i}\rvert} \lvert \left\{\left(y_a, y_b\right) \in Y_i \times \overline{Y_i}: rank\left(x_i, y_a\right) > rank\left(x_i, y_b\right) \right\}\rvert. \tag{17} \]
Regarding the label based metrics, there are two different ways to aggregate the values of the labels. The macro-averaging approach (see Equation (18)) computes the metric independently for each label and then averages the obtained values to get the final measure. On the contrary, the micro-averaging approach (see Equation (19)) first aggregates the counters for all the labels and then computes the metric only once. In the following equations TP, FP, TN and FN stand for True Positives, False Positives, True Negatives and False Negatives, respectively.
\[\begin{aligned} \label{eq:MacroB} MacroMetric &= \frac{1}{\lvert L\rvert} \sum\limits_{l=1}^{\lvert L\rvert}evalMetric\left(TP_l,FP_l,TN_l,FN_l\right). \end{aligned} \tag{18} \]
\[\begin{aligned} \label{eq:MicroB} MicroMetric &= evalMetric\left(\sum\limits_{l=1}^{\lvert L\rvert}TP_l,\sum\limits_{l=1}^{\lvert L\rvert}FP_l,\sum\limits_{l=1}^{\lvert L\rvert}TN_l,\sum\limits_{l=1}^{\lvert L\rvert}FN_l\right). \end{aligned} \tag{19} \]
All the bipartition based metrics, such as Precision, Recall or FMeasure, can be computed as label based measures following these two approaches. In this category, there are as well as some ranking based metrics, such as MacroAUC (see Equation (20)) and MicroAUC (see Equation (21)).
\[\label{eq:MacroAUC} \begin{split} MacroAUC = \frac{1}{\lvert L\rvert} \sum\limits_{l=1}^{\lvert L\rvert} \frac{\lvert \left\{x', x'' : rank\left(x', y_l\right) \ge rank\left(x'', y_l\right), \left(x', x''\right) \in X_l \times \overline{X_l} \right\}\rvert}{\lvert X_l\rvert\lvert \overline{X_l}\rvert},\\ X_l = \left\{ x_i : y_l \in Y_i\right\},\ \overline{X_l} = \left\{x_i : y_l \notin Y_i\right\}. \end{split} \tag{20} \]
\[\label{eq:MicroAUC} \begin{split} MicroAUC = \frac{\lvert \left\{x', x'', y', y'' : rank\left(x', y'\right) \ge rank\left(x'', y''\right), \left(x', y'\right) \in S^+ , \left(x'', y''\right) \in S^- \right\}\rvert}{\lvert S^+\rvert\lvert S^-\rvert},\\ S^+ = \left\{ \left(x_i, y\right) : y \in Y_i\right\},\ S^- = \left\{ \left(x_i, y\right) : y \notin Y_i\right\}. \end{split} \tag{21} \]
When the partition of the MLD for which the predictions have been
obtained, and the predictions themselves are given to the
mldr_evaluate
function, a list of 20 measures is returned. For
instance:
> # Get the true labels in emotions
> predictions <- as.matrix(emotions$dataset[, emotions$labels$index])
> # and introduce some noise
> predictions[sample(1:593, 100), sample(1:6, 100, replace = TRUE)] <-
+ sample(0:1, 100, replace = TRUE)
> # then evaluate the predictive performance
> res <- mldr_evaluate(emotions, predictions)
> str(res)
20
List of $ Accuracy : num 0.917
$ AUC : num 0.916
$ AveragePrecision: num 0.673
$ Coverage : num 2.71
$ FMeasure : num 0.952
$ HammingLoss : num 0.0835
$ MacroAUC : num 0.916
$ MacroFMeasure : num 0.87
$ MacroPrecision : num 0.829
$ MacroRecall : num 0.915
$ MicroAUC : num 0.916
$ MicroFMeasure : num 0.872
$ MicroPrecision : num 0.834
$ MicroRecall : num 0.914
$ OneError : num 0.116
$ Precision : num 0.938
$ RankingLoss : num 0.518
$ Recall : num 0.914
$ SubsetAccuracy : num 0.831
$ ROC :List of 15
...
> plot(res$ROC, main = "ROC curve for emotions") # Plot ROC curve
If the pROC (Robin et al. 2011) package
is available, this list will include non-null AUC (Area Under the ROC
Curve) measures and also an element called ROC
. The latter holds the
information needed to plot the ROC (Receiver Operating Characteristic)
curve, as shown in the last line of the previous example. The result
would be a plot similar to that in Figure 3.
This package provides the user with a web-based graphical user interface on top of the shiny package, allowing to interactively manipulate measurements and obtain graphics and other results. Once mldr is loaded, this GUI can be launched from the R console with a single command:
> mldrGUI()
This will cause the user’s default browser to start or open a new tab in which the GUI will be displayed, organized into a tab bar and a content pane. The tab bar allows the change of section so that different information is shown in the pane.
The GUI will initially display the Main section, as shown in Figure 4. It contains options able to select an MLD from those available, and to load a new one by uploading its ARFF and XML files onto the application. On the right side, several plots are stacked. These show the amount of attributes of each type (numeric, character or label), the amount of labels per instance, the amount of instances corresponding to labels and the number of instances related to labelsets. Each plot can be saved as an image on the file system. Right below these graphics, some tables containing basic measures are shown. The first one lists generic measures related to the entire MLD, and is followed by measures specific to labels, such as Card or Dens. The last table shows a summary of measures for labelsets.
The Labels section contains a table enumerating each label of the MLD with its relevant details and measures: its index in the attribute list, its count and frequency, its IRLbl and its SCUMBLE. Labels in this table can be reordered using the headers, and filtered by the Search field. Furthermore, if the list is longer than the number specified in the Show field, it will be split into several pages. The data shown in all tables can be exported to files in several formats. On the right side, a plot shows the amount of instances that have each label. This is an interactive plot, and allows the range of labels to be manipulated.
Since relations between labels can determine the behavior of new data, studying labelsets is important in multilabel classification. Thus, the section named Labelsets provides information about them, listing each labelset along with its count. This list can be filtered and split into pages as well, and is accompanied by a bar plot showing the count of instances per labelset.
In order to obtain statistical measures about input attributes, the Attributes section organizes all of them into a paged table, displaying their type and some data or measures according to it. If the attribute is numeric, then there will be a table containing its minimum and maximum values, its quartiles and its mean. On the contrary, if the attribute takes values from a finite set, each possible value will be shown along with its count in the MLD.
Lastly, concurrence among labels is provenly a factor to take into account when applying preprocessing techniques to MLDs. For this reason, the Concurrence section attempts to create an easy way of visualizing concurrence among labels (see Figure 5), with a label concurrence plot displaying the selected labels in the left-side table and their coocurrences represented by bands in the circle. By default, the ten labels with highest SCUMBLE are selected. The user is able to select and deselect other labels by clicking their corresponding row on the table.
In this paper the mldr package, aimed to provide exploratory analysis and manipulation tools for MLDs, has been introduced. The functions supplied by this package allow both loading existing MLDs and generating new ones. Several characterization measures and specific plots can be obtained for any MLD, and the content of an MLD can be extracted, filtered and joined, producing new MLDs. Any MLD can be transformed into a set of binary datasets or a multiclass dataset by means of the transformation functions of package mldr. Finally, a web-based graphical user interface eases the access to most of this functionality for everyone.
In its current version, package mldr is a strong base to develop any preprocessing method for MLDs, as has been shown. The development of the mldr package will continue in the near future by including the tools needed to implement and evaluate multilabel classifiers. With this foundation, we aim to encourage other developers to incorporate their own algorithms into mldr, as we will do in forthcoming releases.
This paper is partially supported by the project TIN2012-33856 of the Spanish Ministry of Science and Technology.
RWeka, mldr, shiny, Rcmdr, rattle, XML, circlize, devtools, pROC
Finance, MachineLearning, NaturalLanguageProcessing, TeachingStatistics, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Charte & Charte, "Working with Multilabel Datasets in R: The mldr Package", The R Journal, 2015
BibTeX citation
@article{RJ-2015-027, author = {Charte, Francisco and Charte, David}, title = {Working with Multilabel Datasets in R: The mldr Package}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {2}, issn = {2073-4859}, pages = {149-162} }