Multilabel Classification with R Package mlr

We implemented several multilabel classification algorithms in the machine learning package mlr. The implemented methods are binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking, which can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and rFerns. All these methods can be easily compared by different implemented multilabel performance measures and resampling methods in the standardized mlr framework. In a benchmark experiment with several multilabel datasets, the performance of the different methods is evaluated.


Introduction
Multilabel classification is a classification problem where multiple target labels can be assigned to each observation instead of only one, like in multiclass classification. It can be regarded as a special case of multivariate classification or multi-target prediction problems, for which the scale of each response variable can be of any kind, for example nominal, ordinal or interval.
Originally, multilabel classification was used for text classification (McCallum, 1999;Schapire and Singer, 2000) and is now used in several applications in different research fields. For example, in image classification, a photo can belong to the classes mountain and sunset simultaneously. Zhang and Zhou (2008) and others (Boutell et al., 2004) used multilabel algorithms to classify scenes on images of natural environments. Furthermore, gene functional classifications is a popular application of multilabel learning in the field of biostatistics (Elisseeff and Weston, 2002;Zhang and Zhou, 2008). Additionally, multilabel classification is useful to categorize audio files. Music genres (Sanden and Zhang, 2011), instruments (Kursa and Wieczorkowska, 2014), bird sounds (Briggs et al., 2013) or even emotions evoked by a song (Trohidis et al., 2008) can be labeled with several categories. A song could, for example, be classified both as a rock song and a ballad.
An overview of multilabel classification was given by Tsoumakas and Katakis (2007). Two different approaches exist for multilabel classification. On the one hand, there are algorithm adaptation methods that try to adapt multiclass algorithms so they can be applied directly to the problem. On the other hand, there are problem transformation methods, which try to transform the multilabel classification into binary or multiclass classification problems.
Regarding multilabel classification software, there is the mldr (Charte and Charte, 2015) R package that contains some functions to get basic characteristics of specific multilabel datasets. The package is also useful for transforming multilabel datasets that are typically saved as ARFF-files (Attribute-Relation File Format) to data frames and vice versa. This is especially helpful because until now only the software packages MEKA (Read and Reutemann, 2012) and Mulan (Tsoumakas et al., 2011) were available for multilabel classification and both require multilabel datasets saved as ARFF-files to be executed. Additionally, the mldr package provides a function that applies the binary relevance or label powerset transformation method which transforms a multilabel dataset into several binary datasets (one for each label) or into a multiclass dataset using the set of labels for each observation as a single target label, respectively. However, there is no R package that provides a standardized interface for executing different multilabel classification algorithms. With the extension of the mlr package described in this paper, it will be possible to execute several multilabel classification algorithms in R with many different base learners.
In the following section of this paper, we will describe the implemented multilabel classification methods and then give a practical instruction of how to execute these algorithms in mlr. Finally, we present a benchmark experiment that compares the performance of all implemented methods on several datasets.

Multilabel classification methods implemented in mlr
In this section, we present multilabel classification algorithms that are implemented in the mlr package (Bischl et al., 2016), which is a powerful and modularized toolbox for machine learning in R. The package offers a unified interface to more than a hundred learners from the areas classification, regression, cluster analysis and survival analysis. Furthermore, the package provides functions and tools that facilitate complex workflows such as hyperparameter tuning (see, e.g.,  and

Problem transformation methods
Problem transformation methods try to transform the multilabel classification problem so that a simple binary classification algorithm, the so-called base learner, can be applied.
Let n be the number of observations, let p be the number of predictor variables and let Z = {z 1 , . . . , z m } be the set of all labels. Observations follow an unknown probability distribution P on X × Y, where X is a p−dimensional input space of arbitrary measurement scales and Y = {0, 1} m is the target space. In our notation, ∈ X refers to the i-th observation and refers to the j-th predictor variable, for all i = 1, . . . , n and j = 1, . . . , p. The observations x (i) are associated with their multilabel outcomes y (i) = y . . , n. For all k = 1, . . . , m, setting y (i) k = 1 indicates the relevance, i.e., the occurrence, of label z k for observation x (i) and setting y (i) k = 0 indicates the irrelevance of label z k for observation x (i) . The set of all instances thus becomes D = x (1) , y (1) , x (2) , y (2) , . . . , x (n) , y (n) . Furthermore, refers to the k-th target vector, for all k = 1, . . . , m. Throughout this paper, we visualize multilabel classification problems in the form of tables (n = 6, p = 3, m = 3): The entries of x 1 , x 2 , x 3 can be of any (valid) kind, like continuous, binary, or categorical. The table in (1) visualizes this as an empty gray background. The target variables are indicated by a red background and can only take the binary values 0 or 1.

Binary relevance
The binary relevance method (BR) is the simplest problem transformation method. BR learns a binary classifier for each label. Each classifier C 1 , . . . , C m is responsible for predicting the relevance of their corresponding label by a 0/1 prediction: These binary prediction are then combined to a multilabel target. An unlabeled observation x (l) is assigned the prediction C 1 x (l) , C 2 x (l) , . . . , C m x (l) . Hence, labels are predicted independently of each other and label dependencies are not taken into account. BR has linear computational complexity with respect to the number of labels and can easily be parallelized.

Modeling label dependence
In the problem transformation setting, the arguably simplest way (Montañés et al., 2014) to model label dependence is to condition classifier models not only on X , but also on other label information. The idea is to augment the input space X with information of the output space Y, which is available in the training step. There are different ways to realize this idea of augmenting the input space. In essence, they can be distinguished in the following way: • Should the true label information be used? (True vs. predicted label information) • For predicting one label z k , should all other labels augment the input space, or only a subset of labels? (Full vs. partial conditioning)

True vs. predicted label information
During the training of a classifier C k for the label z k , the label information of other labels are available in the training data. Consequently, these true labels can directly be used as predictors to train the classifier. Alternatively, the predictions that are produced by some classifier can be used instead of the true labels.
A classifier, which is trained on additional labels as predictors, needs those additional labels as input variables. Since these labels are not available at prediction time, they need to be predicted first. When the true label information is used to augment the feature space in the training of a classifier, the assumption that the training data and the test data should be identically distributed is violated (Senge et al., 2013). If the true label information is used in the training data and the predicted label information is used in the test data, the training data is not representative for the test data. However, experiments (Montañés et al., 2014;Senge et al., 2013) show that none of these methods should be dismissed immediately. Note that we use the superscript "true" or "pred" to emphasize that a classifier C true k or C pred k used true labels or predicted labels as additional predictors during training, respectively.
Suppose there are n = 6 observations with p = 3 predictors and m = 3 labels. The true label y 3 shall be used to augment the feature space of a binary classifier C true 1 for label y 1 . C true 1 is thus trained on all predictors and the true label y 3 . The binary classification task for label y 1 is therefore: For an unlabeled observation x (l) , only the three predictor variables x 3 are available at prediction time. However, the classifier C true 1 needs a 4-dimensional observation x (l) , y (l) 3 as input.
The input y (l) 3 therefore needs to be predicted first. A new level-1 classifier C lvl1 3 , which is trained on the set D = ∪ 6 i=1 x (i) , y (i) 3 , will make those predictions for y The alternative to (2) would be to use predicted labelsŷ 3 instead of true labels y 3 . These labels should be produced by means of an out-of-sample prediction procedure (Senge et al., 2013). This can be done by an internal leave-one-out cross-validation procedure, which can of course be computationally intensive. Because of this, coarser resampling strategies can be used. As an example, an internal 2-fold cross-validation will be shown here. Again, let be the set of all predictor variables with y 3 as target variable. Using 2-fold cross-validation, the dataset D is split into two parts Two classifiers C D 1 and C D 2 are then trained on D 1 and D 2 , respectively, for the prediction of y 3 : Train C D 1 on Following the cross-validation paradigm, D 1 is used as test set for the classifier C D 2 , and D 2 is used as a test set for C D 1 : These predictions are merged for the final predicted labelŷ 3 , which is used to augment the feature space. The classifier C pred 1 is then trained on that augmented feature space: The prediction phase is completely analogous to (3). It is worthwhile to mention that the level-1 classifier C lvl1 3 , which will be used to obtain predictionsŷ 3 at prediction time, is trained on the whole set D = D 1 ∪ D 2 , following Simon (2007).

Full vs. partial conditioning
Recall the set of all labels Z = {z 1 , . . . , z m }. The prediction of a label z k can either be conditioned on all remaining labels {z 1 , . . . , z k−1 , z k+1 , . . . , z m } (full conditioning) or just on a subset of labels (partial conditioning). The only method for partial conditioning, which is examined in this paper, is the chaining method. Here, labels z k are conditioned on all previous labels {z 1 , . . . , z k−1 } for all k = 1, . . . , m. This sequential structure is motivated by the product rule of probability (Montañés et al., 2014): Methods that make use of this chaining structure are e.g., classifier chains or nested stacking (these methods will be discussed further below).
To sum up the discussions above: there are four ways in modeling label dependencies through conditioning labels z k on other labels z , k = . They can be distinguished by the subset of labels, which are used for conditioning, and by the use of predicted or real labels in the training step. In Table 1 we show the four methods, which implement these ideas and describe them consequently.

True labels
Pred. labels Partial cond. Classifier chains Nested stacking Full cond. Dependent binary relevance Stacking

Classifier chains
The classifier chains (CC) method implements the idea of using partial conditioning together with the true label information. It was first introduced by Read et al. (2011). CC selects an order on the set of labels {z 1 , . . . , z m }, which can be formally written as a bijective function (permutation): Labels will be chained along this order τ: However, for this paper the permutation shall be τ = id (only for simplicity reasons). The labels therefore follow the order z 1 → z 2 → . . . → z m . In a similar fashion to the binary relevance (BR) method, CC trains m binary classifiers C k , which are responsible for predicting their corresponding label z k , k = 1, . . . , m. The classifiers C k are of the form where {0, 1} 0 := ∅. For a classifier C k the feature space is augmented by the true label information of all previous labels z 1 , z 2 , . . . , z k−1 . Hence, the training data of C k consists of all observations In the example from above, this would look like: At prediction time, when an unlabeled observation x (l) is labeled, a prediction ŷ is obtained by successively predicting the labels along the chaining order: The authors of Senge et al. (2013) summarize several factors, which have an impact on the performance of CC: • The length of the chain. A high number (k − 1) of preceding classifiers in the chain comes with a high potential level of feature noise for the classifier C k . One may assume that the probability of a mistake will increase with the level of feature noise in the input space. Then the probability of a mistake will be reinforced along the chain, due to the recursive structure of CC.
• The order of the chain. Some labels may be more difficult to predict than others. The order of a chain can therefore be important for the performance. It can be advantageous to put simple to predict labels in the beginning and harder to predict labels more towards the end of the chain. Some heuristics for finding an optimal chain ordering have been proposed in da Silva et al. (2014); Read et al. (2013). Alternatively Read et al. (2011) developed an ensemble of classifier chains, which builds many randomly ordered CC-classifiers and put them on a voting scheme for a prediction. However, these methods are not subject of this article.
• The dependency among labels. For an improvement of performance through chaining, there should be a dependence among labels, CC cannot gain in case of label independence. However, CC is also only likely to lose if the binary classifiers C k cannot ignore the added features y 1 , . . . , y k−1 .

Nested stacking
The nested stacking method (NST), first proposed in Senge et al. (2013), implements the idea of using partial conditioning together with predicted label information. NST mimicks the chaining structure of CC, but does not use real label information during training. Like in CC the chaining order shall be τ = id , again for simplicity reasons. CC uses real label information y k during training and predicted labelsŷ k at prediction time. However, unless the binary classifiers are perfect, it is likely that y k and y k do not follow the same distribution. Hence, the key assumption of supervised learning, namely that the training data should be representative for the test data, is violated by CC. Nested stacking tries to overcome this issue by using predicted labelsŷ k instead of true labels y k .

NST trains m binary classifiers
, for all k = 1, . . . , m. The predicted labels should be obtained by an internal out-of-sample method (Senge et al., 2013). How these predictions are obtained was already explained in the True vs. Predicted Label Information chapter. The prediction phase is completely analogous to (11).
The training procedure is visualized in the following with 2-fold cross-validation as an internal out-of-sample method: Train C 3 on x1 x2 x3ŷ1ŷ2 y3 The factors which impact the performance of CC (i.e., length and order of the chain, and the dependency among labels), also impact NST, since NST mimicks the chaining method of CC.

Dependent binary relevance
The dependent binary relevance method (DBR) implements the idea of using full conditioning together with the true label information. DBR is built on two main hypotheses (Montañés et al., 2014): (i) Taking conditional label dependencies into account is important for performing well in multilabel classification tasks.
(ii) Modeling and learning these label dependencies in an overcomplete way (take all other labels for modeling) may further improve model performance.
The first assumption is the main prerequisite for research in multilabel classification. It has been shown theoretically that simple binary relevance classifiers cannot achieve optimal performance for specific multilabel loss functions (Montañés et al., 2014). The second assumption, however, is harder to justify theoretically. Nonetheless, the practical usefulness of learning in an overcomplete way has been shown in many branches of (classical) single-label classification (e.g., ensemble methods (Dietterich, 2000)).
Formally, DBR trains m binary classifiers C 1 , . . . , C m (as many classifiers as labels) on the corresponding training data k = 1, . . . , m. Thus, each classifier C k is of the form Hence, for each classifier C k the true label information of all labels except y k is used as augmented features. Again, here is a visualization with the example from above: To make these classifiers applicable, when an unlabeled instance x (l) needs to be labeled, the help of other multilabel classifiers is needed to produce predicted labelsŷ m as additional features. The classifiers, which produce predicted labels as additional features, are called base learners (Montañés et al., 2014). Theoretically any multilabel classifier can be used as base learner. However, in this paper, the analysis is focused on BR as base learner only. The prediction of an unlabeled instance x (l) formally works as follows: (i) First level: Produce predicted labels by using the BR base learner: (ii) Second level, which is also called meta level (Montañés et al., 2014): Produce final prediction by applying DBR classifiers C 1 , . . . , C m :

Stacking
Stacking (STA) implements the last variant of Table 1, namely the use of full conditioning together with predicted label information. Stacking is short for stacked generalization (Wolpert, 1992) and was first proposed in the multilabel context by Godbole and Sarawagi (2004). Like in classical stacking, for each label it takes predictions of several other learners that were trained in a first step to get a new learner to make predictions for the corresponding label. Both hypotheses on which DBR is built on also apply to STA, of course.
STA trains m classifiers C 1 , . . . , C m on the corresponding training data The classifiers C k , k = 1, . . . , m, are therefore of the following form: Like in NST, the predicted labels should be obtained by an internal out-of-sample method (Sill et al.). STA can be seen as the alternative to DBR using predicted labels (like NST is for CC). However, the classifiers C k , k = 1, . . . , m, are trained on all predicted labelsŷ 1 , . . . ,ŷ m for the STA approach (in DBR the label y k is left out of the augmented training set).
The training procedure is outlined in the following: For i=1,2,3 train C k on x1 x2 x3ŷ1ŷ2ŷ3 y k y (1) 1ŷ (1) 2ŷ (2) 2ŷ Like in DBR, STA depends on a BR base learner, to produce predicted labels as additional features. Again, the use of BR as a base learner is not mandatory, but it is the proposed method in Godbole and Sarawagi (2004).
The prediction of an unlabeled instance x (l) works almost identically to the DBR case and is illustrated here: (i) First level. Produce predicted labels by using the BR base learner: ii) Meta level. Apply STA classifiers C 1 , . . . , C m :

Multilabel performance measures
Analogously to multiclass classification there exist multilabel classification performance measures. Six multilabel performance measures can be evaluated in mlr. These are: Subset 0/1 loss, hamming loss, accuracy, precision, recall and F 1 -index. Multilabel performance measures are defined on a per instance basis. The performance on a test set is the average over all instances.
(i) The subset 0/1 loss is used to see if the predicted labels C(x (i) ) = ŷ The subset 0/1 loss of a classifier C on a test set D test thus becomes: The subset 0/1 loss can be interpreted as the analogon of the mean misclassification error in multiclass classifications. In the multilabel case it is a rather drastic measure because it treats a mistake on a single label as a complete failure (Senge et al., 2013).
(ii) The hamming loss also takes into account observations where only some labels have been predicted correctly. It corresponds to the proportion of labels whose relevance is incorrectly this is defined as: If one label is predicted incorrectly, this accounts for an error of 1 m . For a test set D test the hamming loss becomes: The following measures are scores instead of loss function like the two previous ones.
(iii) The accuracy, also called Jaccard-Index, for a test set D test is defined as: (iv) The precision for a test set D test is defined as: (v) The recall for a test set D test is defined as: (vi) For a test set D test the F 1 -index is defined as follows: The F 1 -index is the harmonic mean of recall and precision on a per instance basis.
All these measures lie between 0 and 1. In the case of the subset 0/1 loss and the hamming loss the values should be low, in all other cases the scores should be high. Demonstrative definitions with sets instead of vectors can be seen in Charte and Charte (2015).

Implementation
In this section, we briefly describe how to perform multilabel classifications in mlr. We provide small code examples for better illustration. A short tutorial is also available at http://mlr-org. github.io/mlr-tutorial/release/html/multilabel/index.html. The first step is to transform the multilabel dataset into a 'data.frame' in R. The columns must consist of vectors of features and one logical vector for each label that indicates if the label is present for the observation or not. To fit a multilabel classification algorithm in mlr, a multilabel task has to be created, where a vector of targets corresponding to the column names of the labels has to be specified. This task is an S3 object that contains the data, the target labels and further descriptive information. In the following example, the yeast data frame is extracted from the yeast.task, which is provided by the mlr package. Then the 14 label names of the targets are extracted and the multilabel task is created. yeast = getTaskData(yeast.task) labels = colnames(yeast)[1:14] yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)

Problem transformation methods
To generate a problem transformation method learner, a binary classification base learner has to be created with 'makeLearner'. A list of available learners for classifications in mlr can be seen at http://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/. Specific hyperparameter settings of the base learner can be set in this step through the 'par.vals' argument in 'makeLearner'. Afterwards, a learner for any problem transformation method can be created by applying the function 'makeMultilabel[. . .]Wrapper', where [. . .] has to be substituted by the desired problem transformation method. In the following example, two multilabel variants with rpart as base learner are created. The base learner is configured to output probabilities instead of discrete labels during prediction. lrn = makeLearner("classif.rpart", predict.type = "prob") multilabel.lrn1 = makeMultilabelBinaryRelevanceWrapper(lrn) multilabel.lrn2 = makeMultilabelNestedStackingWrapper(lrn)

Algorithm adaptation methods
Algorithm adaptation method learners can be created directly with 'makeLearner'. The names of the specific learner can be looked up at http://mlr-org.github.io/mlr-tutorial/release/html/ integrated_learners/ in the multilabel section.

Train, predict and evaluate
Training and predicting on data can be done as usual in mlr with the functions 'train' and 'predict'. Learner and task have to be specified in 'train'; trained model and task or new data have to be specified in 'predict'. mod = train(multilabel.lrn1, yeast.task, subset = 1:1500) pred = predict(mod, task = yeast.task, subset = 1501:1600) The performance of the prediction can be assessed via the function 'performance'. Measures are represented as S3 objects and multiple objects can be passed in as a list. The default measure for multilabel classification is the hamming loss (multilabel.hamloss). All available measures for multilabel classification can be shown by 'listMeasures' or looked up in the appendix of the tutorial page 1 (http://mlr-org.github.io/mlr-tutorial/release/html/measures/index.html).

Binary performance
To calculate a binary performance measure like, e.g., the accuracy, the mean misclassification error (mmce) or the AUC for each individual label, the function 'getMultilabelBinaryPerformances' can be used. This function can be applied to a single multilabel test set prediction and also on a resampled multilabel prediction. To calculate the AUC, predicted probabilities are needed. These can be obtained by setting the argument 'predict.type = "prob"' in the 'makeLearner' function.

Parallelization
In the case of a high number of labels and larger datasets, parallelization in the training and prediction process of the multilabel methods can reduce computation time. This can be achieved by using the package parallelMap in mlr (see also the tutorial section of parallelization: http://mlrorg.github.io/mlr-tutorial/release/html/multilabel/index.html). Currently, only the binary relevance method is parallelizable, the classifier for each label is trained in parallel, as they are independent of each other. The other problem transformation methods will also be parallelizable (as far as possible) soon.

Benchmark experiment
In a similar fashion to Wang et al. (2014), we performed a benchmark experiment on several datasets in order to compare the performances of the different multilabel algorithms. Table 2 we provide an overview of the used datasets. We retrieved most datasets from the Mulan Java library for multilabel learning 2 as well as from other benchmark experiments of multilabel classification methods. See Table 2 for article references. We uploaded all datasets to the open data platform OpenML (Casalicchio et al., 2017;Vanschoren et al., 2013), so they now can be downloaded directly from there. In some of the used datasets, sparse labels had to be removed in order to avoid problems during cross-validation. Several binary classification methods have difficulties when labels are sparse, i.e., a strongly imbalanced binary target class can lead to constant predictions for that target. That can sometimes lead to direct problems in the base learners (when training on constant class labels is simply not allowed) or, e.g., in classifier chains, when the base learner cannot handle constant features. Furthermore, one can reasonably argue that not much is to be learned for such a label. Hence, labels that appeared in less than 2% of the observations were removed. We computed cardinality scores (based on the remaining labels) indicating the mean number of labels assigned to each case in the respective dataset. The following description of the datasets refers to the final versions after removal of sparse labels.

Datasets: In
• The first dataset (birds) consists of 645 audio recordings of 15 different vocalizing bird species (Briggs et al., 2013). Each sound can be assigned to various bird species.
• Another audio dataset (emotions) consists of 593 musical files with 6 clustered emotional labels (Trohidis et al., 2008) and 72 predictors. Each song can be labeled with one or more of the labels {amazed-surprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely, angry-fearful}.
• The genbase dataset contains protein sequences that can be assigned to several classes of protein families (Diplaris et al., 2005). The entire dataset contains 1186 binary predictors.
• The langLog 3 dataset includes 998 textual predictors and was originally compiled in the doctorial thesis of Read (2010). It consists of 1460 text samples that can be assigned to one or more topics such as language, politics, errors, humor and computational linguistics.
• The UC Berkeley enron 4 dataset represents a subset of the original enron 5 dataset and consists of 1702 cases of emails with 24 labels and 1001 predictor variables (Klimt and Yang, 2004).
• A subset of the reuters 6 dataset includes 2000 observations for text classification (Zhang and Zhou, 2008).
• The image 7 benchmark dataset consists of 2000 natural scene images. Zhou and Zhang (2007) extracted 135 features for each image and made it publicly available as processed image dataset. Each observation can be associated with different label sets, where all possible labels are {desert, mountains, sea, sunset, trees}. About 22% of the images belong to more than one class. However, images belonging to three classes or more are very rare.
• The scene dataset is an image classification task where labels like Beach, Mountain, Field, Urban are assigned to each image (Boutell et al., 2004).
• The yeast dataset (Elisseeff and Weston, 2002) consists of micro-array expression data, as well as phylogenetic profiles of yeast, and includes 2417 genes and 103 predictors. In total, 14 different labels can be assigned to a gene, but only 13 labels were used due to label sparsity.
• Another dataset for text-classification is the slashdot 8 dataset (Read et al., 2011). It consists of article titles and partial blurbs. Blurbs can be assigned to several categories (e.g., Science, News, Games) based on word predictors.  Table 2: Used benchmark datasets including number of instances, number of predictor, number of label and label cardinality. Datasets with an asterisk differ from the original dataset as sparse labels have been removed. The genbase dataset contained many constant factor variables, which were automatically removed by mlr.
Base Learners: We employed two different binary classification base learner for each problem transformation algorithm: random forest (rf) of the randomForest package (Liaw and Wiener, 2002) with ntree = 100 and adaboost (ad) from the ada package (Culp et al., 2012), each with standard hyperparameter settings.

Performance Measures:
We used the six previously proposed performance measures. Furthermore, we calculated the reported values by means of a 10-fold cross-validation.
Code: For reproducibility, the complete code and results can be downloaded from Probst (2017). The R package batchtools  was used for parallelization.
The results for hamming loss and F 1 -index are illustrated in Figure 1. Tables 3 and 4 contain performance values with the best performing algorithms highlighted in blue. For all remaining measures one may refer to the Appendix. We did not perform any threshold tuning that would potentially improve some of the performance of the methods.
The results of the problem transformation methods in this benchmark experiment concur with the general conclusions and results in Montañés et al. (2014). The authors ran a similar benchmark study with penalized logistic regression as base learner. They concluded that, on average, DBR performs well in F 1 and accuracy. Also, CC outperform the other methods regarding the subset 0/1 loss most of the time. For the hamming loss measure they got mixed results, with no clear winner concordant to our benchmark results. As base learner, on average, adaboost performs better than random forest in our benchmark study.
Considering the measure F 1 , the problem transformation methods DBR, CC, STA and NST outperform RFERN and RFSRC on most of the datasets and also almost always perform better than BR, which does not consider dependencies among the labels. RFSRC and RFERN only perform well on either precision or recall, but in order to be considered as good classifiers they should perform well on both. The generally poor performances of RFERN can be explained by the working mechanism of the algorithm which randomly chooses variables and split points at each split of a fern. Hence, it cannot deal with too many features that are useless for the prediction of the target labels.

Summary
In this paper, we describe the implementation of multilabel classification algorithms in the R package mlr. The problem transformation methods binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking are implemented and can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and RFerns. We compare all of these methods in a benchmark experiment with several datasets and different implemented multilabel performance measures. The dependent binary relevance method performs well regarding the measures F 1 and accuracy. Classifier chains outperform the other methods in terms of the subset 0/1 loss most of the time. Parallelization is available for the binary relevance method and will be available soon for the other problem transformation methods. Algorithm adaptation methods and problem transformation methods that are currently not available can be incorporated in the current mlr framework easily. In our benchmark experiment we had to remove labels which occured too sparsely, because some algorithms crashed due to one class problems, which appeared during cross-validation. A solution to this problem and an implementation into the mlr framework is of great interest.          For the featureless learner we have no precision results for several datasets. The reason is that the featureless learner does not predict any value in all observations in these datasets. Hence, the denominator in the precision formula is always zero. mlr predicts NA in this case.