The utiml Package : Multi-label Classification in R by

Learning classification tasks in which each instance is associated with one or more labels are known as multi-label learning. The implementation of multi-label algorithms, performed by different researchers, have several specificities, like input/output format, different internal functions, distinct programming language, to mention just some of them. As a result, current machine learning tools include only a small subset of multi-label decomposition strategies. The utiml package is a framework for the application of classification algorithms to multi-label data. Like the well known MULAN used with Weka, it provides a set of multi-label procedures such as sampling methods, transformation strategies, threshold functions, pre-processing techniques and evaluation metrics. The package was designed to allow users to easily perform complete multi-label classification experiments in the R environment. This paper describes the utiml API and illustrates its use in different multi-label classification scenarios.


Introduction
Multi-label classification (MLC) is a classification task where an instance can be simultaneously classified in more than one of the existing classes.Labeled data extracted from several domains, like text, web pages, multimedia (audio, image, videos) and biology are intrinsically multi-labeled.Additionally, the number of application domains with MLC data is growing fast.
Many real current world data science applications are MLC by nature.They are problems from very diverse domains, like labeling newspaper articles by subject and classification of proteins according to their functions.MLC algorithms have been successfully used in these and other MLC tasks (Diplaris et al., 2005).In a recent application, MLC algorithms were used to recommend food truck cuisines (Rivolli et al., 2017), assuming that a person can have more than one cuisine preference, and with the same level of preference.
Despite its growing relevance, there is a lack of comprehensive and easy to use tools for the R environment.A tool frequently used in MLC experiments is the MULAN (Tsoumakas et al., 2011), which is a Java library built on top of Weka (Hall et al., 2009) to allow Weka users to deal with MLC data.Its popularity in the research community can be attributed to its easy use, large number and variation of functionalities.The MLC alternative to Python users is the scikit-multilearn (Szyma ński, 2017), which provides a set of MLC algorithms and an interface for the MULAN library.Although other simpler tools, like MEKA (Read et al., 2016) and general data mining software (Gibaja and Ventura, 2015) include good functionalities to deal with MLC tasks, they address few MLC features and are not available in R.
It is important to mention that there are packages that offer some level of support for MLC in R. The most complete is the mldr package, an exploratory tool for the manipulation and analysis of MLC datasets (Charte and Charte, 2015).Although it does not contain MLC strategies, it supports the ARFF variation for MLC data, largely used for data mining and machine learning (ML) experiments, and has useful features, such as dataset characterization, MLC evaluation measures, and a rich user interface for the data exploration.Some works use the mlr package, which was not specifically designed for MLC.As a result, it provides only few multi-label strategies (Probst et al., 2017) and does not support the MLC ARFF format.In fact, it is a general purpose package, with an interface to more than one hundred algorithms that supports several ML tasks (Bischl et al., 2016).Another related package, MLPUGS, is a simple MLC package that contains only the implementation of the classifier chains (CC) strategy (Read et al., 2009).
Although the previous packages make it easier to perform some procedures related to MLC learning, their adoption in MLC experiments require more efforts from the developer/researcher than MULAN, available for Weka uses, which motivated the authors to design utiml, a more comprehensive, specific, easy and extensible solution.The main features of the utiml package include: • Pre-processing techniques: a set of techniques for the preparation and pre-possessing of MLC data to be used in experiments.These techniques deal with simple tasks, like removal of predictive attributes, instances and labels, replacement of nominal attribute values by numerical values and data normalization.
• Sampling: a set of methods used to split MLC data through the holdout and k-fold methodologies.
The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 Random or stratified strategies can be used for data partitioning.
• Classification/Ranking: the main MLC strategies.The transformation strategies support several base algorithms and the result can be seen as bipartition, probability/score and ranking.
• Threshold: score-based and ranking-based threshold functions to be employed after the label prediction, so that bipartition values can be changed.
• Evaluation: traditional MLC evaluation measures and MLC confusion matrix for the summarization of classification result.
This paper describes the main aspects and resources of the utiml package.The current version is 0.1.4and an updated list with all resources available will be maintained in the vignette document and the reference manual.The following section provides a brief review of MLC learning.Next, the package API is detailed, its resources are presented and some illustrative examples are provided.Finally, the main issues regarding the package use are highlighted in the summary section.

Multi-label classification learning
MLC tasks have attracted a growing attention in the ML community (de Carvalho and Freitas, 2009;Tsoumakas et al., 2010;Gibaja and Ventura, 2014).While in multi-class classification only a single class label is predicted, in MLC, more than one class label can be simultaneously predicted.In the same way as multi-class classification tasks can be seen as a generalization of binary classification tasks, which restricts to two the number of classes, MLC can be seen as a generalization of multi-class, which restricts to one the number of predicted classes (de Carvalho and Freitas, 2009).The main MLC approaches are prediction of multiple labels, label ranking and multi-label ranking (Tsoumakas et al., 2010).
Multi-Label Classification (MLC), the most common task (Tsoumakas et al., 2010), induces a predictive model h(x) → Y from a set of training data, which later assigns one or more labels to each new example.This task can be formally defined as: Let D be a set of labeled instances E, such that .., λ q , which describes a position in a {0, 1} q output space.
The Label Ranking (LR) task can be characterized by a function f (x, λ i ), which, for each class label, outputs a score value in the interval [0.0, 1.0], indicating the relevance, confidence or probability of instance x belonging to the class whose label is λ i .The higher score value, the better the ranking position.While MLC predicts bipartitions and LR predicts scores, Multi-label Ranking (MLR) generates both.Since MLC can be derived from the LR formulation (Gibaja and Ventura, 2015) 1 , if a strategy can be used in the LR task, it can also be used in the two other tasks.
These models can be obtained by two approaches (Tsoumakas et al., 2010), problem transformation and algorithm adaptation.Problem transformation converts the original MLC task into a set of binary or multi-class classification subtasks.Afterwards, any classification algorithm, here called base algorithm, can be used to induce models for the subtasks.In the algorithm adaptation approach, the multi-label support is embedded into the algorithm structure.Thus, while transformation fits data to algorithms, adaptation fits algorithms to data (Zhang and Zhou, 2014).
The transformation approach can be performed in three different ways: binary, pairwise and powerset.Binary transformation generates at least one dataset per label, as in the one-versus-all multiclass strategy.Pairwise transformation, instead, creates one dataset for each pair of labels, similarly to one-versus-one multiclass strategy.Finally, powerset is a multi-class transformation that uses labelsets as classes.The adaptation approach, on the other hand, modifies conventional ML algorithms, like Decision Tree Induction Algorithms (DT), K-Nearest Neighbors (KNN), Random Forest (RF) and Support Vector Machines (SVM).
Other steps required for the application of ML algorithms need to be adapted to deal with MLC tasks.For example, stratified sampling for MLC data must take into account multiple targets and the predictive performance evaluation must consider situations like partially correct results and ranking accuracy.A complete overview of the alternatives to deal with these issues can be seen in Zhang and Zhou (2014) and Gibaja and Ventura (2015).

Handling multi-label classification data
The predictive performance of MLC tasks can be strongly affected by the use of data pre-processing techniques.For such, utiml uses the mldr package (Charte and Charte, 2015), which provides the support for data pre-processing.Moreover, when utiml is installed/loaded, the mldr package is automatically installed/loaded.Specially, it supports the MLC ARFF format, which has an additional XML file describing the label columns2 .By default, the mldr package handles categorical data as "character", instead of "factor", which is not supported by the implementation of some traditional machine learning algorithms available in R, like Random Forest from the randomForest package.To address this limitation, the mldata function converts all text columns to factors of an "mldr" dataset.For example, the mldata should be used to load the 'flags' dataset3 , contains categorical attributes, like > flags <-mldata(mldr("flags")) After a dataset is loaded, pre-processing techniques can be applied to it.Table 1 shows the preprocessing techniques available in the utiml package.All these functions receive an "mldr" dataset as argument and return a pre-processed version of this dataset.

fill_sparce_mldata(mdata)
Exchanges the NA values present in the dataset to 0 or "", according to the attribute type.

normalize_mldata(mdata)
Re-scales all numerical attribute values to values between 0 and 1 according to the min-max transformation.The lowest value is modified to 0.0 and the highest value is converted to 1.0.
remove_attributes(mdata, attributes) Removes the specified attributes from the dataset.

remove_labels(mdata, labels)
Removes the specified labels from the dataset.

remove_unique_attributes(mdata)
Removes from the dataset attributes whose values are the same for all instances.

remove_unlabeled_instances(mdata)
Removes from the dataset instances without class labels.

remove_skewness_labels(mdata, t)
Removes from the dataset highly infrequent or highly frequent labels, according to a specific threshold value.The threshold t indicates the minimum number of positive and negative instances associated with each label.

replace_nominal_attributes(mdata)
Replaces categorical attributes by binary attributes.An attribute with n different values will be mapped to n − 1 new columns containing binary values.
Table 1: Pre-processing techniques available in the utiml package The utiml package also supports the main methodologies for data sampling, as shown in Table 2.The holdout and k-fold sampling can partition a dataset randomly and in a stratified way.They are selected by a parameter named method, which determines the sampling algorithm that creates the partitions.According to Sechidis et al. (2011), the accepted values are "random", "iterative" and "stratified", where the latter two are different stratification options.The "iterative" process stratifies a MLC dataset considering each label independently, while "stratified" is based on the different combinations of labels, also known as labelset.
These techniques were designed to improve the mldr package and to simplify the data preparation for the learning step.Concerning the analysis of the MLC data, utiml does not provide additional resources in this current version.However, it is possible to use the mldr package, which enables the understanding and exploration of several data aspects through an interactive interface (Charte and Charte, 2015).

Sampling function Description create_holdout_partition(mdata, partitions, method)
Splits the data into at least two distinct parts.
The second parameter defines the name and size of the partitions and method defines the type of sampling.

create_kfold_partition(mdata, k, method)
Creates an object that contains the k distinct parts of the dataset using the method for splitting the folds.It should be used in combination with partition_fold(object, fold), which provides the training and testing data to a specific fold.The parameter object is the result of create_kfold_partition.

create_random_subset(mdata, instances, attributes, replacement)
Creates a random subset of the dataset based on the proportion of instances and attributes.When replacement=TRUE, a same instance can appear one more time in the training data.

create_subset(mdata, rows, cols)
Creates a specific subset of the dataset based on the instances (rows) and attributes (cols) specified.
Table 2: Sampling functions available in the utiml package The utiml package also provides two MLC dataset.The toyml, a synthetic dataset generated by the Mldatagen tool (Tomás et al., 2014), and the foodtruck dataset, in which several food truck cuisines are mapped as labels (Rivolli et al., 2017).

Multi-label classification strategies
The classification strategies are the heart of the utiml package.Table 3 shows the strategies available in the current version of the package.Some of the implemented strategies, such as brplus, ctrl, dbr, lift, prudent and rdbr were not found by the authors in other tools.
Transformation strategies build the multi-label models by using a ML base algorithm.The base.algorithm parameter defines the base algorithm employed to create the internal models.Table 4 shows the ML algorithms currently supported by the package.Their use requires additional packages, as indicated in column R function Called4 .For example, "C5.0" algorithm requires the C50 package be installed.Only the "MAJORITY" and "RANDOM" options require no additional packages.
The arguments of the transformation strategies follow the pattern: 1. mdata: an "mldr" dataset object.
2. base.algorithm: a base algorithm, as listed in Table 4.
3. additional strategy parameters: specific parameters for each strategy.While the BR strategy contains no additional parameters, the ensemble ECC receives 4 specific parameters, namely m, subsample, attr.space and replacement5 .4. ...: extra parameters used by the base.algorithmselected.As illustration, if base.algorithm = "SVM", the extra parameters can be those defined in the svm function of the e1071 package, such as kernel,gamma,cost, among others.
5. cores: number of cores used for the parallelization of the training phase.Note some classification strategies, as lp, ignore the parameter because the tasks can not be parallelized.
6. seed: a seed that ensures reproducibility.This is particularly important when the task is parallelized.In other words, if cores = 1, the seed effect is similar to that of set.seed(seed).
However, if the cores are higher than 1, the set.seed(seed)command will not guarantee the same result can be obtained, since the task will be performed in parallel.After the creation of a MLC model, the model can be applied to new data through the S3 predict function.The arguments of the predict function are: 1. object: a multi-label classifier.
3. additional model parameters: specific parameters for each model.For example, the vote scheme to be used in the ecc prediction function can be defined6 .
4. probability: a logical value that indicates if the prediction result should be probability/score or bipartition.If TRUE, a probability result is returned; otherwise, the bipartition is obtained.The result can be changed, as observed next.
5. ...: extra parameters based on the S3 predict function related to the base.algorithmselected.
6. cores: number of cores used for the parallelization of the prediction phase.Some models, like CC, ignore this parameter because the tasks cannot be parallelized.
7. seed: a seed that ensures reproducibility.This is particularly important when the task is parallelized and the base algorithm is not deterministic.
The prediction result is an "mlresult" type object and can be used directly as a matrix, where each column is a label and each row is an instance.To change the type of result to bipartition, probability/score or a ranking matrix, the functions as.bipartition, as.probability and as.ranking, respectively, can be used.

Multi-label post-processing
Threshold functions adjust the bipartition result according to the score/probabilities predicted by the predictive models.In MLC learning, these functions can be score-based or rank-based, depending on the type of data used to define the threshold values.A single threshold value for all labels is named global threshold, the use of one value per label is named label-wise and the use of one value per instance is named instance-wise (Al-Otaibi et al., 2014).
Table 5 shows the threshold functions available in the utiml package.All of them receive a probability/score matrix or an "mlresult" object as input and return a new "mlresult" object with different bipartitions as output.The only exception is scut_threshold, which returns threshold values, instead of bipartitions, and should be combined with the fixed_threshold function.Additionally, the subset_correction can be used as a threshold function (Senge et al., 2013).It changes the bipartition based on the labelsets present in the training data and outputs only the known labelsets.

Threshold function
Description Approach fixed_threshold(prediction, threshold) Applies a fixed global or label-wise threshold.
score-based lcard_threshold(prediction, cardinality) Applies an instance-wise threshold using the cardinality measure.

rank-based mcut_threshold(prediction)
Applies an instance-wise threshold and selects the subset of labels of the highest interval between two sorted scores.

pcut_threshold(prediction, ratio)
Applies a global or label-wise threshold using the ratio value to define the proportion of instances that will be relevant.

rcut_threshold(prediction, k)
Applies an instance-wise threshold and defines the k labels with highest scores as relevant.
rank-based scut_threshold(prediction, expected, loss.function) Returns a label-wise threshold using a loss function that minimizes the difference between the value predicted and the expected prediction value.

score-based
Table 5: Threshold functions available in the utiml package Finally, utiml also supports the evaluation of MLC models.The multilabel_evaluate and multilabel_confusion_matrix functions can be used during the evaluation.The first calculates the traditional evaluation measures also available in mldr and MULAN, whereas the second generates a multi-label confusion matrix ("mlconfmat" object) detailing labels and instances.

Figure 1: Groups of multi-label evaluation measures
The multilabel_evaluate function receives an "mlresult" or an "mlconfmat" object and the desired evaluation measures.One or more measures, likewise one or more group of measures can be adopted.Figure 1 shows the measures values, currently supported.A complete review of MLC The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 evaluation measures can be found in Zhang and Zhou (2014) and Gibaja and Ventura (2015).Moreover, if the hyperparameter labels=TRUE, the return will be a list that contains the multi-label and labels' results detailed.

Default options
The utiml package uses the option function to customize some default parameters.For example, the default base algorithm for all transformation strategies is "SVM".The utiml.base.algorithmoption can be used to change this parameter value.Table 6 shows the options names, a brief description of each option parameter value and their default value.The following code defines Random Forest as the default base algorithm and sets the default number of cores to 8, to illustrate the setting of the options.
> options(utiml.base.algorithm= "RF", utiml.cores=8)The utiml.empty.predictionoption defines whether the MLC strategies can predict no labels for one or more instances.Among the alternatives to avoid an empty prediction (Liu and Chen, 2015), utiml outputs the labels with the highest probability/score.It must be observed that this option may directly interfere with the result of the bipartition evaluation measures.Thus, it must be set according to the characteristics of the experiment being carried out.

How to use the package for multi-label classification experiments
The toyml dataset was used in the examples illustrated in this section.As toyml has two irrelevant attributes ("iatt8" and "iatt9") and one redundant ("ratt10") attribute, the pre-processing remove_attributes function can be applied to remove them.
> new.toyml<-remove_attributes(toyml, c("iatt8", "iatt9", "ratt10")) > pre.process <-function (mdata) { + aux <-remove_skewness_labels(mdata, 5) # Remove infrequent labels (less than 5) + aux <-remove_unlabeled_instances(aux) # Remove instances without labels + aux <-remove_unique_attributes(aux) # Remove constant attributes + return(mdata) + } As toyml is already normalized and the dataset has a small number of instances, no other preprocessing technique is required.Thus, the pre.process function has no effect if applied in this case.For other datasets, the same procedure can be useful for their preparation for MLC experiments.Two scenarios that illustrate the use of utiml for the development of MLC experiments are presented next.Finally, a simple experimental analysis is performed using the foodtruck dataset, illustrating the use of the package in a more realist scenario.

Training and test experiment example using holdout
This example shows a MLC experiment using holdout, in which 70% of the data set instances are used for training and 30% for test.A BR model that uses Random Forest as a base algorithm is induced and applied to the test instances.Next, predictions are assessed using MLC evaluation measures.
> head(as.bipartition(predictions Three different ECC models are created in the following code to illustrate the use of different parameters and base algorithms.Each model uses a specific base algorithm and configuration, which, consequently, results in different models for the same data and MLC strategy. > # Using KNN with k = 5 and changing ECC parameters > model1 <-ecc(ds$train, "KNN", m=7, subsample=0.8,k=5) > # Using C5.0 and changing ECC parameters > model2 <-ecc(ds$train, "C5.0", subsample=0.6,attr.space=1)> # Using SVM with cost = 10 and gamma = 0.5 and default ECC parameters > model3 <-ecc(ds$train, "SVM", cost=10, gamma=0.5)By default, the create_holdout_partition function creates two random partitions (train and test) with 70% and 30% of the data set instances, respectively.The number of partitions, sizes and sampling method can be modified.The following code shows how to create three label stratified partitions, named "train", "test" and "val" with 70%, 20% and 10% of the instances, respectively.The "val" partition can be used in a validation step for model selection or hyperparameter tuning.

Training and test example experiment using k-fold cross validation
This section shows some examples of how to perform cross-validation MLC experiments.The cv method can be used to encapsulate the whole procedure, which simplifies the respective task, such that a 10-fold stratified cross-validation can be performed with few lines of code.For instance, the RAkEL strategy using the "SVM" base algorithm can be evaluated in the following way: # Defining the evaluation measures > measures <-c("hamming-loss", "subset-accuracy", "one-error") # Running 10-fold cross validation > results <-cv(new.toyml,method="rakel", base.algorith="SVM",cv.folds=10, + cv.sampling="stratified", cv.measures=measures, cv.seed=123) > round(results, 4) ## hamming-loss one-error subset-accuracy ## 0.212 0.160 0.240 To obtain detailed results by folds and/or labels, the hyperparameter cv.results=TRUE can be set.In this case, a list is returned where the multi-label and labels' results can be obtained as illustrated in the next example.

Experiments with the food truck dataset
In order to show how the package can be used in a real world problem, this section illustrates the use of the utiml to perform an exploratory analysis of the food truck dataset (Rivolli et al., 2017).First, the br strategy is evaluated with different ML base algorithms ("C5.0","RF", "SVM" and "XGB") to identify the algorithm that produces the best macro and micro-F1 results.
The next code shows that six labels (mexican_food, chinese_food, japanese_food, arabic_food, healthy_food and fitness_food) had no True Positive (TP) and False Positive (FP) predictions.Thus, for these labels, all instances were predicted as negative.This explains the difference observed between the macro and micro-F1 result, since the macro-F1 is the average labels' F1, which is 0 for these labels.
> set.seed(1)> ds <-create_holdout_partition(foodtruck, method="iterative") > model <-br(ds$train, "RF") > pred <-predict(model, ds$test) It must be observed that the cm object is a list containing several information about the prediction, like the confusion matrix values summarized by instances and labels.Any evaluation measure can be computed using only the information provided by this object.As an example, the next code summarizes the proportion of instances and the number of labels correctly predicted in the previous example.The results show that the BR model was not able to predict a correct label for almost 20% of the test instances; around 65% of the instances were correctly predicted with a single label; 12% were correctly predicted with 2 labels; and, 3% were correctly predicted with 3 labels.
> prop.table(table(cm$TPi))## 0 1 2 3 ## 0.200 0.648 0.120 0.032 These results show a researcher can simulate new scenarios and explore different solutions in order to improve the predictive performance in a MLC task.The utiml package offers several resources that simplify the most basic and recurrent procedures adopted in the MLC domain.

Summary
Data classification is one of the main ML tasks.Although ML classification algorithms are usually designed and employed for single label classification tasks, in several application domains, an instance can have more than one class label.This paper introduced the utiml package, which provides several functions for MLC experiments in R. Similarly to MULAN, one of the most popular MLC tools, utiml offers a wide set of functionalities.The provided functions implement procedures that cover several MLC-related tasks, which include data pre-processing, data sampling model induction, optimization and evaluation of MLC models.The package utiml also supports the intrinsic parallelization of tasks and allows the reproducibility of MLC experiments.
To the best of the authors knowledge, some of the features present in utiml are not available in any other R tool, such as the implementation of MLC stratification (Sechidis et al., 2011), baselines (Metz et al., 2012), thresholds (Al-Otaibi et al., 2014) and an option that allow the users to avoid the empty prediction problem (Liu and Chen, 2015)..Moreover, as in MULAN, which enables users to take advantage of the resources available in the Weka environment, utiml users can benefit from the several libraries available in R.
The most important limitation of this package is that some common MLC procedures, like feature selection, imbalanced data and classification strategies based on the algorithm adaptation approach are not available yet.They will be implemented in the future as a natural progression of this work and will be included in the next versions of the utiml package.The authors encourage other developers to integrate their own algorithms in the utiml package7 , so that it becomes a more robust and complete MLC package.

Table 3 :
Strategies available in the utiml package

Table 4 :
Base algorithms available in the utiml package