The NoiseFiltersR Package : Label Noise Preprocessing in R by

In Data Mining, the value of extracted knowledge is directly related to the quality of the used data. This makes data preprocessing one of the most important steps in the knowledge discovery process. A common problem affecting data quality is the presence of noise. A training set with label noise can reduce the predictive performance of classification learning techniques and increase the overfitting of classification models. In this work we present the NoiseFiltersR package. It contains the first extensive R implementation of classical and state-of-the-art label noise filters, which are the most common techniques for preprocessing label noise. The algorithms used for the implementation of the label noise filters are appropriately documented and referenced. They can be called in a R-user-friendly manner, and their results are unified by means of the "filter" class, which also benefits from adapted print and summary methods.


Introduction
In last years, Data Mining has been faced with increasingly challenging problems in terms of the nature of available data.Not only its size, but also its imperfections and varied shapes, are providing the researchers with plenty of dierent scenarios to be addressed.Consequently, Data Preprocessing [4] has become an important part of the KDD (Knowledge Discovery from Databases) process, and related-software development is also essential to provide practitioners with the adequate tools.
Data Preprocessing intends to appropriately process the collected data so that subsequent learning algorithms can extract maximum knowledge out of it.It is known to be one of the most timeconsuming steps in the whole KDD process.There exist several aspects involved in data preprocessing, like feature selection or dealing with missing values and noisy data.Feature selection aims at extracting the most relevant attributes for the learning step, thus reducing the complexity of models and the computing time.The treatment of missing values is also essential to keep as much information as possible in data Finally, noisy data refers to values that are either incorrect or clearly far from the general underlying distribution.
All these tasks have associated software available.For instance, the KEEL tool [1] contains a broad collection of data preprocessing algorithms, which covers all the aforementioned topics .There are other popular options for these tasks, like SPSS or SAS for missing values, and WEKA [18] or RapidMiner for feature selection.Apart from these, there exist many other general-purpose Data Mining software suites like R, KNIME or Python.
Regarding the R statistical software, there are plenty of packages available in the Comprehensive R Archive Network (CRAN) repository to address preprocessing tasks.For example, MICE [16] and Amelia [6] are very popular packages for handling missing values, whereas caret [9] or FSelector [13] provide a wide range of techniques for feature selection.There are also general-purpose packages for decting outliers and anomalies, like mvoutlier [2].
However, to the best of our knowledge, CRAN lacks an extensive collection of classicationoriented label noise preprocessing algorithms, some of which are among the most inuential preprocessing techniques [5].This is the gap we intend to ll with the release of the NoiseFiltersR package, whose taxonomy is inspired on the recent survey on label noise by B. Frénay and M. Verleysen [3].
Yet, it should be noted that there are other packages that include some isolated implementations of label noise lters, since they are sometimes needed as auxiliary functions.It is the case of the unbalanced [11] package, which deals with imbalanced classication.It contains basic versions of classical lters such as Tomek-Links [15] or ENN [17], which are tipically applied after oversampling an imbalanced dataset (which is the main purpose of the unbalanced package).
In the following Section 2 we briey introduce the problem of classication with label noise, as well as the most popular techniques to overcome it.In Section 3 we show how to use the NoiseFiltersR package to apply these techniques in a unied and R-user-friendly manner.Finally, Section 4 presents a general overview of this work.Once the package is loaded, the source and R code for this vignette is directly available from R with the command browseVignettes.
2 Label noise preprocessing Data collection and preparation processes are usually subject to errors in Data Mining applications [19].Consequently, real-world datasets are commonly aected by imperfections or noise.In classication problems, this noise negatively aects the learning process of classiers, leading to less accurate predictions, excessively complex models, and longer computation time.
Two dierent types of noise are usually distinguished in the specialized literature for classication: attribute noise and label noise (which is also called class noise) [20].The former refers to imperfections in the attributes of the training dataset, whereas the latter relates to errors in the labels used for classication.The NoiseFiltersR package (and the rest of this work) focuses on label noise, which is known to be the most disruptive one, since label quality is essential for the classier training [20].
In order to address the problem of label noise, there exist two main approaches in the literature, and both are surveyed in the recent work [3].On the one hand, algorithm level approaches [12] attempt to create robust classication algorithms that are little inuenced by the presence of noise.
On the other hand, data level approaches [8] (also called lters) try to develop strategies to cleanse the dataset as a previous step to the t of the classier.The NoiseFiltersR package follows the second approach, since this allows to carry out the data preprocessing just once and apply any classier thereafter, whereas the rst option is specic for each classication algorithm 1 .
Regarding data-level handling of label noise, we take the aforementioned survey by Frénay et al. [3] as the basis for our NoiseFiltersR package.That work provides an overview and references for the most popular classical and state-of-the-art lters, which are organized and classied taking into account several aspects: Considering how to identify noisy instances, ensemble based, similarity based and data complexity based algorithms are distinguished.The rst type makes use of predictions from classiers ensembles built over dierent partitions or resamples of training data, the second is based on label distribution in the nearest neighbors of each instance, and the third attempts to reduce complexity metrics which are related to the presence of noise.As we will explain in Section 3 (see Figure 1), the NoiseFiltersR package contains implementations of all these types of algorithms, and the explicit distinction is indicated in the documentation page of each function.
Regarding how to deal with the identied noise, noise removal and noise reparation strategies are considered.The rst option removes the noisy instances, whereas the second one relabels them with the most likely label on the basis of the information available.There also exist hybrid approaches, which only carry out relabelling when they have enough condence on the new label, and otherwise remove.The discussion between noise removal, noise reparation and their possible sinergies is an active and open eld of research [3, Section VI.H]: most works agree on the potential damages of incorrect relabelling [10], although other studies also point out the dangers of removing too many instances and advocate hybrid approaches [14].As we will see in Section 3, the NoiseFiltersR package includes lters which implement all these possibilities, and the specic behaviour is explicitly indicated in the documentation page of the corresponding function.

The NoiseFiltersR package
The released package implements, documents, explains and provides references for a broad collection of label noise lters surveyed in [3].To the best of our knowledge, it is the rst comprehensive review and implementation of this topic for R, which has become an essential tool in Data Mining in the last years. 1Of course, in R there exist implementations of very popular label noise robust classiers (the aforementioned algorithmlevel approach), such as C4. 5 and RIPPER, which are called J48 and JRip respectively in RWeka package [7] (which is a R interface to WEKA software [18]).
Namely, the NoiseFiltersR package includes a total of 30 lters which were published along 24 research papers (each one of these papers is referenced in the corresponding lter documentation page, see Section 3.2).Regarding the noise detection strategy, 13 of them are ensemble based lters, 14 can be cataloged as similarity based, and the other 3 are based on data complexity measures.
Taking into account the noise handling approach, 4 of them integrate the possibility of relabelling, whereas the other 26 only allow for removing (which clearly evidences a general preference for data removal in the literature).The full list of implemented lters and its distribution according to the two aforementioned criterions is displayed in Figure 1, which provides a general overview of the package.The rest of section is organized as follows.Section 3.1 is devoted to the installation process.

ŶƐĞŵďůĞ ^ŝŵŝůĂƌŝƚǇ
In Section 3.2 we present the documentation, where further details of each lter can be looked up.Section 3.3 focuses on the two implemented methods to call the lters.Finally, Section 3.4 presents the filter class, which unies the return value of the lters in NoiseFiltersR package.

Installation
The NoiseFiltersR package is available at CRAN servers, so it can be downloaded and installed directly from the R command line by typing: install.packages("NoiseFiltersR")This command will also install the eleven dependencies of the package, which mainly provide the classication algorithms needed for the implemented lters, and which can be looked up in the Imports section of the CRAN website for the package https://cran.r-project.org/web/packages/NoiseFiltersR/index.html.
In order to easily access all the package's functions, it must be attached in the usual way: library(NoiseFiltersR)

Documentation
Whereas this vignette provides the user with an overview of the NoiseFiltersR package, it is also important to have access to specic information for each available lter.This information can be looked up in the corresponding documentation page, that in all cases includes the following essential items (see Figure 2 for an example): A description section, which indicates the type of lter according to the taxonomy explained at the end of Section 2 and summarized in Figure 1.
A details section, which provides the user with a general explanation of the lter's behaviour and any other usage particularity or warning.
A references section that points to the original contribution where the lter was proposed, where further details, motivations or contextualization can be found.As usually in R, the function documentation pages can be either checked in the CRAN website for the package or loaded from the command line with the orders ?or help: ?GE help(GE)

Calling the lters
When it comes to apply a label-noise lter in Data Mining applications, all we need to know is the dataset to be ltered and its class variable (i.e. the one that contains the label for each available instance).The NoiseFiltersR package provides two standard ways for tagging the class variable when calling the implemented lters (see also Figure 3 and the example below): The default method receives the dataset to be ltered in the x argument, and the number for the class column through the classColumn argument.If the latter is not provided, the last column of the dataset is assumed to contain the labels.
The formula method is intended for regular R users, who are used to this approach when tting regression or classication models.It allows for indicating the class variable (along with the attributes to be used) by means of an expression like Class~Attr1+...+AttrN (recall that Class~.makes use of all attributes).Next, we provide an example on how to use these two methods for ltering out the iris dataset with edgeBoostFilter (we do not change the default parameters of the lter): data(iris) str(iris) ## 'data.frame':150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4. Notice that, in the last command of the example, we used the $ operator to access the objects returned from the lter.In next section we explore the structure and contents of these objects.

The filter class
The S3 class filter is designed to unify the return value of the lters inside the NoiseFiltersR package.It is a list that encapsulates seven elements with the most relevant information of the process: cleanData is a data.framecontaining the ltered dataset.
remIdx is a vector of integers indicating the indexes of removed instances (i.e.their row number with respect to the original data.frame).
repIdx is a vector of integers indicating the indexes of repaired/relabelled instances (i.e.their row number with respect to the original data.frame).
repLab is a factor containing the new labels for repaired instances.
parameters is a list that includes the tuning parameters used for the lter.
call is an expression that contains the original call to the lter.
extraInf is a character vector including additional information not covered by previous items.
As an example, we can check the structure of the above out_For object, which was the return value of egdeBoostFilter function: : language edgeBoostFilter(formula = Species ~., data = iris) ## $ extraInf : chr "Highest edge value kept: 0.0669358381115568" ## -attr(*, "class")= chr "filter" In order to cleanly display this filter class in the R console, two specic print and summary methods were implemented.The appearance of the rst one is as follows The original call to the lter.The tuning parameters used for the lter.An overview of the results, with the number (and percentage of the total) of removed and repaired instances.
The summary method displays some extra blocks: It always adds a title that summarizes the lter and dataset used.
If there exists additional information in the extraInf component of the object, it is displayed under a homonymous block.
If the argument explicit is set to TRUE (it defaults to FALSE), the explicit results (i.e. the indexes for removed and repaired instances and the new labels for the latters) are displayed.
In the case of the out_For object, the summary gets the following: In this vignette we introduced the NoiseFiltersR package, which is the rst R extensive implementation of classication-oriented label-noise lters.To set a context and motivation for this work, we presented the problem of label noise inside data preprocessing and the related software.As we explained, the released package unies the return value of the lters by means of a S3 class, which benets from specic print and summary methods.Moreover, it provides a R-user-friendly way to call the implemented lters, whose documentation is worth reading and points to the original reference where they were rst published.
Regarding the potential extensions of this package, there exist several aspects which can be adressed in future releases.For instance, there exist some other label-noise lters reviewed in the main reference [3] whose noise identication strategy does not belong to the ones covered here: ensemble based, similarity based and data complexity based (as shown in Figure 1).A relevant extension would be the inclusion of some datasets with dierent levels of articially introduced label noise, in order to ease the experimentation workow 2 .

Figure 1 :
Figure 1: Names and taxonomy of available lters in the NoiseFiltersR package.

Figure 2 :
Figure 2: Extract from GE lter's documentation page, showing the highlighted above aspects.

Figure 3 :
Figure 3: Extract from edgeBoostFilter's documentation page, which shows the two methods for calling lters in NoiseFiltersR package.In both cases, tuning parameters of the lter are provided through additional arguments.