The NoiseFiltersR Package: Label Noise Preprocessing in R


In Data Mining, the value of extracted knowledge is directly related to the quality of the used data. This makes data preprocessing one of the most important steps in the knowledge discovery process. A common problem affecting data quality is the presence of noise. A training set with label noise can reduce the predictive performance of classification learning techniques and increase the overfitting of classification models. In this work we present the NoiseFiltersR package. It contains the first extensive R implementation of classical and state-of-the-art label noise filters, which are the most common techniques for preprocessing label noise. The algorithms used for the implementation of the label noise filters are appropriately documented and referenced. They can be called in a R-user-friendly manner, and their results are unified by means of the "filter" class, which also benefits from adapted print and summary methods.

1 Introduction

In the last years, the large quantity of data of many different kinds and from different sources has created numerous challenges in the Data Mining area. Not only their size, but their imperfections and varied formats are providing the researchers with plenty of new scenarios to be addressed. Consequently, Data Preprocessing has become an important part of the KDD (Knowledge Discovery from Databases) process, and related software development is also essential to provide practitioners with the adequate tools.

Data Preprocessing intends to process the collected data appropriately so that subsequent learning algorithms can not only extract meaningful and relevant knowledge from the data, but also induce models with high predictive or descriptive performance. Data preprocessing is known as one of the most time-consuming steps in the whole KDD process. There exist several aspects involved in data preprocessing, like feature selection, dealing with missing values and detecting noisy data. Feature selection aims at extracting the most relevant attributes for the learning step, thus reducing the complexity of models and the computing time taken for their induction. The treatment of missing values is also essential to keep as much information as possible in the preprocessed dataset. Finally, noisy data refers to values that are either incorrect or clearly far from the general underlying data distribution.

All these tasks have associated software available. For instance, the KEEL tool contains a broad collection of data preprocessing algorithms, which covers all the aforementioned topics. There exist many other general-purpose Data Mining software with data preprocessing functionalities, like WEKA , KNIME , RapidMiner or R.

Regarding the R statistical software, there are plenty of packages available in the Comprehensive R Archive Network (CRAN) repository to address preprocessing tasks. For example, MICE and Amelia are very popular packages for handling missing values, whereas caret or FSelector provide a wide range of techniques for feature selection. There are also general-purpose packages for decting outliers and anomalies, like mvoutlier . If we examine software in CRAN developed to tackle label noise, there already exist non-preprocessing packages that provide label noise robust classifiers. For instance, robustDA implements a robust mixture discriminant analysis , while probFDA package provides a probabilistic Fisher discriminant analysis related to the seminal work in .

However, to the best of our knowledge, CRAN lacks an extensive collection of label noise preprocessing algorithms for classification , some of which are among the most influential preprocessing techniques . This is the gap we intend to fill with the release of the NoiseFiltersR package, whose taxonomy is inspired on the recent survey on label noise by B. Frénay and M. Verleysen . Yet, it should be noted that there are other packages that include some isolated implementations of label noise filters, since they are sometimes needed as auxiliary functions. This is the case of the unbalanced package, which deals with imbalanced classification. It contains basic versions of classical filters, such as Tomek-Links or ENN , which are tipically applied after oversampling an imbalanced dataset (which is the main purpose of the unbalanced package).

In the following section we briefly introduce the problem of classification with label noise, as well as the most popular techniques to overcome this problem. Then, we show how to use the NoiseFiltersR package to apply these techniques in a unified and R-user-friendly manner. Finally, we present a general overview of this work and potential extensions.

2 Label noise preprocessing

Data collection and preparation processes are usually subject to errors in Data Mining applications . Consequently, real-world datasets are commonly affected by imperfections or noise. In a classification problem, several effects of this noise can be observed by analyzing its spatial characteristics: noise may create small clusters of instances of a particular class in the instance space corresponding to another class, displace or remove instances located in key areas within a concrete class, or disrupt the boundaries of the classes resulting in an increased boundaries overlap. All these imperfections may harm interpretation of data, the design, size, building time, interpretability and accuracy of models, as well as the making of decisions .

In order to alleviate the effects of noise, we need first to identify and quantify the components of the data that can be affected. As described by , from the large number of components that comprise a dataset, class labels and attribute values are two essential elements in classification datasets . Thus, two types of noise are commonly differentiated in the literature :

The NoiseFiltersR package (and the rest of this manuscript) focuses on label noise, which is known to be the most disruptive one, since label quality is essential for the classifier training . In the mechanisms that generate label noise are examined, relating them to the appropriate treatment procedures that can be safely applied. In the specialized literature there exist two main approaches to deal with label noise, and both are surveyed in :

The NoiseFiltersR package follows the data level approaches, since this allows the data preprocessing to be carried out just once, and apply any classifier thereafter, whereas algorithm level approaches are specific for each classification algorithmOf course, in R there exist implementations of very popular label noise robust classifiers (the aforementioned algorithm-level approach), such as C4.5 and RIPPER, which are called J48 and JRip respectively in the RWeka package , which is a R interface to WEKA software , or the method described in in its own package.. Regarding data-level handling of label noise, we take the aforementioned survey by as the basis for our NoiseFiltersR package. That work provides an overview and references for the most popular classical and state-of-the-art filters, which are organized and classified taking into account several aspects:

3 The NoiseFiltersR package

The released package implements, documents, explains and provides references for a broad collection of label noise filters surveyed in . To the best of our knowledge, it is the first comprehensive review and implementation of this topic for R, which has become an essential tool in Data Mining in the last years.

Namely, the NoiseFiltersR package includes a total of 30 filters, which were published in 24 research papers. Each one of these papers is referenced in the corresponding filter documentation page, as shown in the next Documentation section (and particularly in Figure 1). Regarding the noise detection strategy, 13 of them are ensemble based filters, 14 can be cataloged as similarity based, and the other 3 are based on data complexity measures. Taking into account the noise handling approach, 4 of them integrate the possibility of relabelling, whereas the other 26 only allow for removing (which clearly evidences a general preference for data removal in the literature). The full list of implemented filters and its distribution according to the two aforementioned criterions is displayed in Table 1, which provides a general overview of the package.

Table 1: Names and taxonomy of available filters in the NoiseFiltersR package. Every filter is appropriately referenced in its documentation page, where the original paper is provided.
Noise Identification
Ensemble Similarity Data Complexity
Noise Handling Remove C45robustFilter AENN
C45votingFilter BBNR
C45iteratedVotingFilter CNN
dynamicCF DROP2 saturationFilter
edgeBoostFilter DROP3 consensusSF
EF ENG classifSF
ORBoostFilter TomekLinks
hybridRepairFilter GE

The rest of this section is organized as follows. First, a few lines are devoted to the installation process. Then, we present the documentation page for the filters, where further specific details can be looked up. After that, we focus on the two implemented methods to call the filters (default and formula). Finally, the "filter" class, which unifies the return value of the filters in NoiseFiltersR package, is presented.


The NoiseFiltersR package is available at CRAN servers, so it can be downloaded and installed directly from the R command line by typing:

         > install.packages("NoiseFiltersR")

In order to easily access all the package’s functions, it must be attached in the usual way:

         > library(NoiseFiltersR)


Whereas this paper provides the user with an overview of the NoiseFiltersR package, it is also important to have access to specific information for each available filter. This information can be looked up in the corresponding documentation page, that in all cases includes the following essential items (see Figure 1 for an example):

graphic without alt text
Figure 1: Extract from GE filter’s documentation page, showing the highlighted above aspects.

As usually in R, the function documentation pages can be either checked in the CRAN website for the package or loaded from the command line with the orders ? or help:

    > ?GE
    > help(GE)

Calling the filters

When one wants to use a label noise filter in Data Mining applications, all we need to know is the dataset to be filtered and its class variable (i.e. the one that contains the label for each available instance). The NoiseFiltersR package provides two standard ways for tagging the class variable when calling the implemented filters (see also Figure 2 and the example below):

Next, we provide an example on how to use these two methods for filtering out the iris dataset with edgeBoostFilter (we did not change the default parameters of the filter):

    # Checking the structure of the dataset (last variable is the class one)
    > data(iris)
    > str(iris)
    'data.frame':   150 obs. of  5 variables:
    $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ...
    # Using the default method:
    > out_Def <- edgeBoostFilter(iris, classColumn = 5)
    # Using the formula method:
    > out_For <- edgeBoostFilter(Species~., iris)
    # Checking that the filtered datasets are identical:
    > identical(out_Def$cleanData, out_For$cleanData)
    [1] TRUE
graphic without alt text
Figure 2: Extract from edgeBoostFilter’s documentation page, which shows the two methods for calling filters in NoiseFiltersR package. In both cases, the parameters of the filter can be tunned through additional arguments.

Notice that, in the last command of the example, we used the $ operator to access the objects returned from the filter. In next section we explore the structure and contents of these objects.

The "filter" class

The S3 class "filter" is designed to unify the return value of the filters inside the NoiseFiltersR package. It is a list that encapsulates seven elements with the most relevant information of the process:

As an example, we can check the structure of the above out_For object, which was the return value of egdeBoostFilter function:

    > str(out_For)
    List of 7
    $ cleanData :'data.frame':  142 obs. of  5 variables:
    ..$ Sepal.Length: num [1:142] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    ..$ Sepal.Width : num [1:142] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    ..$ Petal.Length: num [1:142] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    ..$ Petal.Width : num [1:142] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ...
    $ remIdx    : int [1:8] 58 78 84 107 120 130 134 139
    $ repIdx    : NULL
    $ repLab    : NULL
    $ parameters:List of 3
    ..$ m        : num 15
    ..$ percent  : num 0.05
    ..$ threshold: num 0
    $ call      : language edgeBoostFilter(formula = Species ~ ., data = iris)
    $ extraInf  : chr "Highest edge value kept: 0.0669358381115568"
    - attr(*, "class")= chr "filter"

In order to cleanly display this "filter" class in the R console, two specific print and summary methods were implemented. The appearance of the first one is as follows

    > print(out_For)
    edgeBoostFilter(formula = Species ~ ., data = iris)
    m: 15
    percent: 0.05
    threshold: 0
    Number of removed instances: 8 (5.333333 %)
    Number of repaired instances: 0 (0 %)

and contains three main blocks:

The summary method displays some extra blocks:

In the case of the previous out_For object, the summary command gets the following format:

    > summary(out_For, explicit = TRUE)
    Filter edgeBoostFilter applied to dataset iris 
    edgeBoostFilter(formula = Species ~ ., data = iris)
    m: 15
    percent: 0.05
    threshold: 0
    Number of removed instances: 8 (5.333333 %)
    Number of repaired instances: 0 (0 %)
    Additional information:
    Highest edge value kept: 0.0669358381115568 
    Explicit indexes for removed instances:
    58 78 84 107 120 130 134 139

4 Summary

In this paper, we introduced the NoiseFiltersR package, which is the first R extensive implementation of classification-oriented label-noise filters. To set a context and motivation for this work, we presented the problem of label noise and the main approaches to deal with it inside data preprocessing, as well as the related software. As previously explained, the released package unifies the return value of the filters by means of the "filter" class, which benefits from specific print and summary methods. Moreover, it provides a R-user-friendly way to call the implemented filters, whose documentation is worth reading and points to the original reference where they were first published.

Regarding the potential extensions of this package, there exist several aspects which can be adressed in future releases. For instance, there exist some other label noise filters reviewed in the main reference whose noise identification strategy does not belong to the ones covered here: ensemble based, similarity based and data complexity based (as shown in Table 1). Other relevant extension would be the inclusion of some datasets with different levels of artificially introduced label noise, in order to ease the experimentation workflowA wide variety of such datasets can be downloaded from the KEEL dataset repository in the website, and then loaded from R..

5 Acknowledgements

This work was supported by the Spanish Research Project TIN2014-57251-P, the Andalusian Research Plan P11-TIC-7765, and the Brazilian grants CeMEAI-FAPESP 2013/07375-0 and FAPESP 2012/22608-8. Luís P. F. Garcia was supported by FAPESP 2011/14602-7.

