The NoiseFiltersR Package: Label Noise Preprocessing in R

In Data Mining, the value of extracted knowledge is directly related to the quality of the used data. This makes data preprocessing one of the most important steps in the knowledge discovery process. A common problem affecting data quality is the presence of noise. A training set with label noise can reduce the predictive performance of classification learning techniques and increase the overfitting of classification models. In this work we present the NoiseFiltersR package. It contains the first extensive R implementation of classical and state-of-the-art label noise filters, which are the most common techniques for preprocessing label noise. The algorithms used for the implementation of the label noise filters are appropriately documented and referenced. They can be called in a R-user-friendly manner, and their results are unified by means of the "filter" class, which also benefits from adapted print and summary methods.

Pablo Morales (Department of Computer Science and Artificial Intelligence, University of Granada) , Julián Luengo (Department of Computer Science and Artificial Intelligence, University of Granada) , Luís P.F. Garcia (Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo) , Ana C. Lorena (Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo) , André C.P.L.F. de Carvalho (Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo) , Francisco Herrera (Department of Computer Science and Artificial Intelligence, University of Granada)
2017-05-10

1 Introduction

In the last years, the large quantity of data of many different kinds and from different sources has created numerous challenges in the Data Mining area. Not only their size, but their imperfections and varied formats are providing the researchers with plenty of new scenarios to be addressed. Consequently, Data Preprocessing (Garcı́a et al. 2015) has become an important part of the KDD (Knowledge Discovery from Databases) process, and related software development is also essential to provide practitioners with the adequate tools.

Data Preprocessing intends to process the collected data appropriately so that subsequent learning algorithms can not only extract meaningful and relevant knowledge from the data, but also induce models with high predictive or descriptive performance. Data preprocessing is known as one of the most time-consuming steps in the whole KDD process. There exist several aspects involved in data preprocessing, like feature selection, dealing with missing values and detecting noisy data. Feature selection aims at extracting the most relevant attributes for the learning step, thus reducing the complexity of models and the computing time taken for their induction. The treatment of missing values is also essential to keep as much information as possible in the preprocessed dataset. Finally, noisy data refers to values that are either incorrect or clearly far from the general underlying data distribution.

All these tasks have associated software available. For instance, the KEEL tool (Alcalá et al. 2010) contains a broad collection of data preprocessing algorithms, which covers all the aforementioned topics. There exist many other general-purpose Data Mining software with data preprocessing functionalities, like WEKA (Witten and Frank 2005), KNIME (Berthold et al. 2009), RapidMiner (Hofmann and Klinkenberg 2013) or R.

Regarding the R statistical software, there are plenty of packages available in the Comprehensive R Archive Network (CRAN) repository to address preprocessing tasks. For example, MICE (van Buuren and Groothuis-Oudshoorn 2011) and Amelia (Honaker et al. 2011) are very popular packages for handling missing values, whereas caret (Kuhn 2008) or FSelector (Romanski and Kotthoff 2014) provide a wide range of techniques for feature selection. There are also general-purpose packages for decting outliers and anomalies, like mvoutlier (Filzmoser and Gschwandtner 2015). If we examine software in CRAN developed to tackle label noise, there already exist non-preprocessing packages that provide label noise robust classifiers. For instance, robustDA implements a robust mixture discriminant analysis (Bouveyron and Girard 2009), while probFDA package provides a probabilistic Fisher discriminant analysis related to the seminal work in (Lawrence and Schölkopf 2001).

However, to the best of our knowledge, CRAN lacks an extensive collection of label noise preprocessing algorithms for classification (Garcia et al. 2015; Sáez et al. 2016), some of which are among the most influential preprocessing techniques (Garcı́a et al. 2016). This is the gap we intend to fill with the release of the NoiseFiltersR package, whose taxonomy is inspired on the recent survey on label noise by B. Frénay and M. Verleysen (Frénay and Verleysen 2014). Yet, it should be noted that there are other packages that include some isolated implementations of label noise filters, since they are sometimes needed as auxiliary functions. This is the case of the unbalanced (Pozzolo et al. 2015) package, which deals with imbalanced classification. It contains basic versions of classical filters, such as Tomek-Links (Tomek 1976) or ENN (Wilson 1972), which are tipically applied after oversampling an imbalanced dataset (which is the main purpose of the unbalanced package).

In the following section we briefly introduce the problem of classification with label noise, as well as the most popular techniques to overcome this problem. Then, we show how to use the NoiseFiltersR package to apply these techniques in a unified and R-user-friendly manner. Finally, we present a general overview of this work and potential extensions.

2 Label noise preprocessing

Data collection and preparation processes are usually subject to errors in Data Mining applications (Wu and Zhu 2008). Consequently, real-world datasets are commonly affected by imperfections or noise. In a classification problem, several effects of this noise can be observed by analyzing its spatial characteristics: noise may create small clusters of instances of a particular class in the instance space corresponding to another class, displace or remove instances located in key areas within a concrete class, or disrupt the boundaries of the classes resulting in an increased boundaries overlap. All these imperfections may harm interpretation of data, the design, size, building time, interpretability and accuracy of models, as well as the making of decisions (Zhong et al. 2004; Zhu and Wu 2004).

In order to alleviate the effects of noise, we need first to identify and quantify the components of the data that can be affected. As described by Wang et al. (1995), from the large number of components that comprise a dataset, class labels and attribute values are two essential elements in classification datasets (Wu 1996). Thus, two types of noise are commonly differentiated in the literature (Wu 1996; Zhu and Wu 2004):

The NoiseFiltersR package (and the rest of this manuscript) focuses on label noise, which is known to be the most disruptive one, since label quality is essential for the classifier training (Zhu and Wu 2004). In Frénay and Verleysen (2014) the mechanisms that generate label noise are examined, relating them to the appropriate treatment procedures that can be safely applied. In the specialized literature there exist two main approaches to deal with label noise, and both are surveyed in Frénay and Verleysen (2014):

The NoiseFiltersR package follows the data level approaches, since this allows the data preprocessing to be carried out just once, and apply any classifier thereafter, whereas algorithm level approaches are specific for each classification algorithm1. Regarding data-level handling of label noise, we take the aforementioned survey by Frénay and Verleysen (2014) as the basis for our NoiseFiltersR package. That work provides an overview and references for the most popular classical and state-of-the-art filters, which are organized and classified taking into account several aspects:

3 The NoiseFiltersR package

The released package implements, documents, explains and provides references for a broad collection of label noise filters surveyed in (Frénay and Verleysen 2014). To the best of our knowledge, it is the first comprehensive review and implementation of this topic for R, which has become an essential tool in Data Mining in the last years.

Namely, the NoiseFiltersR package includes a total of 30 filters, which were published in 24 research papers. Each one of these papers is referenced in the corresponding filter documentation page, as shown in the next Documentation section (and particularly in Figure 1). Regarding the noise detection strategy, 13 of them are ensemble based filters, 14 can be cataloged as similarity based, and the other 3 are based on data complexity measures. Taking into account the noise handling approach, 4 of them integrate the possibility of relabelling, whereas the other 26 only allow for removing (which clearly evidences a general preference for data removal in the literature). The full list of implemented filters and its distribution according to the two aforementioned criterions is displayed in Table 1, which provides a general overview of the package.

Table 1: Names and taxonomy of available filters in the NoiseFiltersR package. Every filter is appropriately referenced in its documentation page, where the original paper is provided.
Noise Identification
Ensemble Similarity Data Complexity
Noise Handling Remove C45robustFilter AENN
C45votingFilter BBNR
C45iteratedVotingFilter CNN
CVCF DROP1
dynamicCF DROP2 saturationFilter
edgeBoostFilter DROP3 consensusSF
EF ENG classifSF
HARF ENN
INFFC PRISM
IPF RNN
ORBoostFilter TomekLinks
PF
Repair/Hybrid
EWF
hybridRepairFilter GE
ModeFilter

The rest of this section is organized as follows. First, a few lines are devoted to the installation process. Then, we present the documentation page for the filters, where further specific details can be looked up. After that, we focus on the two implemented methods to call the filters (default and formula). Finally, the "filter" class, which unifies the return value of the filters in NoiseFiltersR package, is presented.

Installation

The NoiseFiltersR package is available at CRAN servers, so it can be downloaded and installed directly from the R command line by typing:

         > install.packages("NoiseFiltersR")

In order to easily access all the package’s functions, it must be attached in the usual way:

         > library(NoiseFiltersR)

Documentation

Whereas this paper provides the user with an overview of the NoiseFiltersR package, it is also important to have access to specific information for each available filter. This information can be looked up in the corresponding documentation page, that in all cases includes the following essential items (see Figure 1 for an example):

graphic without alt text
Figure 1: Extract from GE filter’s documentation page, showing the highlighted above aspects.

As usually in R, the function documentation pages can be either checked in the CRAN website for the package or loaded from the command line with the orders ? or help:

    > ?GE
    > help(GE)

Calling the filters

When one wants to use a label noise filter in Data Mining applications, all we need to know is the dataset to be filtered and its class variable (i.e. the one that contains the label for each available instance). The NoiseFiltersR package provides two standard ways for tagging the class variable when calling the implemented filters (see also Figure 2 and the example below):

Next, we provide an example on how to use these two methods for filtering out the iris dataset with edgeBoostFilter (we did not change the default parameters of the filter):

    # Checking the structure of the dataset (last variable is the class one)
    > data(iris)
    > str(iris)
    'data.frame':   150 obs. of  5 variables:
    $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ...
    
    # Using the default method:
    > out_Def <- edgeBoostFilter(iris, classColumn = 5)
    # Using the formula method:
    > out_For <- edgeBoostFilter(Species~., iris)
    # Checking that the filtered datasets are identical:
    > identical(out_Def$cleanData, out_For$cleanData)
    [1] TRUE
    
graphic without alt text
Figure 2: Extract from edgeBoostFilter’s documentation page, which shows the two methods for calling filters in NoiseFiltersR package. In both cases, the parameters of the filter can be tunned through additional arguments.

Notice that, in the last command of the example, we used the $ operator to access the objects returned from the filter. In next section we explore the structure and contents of these objects.

The "filter" class

The S3 class "filter" is designed to unify the return value of the filters inside the NoiseFiltersR package. It is a list that encapsulates seven elements with the most relevant information of the process:

As an example, we can check the structure of the above out_For object, which was the return value of egdeBoostFilter function:

    > str(out_For)
    List of 7
    $ cleanData :'data.frame':  142 obs. of  5 variables:
    ..$ Sepal.Length: num [1:142] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    ..$ Sepal.Width : num [1:142] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    ..$ Petal.Length: num [1:142] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    ..$ Petal.Width : num [1:142] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ...
    $ remIdx    : int [1:8] 58 78 84 107 120 130 134 139
    $ repIdx    : NULL
    $ repLab    : NULL
    $ parameters:List of 3
    ..$ m        : num 15
    ..$ percent  : num 0.05
    ..$ threshold: num 0
    $ call      : language edgeBoostFilter(formula = Species ~ ., data = iris)
    $ extraInf  : chr "Highest edge value kept: 0.0669358381115568"
    - attr(*, "class")= chr "filter"

In order to cleanly display this "filter" class in the R console, two specific print and summary methods were implemented. The appearance of the first one is as follows

    > print(out_For)
    
    Call:
    edgeBoostFilter(formula = Species ~ ., data = iris)
    
    Parameters:
    m: 15
    percent: 0.05
    threshold: 0
    
    Results:
    Number of removed instances: 8 (5.333333 %)
    Number of repaired instances: 0 (0 %)

and contains three main blocks:

The summary method displays some extra blocks:

In the case of the previous out_For object, the summary command gets the following format:

    > summary(out_For, explicit = TRUE)
    
    Filter edgeBoostFilter applied to dataset iris 
    
    Call:
    edgeBoostFilter(formula = Species ~ ., data = iris)
    
    Parameters:
    m: 15
    percent: 0.05
    threshold: 0
    
    Results:
    Number of removed instances: 8 (5.333333 %)
    Number of repaired instances: 0 (0 %)
    
    Additional information:
    Highest edge value kept: 0.0669358381115568 
    
    Explicit indexes for removed instances:
    58 78 84 107 120 130 134 139

4 Summary

In this paper, we introduced the NoiseFiltersR package, which is the first R extensive implementation of classification-oriented label-noise filters. To set a context and motivation for this work, we presented the problem of label noise and the main approaches to deal with it inside data preprocessing, as well as the related software. As previously explained, the released package unifies the return value of the filters by means of the "filter" class, which benefits from specific print and summary methods. Moreover, it provides a R-user-friendly way to call the implemented filters, whose documentation is worth reading and points to the original reference where they were first published.

Regarding the potential extensions of this package, there exist several aspects which can be adressed in future releases. For instance, there exist some other label noise filters reviewed in the main reference (Frénay and Verleysen 2014) whose noise identification strategy does not belong to the ones covered here: ensemble based, similarity based and data complexity based (as shown in Table 1). Other relevant extension would be the inclusion of some datasets with different levels of artificially introduced label noise, in order to ease the experimentation workflow2.

5 Acknowledgements

This work was supported by the Spanish Research Project TIN2014-57251-P, the Andalusian Research Plan P11-TIC-7765, and the Brazilian grants CeMEAI-FAPESP 2013/07375-0 and FAPESP 2012/22608-8. Luís P. F. Garcia was supported by FAPESP 2011/14602-7.

CRAN packages used

MICE, Amelia, caret, FSelector, mvoutlier, robustDA, probFDA, NoiseFiltersR, unbalanced, RWeka

CRAN Task Views implied by cited packages

HighPerformanceComputing, MachineLearning, MissingData, NaturalLanguageProcessing, Robust

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. Garcı́a, L. Sánchez and F. Herrera. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3): 255–287, 2010.
M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, K. Thiel and B. Wiswedel. KNIME - the Konstanz information miner: Version 2.0 and beyond. SIGKDD Explorations Newsletter, 11(1): 26–31, 2009. URL https://doi.org/10.1145/1656274.1656280.
C. Bouveyron and S. Girard. Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recognition, 42(11): 2649–2658, 2009. URL https://doi.org/10.1016/j.patcog.2009.03.027.
C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11: 131–167, 1999. URL https://doi.org/10.1613/jair.606.
P. Filzmoser and M. Gschwandtner. Mvoutlier: Multivariate outlier detection based on robust methods. 2015. URL https://CRAN.R-project.org/package=mvoutlier. R package version 2.0.6.
B. Frénay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE transactions on neural networks and learning systems, 25(5): 845–869, 2014. URL https://doi.org/10.1109/TNNLS.2013.2292894.
L. P. Garcia, J. A. Sáez, J. Luengo, A. C. Lorena, A. C. de Carvalho and F. Herrera. Using the One-vs-One decomposition to improve the performance of class noise filters via an aggregation strategy in multi-class classification problems. Knowledge-Based Systems, 90: 153–164, 2015. URL https://doi.org/10.1016/j.knosys.2015.09.023.
S. Garcı́a, J. Luengo and F. Herrera. Data preprocessing in data mining. Springer-Verlag, 2015. URL https://doi.org/10.1007/978-3-319-10247-4.
S. Garcı́a, J. Luengo and F. Herrera. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems, 98: 1–29, 2016. URL https://doi.org/10.1016/j.knosys.2015.12.006.
M. A. Hernández and S. J. Stolfo. Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2: 9–37, 1998. URL https://doi.org/10.1023/A:1009761603038.
M. Hofmann and R. Klinkenberg. RapidMiner: Data mining use cases and business analytics applications. Chapman & Hall/CRC, 2013.
J. Honaker, G. King and M. Blackwell. Amelia II: A program for missing data. Journal of Statistical Software, 45(7): 1–47, 2011. URL https://doi.org/10.18637/jss.v045.i07.
K. Hornik, C. Buchta and A. Zeileis. Open-source machine learning: R meets Weka. Computational Statistics, 24(2): 225–232, 2009. URL https://doi.org/10.1007/s00180-008-0119-7.
T. M. Khoshgoftaar and P. Rebours. Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22(3): 387–396, 2007. URL https://doi.org/10.1007/s11390-007-9054-2.
M. Kuhn. Building predictive models in R using the caret package. Journal of Statistical Software, 28(5): 2008. URL https://doi.org/10.18637/jss.v028.i05.
N. D. Lawrence and B. Schölkopf. Estimating a kernel Fisher discriminant in the presence of label noise. In Proceedings of the eighteenth international conference on machine learning, pages. 306–313 2001.
Y. Li, L. F. A. Wessels, D. de Ridder and M. J. T. Reinders. Classification in the presence of class noise using a probabilistic kernel Fisher method. Pattern Recognition, 40(12): 3349–3357, 2007. URL https://doi.org/10.1016/j.patcog.2007.05.006.
Q. Miao, Y. Cao, G. Xia, M. Gong, J. Liu and J. Song. RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems, 27(11): 2216–2228, 2016. URL https://doi.org/10.1109/TNNLS.2015.2475750.
A. L. Miranda, L. P. F. Garcia, A. C. Carvalho and A. C. Lorena. Use of classification algorithms in noise detection and elimination. In International conference on hybrid artificial intelligence systems, pages. 417–424 2009. Springer. URL https://doi.org/10.1007/978-3-642-02319-4_50.
A. D. Pozzolo, O. Caelen and G. Bontempi. Unbalanced: Racing for unbalanced methods selection. 2015. URL https://CRAN.R-project.org/package=unbalanced. R package version 2.0.
J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.
P. Romanski and L. Kotthoff. FSelector: Selecting attributes. 2014. URL https://CRAN.R-project.org/package=FSelector. R package version 0.20.
J. A. Sáez, M. Galar, J. Luengo and F. Herrera. Analyzing the Presence of Noise in Multi-Class Problems: Alleviating Its Influence with the One-vs-One Decomposition. Knowledge and Information Systems, 38(1): 179–206, 2014. URL https://doi.org/10.1007/s10115-012-0570-1.
J. A. Sáez, M. Galar, J. Luengo and F. Herrera. INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion, 27: 19–32, 2016. URL https://doi.org/10.1016/j.inffus.2015.04.002.
C. M. Teng. Dealing with data corruption in remote sensing. In International symposium on intelligent data analysis, pages. 452–463 2005. Springer. URL https://doi.org/10.1007/11552253_41.
C.-M. Teng. Correcting Noisy Data. In Proceedings of the sixteenth international conference on machine learning, pages. 239–248 1999. San Francisco, CA, USA: Morgan Kaufmann Publishers.
I. Tomek. Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6: 769–772, 1976. URL https://doi.org/10.1109/TSMC.1976.4309452.
S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3): 1–67, 2011. URL https://doi.org/10.18637/jss.v045.i03.
R. Y. Wang, V. C. Storey and C. P. Firth. A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering, 7(4): 623–640, 1995. URL https://doi.org/10.1109/69.404034.
D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3): 408–421, 1972. URL https://doi.org/10.1109/TSMC.1972.4309137.
I. H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
X. Wu. Knowledge acquisition from databases. Norwood, NJ, USA: Ablex Publishing Corp., 1996.
X. Wu and X. Zhu. Mining with noise knowledge: Error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(4): 917–932, 2008. URL https://doi.org/10.1109/TSMCA.2008.923034.
S. Zhong, T. M. Khoshgoftaar and N. Seliya. Analyzing Software Measurement Data with Clustering Techniques. IEEE Intelligent Systems, 19(2): 20–27, 2004. URL https://doi.org/10.1109/MIS.2004.1274907.
X. Zhu and X. Wu. Class noise vs. Attribute noise: A quantitative study. Artificial Intelligence Review, 22(3): 177–210, 2004. URL https://doi.org/10.1007/s10462-004-0751-8.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Morales, et al., "The NoiseFiltersR Package: Label Noise Preprocessing in R", The R Journal, 2017

BibTeX citation

@article{RJ-2017-027,
  author = {Morales, Pablo and Luengo, Julián and Garcia, Luís P.F. and Lorena, Ana C. and Carvalho, André C.P.L.F. de and Herrera, Francisco},
  title = {The NoiseFiltersR Package: Label Noise Preprocessing in R},
  journal = {The R Journal},
  year = {2017},
  note = {https://rjournal.github.io/},
  volume = {9},
  issue = {1},
  issn = {2073-4859},
  pages = {219-228}
}