Detecting outliers in a multivariate and unsupervised context is an important and ongoing problem notably for quality control. Many statistical methods are already implemented in R and are briefly surveyed in the present paper. But only a few lead to the accurate identification of potential outliers in the case of a small level of contamination. In this particular context, the Invariant Coordinate Selection (ICS) method shows remarkable properties for identifying outliers that lie on a low-dimensional subspace in its first invariant components. It is implemented in the ICSOutlier package. The main function of the package, ics.outlier
, offers the possibility of labelling potential outliers in a completely automated way. Four examples, including two real examples in quality control, illustrate the use of the function. Comparing with several other approaches, it appears that ICS is generally as efficient as its competitors and shows an advantage in the context of a small proportion of outliers lying in a low-dimensional subspace. In quality control, the method may help in properly identifying some defective products while not detecting too many false positives.
The unsupervised detection of multivariate outliers is a timeless subject of interest in statistics, see Aggarwal (2017), Hodge and Austin (2004), Hadi et al. (2009) for a complete overview. Indeed, this can be the goal of the analysis, like in fraud detection, in medical applications or manufacturing-defect detection. It can also be used for preprocessing in almost any statistical analysis that is sensitive to the presence of outliers, e.g. Principal Component Analysis (PCA). Many statistical methods exist and have already been implemented in R through tens of packages. Only some of these packages are dedicated to the unsupervised context and are mentioned below. One of the most common methods for multivariate outlier detection is the Mahalanobis Distance (MD), and many packages are based on this distance: mvoutlier (Filzmoser and Gschwandtner 2017), CerioliOutlierDetection (Green and Martin 2017a), rrcovHD (Todorov 2016), faoutlier (Chalmers and Flora 2015). Additionally, traditional methods such as the angle-based methods are implemented in the packages abodOutlier (Jimenez 2015), HighDimOut (Fan 2015), the distribution-based methods in alphaOutlier (Rehage and Kuhnt 2016), extremevalues (van der Loo 2010), HDoutliers (Fraley 2016), outliers (Komsta 2011), the density-based methods in DMwR2 (Torgo 2016), HighDimOut, ldbod (Williams 2017), Rlof (Hu et al. 2015) and the depth-based methods in depth (Genest et al. 2017). Finally, other approaches have been implemented, such as the one based on Projection Pursuit (PP) in REPPlab (Fischer et al. 2016b), and PCA in OutlierDC (Eo and Cho 2014), pcadapt (Luu and Blum 2017), rrcov (Todorov and Filzmoser 2009) and rrcovHD. However, it is important to note that all the implemented functions do not necessary return outlierness measures and the identities of the outlying observations. Limiting only to some packages fulfilling these conditions and to multivariate methods for numerical variables leads to focusing on mvoutlier, CerioliOutlierDetection and rrcov. These packages implement modified versions of the robust Mahalanobis distance, different algorithms for the outlier identification in high dimensions (Filzmoser and Todorov 2013), classical and robust PCA. In PCA, the labelling of the outliers is based on the comparison of two outlierness measures: a score distance (SD) based on the first principal components and a distance in the orthogonal space (OD). All these methods rely on distances. To extend the analysis, we include comparisons with Local Outlier Factor (LOF) (Rlof package) and Angle-Based Outlier Factor (ABOD) (abodOutlier package) approaches even if they only return outlierness measures and not the identities of the outlying observations.
These methods dedicated to outlier detection in an unsupervised context have different properties. Contrary to PCA, the Mahalanobis distance and its modified versions are invariant under any affine transformation as long as the location and scatter estimators are affine equivariant. However, several drawbacks have been noticed in the use of these distances. First, Filzmoser et al. (2005) highlight the fact that an outlier is not necessarily an extreme value. So, using a fixed threshold to label outliers for every data set is not the wisest solution and the threshold should be adjusted to the sample size. Moroever, as Cerioli (2010) notes, the threshold used usually is not adapted to the case in which there is no outlier in the data. Finally, for Cerioli (2010), the comparison of the distances of each observation to the threshold should not be performed independently without taking into account any simultaneous adjustments. Otherwise, the method may lead to many false positives, i.e non outlying observations that are identified as outliers. Note that this is a serious concern that applies to almost all the outlier-detection methods. To the best of our knowledge, the mvoutlier and CerioliOutlierDetection packages are the first ones that tend to correct these drawbacks, but only for the robust Mahalanobis distance with the Minimum Covariance Determinant (MCD) estimators.
The ICSOutlier (Archimbaud et al. 2016) package is another approach to the detection of outliers. It appears that ICS is as efficient as its competitors in many situations and shows an advantage in the case of a small proportion of outliers lying in a low-dimensional subspace. This method is well-designed in particular for quality control. Its current version is able to handle a small proportion of outliers that lie on a subspace, using the affine equivariant Invariant Coordinate Selection (ICS) method, as presented by Archimbaud et al. (2018). The method relies on a generalized diagonalization and on the selection of the invariant components associated with the largest eigenvalues. Note that when more than say 10% of the observations are suspected to be outlying, invariant components associated with the smallest eigenvalues may also be of interest and this alternative will be considered in a future version of the package. ICS fulfills all the properties mentioned previously: (i) detect the absence of outliers, (ii) not be sensitive to the standardization of the data and (iii) be a multivariate method which controls the number of observations identified as outliers. The restriction to a small proportion of outliers that belong to a subspace is of interest in some areas of manufacturing products with a high level of quality control (e.g. automotive, avionics or aerospace).
In the following sections, the principle of the Invariant Coordinate Selection (ICS) method is recalled as well as the three steps of the outlier detection procedure: (i) invariant components selection, (ii) outlierness measure definition and (iii) outlier identification. Then, we explain how to use the ICSOutlier package and finally we analyze the efficiency of the ICS method compared to other methods on four examples, including a new real data set included in the package.
The ICS method is a powerful method designed for exploring multivariate data by revealing their structure (Nordhausen et al. 2008). As explained in Tyler et al. (2009), it is based on a simultaneous spectral decomposition of two scatter matrices and leads to an affine invariant coordinate system. Readers not so familiar with the concept of scatter matrices are refered for example to (Rousseeuw and Hubert 2013), (Nordhausen and Tyler 2015) and references therein for definitions, examples and properties.
Following Archimbaud et al. (2018), for a \(p\)-variate dataset \(\mathbf X_n=(\mathbf x_1,\ldots,\mathbf x_n)'\), let \(\mathbf V_{1}(\mathbf X_n)\) and \(\mathbf V_{2}(\mathbf X_n)\) be two affine equivariant scatter matrices (symmetric and positive definite) and \(\mathbf m_{1}(\mathbf X_n)\) and \(\mathbf m_{2}(\mathbf X_n)\) their associated location estimators. ICS is looking for the diagonal \(p \times p\) matrix \(\mathbf D(\mathbf X_n)\) containing the eigenvalues of \(\mathbf V_{1}(\mathbf X_n)^{-1} \mathbf V_{2}(\mathbf X_n)\) in decreasing order and the \(p \times p\) matrix \(\mathbf B(\mathbf X_n)=(\mathbf b_1,\ldots,\mathbf b_p)'\) containing the corresponding eigenvectors as its rows and such that \[\mathbf B(\mathbf X_n) \mathbf V_{1}(\mathbf X_n) \mathbf B'(\mathbf X_n) = \mathbf I_p.\] We have: \[\mathbf V_{1}(\mathbf X_n)^{-1} \mathbf V_{2}(\mathbf X_n) \mathbf B(\mathbf X_n)' = \mathbf B(\mathbf X_n)' \mathbf D(\mathbf X_n).\]
Letting \(\mathbf V_{1}(\mathbf X_n)\) be “more robust” than
\(\mathbf V_{2}(\mathbf X_n)\), the interpretation of these eigenvalues is
the main point of the ICS method: they correspond to some kurtosis
measure that depends on the choice of \(\mathbf V_{1}(\mathbf X_n)\) and
\(\mathbf V_{2}(\mathbf X_n)\) . For example if \(\mathbf V_{1}\) is the
usual empirical variance-covariance matrix and \(\mathbf V_{2}\) is the
scatter matrix based on the fourth moments (the default in the
ics.outlier
function), the eigenvalues are ordered according to their
classical Pearson kurtosis values in decreasing order. In this case and
for a small proportion of outliers, it is advisable to analyze the
projections that maximize the kurtosis and are associated with the
largest eigenvalues (Archimbaud et al. 2018).
Then, using the location estimator \(\mathbf m_1(\mathbf X_n)\) associated with the scatter matrix \(\mathbf V_{1}(\mathbf X_n)\) , the corresponding centered scores are obtained as \(\mathbf z_i=\mathbf B(\mathbf X_n) (\mathbf x_i - \mathbf m_1(\mathbf X_n))\) for \(i=1,\ldots,n\), so that: \[\mathbf Z_n =(\mathbf z_1,\ldots,\mathbf z_n)'= (\mathbf X_n- \mathbf 1_n \mathbf m_1'(\mathbf X_n))\mathbf B'(\mathbf X_n).\] They are called the invariant components (IC) because of their affine invariance property in the sense that if \(\mathbf X_n^* = \mathbf X_n \mathbf A+ \mathbf 1_n \mathbf b'\) for any \(p \times p\) regular matrix \(\mathbf A\) and any \(p\)-vector \(\mathbf b\), \[(\mathbf X_n^* - \mathbf 1_n \mathbf m_1(\mathbf X_n^*)')\mathbf B(\mathbf X_n^*)' = (\mathbf X_n - \mathbf 1_n \mathbf m_1(\mathbf X_n)')\mathbf B(\mathbf X_n)' \mathbf J,\] where \(\mathbf 1_n\) is a \(n\)-vector of ones and \(\mathbf J\) is a \(p \times p\) diagonal matrix with diagonal elements \(\pm 1\), which means the invariant coordinates change at most their signs.
These scores are related to the Mahalanobis Distance (MD). For any observation \(\mathbf x_i\) with \(i=1,\ldots,n\), the squared Euclidian norm of its centered invariant coordinates is exactly the squared Mahalanobis distance from \(\mathbf m_1(\mathbf X_n)\) in the sense of \(\mathbf V_{1}(\mathbf X_n)\): \[\mathbf z_i' \mathbf z_i= (\mathbf x_i- \mathbf m_1(\mathbf X_n))'\mathbf V_{1}(\mathbf X_n)^{-1} (\mathbf x_i- \mathbf m_1(\mathbf X_n)).\] The added value of using ICS over MD is realized when the structure of the data is on a subspace of dimension \(q<p\). In this context and with a small percentage of outliers, it is of interest to focus only on the \(q\) projections that maximize a kurtosis measure. Indeed, if the number \(k\) of selected Invariant Coordinates corresponds to the dimension \(q\) of the subspace on which the outliers lie, then the ICS method is expected to recover the subspace of interest for outlier detection. As detailed in Tyler et al. (2009), this subspace corresponds to Fisher’s discriminant subspace in case of mixtures of Gaussian distributions. Nevertheless, the main difficulty is to correctly estimate this dimension \(k\).
Depending on the combination of the scatters \(\mathbf V_{1}(\mathbf X_n)\) and \(\mathbf V_{2}(\mathbf X_n)\) chosen, the ICSOutlier package incorporates automated ways to select these \(k\) invariant coordinates. The two approaches proposed in Archimbaud et al. (2018) are implemented here: a test based on a quasi-inferential procedure and some normality tests. Considering \(\mathbf V_{1}(\mathbf X_n)\) be more robust than \(\mathbf V_{2}(\mathbf X_n)\) and a small percentage of outliers (less than 10% is recommended), the structure of the outlierness should be contained in the first \(k\) non-normal components. Note that the structure may be also contained in the last components (associated with the smallest eigenvalues) if the proportion of outliers is large. What is meant by “large” depends on the scatter pair but also on the distribution of the data as detailed in Archimbaud et al. (2018). Since the invariant coordinates are already ordered decreasingly according to a kurtosis measure, it is enough to test whether each component is Gaussian beginning by the one associated with the largest eigenvalue and stop the test procedure as soon as we find a Gaussian component. So, if \(k+1\) denotes the rank of the first Gaussian component, the components are sequentially tested at the adapted level \(\alpha_j = \alpha/j\), for \(j=1,\dots,k\), as in Dray (2008). Because of the sequentiality, the Bonferroni correction adjusts the significance of each test and ensures a nominal level \(\alpha\).
The first approach is a Parallel Analysis (PA) based on Monte Carlo simulations of eigenvalues for Gaussian populations of the same dimension as the initial dataset. This method is common for selecting components in PCA as described in Peres-Neto et al. (2005). It is based on \(mEig\) simulations of a Gaussian population and, for each \(j=1,\dots,p\), the computation of the \(1-\alpha_j\) percentile of the j\(^{th}\) eigenvalue of ICS which is considered as a cut-off for the j\(^{th}\) component. Then, sequentially from \(j=1\), if the observed j\(^{th}\) eigenvalue exceeds the cut-off then the j\(^{th}\) invariant component is declared non-Gaussian and is selected. As soon as one eigenvalue is smaller than its associated cut-off, the invariant component is considered as Gaussian and the test procedure is stopped.
The second approach is directly based on usual normality tests and is applied to the invariant coordinates. In the package the user can choose out of the following five tests: the D’Agostino test of skewness (DA), the Anscombe-Glynn (AG) test of kurtosis, the Bonett-Seier (BS) test of Geary’s kurtosis, the Jarque-Bera (JB) test for normality which is based on both skewness and kurtosis measures and the Shapiro-Wilk (SW) normality test (see Yazici and Yolacan (2007) and Bonett and Seier (2002) for a complete description of each). The process of testing is still sequential with an adaptation of the level of each test. Once the first normal coordinate is found, the test procedure stops and only the non-Gaussian coordinates are selected.
Let \(k\) denote the number of invariant components selected by one of the two test procedures detailed previously or by looking at the screeplot of the eigenvalues. Then, for each observation \(\mathbf x_i\), \(i=1,\ldots,n\), its squared ICS distance is computed based on the \(k\) selected invariant coordinates by: \[ICSD^2_{\mathbf V_{1}(\mathbf X_n)^{-1} \mathbf V_{2}(\mathbf X_n)} (\mathbf x_i, k)= \mathbf z_{i,k}' \mathbf z_{i,k}.\] where \(\mathbf z_{i,k}=\mathbf B(\mathbf X_n,k) (\mathbf x_i - \mathbf m_1(\mathbf X_n))\), for \(i=1,\ldots,n\), and the \(k\times p\) matrix \(\mathbf B(\mathbf X_n,k)\) contains the \(k\) first rows of \(\mathbf B(\mathbf X_n)\).
Finally, the outliers are identified based on the comparison of their ICS distance with the expectation under the Gaussian distribution as in Archimbaud et al. (2018). The cut-off is derived from \(mDist\) Monte Carlo simulations of a Gaussian population of the same dimension as the initial dataset. For each observation, the ICS distance is computed using the \(k\) selected components. For a given level \(\beta\), the cut-off corresponds to the average of the \(1-\beta\) percentiles of the distances over all the simulations. An observation is labeled as outlier if its ICS distance is higher than this cut-off. By default, \(\beta=5\%\).
The main function available in ICSOutlier is ics.outlier
which
directly calls all the other functions included in the package. First it
is necessary to apply the ics2
function from the
ICS (Nordhausen et al. 2008)
package in order to create an object of class ics2
. The ICS
package has been available for a few years, and contains a function
called ics
for implementing the ICS method. However, the ics
function cannot be used directly for outlier detection, and a new
function, ics2
, had to be added to the ICS package (version
1.3-0). The arguments of ics2
are mostly the same as those of the
function ics
, except for now it is necessary to add the location
vectors associated with the scatter estimators. The main function
ics.outlier
of the package ICSOutlier uses the output of the
ics2
function as an input. The function ics.outlier
implements the
three steps necessary for outlier detection using ICS in an automated
way:
Select the invariant coordinates which recover at best the subspace where the outliers are lying using some test procedure.
Compute the ICS distance based on the selected invariant coordinates as an outlierness measure for each observation.
Label the outliers using the cut-off value obtained through simulations.
The links between the ics2
function from the ICS package and the
functions available in the ICSOutlier package are illustrated on
Figure 1.
The ics.outlier
function only takes as parameter an object of class
ics2
and not an object of class ics
because this later returns only
uncentered invariant coordinates contrary to the ics2
class. From a
practical point a view, the S1
and S2
arguments of the ics2
function should be the name of the function which returns a list with
the first (resp. second) location vector \(\mathbf m_{1}\) (resp.
\(\mathbf m_{2}\)) and the first scatter matrix \(\mathbf V_{1}\) (resp.
\(\mathbf V_{2}\)).
In addition to this ics2
object, there are other parameters of
ics.outlier
that can be tuned, including an option to parallelize the
computations in order to increase the speed of the function:
method
: name of the method used to select the ICS components.
Options are "simulation"
for the parallel analysis approach and
"norm.test"
for the normality tests approach. Depending on the
method, either comp.norm.test
or comp.simu.test
is used.
test
: name of the marginal normality test to use if
method = "norm.test"
. Possibilities are "jarque.test"
,
"anscombe.test"
, "bonett.test"
, "agostino.test"
,
"shapiro.test"
. Default is "agostino.test"
.
mEig
: number of simulations performed to derive the cut-off values
for selecting the ICS components. Only if method = "simulation"
.
See comp.simu.test
for details.
level.test
: level
for the comp.norm.test
or comp.simu.test
functions. The inital level for selecting the invariant coordinates.
adjust
: logical. For selecting the invariant coordinates, the
level of the test can be adjusted for each component to deal with
multiple testing. See comp.norm.test
and comp.simu.test
for
details. Default is TRUE.
level.dist
: level
for the dist.simu.test
function. The
(1-level
(s))th quantile(s) used to determine the cut-off value(s)
for the ICS distances.
mDist
: number of simulations performed to derive the cut-off value
for the ICS distances. See dist.simu.test
for details.
type
: currently the only option is "smallprop"
which means that
only the first ICS components can be selected. See comp.norm.test
or comp.simu.test
for details.
ncores
: number of cores to be used in dist.simu.test
and
comp.simu.test
. If NULL
or 1, no parallel computing is used.
Otherwise makeCluster
with type = "PSOCK"
, from the
parallel package, is used.
iseed
: The seed passed on to clusterSetRNGStream
if parallel
computation is used. Default is NULL which means no fixed seed is
used.
pkg
: A character vector listing all the packages which need to be
loaded on the different cores via require
if parallel computation
is used. Must be at least "ICSOutlier" and must contain the
packages needed to compute the scatter matrices.
qtype
: specifies the quantile algorithm used in quantile
.
By default the ics2
function performs ICS with the usual mean vector
and the usual empirical variance-covariance matrix returned by MeanCov
and the location vector based on the third moments associated to the
scatter matrix based on the fourth moments returned by Mean3Cov4
.
In the following examples, we illustrate first, using an artifical data
set, how ics.outlier
behaves in the case of no outlier and then apply
the function to three data sets available in R. For reproducibility all
examples have a fixed seed.
One of the advantages of using ICS for outlier detection is its ability to detect the absence of outliers. If the first invariant coordinate is considered normal that means there is no outlier in the data set. So, in presence of a normal multivariate data set the function should select in most cases no component and hence will not detect outliers.
For the first example we simulate a bivariate normally distributed data set with 500 observations and we determine the cut-offs for identifying the outliers at the level 2.5% for all methods.
library("ICSOutlier", quietly = TRUE)
# Data simulation
set.seed(123)
<- matrix(rnorm(1000, 0, 0.1), 500, 2) X
Using the default ICS setting and choosing the Jarque-Bera test of kurtosis to select the components at the default level 5% can easily be done as follows.
<- ics2(X)
icsX <- ics.outlier(icsX, test = "jarque", level.test = 0.05,
icsOutlierJB level.dist = 0.025)
print(icsOutlierJB)
> [1] "0 components were selected and no outliers were detected." R
Hence in this outlier-free normal data no component was considered non-normal and therefore as desired no outliers were detected.
As a comparison we compute the robust Mahalanobis distances using the MCD with a breakdown point of 25% and compute the commonly used cut-off value coming from the \(\chi^2_2\) distribution and the adjusted cut-off value from (Green and Martin 2017b) as implemented in the package CerioliOutlierDetection.
# Robust Mahalanobis distance with MCD estimates with a breakdown point of 25%
library("robustbase")
<- covMcd(X, alpha = 0.75)
MCD <- mahalanobis(X, MCD$center, MCD$cov)
RD
# Cut-off based on the chi-square distribution
<- qchisq(0.975, df = ncol(X))
cutoff.chi.sq cutoff.chi.sq
> [1] 7.377759 R
# Cut-off based Green and Martin (2017)
library("CerioliOutlierDetection")
<- hr05CutoffMvnormal(n.obs = nrow(X), p.dim = ncol(X), mcd.alpha = 0.75,
cutoff.GM signif.alpha = 0.025, method = "GM14",
use.consistency.correction = TRUE)$cutoff.asy
# This dollar sign works around the mathjax bug: $ cutoff.GM
> [1] 14.22071 R
To visualize these results, the robust squared Mahalanobis distances are plotted with two different cut-off values in Figure 2.
# Code for the Figure 2
<- ifelse(RD >= min(c(cutoff.chi.sq, cutoff.GM)), 1, grey(0.5))
colPoints <- ifelse(RD >= min(c(cutoff.chi.sq, cutoff.GM)), 16, 4)
pchPoints plot(seq_along(RD), RD, pch = pchPoints, col = colPoints,
ylim=c(0, max(RD, cutoff.chi.sq, cutoff.GM) + 2), cex.axis = 0.7, cex.lab = 0.7,
ylab = expression(RD**2), xlab = "Observation Number")
abline(h = c(cutoff.chi.sq, cutoff.GM), lty = c("dashed", "dotted"))
legend("topleft", lty = c("dashed", "dotted"), cex = 0.7, ncol = 2, bty = "n",
legend = c(expression(paste(chi[p]**2, " cut-off")), "GM cut-off"))
Hence in this example using the \(\chi_2^2\) cut-off yields 12 outliers while the more sophisticated cut-off does not classify any observation as outlying.
Another situation where the ics.outlier
function can conclude to an
absence of outliers is when the squared ICS distances are below the
corresponding cut-off for all observations, as illustrated in the help
file of the function: ?ics.outlier
.
The real data set HTP
, introduced in Archimbaud et al. (2018), is included in
this package. The data set provides the results of 88 numerical tests
for 902 high-tech parts. Based on these results the producer considered
all parts functional and all of them were sold. However two parts, 581
and 619, showed defects in use and were returned to the manufacturer.
These two observations can thus be considered as outliers and the
objective is to detect them. We use the ics.outlier
function with its
default settings, as most users initially would do but with the
parallelizing option for the computations.
# HTP dataset
library("ICSOutlier")
set.seed(123)
data(HTP)
<- c(581, 619)
outliers
# default ICS
<- ics2(HTP)
icsHTP
# Outlier detection with selection of components based on normality tests
# by default it can take quite long as mDist = 10000 so we choose to
# use all but one available cores to parallelize the simulations.
library(parallel)
<- ics.outlier(icsHTP, ncores = detectCores()-1, iseed = 123)
icsOutlierDA summary(icsOutlierDA)
>
R> ICS based on two scatter matrices and two location estimates
R> S1: MeanCov
R> S2: Mean3Cov4
R>
R> Searching for a small proportion of outliers
R>
R> Components selected at nominal level 0.05: 14
R> Selection method: norm.test (agostino.test)
R> Number of outliers at nominal level 0.025: 43 R
# Code for the Figure 3
plot(icsOutlierDA, cex.lab = 0.7, cex.axis = 0.7)
points(outliers, icsOutlierDA@ics.distances[outliers], pch = 5)
text(outliers, icsOutlierDA@ics.distances[outliers], outliers, pos = 2, cex = 0.7)
Based on this result we are able to identify the two outliers among the 5% of the observations declared as outliers, taking into account the 14 first components selected by the D’Agostino normality test.
However, following Archimbaud et al. (2018), a simple alternative for selecting
components is to use the screeplot of the icsHTP
object, in a similar
way as for PCA. Note that while the eigenvalues plotted on the screeplot
are related to the variance, for PCA they correspond to a generalized
kurtosis for ICS (see Tyler et al. (2009)).
# Code for the Figure 4
screeplot(icsHTP, cex.lab = 0.7, cex.axis = 0.7, cex.names = 0.7, cex.main = 0.7)
Based on this screeplot, three components might be a reasonable choice
and then the functions ics.distances
and dist.simu.test
can be used
to obtain the desired distances and cut-off value.
<- ics.distances(icsHTP, index = 1:3)
ics.dist.scree # by default it can take quite long as m = 10000, so we choose to
# use all but one available cores to parallelize the simulations.
library(parallel)
<- dist.simu.test(icsHTP, 1:3, ncores = detectCores()-1, iseed = 123)
ics.cutOff ics.cutOff
> 97.5%
R> 12.80771 R
Before visualizing these results we also apply two competing outlier detection methods. First the Finite-Sample Reweighted MCD (FSRMCD) outlier detection test of Cerioli (Cerioli 2010), implemented in the package CerioliOutlierDetection and secondly the SIGN1 algorithm (Filzmoser et al. 2008) implemented in the package mvoutlier. For the FSRMCD algorithm, we choose as nominal level \(\alpha = 1- \gamma^{1/n}\) for the individual outlier tests, with \(\gamma\) the nominal size of the intersection test and \(n\) the number of observations. Then the ICS distances based on the three selected components are plotted against the distances from the two competing methods together with the corresponding cut-off values.
# FSRMCD
library("CerioliOutlierDetection")
<- cerioli2010.fsrmcd.test(HTP, signif.alpha = 1 - 0.975**(1/nrow(HTP)),
FSRMCD mcd.alpha = 0.75)
# Two critical values: one for points included in the reweighted MCD (weights == 1)
# and one for points excluded from the reweighted MCD (weights == 0)).
<- unique(FSRMCD$critvalfcn(FSRMCD$signif.alpha))
FSRMCD.cutoffs FSRMCD.cutoffs
> [1] 144.8583 183.2797 R
# SIGN1
library("mvoutlier", quietly = TRUE)
<- sign1(HTP, qcrit = 0.975) SIGN1
# Code for the Figure 5
par(mfrow = c(1, 2))
par(mar = c(4, 4, 2, 0.2))
# Comparison ICS vs FSRMCD
<- ifelse(ics.dist.scree >= ics.cutOff , 1, grey(0.5))
colPoints <- ifelse(ics.dist.scree >= ics.cutOff, 16, 4)
pchPoints plot(FSRMCD$mahdist.rw, ics.dist.scree, col = colPoints, pch = pchPoints,
cex.lab = 0.7, cex.axis = 0.7, cex.main = 0.7,
main = "ICS vs FSRMCD", ylab = "ICS Distances", xlab = "FSRMCD") # This dollar sign works around the mathjax bug: $
points(FSRMCD$mahdist.rw[outliers], ics.dist.scree[outliers], pch = 5)
text(FSRMCD$mahdist.rw[outliers], ics.dist.scree[outliers], labels = outliers, pos = 2,
cex = 0.7)
abline(h = ics.cutOff, v = FSRMCD.cutoffs, lty = "dashed")
# Comparison ICS vs SIGN1
plot(SIGN1$x.dist, ics.dist.scree, col = colPoints, pch = pchPoints,
cex.lab = 0.7, cex.axis = 0.7, cex.main = 0.7,
main = "ICS vs SIGN1", ylab = "ICS Distances", xlab = "SIGN1")
points(SIGN1$x.dist[outliers], ics.dist.scree[outliers], pch = 5)
text(SIGN1$x.dist[outliers], ics.dist.scree[outliers], labels = outliers, pos = 2,
cex = 0.7)
abline(h = ics.cutOff, v = SIGN1$const, lty = "dashed")
As illustrated in Figure 5, the cut-offs used for the FSRMCD and SIGN1 algorithms plotted on the x-axis lead to too many false positives. Even when focusing only on the rank of the outlying observations it is clear that the ICS method outperforms the other methods. The outlier labelled 619 is obviously outlying with all methods, but the outlier labelled 581 has the 4th highest squared ICS distance compared to 15th and 12th for the distances computed with FSRMCD and SIGN1 algorithms. Note that the data set contains highly collinear variables and it is well-known that high-breakdown point methods are not optimal in this case.
Archimbaud et al. (2018) contains a more detailed analysis of this dataset comparing ICS with other methods and also gives the impact of the choice of different scatter matrices. For the present data set the default scatter combination used here seems the best. In this analysis, it is also interesting to notice that the first component highlights only the two defective parts. The other two observations are detected on the second and the third components of ICS and as far as we know they are not defective parts but false positives.
The Reliability data set comes also from an industrial context and is part the of REPPlab package. It contains 55 variables measured during the production process of 520 units. The challenge for this data is to identify the produced items with a fault not detected by the marginal tests. According to Fischer et al. (2016a), the observations numbered 414 and 512 are most likely outliers.
A special feature of this data set is that for example variable 24 is nearly constant which makes this data set hard to use with high-breakdown methods. For example the MCD cannot be computed for this data set when all variables are used. The use of ICS and other distance-based outlier detection methods is also detailed in Archimbaud et al. (2018).
Let us now illustrate the use of other scatter matrices than the default
one together with the way to pass arguments to some of the functions.
This time the invariant components are selected using simulations
(Parallel analysis) with an initial decision level of 5%. These results
are then compared to two methods not included in Archimbaud et al. (2018), namely
the Local Outlier Factor (LOF, (Breunig et al. 2000)) and the Angle-Based
Outlier Factor (Kriegel et al. 2008) outlier detection methods. First, we
apply the LOF method with the lof
function from the Rlof package
which parallelizes the computation of the local outlier factor using \(l\)
neighbours for each observation. The number of neighbours is determined
based on the guidelines from Breunig et al. (2000): we compute the outlierness
measures for \(l\) between 5 and 50 and we aggregate the results by taking
the maximum of the local outlier factor for each observation. However,
the distribution of the LOF values (and of the aggregated indices) is
not known and therefore, there exists no theoretical cut-off to identify
the outlying observations. In practice, observations with LOF values
much higher than one are considered as potentially outlying. Second, we
calculate the angle-based outlier factor for each observation through
the abod
function from the abodOutlier package on a random
sample of 10% (the default) of the data (not on the entire data set
because it is too time-consuming). As for the LOF method, the
distribution of the outlyingness indices is unknown. But for this
method, a potentially outlying observation is indicated by a small index
value.
The location and scatter combinations we use here are the regular mean
vector and covariance matrix and the joint maximum likelihood estimation
of location and scatter of a t-distribution with one degree of freedom,
also known as Cauchy MLE estimate. The function MeanCov
returns the
mean vector and regular covariance matrix as required for ics2
and for
the Cauchy MLE we use the function tM
where the degree of freedom is
specified using the argument df
. As the Cauchy MLE is considered the
more robust estimator it should be specified in ics2
as S1
.
# ReliabilityData example: the observations 414 and 512 are suspected to be outliers
library("ICSOutlier")
library("REPPlab")
set.seed(123)
data(ReliabilityData)
<- c(414, 512)
outliers
# ICS with MLE Cauchy and the Mean-Cov
<- ics2(ReliabilityData, S1 = tM, S2 = MeanCov,
icsReliabilityData S1args = list(df = 1))
# Outlier detection with selection of components based on simulations
# it can take quite long as mEig = 5000 and mDist = 5000, so we choose
# to use all but one available cores to parallelize the simulations.
<- ics.outlier(icsReliabilityData, method = "simulation",
icsOutlierPA level.test = 0.05, mEig = 5000,
level.dist = 0.01, mDist = 5000,
ncores = detectCores()-1, iseed = 123)
icsOutlierPA
> [1] "39 components were selected and 86 outliers were detected." R
# LOF: Local Outlier Factor
library("Rlof", quietly = TRUE)
<- lof(ReliabilityData, 5:50, cores = 2)
X.lof <- apply(X.lof, 1, max) X.lof.max
# ABOD: Angle-Based Outlier Factor
library("abodOutlier", quietly = TRUE)
<- abod(ReliabilityData, method = "randomized") X.abod
# Code for the Figure 6
par(mfrow = c(1, 2))
par(mar = c(4, 4, 2, 0.2))
# Comparison ICS vs LOF
plot(X.lof.max, icsOutlierPA@ics.distances, cex.lab = 0.7, cex.axis = 0.7, pch = 4,
cex.main = 0.7, main = "ICS vs LOF", ylab = "ICS Distances", xlab = "LOF")
text(X.lof.max[outliers], icsOutlierPA@ics.distances[outliers], labels = outliers,
pos = 4, cex = 0.7)
# Comparison ICS vs ABOD
plot(X.abod, icsOutlierPA@ics.distances, cex.lab = 0.7, cex.axis = 0.7, cex.main = 0.7,
main = "ICS vs ABOD", ylab = "ICS Distances", xlab = "ABOD", pch = 4)
text(X.abod[outliers], icsOutlierPA@ics.distances[outliers], labels = outliers,
pos = 4, cex = 0.7)
This analysis shows that ICS selects here quite many components and suggests 86 outliers at the level of 1%. For their part, the LOF and ABOD methods do not provide cut-offs for identifying outlying observations. So we only consider the ranks of the observations. When looking at the ICS distances on Figure 6, the observations 414 and 512 are clearly separated from the main bulk of the data while the two observations do not differ at all from the rest of the data using the LOF and ABOD methods (see also Archimbaud et al. (2018) for more details on this exmaple).
The hbk
data set is included in the
robustbase
package (Rousseeuw et al. 2017). It is an artificial data set created by
Hawkins et al. (1984) for illustrating the so-called masking effect of
outliers. It contains two groups of outliers with observations 1-10 in
the first group and observations 11-14 in the second group, over the 75
observations characterized by the three explanatory variables. Usually
with non-robust methods only the observations 12, 13 and 14 are
identified as outliers. For this example the percentage of outliers is
around 19% and so is larger than the percentage we recommend for the
current version of the package. Our aim is to illustrate that in such a
situation, the choice of the scatter matrices may have a large impact on
the results.
First, the combination of the highly robust MCD estimates with the mean
vector and regular covariance matrix is studied. Note that in order to
be able to use the MCD via the function CovMcd
a wrapper needs to be
written around the function so that it returns the proper list as
expected by ics2
. Using then the D’Agostino test for skewness for
selecting the invariant components leads to select two components at the
default level of 5%. The level of the quantile used for deriving the
cut-off for outlier identification is the default, 2.5%. For this
combination, ICS performs equally well as the MCD-based Mahalanobis
distances, as illustrated in the first three plots of the Figure
7. All outliers are correctly identified, no false
positive is detected and the two groups of outliers are clearly
separated.
Next, we consider the combination of two non-robust location vectors and scatter matrices which are the default in ICS, with the same parameters for selecting the components and for identifying the outliers as previously. The results are presented on the fourth plot of Figure 7.
library("rrcov")
set.seed(123)
# HBK data set
data(hbk)
# ICS with MCD estimates and the usual estimates
# Need to create a wrapper for the CovMcd function to return first the location estimate
# and the scatter estimate secondly.
<- function(x,...){
myMCD <- CovMcd(x,...)
mcd return(list(location = mcd@center, scatter = mcd@cov))
}<- ics2(hbk[, 1:3], S1 = myMCD, S2 = MeanCov, S1args = list(alpha = 0.75))
icsHBK_mcd
# Outlier detection with selection of components based on the D'Agostino test for
# skewness. It can take quite long as mEig = 10000 and mDist = 10000, so we choose
# to use all but one available cores to parallelize the simulations.
library(parallel)
<- ics.outlier(icsHBK_mcd, mEig = 10000, level.dist = 0.025,
icsOutlierDA.MCD mDist = 10000,
ncores = detectCores()-1, iseed = 123,
pkg = c("ICSOutlier", "rrcov"))
icsOutlierDA.MCD
> [1] "2 components were selected and 14 outliers were detected." R
# Robust Mahalanobis distance with MCD estimates with a breakdown point of 25%
<- covMcd(hbk[, 1:3], alpha = 0.75)
MCD <- mahalanobis(hbk[, 1:3], MCD$center, MCD$cov)
RD <- qchisq(0.975, df = ncol(hbk[, 1:3]))
cutoff.chi.sq
# Cut-off based Green and Martin (2017)
library("CerioliOutlierDetection")
<- hr05CutoffMvnormal(n.obs = nrow(hbk[, 1:3]), p.dim = ncol(hbk[, 1:3]),
cutoff.GM mcd.alpha = 0.75, signif.alpha = 0.025, method = "GM14",
use.consistency.correction = TRUE)$cutoff.asy #This dollar sign works around the mathjax bug: $
# ICS with non robust estimates
<- ics2(hbk[,1:3], S1 = MeanCov, S2 = Mean3Cov4)
icsHBK
# Outlier detection with selection of components based on the D'Agostino test for
# skewness. It can take quite long as mEig = 10000 and mDist = 10000, so we choose
# to use all but one available cores to parallelize the simulations.
<- ics.outlier(icsHBK, mEig = 10000, level.dist = 0.025,
icsOutlierDA mDist = 10000,
ncores = detectCores()-1, iseed = 123,
pkg = c("ICSOutlier", "rrcov"))
icsOutlierDA
> [1] "2 components were selected and 2 outliers were detected." R
# Code for the Figure 7
par(mfrow = c(1, 3))
par(mar = c(4, 2, 2, 0.2))
# Robust Mahalanobis distance with MCD estimates
<- ifelse(RD >= cutoff.chi.sq, 1, grey(0.5))
colPoints <- ifelse(RD >= cutoff.chi.sq, 16, 4)
pchPoints plot(RD, col = colPoints, pch = pchPoints, xlab = "Observation Number",
cex.lab = 0.7, cex.axis = 0.7, main = "Robust MD", cex.main = 0.7)
abline(h = c(cutoff.chi.sq, cutoff.GM), lty = 1:2)
legend("topright", lty = 1:2, cex = 0.7, bty = "n",
legend = c(expression(paste(chi[p]**2, " cut-off")), "GM cut-off"))
# ICS with MCD estimates and regular covariance
plot(icsOutlierDA.MCD, cex.lab = 0.7, cex.axis = 0.7, main = "ICS MCD-COV",
cex.main = 0.7)
# ICS with default non-robust estimates
plot(icsOutlierDA, cex.lab = 0.7, cex.axis = 0.7, main = "ICS COV-COV4",
cex.main = 0.7)
Here the default ics2
scatter combination does not perform well. Only
three outliers from the second group are detected, although the distance
plot indicates three distance levels but the cut-off value is not good
and the separation not clear especially when compared with the MCD-COV
combination. However, when looking at the invariant coordinates on
Figure 8, we can see that the default scatter
combination reveals the 14 outliers and the masking effect appears only
when computing the ICS distances.
# Code for the Figure 8
plot(icsHBK, col = rep(3:1, c(10, 4, 61)))
The use of ICSOutlier for outlier detection has been illustrated on several examples. The results are already pretty good with the default parameters as soon as the percentage of outliers is small. Note however that the components selection procedure can be improved sometimes by a visual inspection of the screeplot and the cut-off for outlier identification can be adjusted by looking at the ICS distances plot. The ability of the method to check for the absence of outliers is an advantage compared with methods such as the Mahalanobis distance. In the same vein, the number of false positives, i.e the number of non-outliers declared as outliers, is restricted thanks to the use of the cut-off derived from simulations. Moreover, the real HTP data set included in the package shows that the ICS method is useful for outlier detection in the context of quality control, among possible applications. Finally the ICS method competes with common approaches based on distances (the Mahalanobis distance, its robust version, the PCA methods, its variants and improvements) as well as other methods based on the density (LOF) or on angles (ABOD).
The computation time for cut-offs for the selection of the non-Gaussian invariant components, on the one hand, and for the threshold for labeling the outliers, on the other hand, can be reduced by using the parallelizing option. Moreover, for now, the function can only select the first invariant components which is relevant when the percentage of outliers is small. For a large percentage, the invariant components associated with the smallest eigenvalues may also be of interest and this is a current perspective of the present package. A generalization of the ICS method in case of perfectly collinear variables is also a work in progress.
The article is based upon work from CRoNoS COST Action IC1408, supported by COST (European Cooperation in Science and Technology).
The authors wish to thank the editor and the two reviewers for their comments and suggestions which helped improve not only the paper but also the R package.
mvoutlier, CerioliOutlierDetection, rrcovHD, faoutlier, abodOutlier, HighDimOut, alphaOutlier, extremevalues, HDoutliers, outliers, DMwR2, ldbod, Rlof, depth, REPPlab, OutlierDC, pcadapt, rrcov, ICSOutlier, ICS, robustbase
OfficialStatistics, Psychometrics, Robust
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Archimbaud, et al., "ICSOutlier: Unsupervised Outlier Detection for Low-Dimensional Contamination Structure", The R Journal, 2018
BibTeX citation
@article{RJ-2018-034, author = {Archimbaud, Aurore and Nordhausen, Klaus and Ruiz-Gazen, Anne}, title = {ICSOutlier: Unsupervised Outlier Detection for Low-Dimensional Contamination Structure}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {234-250} }