Identification of informative variables in an information system is often performed using simple one-dimensional filtering procedures that discard information about interactions between variables. Such an approach may result in removing some relevant variables from consideration. Here we present an R package MDFS (MultiDimensional Feature Selection) that performs identification of informative variables taking into account synergistic interactions between multiple descriptors and the decision variable. MDFS is an implementation of an algorithm based on information theory (Mnich and Rudnicki 2017). The computational kernel of the package is implemented in C++. A high-performance version implemented in CUDA C is also available. The application of MDFS is demonstrated using the well-known Madelon dataset, in which a decision variable is generated from synergistic interactions between descriptor variables. It is shown that the application of multidimensional analysis results in better sensitivity and ranking of importance.
Identification of variables that are related to the decision variable is often the most important step in dataset analysis. In particular, it becomes really important when the number of variables describing the phenomena under scrutiny is large.
Methods of feature selection fall into three main categories (Guyon and Elisseeff 2003):
Filters are designed to provide a quick answer and therefore are the
fastest. On the other hand, their simplicity is also the source of their
errors. The rigorous univariate methods, such as
The wrapper algorithms are designed around machine learning algorithms such as SVM (Cortes and Vapnik 1995), as in the SVM-RFE algorithm (Guyon et al. 2002), or random forest (Breiman 2001), as in the Boruta algorithm (Kursa et al. 2010). They can identify variables involved in non-linear interactions. Unfortunately, for systems with tens of thousands of variables they are slow. For example, the Boruta algorithm first expands the system with randomised copies of variables and then requires numerous runs of the random forest algorithm.
The embedded methods are mostly limited to linear approximations and are part of a modelling approach where the selection is directed towards the utility of the model (lasso?; elastic_net?). Therefore, variables that are relevant for understanding the phenomena under scrutiny may be omitted and replaced by variables more suitable for building a particular model.
Here we introduce an R package implementing a filter based on information theory. The algorithm can identify synergistic relevant variables by performing an exhaustive search of low-dimensional combinations of variables.
Kohavi and John proposed that a variable
The algorithm implements the definition of
The maximum decrease in conditional information entropy upon adding
To declare a variable
For a sufficiently large sample, the value of "BY"
for p.adjust
function.
In one dimension (
All variables that are weakly relevant in one-dimensional test should
also be discovered in higher-dimensional tests, nevertheless their
relative importance may be significantly influenced by interactions with
other variables. Often the criterium for inclusion to further steps of
data analysis and model building is simply taking the top
The MDFS package (Piliszek et al. 2018)
consists of two main parts. One is an R interface to two computational
engines. These engines utilise either CPU or NVIDIA GPU and are
implemented in standard C++ and in CUDA C, respectively. Either
computational engine returns the
The
The problem of selecting the right amount of classes (the right value of
Conditional information entropy is obtained from the experimental
probabilities of a decision class using the following formula:
The implementation of the algorithm is currently limited to binary decision variables. The analysis for information systems that have more than two categories can be performed either by executing all possible pairwise comparisons or one-vs-rest. Then all variables that are relevant in the context of a single pairwise comparison should be considered relevant. In the case of continuous decision variable one must discretise it before performing analysis. In the current implementation all variables are discretised into an equal number of classes. This constraint is introduced for increased efficiency of computations, in particular on a GPU.
Another limitation is the maximum number of dimensions set to 5. This is due to several reasons. Firstly, the computational cost of the algorithm is proportional to number of variables to power equal the dimension of the analysis, and it becomes prohibitively expensive for powers larger than 5 even for systems described with a hundred of variables. Secondly, analysis in higher dimensions requires a substantial number of objects to fill the voxels sufficiently for the algorithm to detect real synergies. Finally, it is also related to the simplicity of efficient implementation of the algorithm in CUDA.
The most time-consuming part of the algorithm is computing the counters for all voxels. Fortunately, this part of the computations is relatively easy to parallelise, as the exhaustive search is very well suited for GPU. Therefore, a GPU version of the algorithm was developed in CUDA C for NVIDIA GPGPUs and is targeted towards problems with a very large number of features. The CPU version is also parallelised to utilise all cores available on a single node. The 1D analysis is available only in the CPU version since there is no benefit in running this kind of analysis on GPU.
There are three functions in the package which are to be run directly
with the input dataset: MDFS
, ComputeMaxInfoGains
, and
ComputeInterestingTuples
. The first one, MDFS
, is our recommended
function for new users, since it hides internal details and provides an
easy to use interface for basic end-to-end analysis for current users of
other statistical tests (e.g., t.test
) so that the user can
straightforwardly get the statistic values, ComputeMaxInfoGains
returns the max IGs, as described in the theory
section. It can optionally provide information about the tuple in which
this max IG was observed. On the other hand, one might be interested in
tuples where certain IG threshold has been achieved. The
ComputeInterestingTuples
function performs this type of analysis and
reports which variable in which tuple achieved the corresponding IG
value.
The ComputePValue
function performs fitting of IGs to respective
statistical distributions as described in the theory section. The
goodness of fit is tested using Kolmogorov-Smirnov one-sample test and a
warning is emitted if the threshold is exceeded. ComputePValue
returns
an object of the "MDFS"
class which contains, in particular,
ComputePValue
is implemented in a general
way, extending beyond limitations of the current implementation of
ComputeMaxInfoGains
. In particular, it can handle multi-class problems
and different number of divisions for each variable.
The AddContrastVariables
is an utility function used to construct
contrast variables (Stoppiglia et al. 2003; Kursa et al. 2010). Contrast variables
are used solely for improving reliability of the fit of statistical
distribution. In the case of fitting distribution to contrast variables
we know exactly how many irrelevant variables there are in the system.
The contrast variables are not tested for relevance and hence not used
when adjusting
As mentioned earlier, the recommended way to use the package is to use
the MDFS
function. It uses the other packaged functions to achieve its
goal in the standard and thoroughly tested way, so it may be considered
the canonical package usage pattern. The MDFS
function is general in
terms of contrast variables being optional, hence let us examine a
simplified version of it assuming the contrast variables are actually
being used. We also neglect the setting of seed but we recommend it to
be set so that the result is reproducible. The MDFS
wrapper does
accept a seed and saves it with the result.
The first step is to build the contrast variables:
contrast <- AddContrastVariables(data, n.contrast)
In the next step, the compute-intensive computation of IGs is executed:
MIG.Result <- ComputeMaxInfoGains(contrast$x, decision,
dimensions = dimensions, divisions = divisions,
discretizations = discretizations, range = range, pseudo.count = pseudo.count)
The first two positional parameters are respectively the feature data
(plus contrast variables) and the decision. The other parameters decide
on the type of computed IGs: dimensions
controls dimensionality,
divisions
controls the number of classes in the discretisation (it is
equal to divisions+1
), discretizations
controls the number of
discretisations, range
controls how random the discretisation splits
are, and pseudo.count
controls the regularization parameter
Finally, the computed IGs are analysed and a statistical result is computed:
fs <- ComputePValue(MIG.Result$IG,
dimensions = dimensions, divisions = divisions,
contrast.mask = contrast$mask,
one.dim.mode = ifelse (discretizations==1, "raw",
ifelse(divisions*discretizations<12, "lin", "exp")))
statistic <- MIG.Result$IG[!contrast$mask]
p.value <- fs$p.value[!contrast$mask]
adjusted.p.value <- p.adjust(p.value, method = p.adjust.method)
relevant.variables <- which(adjusted.p.value < level)
The one.dim.mode
parameter controls the expected distribution in 1D.
The rule states that as long as we have 1 discretisation the resulting
distribution is chi-squared, otherwise, depending on the product of
discretizations
and divisions
, the resulting distribution might be
closer to a linear or exponential, as in higher dimensions, function of
chi-squared distributions. This is heuristic and might need to be tuned.
Features with adjusted
For demonstration of the MDFS package we used the training subset of the well-known Madelon dataset (Guyon et al. 2007). It is an artificial set with 2000 objects and 500 variables. The decision was generated using a 5-dimensional random parity function based on variables drawn from normal distribution. The remaining variables were generated in the following way. Fifteen variables were obtained as linear combinations of the 5 input variables and remaining 480 variables were drawn randomly from the normal distribution. The data set can be accessed from the UCI Machine Learning Repository (Dheeru and Karra Taniskidou 2017) and it is included in MDFS package as well.
We conducted the analysis in all possible dimensionalities using both
CPU and GPU versions of the code. Additionally, a standard
In the first attempt we utilised the given information on the properties of the dataset under scrutiny. We knew in advance that Madelon was constructed as a random parity problem and that each base variable was constructed from a distinct distribution. Therefore, we could use one discretisation into 2 equipotent classes. In the second attempt the recommended ‘blind’ approach in 2D was followed, which utilises several randomized discretisations.
For brevity, in the following examples the set of Madelon independent
variables is named x
and its decision is named y
:
x <- madelon$data
y <- madelon$decision
For easier comparison we introduce a helper function to obtain, from
MDFS analysis, the relevant features’ indices in decreasing relevance
(increasing
GetOrderedRelevant <- function (result) {
result$relevant.variables[order(result$p.value[result$relevant.variables])]
}
One can now obtain p.adjust
function), take relevant with level
> tt <- ttests(x, ina = y+1)[,2] # we only use $p$-values (2nd column)
> tt.adjusted <- p.adjust(tt, method = "holm")
> tt.relevant <- which(tt.adjusted < 0.05)
> tt.relevant.ordered <- tt.relevant[order(tt.adjusted[tt.relevant])]
> tt.relevant.ordered
[1] 476 242 337 65 129 106 339 49 379 443 473 454 494
A FWER correction is used because we expect strong separation between
relevant and irrelevant features in this artificial dataset. We used the
ttests
function from the
Rfast (Papadakis et al. 2018) package as
it is a version of
To achieve the same with MDFS for 1, 2, and 3 dimensions one can use
the wrapper MDFS
function:
> d1 <- MDFS(x, y, n.contrast = 0, dimensions = 1, divisions = 1, range = 0)
> d1.relevant.ordered <- GetOrderedRelevant(d1)
> d1.relevant.ordered
[1] 476 242 339 337 65 129 106 49 379 454 494 443 473
> d2 <- MDFS(x, y, n.contrast = 0, dimensions = 2, divisions = 1, range = 0)
> d2.relevant.ordered <- GetOrderedRelevant(d2)
> d2.relevant.ordered
[1] 476 242 49 379 154 282 434 339 494 454 452 29 319 443 129 473 106 337 65
> d3 <- MDFS(x, y, n.contrast = 0, dimensions = 3, divisions = 1, range = 0)
> d3.relevant.ordered <- GetOrderedRelevant(d3)
> d3.relevant.ordered
[1] 154 434 282 49 379 476 242 319 29 452 494 106 454 129 473 443 339 337 65 456
The changes in the relevant variables set can be examined with simple
setdiff
comparisons:
> setdiff(tt.relevant.ordered, d1.relevant.ordered)
integer(0)
> setdiff(d1.relevant.ordered, tt.relevant.ordered)
integer(0)
> setdiff(d1.relevant.ordered, d2.relevant.ordered)
integer(0)
> setdiff(d2.relevant.ordered, d1.relevant.ordered)
[1] 154 282 434 452 29 319
> setdiff(d2.relevant.ordered, d3.relevant.ordered)
integer(0)
> setdiff(d3.relevant.ordered, d2.relevant.ordered)
[1] 456
One may notice that ordering by importance leads to different results for these 4 tests.
In the above the knowledge about properties of the Madelon dataset was used: that there are many random variables, hence no need to add contrast variables, and that the problem is best resolved by splitting features in half, hence one could use 1 discretisation and set range to zero.
However, one is usually not equipped with such knowledge and then may need to use multiple random discretisations. Below an example run of ‘blind’ 2D analysis of Madelon is presented:
> d2b <- MDFS(x, y, dimensions = 2, divisions = 1, discretizations = 30, seed = 118912)
> d2b.relevant.ordered <- GetOrderedRelevant(d2b)
> d2b.relevant.ordered
[1] 476 242 379 49 154 434 106 282 473 339 443 452 29 454 494 319 65 337 129
> setdiff(d2b.relevant.ordered, d2.relevant.ordered)
integer(0)
> setdiff(d2.relevant.ordered, d2b.relevant.ordered)
integer(0)
This demonstrates that the same variables are discovered, yet with a different order.
The performance of the CPU version of the algorithm was measured on a computer with two Intel Xeon E5-2650 v2 processors, running at 2.6 GHz. Each processor has eight physical cores. Hyperthreading was disabled.
The GPU version was tested on a computer with a single NVIDIA Tesla K80 accelerator. The K80 is equipped with two GK210 chips and is therefore visible to the system as two separate GPGPUs. Both were utilised in the tests.
The Madelon dataset has moderate dimensionality for modern standards, hence it is amenable to high-dimensional analysis. The CPU version of the code handles analysis up to four dimensions in a reasonable time, see Table 1.
The performance gap between CPU and GPU versions is much higher than suggested by a simple comparison of hardware capabilities. This is due to two factors. Firstly, the GPU version has been highly optimised towards increased efficiency of memory usage. The representation of the data by bit-vectors and direct exploitation of the data locality allows for much higher data throughput. What is more, the bit-vector representation allows for using very efficient popcnt instruction for counter updates. On the other hand the CPU version has been written mainly as a reference version using a straightforward implementation of the algorithm and has not been strongly optimised.
1D | 2D | 3D | 4D | 5D | ||
---|---|---|---|---|---|---|
CPU | 0.01s | 0.01s | 0.44s | 42s | 1h:58m | 249h |
GPU | - | - | 0.23s | 0.2s | 9.8s | 59m:37s |
1D | 2D | 3D | 4D | 5D | |
---|---|---|---|---|---|
CPU | 0.35s | 5.8s | 37m:11s | 92h | - |
GPU | - | 2.9s | 3.3s | 7m:36s | 42h |
Cluster | Members |
---|---|
154 | 154, 282, 434 |
29 | 29, 319, 452 |
49 | 49, 379 |
242 | 476, 242 |
456 | 456 |
454 | 454, 494 |
339 | 339 |
443 | 473, 443 |
106 | 106, 129 |
65 | 337, 65 |
1D | 2D | 3D | 4D | 5D | |||
---|---|---|---|---|---|---|---|
1. | 242 | 242 | 242 | 154 | 154 | 154 | |
2. | 65 | 339 | 49 | 49 | 49 | 29 | |
3. | 106 | 65 | 154 | 242 | 29 | 49 | |
4. | 339 | 106 | 339 | 29 | 242 | 242 | |
5. | 49 | 49 | 454 | 454 | 454 | 456 | |
6. | 443 | 454 | 29 | 106 | 339 | 454 | |
7. | 454 | 443 | 443 | 443 | 106 | 339 | |
8. | - | - | 106 | 339 | 456 | 443 | |
9. | - | - | 65 | 65 | 443 | 106 | |
10. | - | - | - | 456 | 65 | 65 |
1D | 2D | 3D | 4D | 5D | ||||
---|---|---|---|---|---|---|---|---|
1. | 242 | 242 | 242 | 154 | 154 | 154 | ||
2. | 65 | 339 | 49 | 49 | 49 | 29 | ||
3. | 106 | 65 | 154 | 242 | 29 | 49 | ||
4. | 339 | 443 | 106 | 29 | 242 | 242 | ||
5. | 49 | 106 | 443 | 106 | 454 | 454 | ||
6. | 443 | 454 | 339 | 454 | 106 | 456 | ||
7. | 454 | 49 | 29 | 443 | 339 | 443 | ||
8. | - | 205 | 454 | 339 | 443 | 339 | ||
9. | - | - | 65 | 65 | 456 | 106 | ||
10. | - | - | - | 456 | 65 | 65 |
The twenty relevant variables in Madelon can be easily identified by
analysis of histograms of variables, their correlation structure and by
a priori knowledge of the method of construction of the dataset. In
particular, base variables, i.e. these variables that are directly
connected to a decision variable, have the unique distribution that has
two distinct peaks. All other variables have smooth unimodal
distribution, hence identification of base variables is trivial. What is
more, we know that remaining informative variables are constructed as
linear combinations of base variables, hence they should display
significant correlations with base variables. Finally, the nuisance
variables are completely random, hence they should not be correlated
neither with base variables nor with their linear combinations. The
analysis of correlations between variables reveals also that there are
several groups of very highly correlated (
This clear structure of the dataset creates an opportunity to confront results of the MDFS analysis with the ground truth and observe how the increasing precision of the analysis helps to discover this structure without using the a priori knowledge on the structure of the dataset.
One-dimensional analysis reveals 13 really relevant variables (7
independent ones), both by means of the
Five variables, namely
We have introduced a new package for identification of informative
variables in multidimensional information systems which takes into
account interactions between variables. The implemented method is
significantly more sensitive than the standard
The research was partially funded by the Polish National Science Centre, grant 2013/09/B/ST6/01550.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
"BY"
for p.adjust
function.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Piliszek, et al., "MDFS: MultiDimensional Feature Selection in R", The R Journal, 2019
BibTeX citation
@article{RJ-2019-019, author = {Piliszek, Radosław and Mnich, Krzysztof and Migacz, Szymon and Tabaszewski, Paweł and Sułecki, Andrzej and Polewko-Klim, Aneta and Rudnicki, Witold}, title = {MDFS: MultiDimensional Feature Selection in R}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {198-210} }