progenyClust: an R package for Progeny Clustering

Chenyue W. Hu; Amina A. Qutub

1 Introduction

Clustering is a classical and widely-used machine learning technique, yet the field of clustering is constantly growing. The goal of clustering is to group objects that are similar to each other and separate objects that are not similar to each other based on common features. Clustering can, for example, be applied to distinguishing tumor subclasses based on gene expression data (Sørlie et al. 2001; Budinska et al. 2013), or dividing sport fans based on their demographic information (Ross 2007). One critical challenge in clustering is identifying the optimal number of groups. Despite some advanced clustering algorithms that can automatically determine the cluster number (e.g. Affinity Propagation (Frey and Dueck 2007)), the commonly used algorithms (e.g. k-means (Hartigan and Wong 1979) and hierarchical clustering (Johnson 1967)) unfortunately require users to specify the cluster number before performing the clustering task. However, most often than not, the users do not have prior knowledge of the number of clusters that exist in their data.

To solve this challenge of finding the optimal cluster number, quite a few clustering evaluation techniques (Arbelaitz et al. 2013; Charrad et al. 2014a) as well as R packages (e.g. cclust (Dimitriadou et al. 2015), clusterSim (Walesiak et al. 2015), cluster (Maechler et al. 2015), Nbclust (Charrad et al. 2014b), fpc (Hennig 2015)) were developed over the years to objectively assess the clustering quality. The problem of identifying the optimal cluster number is thus transformed into the problem of clustering evaluation. In most of these solutions, clustering is first performed on the data with each of the candidate cluster numbers. The quality of these clustering results is then evaluated based on properties such as cluster compactness (Rousseeuw 1987; Tibshirani et al. 2001) or clustering stability (Ben-Hur et al. 2001; Monti et al. 2003). In particular, stability-based methods have been well received and greatly promoted in recent years (Meinshausen and Bühlmann 2010). However, these methods are generally slow to compute because of the repetitive clustering process mandated by the nature of stability assessment. Recently, a new method Progeny Clustering was developed by Hu et al. (2015) to assess clustering quality and to identify the optimal cluster number based on clustering stability. Compared to other clustering evaluation methods, Progeny Clustering requires fewer samples for clustering stability assessment, thus it is able to greatly boost computing efficiency. However, this advantage is based on the assumption that features are independent for each cluster, thus users need to either transform data and create independent features or consult other methods if this assumption does not hold for the data of interest.

Here, we introduce a new R package, progenyClust, that performs Progeny Clustering for continuous data. The package consists of a main function progenyClust() that requires few parameter specifications to run the algorithm on any given dataset, as well as a built-in function hclust.progenyClust to use hierarchical clustering as an alternative to using kmeans. Two example datasets test and cell, used in the original publication of Progeny Clustering, are provided in this package for testing and sharing purposes. In addition, the progenyClust package includes an option to invert the stability scores, which is not considered in the original algorithm. This additional capability enables the algorithm to produce more interpretable and easier-to-plot results. The rest of the paper is organized as follows: We will first describe how Progeny Clustering works and then go over the implementation of the progenyClust package. Following the description of functions and datasets provided by the package, we will provide one proof-of-concept example of how the package works and a real world example where the package is used to identify cell phenotypes based on imaging data.

2 Progeny Clustering

In this section, we briefly review the algorithm of Progeny Clustering (Hu et al. 2015). Progeny Clustering is a clustering evaluation method, thus it needs to couple with a stand-alone clustering method such as k-means. The framework of Progeny Clustering is similar to other stability based methods, which select the optimal cluster number that renders the most stable clustering. The evaluation of clustering stability usually starts with an initial clustering of the full or sometimes partial dataset, followed by bootstrapping and repetitive clustering, and then uses certain criterion to assess the stability of clustering solutions. Progeny Clustering uses the same workflow, but innovates at the bootstrapping method and improves on the stability assessment.

Consider a finite dataset {$x_{ij}$}, $i=1,\ldots,N$, $j=1,\ldots,M$ that contains $M$ variables (or features) for $N$ independent observations (or samples). Given a number $K$ (a positive integer) for clustering, a clustering method partitions the dataset into $K$ clusters. Each cluster is denoted as $C_k$, $k=1,\ldots,K$. Inspired by biological concepts, each cluster is treated as a subpopulation and the bootstrapped samples as progenies from that subpopulation. The uniqueness of Progeny Sampling during the bootstrapping step is that it randomly samples feature values with replacement to construct new samples rather than directly sampling existing samples. Let $\tilde{N}$ be the number of progenies we generate from each cluster $C_k$. Combining these progenies, we have a validation dataset {$y_{ij}^{\left(k\right)}$}, $i=1,\ldots,\tilde{N}$, $j=1,\ldots,M$, $k=1,\ldots,K$, containing $K\times \tilde{N}$ observations with $M$ features. Using the same number $K$ and the same method for clustering, we partition the progenies {$y_{ij}^{\left(k\right)}$} into $K$ progeny clusters, denoted by $C'_k$, $k=1,\ldots,K$. A symmetric co-occurrence matrix $Q$ records the clustering memberships of each progeny as follows:

\[Q_{ab}= \begin{cases} 1, & \text{if the $a^{th}$ progeny and the $b^{th}$ progeny are in the same cluster $C'_k$} \\ 0, & \text{otherwise} \end{cases}\label{eq:a1} . \tag{1}\] The progenies in $Q$ were ordered by the initial cluster ($C_k$) they were generated from, such that $Q_a,\ldots,Q_{a+\tilde{N}}\in C_k$, $a=\left(k-1\right)\tilde{N}$. After repeating the above process (from generating Progenies to obtaining $Q$) $R$ times, we can get a series of co-occurrence matrices $Q^{\left(r\right)}$, $r=1,\ldots,R$. Averaging $Q^{\left(r\right)}$ results in a stability probability matrix $P$, i.e.

\[\label{eq:002} P_{ab}=\sum_r Q^{\left(r\right)}_{ab}/R . \tag{2}\]

From this probability matrix $P$, we compute the stability score for clustering the dataset {$x_{ij}$} into $K$ clusters as

\[\label{eq:003} S= \frac{\displaystyle \sum_k \sum_{a, b \in C_k, b\neq a} P_{ab}/\left(\tilde{N}-1\right)} {\displaystyle \sum_k \sum_{a \in C_k, b \notin C_k} P_{ab}/\left(K\tilde{N}-\tilde{N}\right)} . \tag{3}\]

A schematic for this process and the pseudocode are shown in Figure 1 and Figure 2.

graphic without alt text — Figure 1: The schematic of the core steps in Progeny Clustering, illustrated using an example of clustering a $20\times2$ matrix into two groups. Schematic reproduced from Hu et al. (2015) under a Creative Commons License.

After computing the stability score for each cluster number, we can then pick the optimal number using a ‘greatest score’ criterion or a ‘greatest gap’ criterion or both. The ‘greatest score’ criterion selects the cluster number that produces the highest stability score compared to reference datasets, similar to what is used in Gap Statistics (Tibshirani et al. 2001). $T$ reference datasets are first generated from a uniform distribution over the range of each feature using Monte Carlo simulation. Each reference dataset is then treated as an input dataset, and stability scores are computed respectively using the same process as in Figure 1. Let $\{\tilde{S}^{\left(K\right)\left(t\right)}\}$, $t=1,\ldots,T$, be the stability score for clustering the $t^{th}$ reference dataset into $K$ clusters. The stability score difference between the original dataset and reference datasets are obtained by

\[\label{eq:004} D^{\left(K\right)}= S^{\left(K\right)} - \sum_t \tilde{S}^{\left(K\right)\left(t\right)}/T , \tag{4}\]

where $K=K_{min},\ldots,K_{max}$. The optimal cluster number with the greatest score difference is then selected, i.e.

\[\label{eq:5} K_o=\operatorname{arg\,max} D^{\left(K\right)}. \tag{5}\]

While the ‘greatest score’ criterion requires computing stability scores from random datasets, the ‘greatest gap’ criterion does not, due to the fact that the stability score linearly increases with an increase in cluster number among reference datasets. The ‘greatest gap’ criterion therefore searches for peaks in the stability score curve and selects the cluster number that has the highest stability score compared to those of its neighboring numbers, i.e.

\[\label{eq:7} K_o=\operatorname{arg\,max} \left(2S^{\left(K\right)} - S^{\left(K-1\right)} - S^{\left(K+1\right)}\right). \tag{6}\]

Compared to other stability-based evaluation methods, the major benefits of using Progeny Clustering include less re-use of the same samples and faster computation. The progenies sampled from the original data resemble but are hardly the same as the original samples. Thanks to this unique feature, a small number of progenies are sufficient to evaluate the clustering stability. The reduction of sample size for evaluation in turn saves substantial computing time, because the complexity of most clustering algorithms is dependent on the sample size (Andreopoulos et al. 2009). The proposal of the ‘greatest gap’ criterion further boosts computation speed of clustering evaluation by eliminating the step of generating reference scores. The comparison of computation speed between Progeny Clustering and other commonly used algorithms can be found in Hu et al. (2015).

3 The progenyClust package

The progenyClust package was developed with the aim of enabling and promoting the usage of the Progeny Clustering algorithm in the R community. This package implements the Progeny Clustering algorithm with an additional feature to invert stability scores. The package includes a main function progenyClust(), plot and summary methods for “progenyClust” objects, a function hclust.progenyClust for hierarchical clustering, and two example datasets. To perform Progeny Clustering using the progenyClust package, users should first run the main function progenyClust() on their dataset, then use plot and summary methods to check the stability score curves, review the clustering results, and check the recommended cluster number. The progenyClust() function allows flexible plug-ins of various clustering algorithms into Progeny Clustering, and directly couples with k-means clustering algorithm as a default as well as hierarchical clustering as an alternative. Since the clustering memberships are returned in addition to the optimal cluster number, the package integrates the clustering process and the cluster number selection process into one, and it saves users additional efforts that are required to complete clustering tasks. In the following sections, we will first explain the motivation to provide score inversion, then go over the main progenyClust() function, the “S3” methods for “progenyClust” objects, the built-in function hclust.progenyClust(), and describe the background of the included datasets.

Inversion of the stability scores

In the original Progeny Clustering algorithm, the optimal cluster number was chosen based on stability scores, which capture the true classification rate over the false classification rate. The higher the score is, the more stable the clustering is, and the more desirable the cluster number is. The computation of stability scores works well in general, except for when the false classification rate is equal to zero. The zero false classification rate indicates a perfectly stable clustering, that is when all progenies are correctly clustered with progenies coming from the same initial cluster. The perfectly stable clustering will produce a positive infinite stability score, which is not ideal for plotting or for further computing to select the optimal cluster number. Therefore, we offer a choice of inverting the stability scores in this package to mitigate the risk of generating an infinite score. The inverted stability scores can be interpreted as a measure of instability, calculated by false classification rate over true classification rate. In the case of a perfectly stable clustering, the inverted stability score is equal to zero, thus is much easier for comparison and visualization. Meanwhile, the chances of a perfectly unstable clustering are much rarer. If the inversion of stability score is chosen when running Progeny Clustering, users should select the cluster number with the smallest score instead of the greatest score.

The `progenyClust()` function

The progenyClust() function takes in a data matrix, performs Progeny Clustering, and outputs a “progenyClust” object. The clustering is performed on rows, thus the input data matrix needs to be formatted accordingly. A number of input arguments were offered by progenyClust() to allow users to specify the clustering algorithm, cluster number selection criterion and parameter values they want to use for Progeny Clustering. The output “progenyClust” object contains information on the clustering memberships and stability scores at each cluster number, and it can work with the plot and summary methods. Since the default values for most of the input arguments are provided, progenyClust() can be run without any tuning. The function is used as follows:

 progenyClust(data, FUNclust = kmeans, method = 'gap', score.invert = F,
    ncluster = 2:10, size = 10, iteration = 100, repeats = 1, nrandom = 10, ...)

Here, we group the input arguments into three categories, and highlight the meaning and usage of each argument.

Input Data: data is a matrix, the rows of which are of interest to cluster. ncluster is a sequence of candidate cluster numbers to evaluate.
Method: Since progenyClust() is a clustering evaluation algorithm, it needs to work together with a clustering algorithm. FUNclust is where the clustering function is specified. The input and output of FUNclust is required to be similar to the default kmeans() function from stat, or the alternative hclust.progenyClust() function for hierarchical clustering as provided in progenyClust. FUNclust should be able to accept data as its first argument, accept the number for clustering as its second argument, and return a list containing a component cluster which is a vector of integers denoting the clustering assignment for each sample. method is the stability score comparison criterion being selected. score.invert can be used to flip the stability scores to instability scores when specified to be TRUE. The values of method can be ‘gap’ which represents the ‘greatest gap’ criterion, ‘score’ which represents the ‘greatest score’ criterion, or ‘both’ which represents using both the ‘greatest gap’ and the ‘greatest score’ criteria. In cases when optimal cluster numbers determined by the ‘greatest gap’ and the ‘greatest score’ do not agree, we suggest users to either review the stability score plots from both criteria and pick the most preferred one or use the cluster number suggested by the ‘greatest score’ criterion.
Tuning Parameters: size specifies the number of progenies to generate from each initial cluster for stability evaluation. iteration denotes how many times the progenies are generated for calculating the stability score. repeats is the number of times the entire algorithm should be repeated from the initial clustering to obtaining the stability scores. If repeats is greater than one, the standard deviation of the stability score at each cluster number will be produced. nrandom specifies the number of random datasets to generate when computing the reference scores, if the ‘greatest score’ method is chosen. All these tuning parameters, if specified inappropriately, can affect the accuracy and computing efficiency of the Progeny Clustering algorithm. In general, the greater the values of size, iteration, repeats and nrandom are, the slower the computing will be.

The output of the progenyClust() function is an object of the “progenyClust” class, which contains information on the clustering results, the stability scores computed and the calls that were made. Specifically, cluster is a matrix of clustering memberships, where rows are samples and columns are cluster numbers; mean.gap and mean.score are the scores computed at each given cluster number and normalized based on the ‘greatest gap’ and the ‘greatest score’ criteria; score and random.score are the initial stability scores computed before using any criteria to normalize; sd.gap and sd.score are the standard deviations of the scores when the input argument repeats is specified to be greater than one; call, ncluster, method and score.invert return the call that was made and input arguments specified.

The `plot` and `summary` methods for “progenyClust” objects

To identify the optimal cluster number, we provide the S3 plot and summary methods for “progenyClust” objects. The plot method enables users to visualize stability scores for cluster number selection and to visualize the clustering results. The plot function is as follows:

plot(x, data = NULL, k = NULL, errorbar = FALSE, xlab = '', ylab = '', ...)

If data is not provided, the function will visualize the stability score at each investigated cluster number to give users an overview of the clustering stability. When data is provided, the function will visualize data in scatter plots and represent each cluster membership by a distinct color. data can be the orginal data matrix used for clustering or a subset of the original data with fewer variables but the same number of samples. Additional graphical arguments can be passed to customize the plot. The only extra input argument we added here is errorbar, which will render error bars when plotting stability scores if errorbar = TRUE. The errbar function from Hmisc (Harrell Jr and Harrell Jr 2015) was used to generate the error bars. In addition, the summary method of the “progenyClust” object produces a quick summary of what number of clusters is the best to use for the given data.

The `hclust.progenyClust()` function

The hclust.progenyClust() function performs hierarchical clustering by combining three existing R functions dist(), hclust() and cutree() from stat into one. The input and output are formatted such that they can be directly plugged into the progenyClust() function as an option for FUNclust, similar to the default kmeans() function. The function is as follows:

 hclust.progenyClust(x, k, h.method = 'ward.D2', dist = 'euclidean', p = 2, ...)

To ensure consistency between similar R functions and allow users to easily use this function, the input arguments are largely kept the same as the ones used in fucntions dist(), hclust(), cutree(). The function returns clustering memberships, an hclust object of the tree, and a dist object of the distance matrix.

The `test` and `cell` datasets

A couple of datasets from the original paper on Progeny Clustering (Hu et al. 2015) were included in the progenyClust package for testing and sharing purposes. As a proof-of-concept example, test was a simulated dataset to help users quickly test the algorithm and see how it works. The dataset was generated by randomly drawing 50 samples from bivariate normal distributions with a common identity covariance matrix and a mean at (-1,2), (2,0) and (-1,-2) respectively. Thus, test is a 150 by 2 matrix that contains three clusters.

The dataset cell, generated experimentally from Slater et al. (2015), contains 444 cell samples and the first three principal components of their morphology metrics. Since the cells were engineered into 4 distinct morphological phenotypes, this dataset in theory should contain 4 clusters. More experimental details of this dataset can be found in Slater et al. (2015) and Hu et al. (2015).

4 Examples

In this section, we demonstrate the use of the progenyClust package in two examples. The first example is a proof-of-concept of how progenyClust works on a simulated test dataset. The second example demonstrates the biomedical application of progenyClust to identify the number of cell phenotypes based on cell imaging data.

Proof-of-concept example

To show how the progenyClust() function works, we use the dataset test included in the progenyClust package as the input dataset. The goal here is to find the inherent number of clusters present in this dataset, which is known to be three. Since most of the parameters have default values, we can run the progenyClust() function for this dataset with the default setting. The R code is as follows:

require('progenyClust')
data(test)
set.seed(1)

## run Progeny Clustering with default parameter setting
test.progenyClust <- progenyClust(test)

## plot stability scores computed by Progeny Clustering
plot(test.progenyClust)

## plot clustering results at the optimal cluster number (default)
plot(test.progenyClust, test)

## report the optimal cluster number
summary(test.progenyClust)

## output from the summary
Call:
progenyClust(data = test)

Optimal Number of Clusters:
gap criterion - 3

The summary of the “progenyClust” object concludes that the optimal number for clustering this test dataset is three, which agrees with the fact that the dataset was generated from three centers. The plot result of the “progenyClust” object alone is shown in Figure 3A, displaying a curve of normalized stability scores for all candidate numbers of clusters except for the minimum and maximum. This score curve can provide us with insights of clustering quality at all cluster numbers, and help us identify the second preferred number of clusters if needed. Using the test data as input, the plot() function visualizes the data in a scatter plot with three colors, where each corresponds to a cluster (Figure 3B).

Though the default setting of progenyClust() function works well in this example, for the purpose of illustrating the capabilities of the function, we will change the input argument values and tune the algorithms slightly. For example, due to the theoretical shortage of the ‘greatest gap’ criterion, the user might want to obtain estimation from both the ‘greatest gap’ and the ‘greatest score’ methods. Though the ‘greatest score’ method will slow down the algorithm because of the laborious process of generating reference scores, it can evaluate clustering stabilities at the minimum and maximum potential cluster numbers which are ignored by the ‘greatest gap’ method. The R code for the altered version is shown below. Here, we also change the input argument repeats to repeat the algorithm three times instead of one time to obtain standard deviations of the stability scores.

set.seed(1)

## run Progeny Clustering with both methods and repeated three times
test2.progenyClust <- progenyClust(test, method = 'both', repeats = 3)

## plot with error bars and summarize the output progenyClust object
plot(test2.progenyClust, errorbar = TRUE, type = 'b')
summary(test2.progenyClust)

## output from the summary
Call:
progenyClust(data = test, method = "both", repeats = 3)

Optimal Number of Clusters:
gap criterion - 3
score criterion -  3

It is clear from both the summary and the score curve plots (Figure 4) that both methods agree on the optimal cluster number being three. Specifically, the S3 plot method automatically plots two score curves if the “progenyClust” object was generated with method = ’both’. Using the errbar() function from Hmisc, the S3 plot method is able to display error bars with errorbar = TRUE.

Application to identifying cell phenotypes

Clustering is a useful technique for the biomedical community, and it can be widely applied to various data-driven research projects. As a second example, we illustrate here how the progenyClust package can be used to identify the number of cell phenotypes based on the morphology metrics derived from cell images. In this experiment, biomedical researches used a special technique called “Image Guided Laser Scanning Lithography (LG-LSL)” (Slater et al. 2011) to pattern cells into four shapes. Images of all patterned cells were taken, and morphology metrics were derived to study cytoskeletal and nuclei features of patterned cells. Finding the cell clusters based on their imaging data is of particular interest in this case, and Progeny Clustering can help estimate the optimal number for clustering.

Similar to the first example, applying Progeny Clustering to the cell dataset using the progenyClust package is straightforward. The R code is shown below. Here, we use the built-in function hclust.progenyClust as FUNclust to run the algorithm with hierarchical clustering instead of the default kmeans, and we select the optimal cluster number based on the ‘greatest gap’ criterion. The plot and summary methods are used to show the output scores and the estimated optimal cluster number. From the output result (Figure 5A), we can see that clustering the cells into four groups has the highest stability, which matches the four patterned cell shapes included in this dataset. The clustering results are shown in Figure 5B in a table of scatter plots for each pairing of variables. Since the cell patterns were engineered, we are fortunate in this example to have prior knowledge of the true number of clusters and to easily test clustering algorithms. However, in a lot of similar biological experiments (e.g. collected tumor cells), we do not possess the knowledge of the true cluster number. In these cases, progenyClust can come in handy to identify the optimal cluster number to divide the cells into, and subsequent analyses are then possible for characterizing each cell cluster and discovering its biological or clinical impact.

data(cell)
set.seed(1)

## run Progeny Clustering with hierarchical clustering
cell.progenyClust <- progenyClust(cell, hclust.progenyClust)

## plot stability scores, clustering results at optimal cluster number, and summarize results
plot(cell.progenyClust, type = 'b')
plot(cell.progenyClust, cell)
summary(cell.progenyClust)

## output from the summary
Call:
progenyClust(data = cell, FUNclust = hclust.progenyClust)

Optimal Number of Clusters:
gap criterion - 4

5 Summary

This paper introduces the R package progenyClust, which identifies the optimal cluster number for any given dataset based on the Progeny Clustering algorithm. Improving on the original algorithm, progenyClust provides the option to invert stability scores to instability scores, thus preventing the generation of infinite scores in a perfectly stable clustering solution. A variety of parameters (including the clustering method, the evaluation method and the size of progenies) are offered by the package and can be easily adjusted for Progeny Clustering. In addition, the default parameter setting specified by the package allows users to perform the algorithm with little background knowledge and parameter tuning. Thanks to the superior computing efficiency of Progeny Clustering, this package is a faster alternative to traditional clustering evaluation methods, and it can benefit R communities in biomedicine and beyond.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

progenyClust: an R package for Progeny Clustering

1 Introduction

2 Progeny Clustering

3 The progenyClust package

Inversion of the stability scores

The `progenyClust()` function

The `plot` and `summary` methods for “progenyClust” objects

The `hclust.progenyClust()` function

The `test` and `cell` datasets

4 Examples

Proof-of-concept example

Application to identifying cell phenotypes

5 Summary

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation

progenyClust: an R package for Progeny Clustering

1 Introduction

2 Progeny Clustering

3 The progenyClust package

Inversion of the stability scores

The progenyClust() function

The plot and summary methods for “progenyClust” objects

The hclust.progenyClust() function

The test and cell datasets

4 Examples

Proof-of-concept example

Application to identifying cell phenotypes

5 Summary

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation

The `progenyClust()` function

The `plot` and `summary` methods for “progenyClust” objects

The `hclust.progenyClust()` function

The `test` and `cell` datasets