progenyClust: an R package for Progeny Clustering

Identifying the optimal number of clusters is a common problem faced by data scientists in various research fields and industry applications. Though many clustering evaluation techniques have been developed to solve this problem, the recently developed algorithm Progeny Clustering is a much faster alternative and one that is relevant to biomedical applications. In this paper, we introduce an R package progenyClust that implements and extends the original Progeny Clustering algorithm for evaluating clustering stability and identifying the optimal cluster number. We illustrate its applicability using two examples: a simulated test dataset for proof-of-concept, and a cell imaging dataset for demonstrating its application potential in biomedical research. The progenyClust package is versatile in that it offers great flexibility for picking methods and tuning parameters. In addition, the default parameter setting as well as the plot and summary methods offered in the package make the application of Progeny Clustering straightforward and coherent.

Chenyue W. Hu (Rice University) , Amina A. Qutub (Rice University)
2016-05-01

1 Introduction

Clustering is a classical and widely-used machine learning technique, yet the field of clustering is constantly growing. The goal of clustering is to group objects that are similar to each other and separate objects that are not similar to each other based on common features. Clustering can, for example, be applied to distinguishing tumor subclasses based on gene expression data (Sørlie et al. 2001; Budinska et al. 2013), or dividing sport fans based on their demographic information (Ross 2007). One critical challenge in clustering is identifying the optimal number of groups. Despite some advanced clustering algorithms that can automatically determine the cluster number (e.g. Affinity Propagation (Frey and Dueck 2007)), the commonly used algorithms (e.g. k-means (Hartigan and Wong 1979) and hierarchical clustering (Johnson 1967)) unfortunately require users to specify the cluster number before performing the clustering task. However, most often than not, the users do not have prior knowledge of the number of clusters that exist in their data.

To solve this challenge of finding the optimal cluster number, quite a few clustering evaluation techniques (Arbelaitz et al. 2013; Charrad et al. 2014a) as well as R packages (e.g. cclust (Dimitriadou et al. 2015), clusterSim (Walesiak et al. 2015), cluster (Maechler et al. 2015), Nbclust (Charrad et al. 2014b), fpc (Hennig 2015)) were developed over the years to objectively assess the clustering quality. The problem of identifying the optimal cluster number is thus transformed into the problem of clustering evaluation. In most of these solutions, clustering is first performed on the data with each of the candidate cluster numbers. The quality of these clustering results is then evaluated based on properties such as cluster compactness (Rousseeuw 1987; Tibshirani et al. 2001) or clustering stability (Ben-Hur et al. 2001; Monti et al. 2003). In particular, stability-based methods have been well received and greatly promoted in recent years (Meinshausen and Bühlmann 2010). However, these methods are generally slow to compute because of the repetitive clustering process mandated by the nature of stability assessment. Recently, a new method Progeny Clustering was developed by Hu et al. (2015) to assess clustering quality and to identify the optimal cluster number based on clustering stability. Compared to other clustering evaluation methods, Progeny Clustering requires fewer samples for clustering stability assessment, thus it is able to greatly boost computing efficiency. However, this advantage is based on the assumption that features are independent for each cluster, thus users need to either transform data and create independent features or consult other methods if this assumption does not hold for the data of interest.

Here, we introduce a new R package, progenyClust, that performs Progeny Clustering for continuous data. The package consists of a main function progenyClust() that requires few parameter specifications to run the algorithm on any given dataset, as well as a built-in function hclust.progenyClust to use hierarchical clustering as an alternative to using kmeans. Two example datasets test and cell, used in the original publication of Progeny Clustering, are provided in this package for testing and sharing purposes. In addition, the progenyClust package includes an option to invert the stability scores, which is not considered in the original algorithm. This additional capability enables the algorithm to produce more interpretable and easier-to-plot results. The rest of the paper is organized as follows: We will first describe how Progeny Clustering works and then go over the implementation of the progenyClust package. Following the description of functions and datasets provided by the package, we will provide one proof-of-concept example of how the package works and a real world example where the package is used to identify cell phenotypes based on imaging data.

2 Progeny Clustering

In this section, we briefly review the algorithm of Progeny Clustering (Hu et al. 2015). Progeny Clustering is a clustering evaluation method, thus it needs to couple with a stand-alone clustering method such as k-means. The framework of Progeny Clustering is similar to other stability based methods, which select the optimal cluster number that renders the most stable clustering. The evaluation of clustering stability usually starts with an initial clustering of the full or sometimes partial dataset, followed by bootstrapping and repetitive clustering, and then uses certain criterion to assess the stability of clustering solutions. Progeny Clustering uses the same workflow, but innovates at the bootstrapping method and improves on the stability assessment.

Consider a finite dataset {\(x_{ij}\)}, \(i=1,\ldots,N\), \(j=1,\ldots,M\) that contains \(M\) variables (or features) for \(N\) independent observations (or samples). Given a number \(K\) (a positive integer) for clustering, a clustering method partitions the dataset into \(K\) clusters. Each cluster is denoted as \(C_k\), \(k=1,\ldots,K\). Inspired by biological concepts, each cluster is treated as a subpopulation and the bootstrapped samples as progenies from that subpopulation. The uniqueness of Progeny Sampling during the bootstrapping step is that it randomly samples feature values with replacement to construct new samples rather than directly sampling existing samples. Let \(\tilde{N}\) be the number of progenies we generate from each cluster \(C_k\). Combining these progenies, we have a validation dataset {\(y_{ij}^{\left(k\right)}\)}, \(i=1,\ldots,\tilde{N}\), \(j=1,\ldots,M\), \(k=1,\ldots,K\), containing \(K\times \tilde{N}\) observations with \(M\) features. Using the same number \(K\) and the same method for clustering, we partition the progenies {\(y_{ij}^{\left(k\right)}\)} into \(K\) progeny clusters, denoted by \(C'_k\), \(k=1,\ldots,K\). A symmetric co-occurrence matrix \(Q\) records the clustering memberships of each progeny as follows:

\[Q_{ab}= \begin{cases} 1, & \text{if the $a^{th}$ progeny and the $b^{th}$ progeny are in the same cluster $C'_k$} \\ 0, & \text{otherwise} \end{cases}\label{eq:a1} . \tag{1}\] The progenies in \(Q\) were ordered by the initial cluster (\(C_k\)) they were generated from, such that \(Q_a,\ldots,Q_{a+\tilde{N}}\in C_k\), \(a=\left(k-1\right)\tilde{N}\). After repeating the above process (from generating Progenies to obtaining \(Q\)) \(R\) times, we can get a series of co-occurrence matrices \(Q^{\left(r\right)}\), \(r=1,\ldots,R\). Averaging \(Q^{\left(r\right)}\) results in a stability probability matrix \(P\), i.e. 

\[\label{eq:002} P_{ab}=\sum_r Q^{\left(r\right)}_{ab}/R . \tag{2}\]

From this probability matrix \(P\), we compute the stability score for clustering the dataset {\(x_{ij}\)} into \(K\) clusters as

\[\label{eq:003} S= \frac{\displaystyle \sum_k \sum_{a, b \in C_k, b\neq a} P_{ab}/\left(\tilde{N}-1\right)} {\displaystyle \sum_k \sum_{a \in C_k, b \notin C_k} P_{ab}/\left(K\tilde{N}-\tilde{N}\right)} . \tag{3}\]

A schematic for this process and the pseudocode are shown in Figure 1 and Figure 2.