Identifying the optimal number of clusters is a common problem faced by data scientists in various research fields and industry applications. Though many clustering evaluation techniques have been developed to solve this problem, the recently developed algorithm Progeny Clustering is a much faster alternative and one that is relevant to biomedical applications. In this paper, we introduce an R package *progenyClust* that implements and extends the original Progeny Clustering algorithm for evaluating clustering stability and identifying the optimal cluster number. We illustrate its applicability using two examples: a simulated test dataset for proof-of-concept, and a cell imaging dataset for demonstrating its application potential in biomedical research. The *progenyClust* package is versatile in that it offers great flexibility for picking methods and tuning parameters. In addition, the default parameter setting as well as the plot and summary methods offered in the package make the application of Progeny Clustering straightforward and coherent.

Clustering is a classical and widely-used machine learning technique, yet the field of clustering is constantly growing. The goal of clustering is to group objects that are similar to each other and separate objects that are not similar to each other based on common features. Clustering can, for example, be applied to distinguishing tumor subclasses based on gene expression data (Sørlie et al. 2001; Budinska et al. 2013), or dividing sport fans based on their demographic information (Ross 2007). One critical challenge in clustering is identifying the optimal number of groups. Despite some advanced clustering algorithms that can automatically determine the cluster number (e.g. Affinity Propagation (Frey and Dueck 2007)), the commonly used algorithms (e.g. k-means (Hartigan and Wong 1979) and hierarchical clustering (Johnson 1967)) unfortunately require users to specify the cluster number before performing the clustering task. However, most often than not, the users do not have prior knowledge of the number of clusters that exist in their data.

To solve this challenge of finding the optimal cluster number, quite a
few clustering evaluation techniques
(Arbelaitz et al. 2013; Charrad et al. 2014a) as well as R packages
(e.g. *cclust*
(Dimitriadou et al. 2015),
*clusterSim*
(Walesiak et al. 2015),
*cluster*
(Maechler et al. 2015),
*Nbclust*
(Charrad et al. 2014b), *fpc*
(Hennig 2015)) were developed over the years to objectively
assess the clustering quality. The problem of identifying the optimal
cluster number is thus transformed into the problem of clustering
evaluation. In most of these solutions, clustering is first performed on
the data with each of the candidate cluster numbers. The quality of
these clustering results is then evaluated based on properties such as
cluster compactness (Rousseeuw 1987; Tibshirani et al. 2001) or clustering
stability (Ben-Hur et al. 2001; Monti et al. 2003). In particular, stability-based methods
have been well received and greatly promoted in recent years
(Meinshausen and Bühlmann 2010). However, these methods are generally slow
to compute because of the repetitive clustering process mandated by the
nature of stability assessment. Recently, a new method Progeny
Clustering was developed by Hu et al. (2015) to assess clustering quality
and to identify the optimal cluster number based on clustering
stability. Compared to other clustering evaluation methods, Progeny
Clustering requires fewer samples for clustering stability assessment,
thus it is able to greatly boost computing efficiency. However, this
advantage is based on the assumption that features are independent for
each cluster, thus users need to either transform data and create
independent features or consult other methods if this assumption does
not hold for the data of interest.

Here, we introduce a new R package,
*progenyClust*, that
performs Progeny Clustering for continuous data. The package consists of
a main function `progenyClust()`

that requires few parameter
specifications to run the algorithm on any given dataset, as well as a
built-in function `hclust.progenyClust`

to use hierarchical clustering
as an alternative to using `kmeans`

. Two example datasets `test`

and
`cell`

, used in the original publication of Progeny Clustering, are
provided in this package for testing and sharing purposes. In addition,
the *progenyClust* package includes an option to invert the stability
scores, which is not considered in the original algorithm. This
additional capability enables the algorithm to produce more
interpretable and easier-to-plot results. The rest of the paper is
organized as follows: We will first describe how Progeny Clustering
works and then go over the implementation of the *progenyClust* package.
Following the description of functions and datasets provided by the
package, we will provide one proof-of-concept example of how the package
works and a real world example where the package is used to identify
cell phenotypes based on imaging data.

In this section, we briefly review the algorithm of Progeny Clustering (Hu et al. 2015). Progeny Clustering is a clustering evaluation method, thus it needs to couple with a stand-alone clustering method such as k-means. The framework of Progeny Clustering is similar to other stability based methods, which select the optimal cluster number that renders the most stable clustering. The evaluation of clustering stability usually starts with an initial clustering of the full or sometimes partial dataset, followed by bootstrapping and repetitive clustering, and then uses certain criterion to assess the stability of clustering solutions. Progeny Clustering uses the same workflow, but innovates at the bootstrapping method and improves on the stability assessment.

Consider a finite dataset {\(x_{ij}\)}, \(i=1,\ldots,N\), \(j=1,\ldots,M\) that contains \(M\) variables (or features) for \(N\) independent observations (or samples). Given a number \(K\) (a positive integer) for clustering, a clustering method partitions the dataset into \(K\) clusters. Each cluster is denoted as \(C_k\), \(k=1,\ldots,K\). Inspired by biological concepts, each cluster is treated as a subpopulation and the bootstrapped samples as progenies from that subpopulation. The uniqueness of Progeny Sampling during the bootstrapping step is that it randomly samples feature values with replacement to construct new samples rather than directly sampling existing samples. Let \(\tilde{N}\) be the number of progenies we generate from each cluster \(C_k\). Combining these progenies, we have a validation dataset {\(y_{ij}^{\left(k\right)}\)}, \(i=1,\ldots,\tilde{N}\), \(j=1,\ldots,M\), \(k=1,\ldots,K\), containing \(K\times \tilde{N}\) observations with \(M\) features. Using the same number \(K\) and the same method for clustering, we partition the progenies {\(y_{ij}^{\left(k\right)}\)} into \(K\) progeny clusters, denoted by \(C'_k\), \(k=1,\ldots,K\). A symmetric co-occurrence matrix \(Q\) records the clustering memberships of each progeny as follows:

\[Q_{ab}= \begin{cases} 1, & \text{if the $a^{th}$ progeny and the $b^{th}$ progeny are in the same cluster $C'_k$} \\ 0, & \text{otherwise} \end{cases}\label{eq:a1} . \tag{1}\] The progenies in \(Q\) were ordered by the initial cluster (\(C_k\)) they were generated from, such that \(Q_a,\ldots,Q_{a+\tilde{N}}\in C_k\), \(a=\left(k-1\right)\tilde{N}\). After repeating the above process (from generating Progenies to obtaining \(Q\)) \(R\) times, we can get a series of co-occurrence matrices \(Q^{\left(r\right)}\), \(r=1,\ldots,R\). Averaging \(Q^{\left(r\right)}\) results in a stability probability matrix \(P\), i.e.

\[\label{eq:002} P_{ab}=\sum_r Q^{\left(r\right)}_{ab}/R . \tag{2}\]

From this probability matrix \(P\), we compute the stability score for clustering the dataset {\(x_{ij}\)} into \(K\) clusters as

\[\label{eq:003} S= \frac{\displaystyle \sum_k \sum_{a, b \in C_k, b\neq a} P_{ab}/\left(\tilde{N}-1\right)} {\displaystyle \sum_k \sum_{a \in C_k, b \notin C_k} P_{ab}/\left(K\tilde{N}-\tilde{N}\right)} . \tag{3}\]

A schematic for this process and the pseudocode are shown in Figure 1 and Figure 2.