Partitioned Local Depth (PaLD) is a framework for holistic consideration of community structure for distance-based data. This paper describes an R package, pald, for calculating Partitioned Local Depth (PaLD) probabilities, implementing community analyses, determining community clusters, and creating data visualizations to display community structure. We present essentials of the PaLD approach, describe how to use the pald package, walk through several examples, and discuss the method in relation to commonly used techniques.
Partitioned Local Depth (PaLD) is a framework for holistic consideration of community structure for distance-based data. Leveraging a socially inspired perspective, the approach provides network-based community information which is founded on new measures of local depth and pairwise cohesion (partitioned local depth). The method does not require distributional assumptions, optimization criteria, nor extraneous inputs. A complete description of the perspective, together with a discussion of the underlying social motivation, theoretical results, and applications to additional data sets is provided in 1. A brief technical description is included below, directly following the introduction.
As suggested in 1, a main goal of PaLD is to “transform input dissimilarity comparisons into output pairwise relationship strengths (or cohesion) and resulting weighted networks”. This is intended to provide within- and between-community structural information which goes beyond simple cluster labeling. Building on existing approaches to (global) depth, local depth expresses features of centrality via an interpretable probability which is free of parameters and robust to outliers. Partitioning the probabilities which define local depth, we then obtain a measure of cohesion between pairs of individuals. Both local depth and cohesion reflect aspects of relative position (rather than absolute distance) and provide a straightforward way to account for varying density across the space. As shown in 1, provided that two sets are separated (in the sense that the minimum between-set distance is greater than the maximum within-set distance), cohesion is invariant under the contraction and dilation of the distances within each set. This property may be particularly valuable when one has reason to believe that there is heterogeneity in density across the space.
As cohesion captures a sense of the relationship strength between points, we can then analyze and visualize the resulting community structure via a network whose edges are weighted by (mutual) cohesion. The underlying social framework motivates a straightforward yet elegant threshold for distinguishing between strongly and weakly cohesive pairs.
Networks obtained from cohesion can be displayed using a force-directed graph drawing algorithm; here we will graphically emphasize the strong ties (colored by connected component). We refer to the connected components of the network of strong ties as community “clusters”. Note that to qualify as a cluster in this definition, one may not have any strong ties with those outside the cluster, and thus the existence of disjoint groups is a strong signal for separation. Here, clusters are identified without additional user inputs nor optimization criteria. If one wishes to further break the community graph into groups, one may consider using community detection methods (such as spectral clustering or the Louvain algorithm), as available, say, in the igraph package. The collection of strong ties may be used in place of (weighted) \(k\)-nearest neighbors in settings such as classification and smoothing. Overall, the structural information obtained from local depth, cohesion and community graphs can provide a holistic perspective to the data which does not require the use of distributional assumptions, optimization criteria nor additional user inputs.
It is important to emphasize at the outset that community analyses go beyond simple cluster labeling to address intra- and inter-cluster structure, and can supplement results from other methods for clustering, embedding, data depth, nearest neighbors, etc. In addition, the method does not require distributional assumptions, optimization criteria, nor extraneous inputs. Theoretical considerations (see 1) and examples point to distinctive properties that assist in considerations of complex data, in particular with respect to varying density and high dimensions. See the section Cultural and Psychological distance analysis, below, for some discussion in the context of density-based methods.
After a brief introduction to the partitioned local depth approach (including a concrete and instructive example), below, we present a new package, pald, for calculating partitioned local depths, implementing community analyses, and creating data visualizations to display community structure. This paper describes how to use the package, walks through several examples, and contrasts the method results with commonly used techniques. Together, these demonstrate both the novelty of the method and utility of the implementation in the package described.
The PaLD methodology as introduced in 1 offers a parameter-free approach to analyzing community structure in distance-based data. First, consider a ground set \(\mathcal{S}\) equipped with a meaningful measure of pairwise distance (or dissimilarity), \(d:\mathcal{S}\times S\rightarrow \mathbb{R}\cup \infty\). We require only that \(d\) satisfies the requirements that for all \(x,y\in S\) with \(y\neq x\),
\[\begin{aligned} d(x,x) \leq d(y,x) ~~~~and~~~~ d(x,x)< d(x,y). \end{aligned} \tag{1}\] Note that \(d\) need not necessarily be symmetric nor positive (see (i) below).
We begin with a concept formalizing locality to a pair. For any pair \((x, y)\in \mathcal{S}\times \mathcal{S}\), the local focus, \(U_{x,y}\) (illustrated in Figure 1 for two-dimensional Euclidean data) is defined via
\[\begin{aligned} U_{x,y} \stackrel{\text{def}}{=} \{ z \in S \mid d(z,x) \leq d(y,x) \text{ or } d(z,y) \leq d(x,y) \}. \label{localfocusdefn} \end{aligned} \tag{2}\]
The set \(U_{x,y}\) is comprised of elements that are “locally” relevant to the pair \((x,y)\). Eq. (1) guarantees that both \(x\) and \(y\) are elements of \(U_{x,y}\). For discussion from a social perspective, wherein the data are embedded in a latent social space, see Social Framework in 1. Recall that in (2), the distance \(d\) need not be symmetric; leveraging the social perspective, we are interested in the direct closeness of \(z\) to \(x\) and \(y\), not the reverse (see also (iv), below).
Now, suppose that \(x\) is fixed, and select \(Y\) and \(Z\) uniformly at random from \(\mathcal{S}\setminus \{x\}\) and \(U_{x,Y}\), respectively. The local depth of \(x\), \(\ell_{\mathcal{S}}(x)\), is defined as the probability that \(Z\) is closer to \(x\) than it is to \(Y\) (with a coin flip in the case \(Z\) is equidistant), i.e.
\[\begin{aligned} \ell_{\mathcal{S}}(x)\stackrel{\mathrm{def}}{=}P(d(Z,x) < d(Z,Y)) \label{localdepthdefn}. \end{aligned} \tag{3}\]
The local depth of \(x\) can be thought of as a measure of the extent to which \(x\) is (relatively) surrounded by other elements of \(\mathcal{S}\) (under \(d\)).
Finally, the cohesion, \(C_{x,w}\), of \(w\) to \(x\), for a given \(w\in \mathcal{S}\), is obtained through a partitioning of the probability defining local depth in (3). In particular,
\[\begin{aligned} C_{x,w} \stackrel{\mathrm{def}}{=}P(Z=w \text{ and } d(Z,x) < d(Z,Y)) \label{cohesdefn}. \end{aligned} \tag{4}\]
The cohesion of \(w\) to \(x\) can be viewed as the contribution of \(w\) to the local depth of \(x\). Eq. (1) guarantees that when \(Z=x\), \(d(Z,x) < d(Z,Y)\). For some theoretical properties of cohesion, see 1. Note that the sum of all cohesions is conserved at \(n/2\), where \(n\stackrel{\mathrm{def}}{=}|\mathcal{S}|\); we thus have a constant mean local depth of \(0.5\) (independent of \(n\)).
A universal threshold for strong cohesion (see 1) is defined via
\[\begin{aligned} T \stackrel{\mathrm{def}}{=}P(Z=W \text{ and } d(Z,X)<d(Z,Y)), \label{Tsd0} \end{aligned} \tag{5}\]
where \(X\), \(Y\), \(Z\), and \(W\) are selected uniformly at random from \(\mathcal{S}\), \(\mathcal{S}\setminus X\), \(U_{X,Y}\), and \(U_{X,Y}\), respectively. The threshold, \(T\), can be conveniently calculated as one half the average of the diagonal of the matrix of pairwise cohesion values (see 1 for details), i.e. \[T = \frac{1}{2} \left( \frac{1}{n} \sum_{x\in \mathcal{S}} C_{x,x} \right). \label{Tsd} \tag{6}\] A strong relationship connection between \(x\) and \(w\) is then established when: \[\min \{ C_{x,w}, C_{w,x} \} \geq T. \label{Tsd2} \tag{7}\]
For further details on PaLD, including theoretical results, discussion of the underlying social perspective and applications, see Berenhaut et al. (2022; see also Berenhaut and Moore 2022).
Before moving on to a simple concrete example, we provide a few remarks, for the interested reader, on the concepts introduced above.
(Distances and dissimilarities) For the purposes of the package, we require only the inequalities in (1). This implies the convenient form of the threshold in (7) (see Berenhaut et al. 2022), and is crucial to the underlying social perspective of conflict. For a more general mathematical PaLD framework, wherein the concepts of locality and support are further refined (Berenhaut et al. 2024). As the definitions in (2), (3) and (4) only depend on triplet distance comparisons, local depth and cohesion are invariant under monotone transformations of distance, and negative distances are fine. If meaningful in context, distances need not be symmetric and self-distances, \(d(x,x)\), may vary over \(x\) (subject to the constraint in (1). The triangle inequality may be violated to some extent, mirroring ideas of strong triadic closure (Granovetter 1973).
(Local foci) The concept of local foci in (2) is crucial to all that follows. It allows for consideration of the data at varying scales without the need for localizing parameters. The selection of \(Z\) in (3) and (4) reflects the larger presence of individuals in smaller foci (see Feld (1981) for discussion is social settings). The results in 1 regarding limiting irrelevance of density and separation under increasing concentration are driven by local comparisons. Self-cohesions (i.e., diagonal elements of \(C\)) are not necessarily equal. This is driven by the fact that by (4), \[\begin{aligned} C_{x,x}=\frac{1}{n-1} \sum_{y\neq x} \frac{1}{|{U_{x,y}}|}, \end{aligned}\] and \(C_{x,x}\) is dependent on local density around \(x\).
(Non-monotonicity of the size of local foci) As distance away from an element, \(x\), increases, monotonicity in volume of local foci is not necessary. In particular, consider the one-dimensional set \(\mathcal{S}=\{B,x,A,C,D\}\), with \(B=5\), \(x=11\), \(A=16\), \(C=18\), \(D=19\). Under Euclidean distance, \(d(x,A)=5<6=d(x,B)\) but \(|U_{x,A}|=|\{x,A,C,D\}|=4\) and \(|U_{x,B}|=|\{x,B,A\}|=3\). This further emphasizes the capturing of local density, an important feature of PaLD.
(Assymmetry) Note that the definitions in (2), (3) and (4) are stated to allow for asymmetric distances, as may occur for instance when considering distances on graphs. Note that even symmetric distances can easily lead to asymmetric cohesion. This is the case even for the simple one-dimensional data considered in Example 1. In considering strong relationships based on cohesion, though, symmetrization is employed in (7).
(The use of weighted networks) In what follows, we will at times have occasion to consider a network with node set \(\mathcal{S}\) and pairwise edges weighted via the respective values of cohesion. When viewed as a weighted (connectivity) network in this sense,
the weighted nodal degrees are the values of local depth, addressing considerations of (local) data depth,
the network may be embedded in low-dimensional Euclidean space via standard network embedding techniques, providing visualization of the data that adapts to relative density,
the isolated nodes of either the original network or that restricted to strong ties can provide information regarding outliers,
the connected components of the network restricted to strong ties (i.e. those with weights above the threshold in (6) can be taken as “clusters" in the classical sense,
weighted or unweighted network neighbors may be used in place of \(k\)-nearest neighbors in settings such as classification and smoothing, and
some network analyses can be informative regarding data structure.
Ideas of distance on networks can be complex, though, and this is particularly so for general weighted networks (e.g. correlation or social networks). As is the case here, edge weights may reflect intensity of connection as opposed to simple pairwise distance, and it is important to exercise caution when applying naïve functions intended for unweighted networks (or simple distance-based weighted networks) in igraph and other software.
(Matrix methods) For convenience, we use the
igraph package to
analyze weighted networks, where appropriate. Matrix-based methods
are also possible. For instance, we employ igraph::components
to
obtain community clusters from networks, but alternatively one
could use blocks extraction directly from the cohesion matrix using
packages such as
lintools.
(Density-based methods and localized approaches) Complimentary density-based methods such as DBSCAN (density-based spatial clustering of applications with noise) and its hierarchical variant HDBSCAN (see for instance Campello et al. 2020), seek to identify high-density regions as clusters. We discuss these in the context of a concrete example below (see Cultural and Psychological distance analysis). Determination of required parameter values can be challenging. There are also extent methods for considering data depth which probe local structure (Agostinelli and Romanazzi 2011; Paindaveine and Van Bever 2013). These often also require localizing parameters. In (Berenhaut et al. 2022) (see Theorem 2: Limiting Irrelevance of Density), it is shown that local depth and cohesion account for varying density in the sense that, provided subsets are sufficiently separated, the cohesion is maintained, as within-subset distances are contracted or dilated, without the need to search over a parameter space.
(Social latent spaces) The definitions in (2), (3), (4), and (6) are developed from a social perspective. The interested reader can refer to the section Social Framework in
As a concrete example demonstrating the PaLD framework, consider the one-dimensional set \(\mathcal{S}\stackrel{\mathrm{def}}{=}\{1,3,7,8,9,13,17\}\), with pairwise (Euclidean) distances given in the array below.
1 3 7 8 9 13 17
1 0 2 6 7 8 12 16
3 2 0 4 5 6 10 14
7 6 4 0 1 2 6 10
8 7 5 1 0 1 5 9
9 8 6 2 1 0 4 8
13 12 10 6 5 4 0 4
17 16 14 10 9 8 4 0
The local depths (rounded to three decimal places) are
1 3 7 8 9 13 17
0.409 0.463 0.588 0.675 0.600 0.471 0.294
while the pairwise cohesions (in matrix form) are given by
1 3 7 8 9 13 17
1 0.210 \tr{0.127} 0.036 0.024 0.012 0.000 0.000
3 \tr{0.137} 0.220 0.048 0.036 0.024 0.000 0.000
7 0.048 0.048 0.220 \tr{0.137} \tr{0.109} 0.028 0.000
8 0.024 0.052 \tr{0.163} 0.218 \tr{0.163} 0.056 0.000
9 0.024 0.024 \tr{0.109} \tr{0.137} 0.220 0.063 0.024
13 0.000 0.000 0.012 0.036 0.103 0.188 \tr{0.133}
17 0.000 0.000 0.000 0.000 0.012 \tr{0.107} 0.175
where (non-diagonal) cohesion values above the threshold of \[\begin{aligned} 0.1036=(1/14) (0.210+0.220+0.220+0.218+0.220+0.188+0.175) \end{aligned}\] are indicated in blue; see (7). The resulting weighted network of pairwise cohesion values is displayed in Figure 2; here strong ties (with cohesion values above the threshold) are colored according to community clusters (the connected components of the network of strong ties), i.e. \(\{1,3\}\), \(\{7,8,9\}\) and \(\{13,17\}\).
Note that the cohesion of the element 17 to the element 1 is zero. This reflects the fact that when an element, \(Y\), other than 1 is selected uniformly at random from \(\mathcal{S}\), and an element \(Z\) is selected uniformly at random from \(U_{x,Y}\), 17 is never, at once, equal to \(Z\) and closer to 1 than it is to the opposing element \(Y\). In fact, 17 is not an element of \(U_{1,y}\), for \(y\in \{3,7,8\}\). For \(y\in \{9,13,17\}\), \(17\in U_{1,y}\), but in each case, \(d(17,y)<d(17,1)\). This zero cohesion is reflected in the absence of any edge (grey or colored) from the node for 17 to the node for 1, in Figure 2. ◻
We now turn to discussion of the pald package.
The main functions in the pald package can be split into 3 categories:
A function for computing the matrix of cohesion values, as defined in (4).
Functions for extracting useful information from the cohesion matrix, such as local depths, neighbors, community clusters, and graph objects.
Plotting functions for community graphs.
In addition, the package provides a number of pertinent example data sets which may be used to demonstrate community analysis, including a synthetic data set of two-dimensional points created by Gionis et al. to demonstrate aggregation, clustering data generated from the scikit-learn Python package (Pedregosa et al. 2011), data describing cognate relationships between words across 87 Indo-European languages (Dyen et al. 1992), data compiled by Love and Irizarry (2015) of tissue gene expressions, data employing information from the World Values Survey (Inglehart et al. 2014) on cultural values regarding family, religion, education, and institutions for several regions (Muthukrishna et al. 2020), and three example data sets generated for the 1 paper.
While it is not a necessity, the
pald package is designed
to function well with the pipe operator, >
, particularly for those who
are familiar with tidyverse-based approaches. This functionality will be
demonstrated briefly below.
For the purposes of the
pald package, the sole
input for the Partitioned Local Depth (PaLD) computations is a distance
matrix or dist
object. Recall that the collection of input distances
(or dissimilarities) is assumed to satisfy the requirements in
(1). More generally, the method only requires
triplet distance comparisons, as opposed to exact numeric distances (see
1).
For demonstration purposes, we first show how one can compute a distance
matrix from an input data frame with, say, two variables x1
and x2
.
The input data may be of any dimension; in fact the PaLD framework
provides advantages when considering high-dimensional data (see the
Examples section as well as 1).
library(pald)
<- data.frame(
df x1 = c(6, 8, 8, 16, 4, 14),
x2 = c(5, 4, 10, 8, 4, 10)
)rownames(df) <- c("A", "B", "C", "D", "E", "F")
The dist
function returns a (default Euclidean) pairwise distance
matrix for an input data frame, as demonstrated below. If the data are
already provided as a distance matrix (or dist
object), the user can
skip to the next step. Note that the distance matrix needed for the
subsequent functions does not need to be a dist
object and need not
be symmetric.
<- dist(df) d
The function above creates a dist
object. If converted to a matrix,
this will be an \(n\times n\) distance matrix, where \(n\) corresponds to
the number of observations in the original data frame (in this example
\(n = 6\)).
This dist
object, or a distance matrix, can then be passed to the
cohesion_matrix
function in order to calculate pairwise cohesion
values.
cohesion_matrix(d)
#> A B C D E F
#> A 0.25000000 0.18333333 0.06666667 0.0000000 0.18333333 0.0000000
#> B 0.14000000 0.24000000 0.05000000 0.0000000 0.10666667 0.0000000
#> C 0.07333333 0.07333333 0.20333333 0.0000000 0.03333333 0.0800000
#> D 0.00000000 0.00000000 0.00000000 0.2333333 0.00000000 0.1333333
#> E 0.14000000 0.10666667 0.03333333 0.0000000 0.24000000 0.0000000
#> F 0.00000000 0.00000000 0.05000000 0.1400000 0.00000000 0.2400000
#> attr(,"class")
#> [1] "cohesion_matrix" "matrix" "array"
Equivalently, the user can use the native pipe >
as follows.
|>
df dist() |>
cohesion_matrix()
#> A B C D E F
#> A 0.25000000 0.18333333 0.06666667 0.0000000 0.18333333 0.0000000
#> B 0.14000000 0.24000000 0.05000000 0.0000000 0.10666667 0.0000000
#> C 0.07333333 0.07333333 0.20333333 0.0000000 0.03333333 0.0800000
#> D 0.00000000 0.00000000 0.00000000 0.2333333 0.00000000 0.1333333
#> E 0.14000000 0.10666667 0.03333333 0.0000000 0.24000000 0.0000000
#> F 0.00000000 0.00000000 0.05000000 0.1400000 0.00000000 0.2400000
#> attr(,"class")
#> [1] "cohesion_matrix" "matrix" "array"
The cohesion matrix output by the cohesion_matrix
function is the
main input for the majority of the remaining functions.
From the cohesion matrix, a variety of useful quantities can be computed. Below, we create a cohesion matrix using the functions described in the previous section.
|>
df dist() |>
cohesion_matrix() -> cohesion
The local_depths
function calculates the depth of each point, as
defined in (3), outputting a vector of local depth
probabilities.
local_depths(cohesion)
#> A B C D E F
#> 0.6833333 0.5366667 0.4633333 0.3666667 0.5200000 0.4300000
In this case, the (locally) deepest point is A
.
The strong_threshold
function will calculate the cohesion threshold
for strong ties given in (5), which reflects typical cohesion for local points
(see 1). Computationally, this is equal to half the
average of the diagonal of the cohesion matrix, and is a threshold that
may be used to distinguish between strong and weak ties.
strong_threshold(cohesion)
#> [1] 0.1172222
Here, the threshold is a little above 0.117
.
The function cohesion_strong
will update the cohesion matrix to set
all weak ties to zero (via the strong_threshold
function). Reflecting
(7), by default,
the matrix will also be symmetrized, using the entry-wise (parallel)
minimum of the cohesion matrix and its transpose.
cohesion_strong(cohesion)
#> A B C D E F
#> A 0.25 0.14 0.0000000 0.0000000 0.14 0.0000000
#> B 0.14 0.24 0.0000000 0.0000000 0.00 0.0000000
#> C 0.00 0.00 0.2033333 0.0000000 0.00 0.0000000
#> D 0.00 0.00 0.0000000 0.2333333 0.00 0.1333333
#> E 0.14 0.00 0.0000000 0.0000000 0.24 0.0000000
#> F 0.00 0.00 0.0000000 0.1333333 0.00 0.2400000
#> attr(,"class")
#> [1] "cohesion_matrix" "matrix" "array"
The community_graphs
function takes the cohesion matrix and creates
igraph objects, graphs
that describe the (symmetrized) relationship structure between points.
This function will output a list of three objects:
G
: the weighted (community) graph whose edge weights are mutual
cohesion
G_strong
: the weighted (community) graph consisting of edges for
which mutual (symmetrized) cohesion (i.e. the minimum of the two
directed cohesion values for any given pair) is greater than the
threshold for strong ties
layout
: the graph layout. By default this is provided by the
Fruchterman Reingold (FR) force-directed graph drawing algorithm for
the graph G
, as implemented in the
igraph package.
<- community_graphs(cohesion)
graphs "G_strong"]] graphs[[
#> IGRAPH c4d81a7 UNW- 6 3 --
#> + attr: name (v/c), weight (e/n)
#> + edges from c4d81a7 (vertex names):
#> [1] A--B A--E D--F
Here we see that there are three connected components, ties A-B
and
A-E
form the first community cluster, and the tie D-F
which forms
another.
The any_isolated()
function will check whether there are any isolated
points (according to cohesion).
any_isolated(cohesion)
Here, there are no isolated points, i.e. points having zero cohesion with all other points in the data (an extreme form of outlier).
The “community clusters” identified by PaLD are the connected components
of the graph of strong ties, G_strong
. To directly calculate these, we
can use the community_clusters
function. This will output a data frame
with two columns; the first corresponds to the individual (point
), as
identified by the row name of the original input data frame, df
, the
second identifies the community
that the individual belongs to.
community_clusters(cohesion)
#> point community
#> A A 1
#> B B 1
#> C C 2
#> D D 3
#> E E 1
#> F F 3
In this example, three communities are identified with these six points.
Points A
, B
, and E
fall into Community 1. Point C
is in
Community 2 (a community of size 1) and points D
and F
fall into
Community 3.
The final category of function is that for data visualization. We can
begin by visualizing the points in the data frame df
(Figure
3). When
visualizing these points, it is important to have the aspect ratio of
the x and y axes equal to 1 so as to not distort distances. When using
the ggplot2 package
for this visualization, one can use the command
coord_fixed(ratio = 1)
. If using the plot
function included in the
base library, one can use the asp = 1
argument.
library(ggplot2)
ggplot(df, aes(x1, x2)) +
geom_text(label = rownames(df)) +
coord_fixed(ratio = 1) +
xlim(c(4, 16)) +
ylim(c(4, 16))
We can pass the cohesion matrix to the plot_community_graphs
function
to view the relationship between points (Figure
4). The function
will also permit parameters that can be passed to plot.igraph
via the
...
argument.
plot_community_graphs(cohesion,
vertex.label.cex = 2,
vertex.label.dist = 0.9)
Notice in this plot the force-directed layout does not match that of the
original data frame as seen in Figure
3. Since our
original data is two-dimensional, it may be reasonable to use the latter
as the layout for plotting. Figure 5 includes this update as well as others addressing
some of the aesthetics, such as employing more readable labels. The
layout
argument allows the user to pass a matrix to dictate the
2-dimensional layout of the graph. For example, if we wanted the graph
to match the visualization displayed in Figure
3, we could pass
as.matrix(df)
(a matrix of the data frame df
) to the layout
argument (Figure 5). Here, we increase the vertex size and change the
vertex label color, through the argument specifications
vertex.size = 100
and vertex.label.color = "white"
. Additionally, to
allow axes, we use axes = TRUE
, and to put these back on the original
scale we set rescale = FALSE
, resetting the axis limits using xlim
and ylim
. The par(pty = "s")
function forces the subsequent plot to
be square.
par(pty = "s")
plot_community_graphs(cohesion,
layout = as.matrix(df),
vertex.size = 100,
vertex.label.color = "white",
axes = TRUE,
rescale = FALSE,
asp = 1,
xlim = c(4, 16),
ylim = c(4, 16))
We will demonstrate the utility of the pald package through several illustrative examples.
The first example is from a subset of data from Zilliox and Irizarry (2007),
McCall et al. (2011), and McCall et al. (2014), obtained from the
tissuesGeneExpression bioconductor package (Love and Irizarry 2015) consisting of
22,215-dimensional gene expression data from 189 tissue samples. A
(Euclidean) dist
object was created using this data set and is
included in the pald
package in an object called tissue_dist
.
The tissue_dist
object is a dist
object resulting in a distance
matrix with 189 rows and 189 columns.
We can create the cohesion matrix using the cohesion_matrix
function.
<- cohesion_matrix(tissue_dist) tissue_cohesion
We can display relationships between tissue samples, both locally and
globally through the plot_community_graphs
function (Figure
6). For clarity
of the display, we show how to remove the labels using
show_labels = FALSE
. We will instead color according to the labels by
passing these to the vertex.color
argument for the plot.igraph
function (via the ...
argument). Similarly, we can add a legend using
the legend()
function, as you would for an
igraph visualization.
Additionally, we use the edge_width_factor
and emph_strong
arguments
to adjust the width of the lines between and within PaLD communities.
<- rownames(tissue_cohesion)
labels plot_community_graphs(tissue_cohesion,
show_labels = FALSE,
vertex.size = 4,
vertex.color = as.factor(labels),
edge_width_factor = 35,
emph_strong = 5)
legend("topleft",
legend = unique(as.factor(labels)),
pt.bg = unique(as.factor(labels)),
col = "black",
pch = 21)
A summary of community strong ties can be produced via the following
code. Note that tissue_graph_strong
is an
igraph object
corresponding to the graph displayed in Figure
6, restricted
only to strong ties.
<- community_graphs(tissue_cohesion)
tissue_graphs <- tissue_graphs[["G_strong"]]
tissue_graph_strong <- igraph::get.edgelist(tissue_graph_strong)
E <- data.frame(
tissue_strong_ties strong_ties = apply(E, 1, paste, collapse = ",")
)|>
tissue_strong_ties table() |>
sort(decreasing = TRUE)
#> strong_ties
#> kidney,kidney
#> 273
#> colon,colon
#> 241
#> hippocampus,hippocampus
#> 191
#> cerebellum,cerebellum
#> 186
#> liver,liver
#> 85
#> endometrium,endometrium
#> 81
#> placenta,placenta
#> 4
#> kidney,endometrium
#> 3
Note that there are only three strong ties between different tissue types (kidney and endometrium) in the community cluster network of 1,064 strong ties total.
This example explores a data set from Dyen et al. (1992) that summarizes
relationships between 87 Indo-European languages from the perspective of
cognates, coded using 2,655-dimensional binary vectors. A dist
object
was created from this data set and is included in the
pald package in an object
called cognate_dist
.
Here we will demonstrate how one can further apply functions in the
igraph package to
objects output from the
pald package. We can
first use the cohesion_matrix
function to calculate the cohesion
matrix and the community_graphs
function to create a list with the
weighted community graph, the weighted community graph with only strong
ties included, and the layout. From this, we can extract the graph with
only the strong ties, here called cognate_graph_strong
.
<- cohesion_matrix(cognate_dist)
cognate_cohesion <- community_graphs(cognate_cohesion)
cognate_graphs
<- cognate_graphs[["G_strong"]] cognate_graph_strong
We can then use the neighbors
function from the
igraph package to
extract the strong neighbors in this graph. For example, we can extract
all neighbors for the language “French”, via the following code.
<- igraph::neighbors(cognate_graph_strong, "French")
french_neighbors french_neighbors
#> + 8/87 vertices, named, from 5c6bb3a:
#> [1] Italian
#> [2] Ladin
#> [3] Provencal
#> [4] Walloon
#> [5] French_Creole_C
#> [6] French_Creole_D
#> [7] Spanish
#> [8] Catalan
Similarly, we can sort and print the associated neighborhood weights by subsetting the cohesion matrix.
"French", french_neighbors] |>
cognate_cohesion[sort(decreasing = TRUE)
#> Walloon
#> 0.03258771
#> Provencal
#> 0.02871174
#> French_Creole_C
#> 0.02406057
#> French_Creole_D
#> 0.02406057
#> Ladin
#> 0.02094596
#> Italian
#> 0.01997696
#> Catalan
#> 0.01859688
#> Spanish
#> 0.01679733
We can again use the plot_community_graphs
function to visualize the
community clusters (Figure 7). One may note the commonly identifiable
language clusters and that, under a slight rotation, some of the
underlying geography is mirrored in the plot.
plot_community_graphs(
cognate_cohesion,edge_width_factor = 30,
emph_strong = 3,
vertex.size = 3,
vertex.label.cex = 0.7,
vertex.label.dist = 1
)
The pald package includes three randomly generated data frames corresponding to plots from 1:
exdata1
is a data set consisting of 8 points used to create Figure
1 in 1
exdata2
is a data set consisting of 16 points used to create
Figure 2 in 1
exdata3
is a data set consisting of 240 points used to create
Figure 4D in 1
Here, we will demonstrate how to use exdata3
. These points were
generated from bivariate normal distributions with varying means and
variances. There are eight “true” communities.
We can contrast resulting community clusters obtained via PaLD
(i.e. connected components of the network of strong ties) with clusters
as obtained through common cluster analysis techniques. Here we will
consider two common clustering methods. The code below calculates the
cohesion matrix (exdata_cohesion
) as well as the community clusters
obtained via PaLD (exdata_pald
), along with \(k=8\) clusters obtained
via k-means (exdata_kmeans
) and hierarchical clustering using
complete linkage (exdata_hclust
). Recall that there is no need to
determine values for extraneous inputs in the case of PaLD (e.g. the
number of clusters, as is necessary to specify for k-means and
hierarchical clustering).
<- exdata3 |>
exdata_cohesion dist() |>
cohesion_matrix()
<- community_clusters(exdata_cohesion)$community
exdata_pald
<- kmeans(exdata3, 8)$cluster
exdata_kmeans
<- exdata3 |>
exdata_hclust dist() |>
hclust() |>
cutree(k = 8)
The information is displayed in Figure 8.
par(mfrow = c(1, 3), pty = "s")
plot(
exdata3,pch = 16,
col = pald_colors[exdata_pald],
xlab = "",
ylab = "",
main = "PaLD Communities",
asp = 1
)plot(
exdata3,pch = 16,
col = pald_colors[exdata_kmeans],
xlab = "",
ylab = "",
main = "K-Means Clusters (k = 8)",
asp = 1
)plot(
exdata3,pch = 16,
col = pald_colors[exdata_hclust],
xlab = "",
ylab = "",
main = "Hiearchical Clusters (k = 8)",
asp = 1
)
Cohesion is particularly useful when considering data with varying local density; see 1 for further examples, discussion, and theoretical results. Note that the PaLD algorithm is able to detect the eight natural groups within the data (along with inter- and intra-community structure not displayed here) without the use of any additional inputs (e.g., number of clusters) nor optimization criteria. Despite the user input of the “correct” number of clusters (i.e., \(k = 8\)) both k-means and hierarchical clustering do not provide the desired result.
In this example we perform a PaLD analysis for cultural distances
obtained in Muthukrishna et al. (2020) from two recent waves of the World
Values Survey (2005 to 2009 and 2010 to 2014; see Inglehart et al. (2014)).
Distances are computed using the cultural fixation index (CFST), which
is a measure built on the framework of fixation indices from population
biology (Cavalli-Sforza et al. 1994; Bell et al. 2009). Recall that the
foundation of PaLD in within-triplet comparisons allows for the
employment of application-dependent and non-Euclidean measures of
dissimilarity. The dist
object is included in the
pald package
(cultures
). We will first create the cohesion matrix using the dist
object cultures
, and proceed to plot the community graph.
<- cohesion_matrix(cultures) cultures_cohesion
plot_community_graphs(
cultures_cohesion,edge_width_factor = 30,
emph_strong = 3,
vertex.label.cex = 0.7,
vertex.size = 3,
vertex.label.dist = 1
)
In addition to viewing the local and global community structure as seen
in Figure 9, the
pald package allows for a
two-dimensional display of cohesion against distance for the data, via
the dist_cohesion_plot
function, as seen below (Figure
10).
dist_cohesion_plot(cultures, mutual = TRUE)
Notice here that the magnitude of the distances within each of the identified communities varies substantially between regions; in fact, the most disparate two regions in the United States (at distance \(\approx\) 0.027) are far closer than the two most similar in India (at distance \(\approx\) 0.043). Despite this, India remains a cohesive whole, and locally disparate regions in the United States such as East South Central and California are not strongly cohesive. For discussion of subtleties in local density (see 1. Additionally, currently available techniques require specification of parameters, as seen in the previous section. For further discussion of the cultural distance data in relation to community analysis see 1.
Given the inherent variation in density over regions, it is of interest
to consider complementary results from density-based methods such as
DBSCAN and its variant HDBSCAN. Employing the
dbscan package (see Hahsler et al. 2019), for DBSCAN with parameters eps=0.34
and
minPts=3
, we have a result with five clusters and six regions being
classified as noise. Therein, (excluding noise) the Indian region is
split into two communities. Similarly, for HDBSCAN, with minPts=3
, we
have a result with five clusters and eight noise points. Here again
(excluding the noise points), India has been split into two communities.
The parameter choices used for both methods were determined by
maximizing normalized mutual information (NMI; see Ana and Jain (2003)), when
comparing with the underlying partition of the regions into Europe,
United States, India and China, by regional information), giving values
of 0.848 and 0.819, respectively. For comparison, the corresponding NMI
value for the result in Figure 9 is 0.933. Further discussion is
available in 1. It is important to note the required
parameter choices for both DBSCAN and HDBSCAN and that results can be
somewhat sensitive to changes in these. For instance, for DBSCAN, an
increase in the value of minPts
from \(3\) to \(5\), and a decrease in
eps
value from \(0.34\) to \(0.30\) leads to a large number of noise
points (25), and a decrease in NMI value from \(0.848\) to \(0.685\). For
HDBSCAN an increase in the value of minPts
from \(3\) to \(5\) leads to 15
noise points, and a decrease in NMI value from \(0.819\) to \(0.760\), while
a decrease in the value of minPts
from \(3\) to \(2\) leads to 10 clusters
and a decrease in NMI to \(0.762\).
Computation of the cohesion matrix as implemented in the package is of order \(O(n^3)\). The method is highly parallelizable, though, and recent work has resulted in extensive speed-up (see Devarakonda and Ballard 2024). Approximations are also available (see Baron et al. 2021). Note that the method is entirely deterministic, and one does not need to search a parameter space nor select initial values.
This paper introduces the pald package, demonstrating its utility for providing novel parameter-free community analysis which can easily be implemented for a variety of data sets, supplementing results from other methods for clustering, embedding, data depth, nearest neighbors, etc. Example code is provided along with discussion in the context of other commonly used R-based approaches.
pald, igraph, lintools, ggplot2, dbscan
ChemPhys, Cluster, GraphicalModels, NetworkAnalysis, Optimization, Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
McGowan, et al., "Partitioned Local Depth (PaLD) Community Analyses in R", The R Journal, 2025
BibTeX citation
@article{RJ-2025-013, author = {McGowan, Lucy D'Agostino and Moore, Katherine and Berenhaut, Kenneth S.}, title = {Partitioned Local Depth (PaLD) Community Analyses in R}, journal = {The R Journal}, year = {2025}, note = {https://doi.org/10.32614/RJ-2025-013}, doi = {10.32614/RJ-2025-013}, volume = {17}, issue = {2}, issn = {2073-4859}, pages = {35-55} }