clustAnalytics : An R Package for Assessing Stability and Signiﬁcance of Communities in Networks

This paper introduces the R package clustAnalytics , which comprises a set of criteria for 5 assessing the signiﬁcance and stability of communities in networks found by any clustering algorithm. clustAnalytics works with graphs of class igraph from the R-package igraph , extended to handle weighted and/or directed graphs. clustAnalytics provides a set of community scoring functions, and methods to systematically compare their values to those of a suitable null model, which are of used to test for cluster signiﬁcance. It also provides of a non parametric bootstrap method combined 10 with similarity metrics derived from information theory and combinatorics, useful to test for cluster stability, as well as a method to synthetically generate a weighted network with a ground truth community structure based on the preferential attachment model construction, producing networks with communities and scale-free degree distribution.


Introduction
The segmentation of a network into communities such that individuals in the same community share similar features, whilst individuals across communities are dissimilar is a technique known as clustering, and it is a principal task in the analysis of networks.This can be achieved by various algorithms that differ significantly in their understanding of what constitutes a community (or cluster) and in the way to find them.Once a clustering algorithm is selected, the user is faced with the problem of determining how meaningful the clusters obtained are.Here is where clustAnalytics comes into play.
clustAnalytics contains a suite of novel methods to validate the partitions into communities of networks obtained by any given clustering algorithm.In particular, its clustering validation methods focus on two of the most important aspects of cluster assessment: the significance and the stability of the resulting clusters.Clusters produced by a clustering algorithm are considered to be significant if there are strong connections within each cluster, and weaker connections (or fewer edges) between different clusters.On the other hand, stability measures how much a clustering remains unchanged under small perturbations of the network.In the case of weighted networks, these could include the addition and removal of vertices, as well as the perturbation of edge weights.clustAnalytics handles weighted networks (that is, those in which the connections between nodes have an assigned numerical value representing some property of the data), as well as unweighted, and contains several other functionalities for producing different statistics on a network, and measures of similarity of partitions produced by two different clustering algorithms, most notably an enhanced version of the Reduced Mutual Information (RMI) of Newman et al. (2020).In the following sections we introduce some examples of usage and highlight the principal tasks resolved by clustAnalytics.

Background concepts
Community detection on networks, which are represented by graphs, is a very active topic of research with many applications.The igraph (Csardi and Nepusz, 2006) package contains a collection of the most popular algorithms for this task, such as the Louvain (Blondel et al., 2008), walktrap (Pons and Latapy, 2005) or label propagation Raghavan et al. (2007) algorithms.
Evaluating the significance of the community structure of a network is no simple task, because there is not a single definition of what a significant community is.However, there is some agreement in the literature (see the survey by Fortunato (2010)) in that communities should have high internal connectivity (presence of edges connecting nodes in the community) while being well separated from each other.These notion can be quantified and formalized by applying several community scoring functions (also known as quality functions in (Fortunato, 2010)), that gauge either the intra-cluster or inter-cluster density.clustAnalytics implements the most relevant, or representative, community scoring functions following the taxonomy of these quality measures done by Yang and Leskovec (2015) and the further discussion on how to adapt them to weighted networks in (Arratia and Renedo-Mirambell, 2021  However, to evaluate the significance of clusters on a given network, one needs reference values of the scoring functions to determine whether they are actually higher than those of a comparable network with no community structure.For this, we use a method described in (Arratia and Renedo-Mirambell, 2021) that rewires edges (or transfers some of their weight, in the case of weighted networks) while keeping the degree distribution constant.Then, it can be determined that the partition of a network contains significant clusters if it obtains sufficiently better scores than those for a comparable network with uniformly distributed edges.
On the other hand, stability measures how much the partition of a network into communities remains unchanged under small perturbations.In the case of weighted networks, these could include the addition and removal of vertices, as well as the perturbation of edge weights.This is consistent with the idea that meaningful clusters should capture an inherent structure in the data and not be overly sensitive to small or local variations, or the particularities of the clustering algorithm.To measure the variation that such perturbations present in the clusters, there are multiple similarity metrics available.We have selected to include the Variation of Information (Meilȃ, 2007), the Reduced Mutual Information (Newman et al., 2020), and the Rand Index (both in its original and adjusted forms) (Hubert and Arabie, 1985).The first two are based on information theory, while the second one counts agreements and disagreements in the membership of pairs of elements.Then, it is possible to evaluate the network using resampling methods such as nonparametric bootstrap, as described for clustering on Euclidean data by Hennig (2007), and later for networks by Arratia and Renedo-Mirambell (2021), and quantify the deviations from the initial partition with the similarity measures.

The clustAnalytics package
The clustAnalytics package (version 0.5.0)contains 23 functions for assessing clustering significance and stability, and other useful utilites.These are listed in Table 1 grouped by category.It also contains some other 14 auxiliary functions to support package management and provide useful baseline graphs with communities.Check these in the reference manual.In what follows we detail the usage of the main functions.

Cluster significance
The scoring functions are formally defined in (Arratia and Renedo-Mirambell, 2021), and were selected and programmed based on the analysis of appropriate scoring functions for unweighted graphs made in (Yang and Leskovec, 2015).They will take into account the weights of the edges if the graph is weighted.They take as arguments the graph as an igraph object, and a membership vector: a vector of the same length as the graph order for which each element is an integer that indicates the cluster that its corresponding vertex belongs to.
A general call to all the scoring functions is made with scoring_functions(), which computes all The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 the scores and returns a dataframe containing a row for each community (if type = "local") and a column for each score, or alternatively (if type = "global") returns a single row with the weighted average scores.Additionally, an individual function is available for each of the scores, as listed in Table 1.The package includes efficient implementations of the clustering coefficient and transitivity for weighted networks introduced by McAssey and Bijma (2015).
The main functions for significance evaluation are evaluate_significance_r() and evaluate_significance().The first one takes an igraph graph and a list of clustering algorithms, and computes the scoring functions of the resulting communities, both on the original graph and on rewired versions of it for comparison.The second version does the same while skipping the rewired graphs.By default the clustering algorithms used by these functions are Louvain, label propagation and Walktrap, but they can take any list of clustering algorithms for igraph graphs.Both functions allow for comparison against ground-truth in case this is known.
The edge rewiring method (including its versions for weighted networks) is available separately as rewireCpp.This differs from the igraph function rewire, in that it is capable of rewiring weighted as well as directed graphs while keeping the weighted degrees constant.

Rewiring algorithm
The function rewireCpp provided by the package is an implementation of the switching algorithm that rewires edges while keeping the degree distribution constant described in (Milo et al., 2003;Rao et al., 1996) (conceived originally for unweighted graphs).The function has been extended to work with weighted and/or directed graphs.
The directed version works very similarly to the undirected one.In the unweighted case, at each step of the algorithm, two directed edges AC and BD are selected randomly, and replaced with the new edges AD, BC (as in the original algorithm, any steps that would produce self-edges or multi-edges are skipped).For vertices A and B, we add and remove 1 to the out-degree, so it remains constant (as well as the in-degree, since no incoming edges are modified).Analogously, we add and remove one incoming edge to both the C and D vertices, so their in-degrees remain constant as well.
We do the same for the directed weighted case, extending the undirected unweighted algorithm.This time, when edges AC and BD are selected, there is a transfer of a certain amount w of weight from both AC and BD to AD and BC.This means that the only effects on the in and out-degrees are adding and removing w to out-degrees of vertices A and B, and the same to out-degrees of vertices C and D, which means that again they all remain constant.
If the graph is directed, the rewireCpp function automatically detects it and internally runs the implementation for directed graphs, so there is no need to specify direction as a parameter.The following example is a food network (where edges indicate predator-prey relationships) from the igraphdata package: > data(foodwebs, package="igraphdata") > rewired_ChesLower <-rewireCpp(foodwebs$ChesLower, weight_sel = "max_weight") In the weighted case, the rewiring algorithm transfers a certain amount w of weight from some edges to others.The package provides two settings, which we will choose according to what type of weighted graph we are working with: • Complete graphs with a fixed upper bound: These graphs have an edge between every pair of vertices, which will usually be the result of applying some function to each pair.For example, networks resulting from computing correlations of time series (where each series corresponds to a vertex, and the edge weights are the correlations between series) fall into this category.
• More sparse graphs with weights that are non-negative but not necessarily upper bounded: This describes most commonly found weighted graphs, where the weights quantify some characteristic of the edges.Multigraphs also fit here, if we reinterpret them as weighted graphs where the edge weight is the number of parallel edges between each pair of vertices.
Of the first type, we show an example built from correlations of currency exchange time series (from Arratia and Renedo-Mirambell ( 2021)).In this network (g_forex included in the package) vertices are pairs of exchange rates, and the edge weights are the correlations of their corresponding time series, scaled to the interval [0, 1].In this case, the appropriate setting is the one that keeps the variance of the edge weights constant.

Cluster stability
As for the study of cluster stability, the function used to perform the evaluation is boot_alg_list().This performs a bootstrap resampling (i.e.uniform sampling of the vertices with replacement) of the input graph, applies a given list of clustering algorithms, and measures the variation of the communities obtained in the resampled graphs with respect to the original communities.In more detail, for each input graph and a list of clustering algorithms, the set of vertices in the input graph is resampled (R times), the induced graph is obtained by taking the new set of vertices with the induced edges from the original graph (two vertices are joined with an edge on the resampled graph if they were on the original graph), and the clustering algorithms are applied to it.Then, the resulting clusterings (in each of the resampled graphs) are compared to the clustering of the original graph using several metrics: the variation of information (vi.dist from package mcclust), normalized reduced mutual information (NRMI) and both adjusted and regular Rand index (rand.indexfrom package fossil and adjustedRandIndex from package mclust).If return_data is set to TRUE, the output is a list of objects of class boot (from package boot); otherwise, returns a table with the mean distances from the clusters in the original graph to the resampled ones, for each of the algorithms.
The Reduced Mutual Information is provided as a separate function: reduced_mutual_information().This is an implementation of Newman's Reduced Mutual Information (RMI) Newman et al. (2020), a version of the mutual information that is corrected for chance.The exact computation of this metric is not reasonable for even moderately sized graphs, so it must be approximated.We provide two analytical methods for this approximation, while another that combines a Monte Carlo method with the analytical formula is currently under development and will be added in the next version of this package.

Graph generators and other utilities
In the analysis of clustering algorithms it is useful to generate controlled examples of networks with communities.The package igraph provides the function sample_sbm which builds random graphs with communities from the stochastic block model, and hence these networks have binomial degree distribution.
We provide in clustAnalytics the barabasi_albert_blocks() function, which produces scale-free graphs using extended versions of the Barabási-Albert model that include a community structure.This function generates the graph by iteratively adding vertices to an initial graph and joining them to the existing vertices using preferential attachment (existing higher degree vertices are more likely to receive new edges).Additionally, vertices are assigned labels indicating community membership, and the probability of one vertex connecting to another is affected by their community memberships according to a fitness matrix B (if a new vertex belongs to community i, the probability of connecting to a vertex of community j is proportional to B ij ).
The parameters that need to be set are m the number of new edges per step, the vector p of label probabilities, the fitness matrix B (with the same dimensions as the length of p), and t_max the final graph order.The initial graph G0 can be set manually, but if not, an appropriate graph will be generated with m edges per vertex, labels sampled from p, and edge probabilities proportional to B.
The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 There are two variants of the model.If type="Hajek", new edges are connected with preferential attachment to any existing vertex but using the appropriate values of B as weights (see (Hajek and Sankagiri, 2019)).If type="block_first", new edges are connected first to a community with probability proportional to the values of B, and then a vertex is chosen within that community with regular preferential attachment.In this case, the resulting degree distribution is scale-free (see (Renedo-Mirambell and Arratia, 2021) for a proof of this fact).This is a simple example with just two communities and a graph of order 100 and size 400: > B <-matrix(c(1, 0.2, 0.2, 1), ncol=2) > G <-barabasi_albert_blocks(m=4, p=c(0.5, 0.5), B=B, t_max=100, type="Hajek", sample_with_replacement = FALSE) > plot(G, vertex.color=(V(G)$label),vertex.label=NA,vertex.size=10)Finally, it is worth mentioning the apply_subgraphs() function, which is used internally in the package, but has also been made available to the user because it can be very convenient.It simply calls a function f on each of the communities of a graph (treated as it's own igraph object), acting as a wrapper for the vapply function.The communities are given as a membership vector com.For a very simple example, we call it to obtain the order of each of the factions of the karate club graph: > apply_subgraphs(g=karate, com=V(karate)$Faction, f=gorder) [1] 16 18

Evaluating cluster Significance
The function evaluate_significance takes the graph and a list of clustering functions as arguments.
If the graph has a known ground truth community structure (such as the factions in the karate club), we can set gt_clustering as the membership vector to evaluate it and compare it to the results of the clustering algorithms.In our karate graph the ground truth is available with V(karate)$Faction.If a ground truth clustering has been provided, the row VIdist_to_GT indicates the variation of information distance (Meilȃ, 2007) between that and each of the partitions.In this case the label propagation algorithm obtains the partition closest to the ground truth, while the Louvain algorithm is the furthest.
With the function evaluate_significance_r we compute the scoring functions as above, and we compare the results to those of a distribution of randomized graphs obtained with the rewiring method.The parameters of the rewiring method can be selected as shown in Table 1, in this case we specify weight_sel="max_weight", but we could also set an upper bound if appropriate to the graph.The resulting (default) table is shown below.This is a table with three columns per algorithm: the original one, the mean of the corresponding rewired scores and it's percentile rank within the distribution of rewired scores.If parameter table_style = "string", the function instead returns a table with a column per algorithm where each element is of the format "original|rewired(percentile)".

Applying scoring functions
If it is the case that we already have some explicit community partition, but not the algorithm that produced it, we can assess its significance by applying the scoring functions directly to the network and the partition.Alternatively, apply the scoring functions individually.Each is called with the graph and the membership vector as arguments, and return a vector with the scores for each community: The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 > cut_ratio(karate, V(karate)$Faction) [1] 0.07638889 0.07638889 > conductance(karate, V(karate)$Faction) [1] 0.10000000 0.09090909 A case in point are the clustering coefficient and transitivity.As they can be applied to weighted graphs in general and not only to their partition into communities, they are simply called with the graph as the only argument: > weighted_clustering_coefficient(karate) [1] 0.8127164 To be able to obtain the result for every community in the graph, we provide the function apply_subgraphs; which given a graph, a membership vector and a scalar function, applies the function to every community and returns the vector of results.In this case it works as follows: > apply_subgraphs(karate, V(karate)$Faction, weighted_clustering_coefficient) [1] 0.9514233 0.7783815

Evaluating cluster Stability
Here we perform a nonparametric bootstrap to the karate club graph and the same selection of algorithms.For each instance, the set of vertices is resampled, the induced graph is obtained by taking the new set of vertices with the induced edges from the original graph, and the clustering algorithms are applied.Then, these results are compared to the induced original clusterings using the metrics mentioned above: the variation of information (VI), the normalized reduced mutual information (NRMI), and both adjusted and regular Rand index (Rand and adRand).Note that in this table the variation of information is a distance, so lower values indicate similar partitions, while for the NRMI, Rand, and adRand, higher values mean the partitions are more similar (1 means they are the same partition).The exact definitions of these measures can all be found in (Arratia and Renedo-Mirambell, 2021).

Clustering assessment on synthetic ground truth networks
We can evaluate the significance and stability of clusters produced by a set of clustering algorithms on a network with known community synthetically created with the stochastic block model (with function sample_sbm) or the preferential attachment model (with barabasi_albert_blocks).The former produces a network with binomial degree distribution, and the latter produces networks with scale-free degree distribution.
Let us generate a graph from a stochastic block model in which we set very strong clusters: the elements in the diagonal of the matrix are much larger than the rest, so the probability of intra-cluster edges is much higher than that of inter-cluster edges.
> pm <-matrix (c(.3, .001We can clearly see that for all metrics, the results are much more stable, which makes sense because we created the sbm graph with very strong clusters.

clustAnalytics in context of related R packages
The outstanding quality of clustAnalytics is that it is a set of robust and efficient measures for assessing significance and stability of clustering algorithms on graphs with the convenience of working with igraph objects, which makes it a valuable complement to the igraph package (Csardi and Nepusz, 2006).A revision of the CRAN Task View: Cluster Analysis & Finite Mixture Models shows that there are very few packages devoted to assessing quality of clusters in general, and none for igraph graphs as input.One could use in a limited manner some of the existing packages by converting igraph graphs to their adjacency matrices, but then quality evaluation follows different paradigms not quite pertaining to networks.For instance, the package ClustAssess (Shahsavari et al., 2022) conceived for evaluating robustness of clustering of single-cell RNA sequences data using proportion of ambiguously clustered pairs, as well as similarity across methods and method stability using element-centric clustering comparison; sigclust (Huang et al., 2014) which provides a single function to assess the statistical significance of splitting a data set into two clusters; clValid (Brock et al., 2021) implements Dunn Index, Silhouette, Connectivity, Stability, BHI and BSI, for a statistical and biological-based validation of clustering results.None of these apply directly to igraph objects, and were not conceived for the analysis of clustering in social networks.The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859

Figure 1 :
Figure 1: Example of the barabasi_albert_communities function with the community labels as vertex colors.

Figure 2 :
Figure 2: Karate club graph before and after the edge randomization process.Colors represent the faction of each participant, the ground truth clustering in this network.

Table 1 :
clustAnalytics list of functions split by category.
To apply all scoring functions at once use scoring_functions with either type local or global: We now assess for stability of the clustering algorithms on this sbm graph.