Movatterモバイル変換

ideanet provides a set of tools for community detectionusing iterative maps on the CHAMP set. These tools help identifycommunity partitions across different resolution parameters, focusing ontheir domain of optimality and self-consistency within a specific blockmodel. This pipeline includes three core functions for finding andprocessing communities in undirected networks, described in thisvignette:

That is, CHAMP takes in a set of community structure partitions,where each partition is a different assignment of nodes intocommunities, and post-processes this set of partitions to identify whichpartition is optimal (relative to the input set of partitions, in themodularity sense) at each value of the resolution parameter$\gamma$. By doing so, CHAMP helps users besure they select community structures that are somewhere optimal in thespace of$\gamma$ values, andhighlights partitions that have wide domains of optimality. Theiterative map is then run on this restricted “CHAMP set” ofsomewhere-optimal partitions, highlighting an even smaller number ofpartitions that are fixed points of the iterative map (that is, they areself-consistent in the sense of the equivalence found by Newman,2016).

Note that in most cases the first step,get_partitions,which is simply a wrapper to multiple calls ofigraph::cluster_leiden and tocomm_detect,will be the most computationally intensive step of this pipeline.Neither CHAMP nor the iterative map find community structuresthemselves; rather, they are tools to highlight a subset of thepartitions input to CHAMP. Generally, the more completely one identifiesthe possible input set of partitions, the better the results will befrom CHAMP and the iterative map. It is important to emphasize thatCHAMP and the iterative map are completely deterministic after the inputset of partitions has been defined byget_partitions; thatis, all of the pseudo-stochasticity from the use of community detectionheuristics in this pipeline is in obtaining the input set of partitionsinget_partitions. The quality of the results from CHAMPand the iterative map are inherently limited by the quality of the inputset of partitions, and the optimality of one partition over anotheridentified by CHAMP in the second step of this pipeline is only relativeto the set input to CHAMP. Note also that one could add other partitionsdefined by hand or obtained by other algorithms, as long as they areformatted consistent with the list ofpartitions output byget_partitions. Finally, we stress that all of thecalculations in CHAMP and the iterative map are based on modularity asextended with a resolution parameter, while other objective functionsmay be preferable in different settings.

Keep in mind that, at the end of the day, this pipeline — combiningmodularity optimization with a resolution parameter, the algorithmicheuristics for this optimization, the CHAMP post-processing, and theequivalence with inference on the degree-corrected planted partitionmodel — is just one framework for community detection, identifyingappropriate resolution parameters for performing modularity-basedunsupervised clustering on network data. In many settings, thecommunities with large domains of optimality identified byCHAMP and the partitions that are fixed points found byget_CHAMP_map will have better overall properties than someother community labels. However, like unsupervised clustering of anyother data, there can be multiple good ways to cluster data and there isno single answer that is best in all settings (see, for example, Peel etal., 2017).

These functions, as part of the broader IDEANet project, aresupported by the National Science Foundation as part of the HumanNetworks and Data Science - Infrastructure program (BCS-2024271 andBCS-2140024).

Note that our current R implementation of this pipeline does nothandle multilayer networks. The Python implementation athttps://github.com/ragibson/ModularityPruning combinesthe multilayer capabilities of CHAMP with the generalization of Newman’sequivalence to multilayer networks derived by Pamfil et al. (2019). An Rimplementation of this multilayer framework is in development. We alsoassume that networks are undirected, though possibly weighted (but seethe note at the end of Example 2 below).

In addition to this vignette, we highly recommend you look attheCHAMP_karate vignette, whichdemonstrates this pipeline on the karate club network, and theCHAMP_football vignette, which goes intoan even deeper dive on the network of Division I-A college footballgames from the Fall 2000 season, demonstrating how this pipeline helpsuncover the conference structure.

Example: Families in Renaissance-era Florence

We demonstrate the 3 main steps of the pipeline with the florentinenetworkigraph object that combines both types of relationsbetween families in Florence, as built in thenetwritevignette.

nw_flor<-netwrite(nodelist = florentine_nodes,node_id ="id",i_elements = florentine_edges$source,j_elements = florentine_edges$target,type = florentine_edges$type,directed =FALSE,net_name ="florentine")

In the interest of simplicity, we will not visualize sociogramsdirectly in this vignette. However, if you are interested in looking atthe general structure of the networks used here, we provide code fordoing so and encourage you to paste it in your R console:

igraph::plot.igraph(nw_flor$florentine)

Step 1: Collect Partitions

The whole idea of CHAMP is that it post-processes community structurepartitions that you’ve already taken the time to compute. It is not acommunity finding algorithm itself; rather, it enforces the modularity(with resolution parameter) objective function back onto the (possiblyvery many) partitions that you have already computed. So to use this wefirst need to have a set of partitions, which we will obtain here withtheget_partitions function. Under the hood, it runscluster_leiden fromigraph at various$\gamma$ resolution parameter values, alongwith the routines incomm_detect, filters out the duplicatepartitions, and then formats the results appropriately in thepartitions list for what comes next. Because Leiden and theother routines incomm_detect are pseudo-stochasticheuristics, we set the random seed as part of the function call toensure reproducible results.

Arguments forget_partitions():

network: igraph object to analyze
gamma_range: range of$\gamma$ values to use as input forcluster_leiden. Default is a range of 0 to 3.
n_runs: number of times that the Leiden algorithm willbe executed at different$\gamma$values. Defaults to 100.
n_iterations: the number of optimization runs passed tocluster_leiden. Defaults to 2.
seed: set a seed for reproducible partitions.
add_comm_detect: Boolean to decide whether to also callthe clustering algorithms included in (default =TRUE).Alternatively, the output of can be provided directly here.

flor_partitions<-get_partitions(nw_flor$florentine,n_runs =500,seed =4781)#> Warning in comm_detect(network): Calling cluster_edge_betweenness with reciprocal weights, which may affect selected membership vector incorrectly.#> Warning in igraph::cluster_edge_betweenness(g_undir, weights =#> igraph::E(g_undir)$r_weight, : At#> vendor/cigraph/src/community/edge_betweenness.c:504 : Membership vector will be#> selected based on the highest modularity score.#>#> [1] "21 unique partitions generated between cluster_leiden() and comm_detect()"

Note that settingn_runs = 500 is an insanely overkillnumber of times to call Leiden for this small network, and indeed we seethe resulting report that we only obtain 21 unique partitions. But thiscall also probably only took a couple seconds at most (depending on yourhardware), and since the quality of the deterministic calculations thatfollow in the next two steps relies on finding good partitions inget_partitions, it is generally good practice to be willingto spend a few compute cycles in this step.

Thepartitions list returned byget_partitions (assigned toflor_partitionshere) includes the counts of the numbers of times that each uniquepartition was obtained by these pseudo-heuristic algorithms.

unlist(flor_partitions$count)#>  [1]  46 100   1   2  11   1   1  90   1   1 114  34   5   1  24   1  13  32   8#> [20]  19   7sum(unlist(flor_partitions$count))#> [1] 512

You can see in the above that some partitions show up many times —like the 2nd, 8th and 11th in the list — while a few were only foundonce. You can also see that there are a total of 512 results(n_runs = 500 calls tocluster_leiden atdifferent resolution parameters plus the 12 clustering results obtainedincomm_detect) that led to these 21 unique partitions. Youcan directly query different unique partitions inside the list asfollows.

flor_partitions$partitions[[2]]#> IGRAPH clustering leiden, groups: 3, mod: NA#> + groups:#>   $`1`#>   [1] "0"  "1"  "2"  "5"  "8"  "9"  "12" "13" "15"#>#>   $`2`#>   [1] "3"  "4"  "6"  "7"  "10" "14"#>#>   $`3`#>   [1] "11"#>flor_partitions$partitions[[10]]#> IGRAPH clustering leiden_cpm_membership, groups: 5, mod: NA#> + groups:#>   $`1`#>   [1] "0"#>#>   $`2`#>   [1] "1" "5"#>#>   $`3`#>   [1] "2"  "3"  "4"  "6"  "7"  "10" "14"#>#>   $`4`#>   + ... omitted several groups/vertices

You can see that the information about each of these partitionsincludes a label about which clustering routine identified thepartition. In particular, you can see that the 10th partition is labeled“leiden_cpm_membership”, indicating which routine called bycomm_detect found this, while the 2nd partition wasobtained from one of thecluster_leiden calls. Of course,as we saw above, the 2nd partition was obtained many times, and we donot distinguish here whether any of those copies of this partition werefound by one of the routines incomm_detect; that is, the“leiden” label here merely means that at least one of the copies of thispartition was found bycluster_leiden. In contrast, the“leiden_cpm_membership” label on the 10th partition means it was notfound by any of our manycluster_leiden calls at differentresolution parameter values.

You might also reasonably wonder about what resolution parametervalues were involved in finding some of these partitions. The range ofthesegamma values is included in the list.

print(flor_partitions$gamma_min)#> [1] 0print(flor_partitions$gamma_max)#> [1] 3

However, importantly, the CHAMP framework does not actually depend onthe resolution parameter value used in the algorithm that found apartition, so we do not store this information. Instead, in the nextstep CHAMP will take all of the unique partitions that we have found andtell us which of these partitions is best at each resolutionparameter.

Step 2: CHAMP

Next, we run theCHAMP algorithm on the resultingpartitions, and plot results to visualize the domains of optimality inthe resolution parameter space for each somewhere-optimal partition.

Arguments forCHAMP():

network: igraph object to analyze.
partitions: partitions object generated fromget_partitions.
plottitle: optionally, a title for the output plot.Null by default.

flor_partitions<-CHAMP(nw_flor$florentine, flor_partitions,plottitle ="Florentine (combined)")#> [1] "6 partitions in the CHAMP set (i.e., on the upper envelope of Q v. gamma)"

In the above plot generated byCHAMP, each of the 21unique partitions returned byget_partitions is representedby a line giving its values of$Q(\gamma)$, the modularity of thepartition as a function of resolution parameter$\gamma$.CHAMP thenidentifies which of these lines ever appear on the upper envelope of$Q(\gamma)$, that is, which of thecorresponding partitions are somewhere optimal in the space of$\gamma$ values. Notably,CHAMP only finds 6 somewhere-optimal partitions here. Theplot annotates the line segments with numbers indexing which of theinput partitions are optimal (relative to the input set) along that linesegment, as well as the$\gamma$values where lines intersect on the upper envelope (that is, where whichpartition is optimal changes). As a technical sidebar, we note that ourCHAMP implementation here only considers partitions thatare optimal with$Q(\gamma)>0$,cutting off the line segment for partition 11 at the point where itcrosses$Q=0$ and ignoring higherresolution parameter values.

CHAMP updates thepartitions list with thegeneratedCHAMPfigure plot shown above and aCHAMPsummary that provides essential information about the6 somewhere-optimal partitions found here.

print(flor_partitions$CHAMPsummary)

starting_gamma	ending_gamma	partition_num	num_communities	next_gamma	next_partition_num	next_num_communities	segment_length	gamma_range
0.0000000	0.3431373	1	2	NA	NA	NA	0.4852694	0.3431373
0.3431373	0.9536785	2	3	NA	NA	NA	0.6827173	0.6105412
0.9536785	0.9722222	5	4	NA	NA	NA	0.0196505	0.0185437
0.9722222	1.6535433	8	5	NA	NA	NA	0.7097419	0.6813211
1.6535433	1.7500000	12	6	NA	NA	NA	0.0991958	0.0964567
1.7500000	2.2992701	11	6	NA	NA	NA	0.5628423	0.5492701

In particular, theflor_partitions$CHAMPsummary$partition_num column and theassociated numerical annotations on the figure correspond to the list ofpartitions inflor_partitions$partitions. For example, thepartition that straddles the default$\gamma=1$ resolution parameter here, whichalso happens to have a largegamma_range andsegment_length (the length of the corresponding linesegment on the upper envelope of Q v.$\gamma$ in the figure) is the 5-communitypartition withpartition_num 8 here. Having thushighlighted this partition, we can directly examine it more closely.

flor_partitions$partitions[[8]]#> IGRAPH clustering leiden, groups: 5, mod: NA#> + groups:#>   $`1`#>   [1] "0"  "8"  "9"  "12" "13" "15"#>#>   $`2`#>   [1] "1" "5"#>#>   $`3`#>   [1] "2"  "4"  "10" "14"#>#>   $`4`#>   + ... omitted several groups/verticesigraph::membership(flor_partitions$partitions[[8]])#>  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15#>  1  2  3  4  3  2  4  4  1  1  3  5  1  1  3  1

igraph::plot.igraph(nw_flor$florentine,mark.groups = flor_partitions$partitions[[8]])

RunningCHAMP selects out only the partitions that areoptimal at some$\gamma$ and, in sodoing, allows the user to see which partitions are optimal over widerranges of the resolution parameter. In particular, we expect that theremight be multiple partitions of interest, and thatCHAMPwill help the user focus their attention down onto a smaller number ofpossible candidate partitions that have large ranges of optimality in$\gamma$.

We emphasize thatCHAMP is deterministic given an inputset of partitions, identifying which of those partitions is optimal atsome$\gamma$; but the set ofpartitions obtained in theget_partitions step depends onthe random seed, the number of runs, and the parameters passed tocluster_leiden. In practice it loosely appears that thepartitions thatCHAMP finds to be optimal over wider rangesof the resolution parameter are more easily obtained by fewer runs inget_partitions; but it may be interesting in future work toexplore some of these computational details including how many runs inget_partitions it takes to obtain relatively robustCHAMP results.

Step 3: Compute the SBM-equivalence iterative map on the CHAMPset

In the current example, with only 6 partitions remaining in “theCHAMP set”, and only some of those with larger ranges of optimality, auser might reasonably stop here with these results and proceed to studythe resulting communities in greater detail. But in many applications,there may still be many somewhere-optimal partitions in the CHAMP setwith comparable gamma range and/or segment length, and the user may belooking for a way to further focus their attention onto a smaller numberof partitions.

Theget_CHAMP_map function uses the equivalence ofmodularity optimization with inference on a degree-corrected plantedpartition model, as discovered by Newman (2016), to define thecorresponding iterative map restricted to the partitions in the CHAMPset, as defined by Gibson & Mucha (2022). In this iterative map,each partition in the set points to an “estimated$\gamma$” value consistent with theequivalence, and thereby points to whichever partition is optimal atthat estimated$\gamma$ value. Afixed point of the map, that is, a partition that points to itself (itpoints to an estimated$\gamma$ inits own domain of optimality), is thus self consistent in the sense ofthis equivalence, and becomes a natural partition for furtherattention.

Arguments forget_CHAMP_map():

network: igraph object to analyze.
partitions: partitions object generated fromget_partitions.
plottitle: optionally, a title for the output plot.Null by default.

flor_partitions<-get_CHAMP_map(nw_flor$florentine, flor_partitions,plotlabel ="Florentine (combined)")#> [1] "Partition # 2 (with 3 communities) is a fixed point of the iterative map"#> [1] "Partition # 8 (with 5 communities) is a fixed point of the iterative map"

The above figure re-visualizes each of the 6 partitions in the CHAMPset (that is, the somewhere-optimal partitions) as a line segmentindicating the number of communities in the partition (on the verticalaxis) versus its domain of optimality in$\gamma$ (on the horizontal axis). Thearrows in the figure visualize the iterative map on the CHAMP set,directed from the middle of each line segment to the “estimated$\gamma$” value associated with thatpartition and, thus, the partition that is optimal at that$\gamma$. These values are tabulated in thenow updatedflor_partitions$CHAMPsummary.

print(flor_partitions$CHAMPsummary)

starting_gamma	ending_gamma	partition_num	num_communities	next_gamma	next_partition_num	next_num_communities	segment_length	gamma_range
0.0000000	0.3431373	1	2	NA	NA	NA	0.4852694	0.3431373
0.3431373	0.9536785	2	3	0.8340116	2	3	0.6827173	0.6105412
0.9536785	0.9722222	5	4	1.0539077	8	5	0.0196505	0.0185437
0.9722222	1.6535433	8	5	1.1534626	8	5	0.7097419	0.6813211
1.6535433	1.7500000	12	6	1.2535777	8	5	0.0991958	0.0964567
1.7500000	2.2992701	11	6	1.2858669	8	5	0.5628423	0.5492701

Looking at the above figure and table, at the bottom of the table wesee that the 6-community partition withpartition_num 11maps to the estimated$\gamma$ valuelisted undernext_gamma, where the 5-community partition isoptimal (that is, the bottom row of the table listsnext_partition_num 8 andnext_num_communities5). Meanwhile, note that thenext_gamma value listed forpartition_num 5 is between thestarting_gammaandending_gamma values forpartition_num 8,whereas thenext_gamma andnext_partition_numforpartition_num 8 maps to itself. That is,partition_num 8 is a fixed point of the iterative map onthe CHAMP set. We see thatpartition_num 2, with 3communities, is also a fixed point, mapping to itself. Note that the“fixed point” message output fromget_CHAMP_map identifiesbothpartition_num 2 andpartition_num 8 asthe two partitions that each point to themselves.

As an aside, note the NA values in the row corresponding to the2-community partition. This partition is unlike the others here in thatit doesn’t map to anywhere, because this 2-community partition is atrivial solution in that it simply identifies the 2 components of thegraph (disconnected from one another) without any further communitystructure within those components, and in the absence of any furthercommunity structure it has no associated resolution. This trivialpartition where all nodes in each connected component are together in acommunity cannot be mapped to any other partition (hence the NA values).However, it is theoretically possible in the pipeline for otherpartitions to map to$\gamma$ valueswhere this trivial partition is optimal (but it does not happen in thepresent examples).

As the only fixed points of the map on the CHAMP set,partition_num 2 andpartition_num 8 are ofspecial interest to us. The fact thatpartition_num 8happens to be the same as the optimal partition at the defaultresolution parameter$\gamma=1$ is aspecial feature of this simple example network — in other examples, thefixed points of the CHAMP map may occur at other resolution parametervalues.

Again, there may be other good partitions in different senses thatare not highlighted by these procedures, but we expect there are manycases where this framework — gathering partitions, runningCHAMP, and picking out the fixed point of the iterative mapon the CHAMP set — easily and efficiently highlights partitions ofinterest.

Movatterモバイル変換

Selecting Community Detection ResolutionParameters with CHAMP Maps

Example: Families in Renaissance-era Florence

Step 1: Collect Partitions

Step 2: CHAMP

Step 3: Compute the SBM-equivalence iterative map on the CHAMPset

References