4.1 Hierarchical clustering
Boris Leroy,Maxime Lenormand and Pierre Denelle
2025-07-05
Source:vignettes/a4_1_hierarchical_clustering.Rmd
a4_1_hierarchical_clustering.Rmd
Hierarchical clustering consists in creating a hierarchical tree froma matrix of distances (or beta-diversities). From this hierarchicaltree, clusters can be obtained by cutting the tree.
Although these methods are conceptually simple, their implementationcan be complex and requires important user decisions. Here, we provide astep-by-step guide to performing hierarchical clustering analyses withbioregion
, along with comments on the philosophy on how wedesigned the functions.
Hierarchical clustering takes place on the right-hand side of thebioregion
conceptual diagram:

1. Compute dissimilarity indices from input data
To initiate the hierarchical clustering procedure, you need toprovide pairwise distances between sites. These pairwise distancesbetween sites can be obtained by runningdissimilarity()
ona species-site matrix, such as a presence-absence or an abundancematrix.
In the example below, we use the vegetation dataset from the packageto compute distance metrics.
library("bioregion")# Work with the vegetation dataset we include in the packagedata(vegemat)# This is an abundance matrix where sites are in rows and species in columnsvegemat[1:10,1:10]
## Species## Site 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010## 35 0 0 0 0 0 0 0 0 0 0## 36 2 0 0 0 0 0 1 12 0 0## 37 0 0 0 0 0 0 0 0 0 0## 38 0 0 0 0 0 0 0 0 0 0## 39 5 0 0 0 0 0 0 2 0 0## 84 0 0 0 0 0 0 0 0 0 0## 85 3 0 0 0 0 0 1 7 0 0## 86 0 0 0 2 0 0 2 22 0 0## 87 16 0 0 0 0 0 2 54 0 0## 88 228 0 0 0 0 0 0 5 0 0
We are going to compute the\(\beta_{sim}\) diversity metric, which is apresence-absence dissimilarity index. The formula is as follows:\(\beta_{sim} = min(b, c) / (a+min(b,c))\)
Wherea is the number of species shared by both sites;b is the number of species occurring only in the first site;andc is the number of species only occurring only in thesecond site.
We typically choose this metric for bioregionalization, because it istheturnover component of the Sorensen index(Baselga, 2012) (in a nutshell, it tells us howsites are different because they have distinct species), and because itis less dependent on species richness than the Jaccard turnover(Leprieur & Oikonomou, 2014). Alternatively,given that we have abundance data here, we could also use theBray-Curtis turnover index(Baselga,2013). The choice of the distance metric is very important forthe outcome of the clustering procedure, so we recommend that you choosecarefully depending on your research question.
dissim<-dissimilarity(vegemat)head(dissim)
## Data.frame of dissimilarity between sites## - Total number of sites: 715## - Total number of species: 3697## - Number of rows: 255255## - Number of dissimilarity metrics: 1###### Site1 Site2 Simpson## 2 35 36 0.02325581## 3 35 37 0.03100775## 4 35 38 0.05426357## 5 35 39 0.05426357## 6 35 84 0.72093023## 7 35 85 0.08527132
By default, only the Simpson index is computed, but other options areavailable in themetric
argument of dissimilarity().Furthermore, users can also write down their own formula to compute anyindex they want for in the argumentformula
, see?dissimilarity().
We are now ready to start the hierarchical clustering procedure withthe objectdissim
we have just created. Alternatively, youcan also use other types of objects inhclu_hierarclust()
,such as a distance matrix object (classdist
) or adata.frame
of your own crafting (make sure to read therequired format carefully in?hclu_hierarclust
).
2. Hierarchical clustering with basic parameters
Hierarchical clustering, and the associated hierarchical tree, can beconstructed in two ways: -agglomerative, where all sites areinitially assigned to their own bioregion and they are progressivelygrouped together -divisive, where all sites initially belongto the same unique bioregion and are then progressively divided intodifferent bioregions
Subsections 2.1 to 2.3 detail the functioning of agglomerativehierarchical clustering, whilesub-section 2.4.illustrates the divisive method.
2.1 Basic usage
The basic use of the function is as follows:
tree1<-hclu_hierarclust(dissim)
## Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#### Final tree has a 0.6863 cophenetic correlation coefficient with the initial dissimilarity matrix
The function gives us some information as it proceeds. Notably, ittalks about a randomization of the dissimilarity matrix - this is a veryimportant feature because hierarchical clustering is strongly influencedby the order of the sites in the distance matrix. Therefore, by default,the function performs a randomization of the order of sites in thedistance matrix with 30 trials (moreinformation in the randomization section). It also tells us thatamong all the trials selected the tree with the highest copheneticcorrelation coefficient, with a value of 0.69.
We can see type the name of the object in the console to see moreinformation:
tree1
## Clustering results for algorithm : hclu_hierarclust## (hierarchical clustering based on a dissimilarity matrix)## - Number of sites: 715## - Name of dissimilarity metric: Simpson## - Tree construction method: average## - Randomization of the dissimilarity matrix: yes, number of trials 100## - Method to compute the final tree: Iterative consensus hierarchical tree## - Cophenetic correlation coefficient: 0.686## Clustering procedure incomplete - no clusters yet
The last line tells us that the the clustering procedure isincomplete: the tree has been built, but it has not yet been cut - sothere are no clusters in the object yet.
To cut the tree, we can use thecut_tree()
function:
# Ask for 3 clusterstree1<-cut_tree(tree1, n_clust=3)
## Determining the cut height to reach 3 groups...
## --> 0.5625
Here, we asked for 3 clusters, and the algorithm automatically findsthe height at which 3 clusters are found (h = 0.562).
tree1
## Clustering results for algorithm : hclu_hierarclust## (hierarchical clustering based on a dissimilarity matrix)## - Number of sites: 715## - Name of dissimilarity metric: Simpson## - Tree construction method: average## - Randomization of the dissimilarity matrix: yes, number of trials 100## - Method to compute the final tree: Iterative consensus hierarchical tree## - Cophenetic correlation coefficient: 0.686## - Number of clusters requested by the user: 3## Clustering results:## - Number of partitions: 1## - Number of clusters: 3## - Height of cut of the hierarchical tree: 0.562
When we type again the name of the object in the console, it gives usthe results of the clustering: we have
- 1 partition: a partition is a clustering result. Weonly cut the tree once, so we only have 1 partition at the moment
- 4 clusters: this is the number of clusters in thepartition. We asked for 3, and we obtained 3, which is good. Sometimes,however, we cannot get the number of clusters we asked for - in whichcase the outcome will be indicated.
- a height of cut at 0.562: this is the height of cutat which we can obtain 4 clusters in our tree.
We can make a quick plot of our partitioned tree with
# We reduced the size of text labels with cex = .2, because there are too many sitesplot(tree1, cex=.2)
Let’s see how it looks like on a map:
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
map_bioregions(tree1$clusters[,c("ID","K_3")],vegesf)
Now, this is a hierarchical tree, and cutting it only once (= only 1partition) oversimplifies the result of the tree. Why not cut itmultiple times? For example, we could make deep, intermediate, andshallow cuts to the tree, likewise toFicetolaet al. (2017), which would allow us to see broad- tofine-scale relationships among sites in our tree.
We can specify, e.g. 4, 10 and 20 clusters:
## Determining the cut height to reach 2 groups...
## --> 0.625
## Determining the cut height to reach 3 groups...
## --> 0.5625
## Determining the cut height to reach 12 groups...
## --> 0.4453125
plot(tree1, cex=.2)
However, it may be more useful to choose the heights of cut, rather thanthe number of clusters. We could, for example, cut the tree at heights0.4 (shallow cut), 0.5 (intermediate cut) and 0.6 (deep cut):
The plot is not easy to read because of the large number of sites. Wecan rather extract the information directly from the object:
tree1
## Clustering results for algorithm : hclu_hierarclust## (hierarchical clustering based on a dissimilarity matrix)## - Number of sites: 715## - Name of dissimilarity metric: Simpson## - Tree construction method: average## - Randomization of the dissimilarity matrix: yes, number of trials 100## - Method to compute the final tree: Iterative consensus hierarchical tree## - Cophenetic correlation coefficient: 0.686## - Heights of cut requested by the user: 0.4 0.5 0.6## Clustering results:## - Number of partitions: 3## - Partitions are not hierarchical## - Number of clusters: 2 6 23## - Height of cut of the hierarchical tree: 0.6 0.5 0.4
From the result, we can read that for the deep cut partition (h =0.6) we have clusters, for the intermediate cut partition (h = 0.5) wehave 6 clusters and for the shallow cut partition (h = 0.4) we have 23clusters.
Here is how the maps look like:
for(iin2:ncol(tree1$clusters)){map_bioregions(tree1$clusters[,c(1,i)],vegesf)}
In the next section we will see what are the default settings and whywe chose them, and then we will see how to find optimal numbers ofclusters.
2.2 Exploring the outputs
To explore the object, you can usestr()
to see theobject structure:
str(tree1)
## List of 6## $ name : chr "hclu_hierarclust"## $ args :List of 14## ..$ index : chr "Simpson"## ..$ method : chr "average"## ..$ randomize : logi TRUE## ..$ n_runs : num 100## ..$ optimal_tree_method: chr "iterative_consensus_tree"## ..$ keep_trials : logi FALSE## ..$ n_clust : NULL## ..$ cut_height : num [1:3] 0.4 0.5 0.6## ..$ find_h : logi TRUE## ..$ h_max : num 1## ..$ h_min : num 0## ..$ consensus_p : num 0.5## ..$ verbose : logi TRUE## ..$ dynamic_tree_cut : logi FALSE## $ inputs :List of 7## ..$ bipartite : logi FALSE## ..$ weight : logi TRUE## ..$ pairwise : logi TRUE## ..$ pairwise_metric: chr "Simpson"## ..$ dissimilarity : logi TRUE## ..$ nb_sites : int 715## ..$ hierarchical : logi FALSE## $ algorithm :List of 6## ..$ final.tree :List of 5## .. ..- attr(*, "class")= chr "hclust"## ..$ final.tree.coph.cor: num 0.686## ..$ final.tree.msd : num 0.023## ..$ trials : chr "Trials not stored in output"## ..$ output_n_clust : Named int [1:3] 2 6 23## .. ..- attr(*, "names")= chr [1:3] "h_0.6" "h_0.5" "h_0.4"## ..$ output_cut_height : num [1:3] 0.6 0.5 0.4## $ clusters :'data.frame': 715 obs. of 4 variables:## ..$ ID : chr [1:715] "1003" "1004" "1005" "1006" ...## ..$ K_2 : chr [1:715] "1" "1" "1" "1" ...## ..$ K_6 : chr [1:715] "1" "1" "1" "1" ...## ..$ K_23: chr [1:715] "1" "1" "1" "1" ...## $ cluster_info:'data.frame': 3 obs. of 3 variables:## ..$ partition_name : chr [1:3] "K_2" "K_6" "K_23"## ..$ n_clust : int [1:3] 2 6 23## ..$ requested_cut_height: num [1:3] 0.6 0.5 0.4## - attr(*, "class")= chr [1:2] "bioregion.clusters" "list"
It show you the different slots in the object, and how you can accessthem. For example, if I want to access theclusters
slot, Ihave to typetree1$clusters
.
- name: the name of the method we are using
- args: the arguments you have selected for yourtree
- inputs: this is mostly for internal use in thepackage, it provides some info about the nature of input data andmethods
- algorithm: this slot contains detailed informationabout the hierarchical clustering. For example, you can have access tothe raw tree here, in
hclust
format. To access it, I cantypetree1$algorithm$final.tree
- clusters: this is a
data.frame
containing your partitions. The first column is your sites, and all theother columns are the partitions. - cluster_info: this is a small
data.frame
which will help you link your requests with theclusters
data.frame
. Its content variesdepending on your choices; for example, in my case, it looks likethis:
tree1$cluster_info
## partition_name n_clust requested_cut_height## h_0.6 K_2 2 0.6## h_0.5 K_6 6 0.5## h_0.4 K_23 23 0.4
It shows the name of the partition (corresponding to column names intree1$clusters
), the number of clusters in each partition,and the cut height I initially requested.
2.3 Explanation of the default settings and how to change them
2.3.1 Randomization of the distance matrix
The order of sites in the distance matrix influences the outcome ofthe hierarchical tree. Let’s see that with an example:
# Compute the tree without randomizing the distance matrixtree2<-hclu_hierarclust(dissim, randomize=FALSE)
## Output tree has a 0.6849 cophenetic correlation coefficient with the initial## dissimilarity matrix
plot(tree2, cex=.1)
This is how the tree looks like when the matrix is not randomized.Now let’s randomize it and regenerate the tree:
# This line randomizes the order of rows in the distance matrixdissim_random<-dissim[sample(1:nrow(dissim)),]# Recompute the treetree3<-hclu_hierarclust(dissim_random, randomize=FALSE)
## Output tree has a 0.6778 cophenetic correlation coefficient with the initial## dissimilarity matrix
plot(tree3, cex=.1)
See how the tree looks different? This is problematic because itmeans that the outcome is heavily influenced by the order of sites inthe distance matrix.
To address this issue, we have developed an iterative algorithm thatwill reconstruct the entire tree from top to bottom, by selecting foreach branch a majority decision among multiple randomizations of thedistance matrix (100 times by default, can be increased). This method iscalledIterative Hierarchical Consensus Tree (argumentoptimal_tree_method = "iterative_consensus_tree"
, defaultvalue) and it ensures that you obtain a consensus tree that it will finda majority decision for each branch of the tree. The tree produced withthis method generally have a better topology than any individual tree.We estimate the performance of the topology with thecopheneticcorrelation coefficient, which is thecorrelation betweenthe initial distance\(\beta_{sim}\)among sites andthe cophenetic distance, which is thedistance at which sites are connected in the tree. It tells us howrepresentative is the tree of the initial distance matrix.
Although this method performs better than any other available method,it comes with a computing cost: it needs to randomize the distancematrix multiple times for each branching of the tree. Therefore, werecommend using it to obtain a robust tree - be patient in that case.Otherwise, if you only need a very fast look at agood tree,you can simply select the single best tree among multiple randomizationtrials. It will never be as good as theIterative HierarchicalConsensus Tree, but it will be a best choice for a fastexploration. To do that, chooseoptimal_tree_method = "best"
. It will select the tree thatbest represents the distance matrix; i.e., the one that has the highestcophenetic correlation coefficient among all trials.
Let’s see an example ofoptimal_tree_method = "best"
. Wecan also ask the function to keep all individual trees for furtherexploration (keep_trials = TRUE
).
tree_best<-hclu_hierarclust(dissim_random, randomize=TRUE, optimal_tree_method="best", keep_trials=TRUE)
## Randomizing the dissimilarity matrix with 100 trials
## -- range of cophenetic correlation coefficients among trials: 0.6669 - 0.6974
#### Final tree has a 0.6974 cophenetic correlation coefficient with the initial dissimilarity matrix
Another possible approach is to build a simple consensus tree amongall the trials. However, we generally do not recommend constructing aconsensus tree, because the topology of simple consensus trees can bevery problematic if there are a lot of ties in the distance matrix.Let’s see it in action here:
tree_consensus<-hclu_hierarclust(dissim_random, randomize=TRUE, optimal_tree_method="consensus", keep_trials=TRUE)
## Randomizing the dissimilarity matrix with 100 trials
#### Final tree has a 0.202 cophenetic correlation coefficient with the initial dissimilarity matrix
See how the cophenetic correlation coefficient for the consensus treeis terrible compared to the IHCT and best tree ?
This consensus tree has almost no correlation with our initialdistance matrix. This is because its topology is terribly wrong, see howthe tree looks like:
plot(tree_consensus)
Booo, this is just a large rake, not a tree!!!
2.3.2 Tree construction algorithm
By default, the function uses the UPGMA method (Unweighted Pair GroupMethod with Arithmetic Mean) because it has been recommended inbioregionalization for its better performance over other approaches(Kreft & Jetz, 2010). You can changethis method by changing the argumentmethod
; all methodsimplemented instats::hclust()
are available.
Note that the current height distances for methodsmethod = "ward.D"
andmethod = "ward.D2"
maydiffer from the calculations instats::hclust()
, due to theiterative nature of the algorithm. In addition, using the methodmethod = "single"
are much slower than the otherapproaches, and we have not yet implemented a workaround to make itfaster (do not hesitate to contact us if you need a fasterimplementation or have an idea of how to make it run faster).
2.3.3 Cutting the tree
There are three ways of cutting the tree:
- Specify the expected number of clusters: you canrequest a specific number of clusters (
n_clust = 5
forexample). You can also request multiple partitions, each with their ownnumber of clusters (n_clust = c(5, 10, 15)
) forexample.
Note: When you specify the number of clusters, thefunction will search for the associated height of cut automatically; youcan disable this parameter withfind_h = FALSE
. It willsearch for thish value betweenh_max
(default 1)andh_min
(default 0). These arguments can be adjusted ifyou are working with indices whose values do not range between 0 and1.
Specify the height of cut: you can request theheight at which you want to cut the tree (e.g.,
cut_height = 0.5
). You can also request multiplepartitions, each with their own cut height(cut_height = c(0.4, 0.5, 0.6)
) for example.Use a dynamic tree cut method: Rather thancutting the entire tree at once, this alternative approach consists incutting individual branches at different heights. This method can berequested by using
dynamic_tree_cut = TRUE
, and is based onthe dynamicTreeCut R package.
2.4. Divisive clustering
While the agglomerative hierarchical clustering in the previoussubsections followed a bottom-up approach, divisive clustering follows atop-down approach. This means that in the first step of the clustering,all sites belong to the same bioregion, and then sites are iterativelydivided into different bioregions until all sites belong to a uniquebioregion.
Divisive clustering is following the DIvisive ANAlysis (DIANA)clustering algorithm described in(Kaufman &Rousseeuw, 2009).
At each step, the algorithm splits the largest cluster by identifyingthe most dissimilar observation (i.e. site) and then putting sites thatare closer to this most dissimilar observation than to the ‘old party’group into a splinter group. The result is that the large cluster issplit into two clusters.
The functionhclu_diana
performs the Diana divisiveclustering.
# Compute the tree with the Diana algorithmtree_diana<-hclu_diana(dissim)
## Output tree has a 0.41 cophenetic correlation coefficient with the initial dissimilarity matrix
plot(tree_diana)
3. How to find an optimal number of clusters?

Step 1.Build a tree with
hclu_hierarclust()
Step 2.Explore a range of partitions, from aminimum (e.g., starting at 2 clusters) up to a maximum (e.g. \(n-1\) clusters where\(n\) is the number of sites).
Step 3.Calculate one or several metrics for eachpartition, to be used as the basis for evaluationplots.
Step 4.Search for one or several optimal number(s) ofclusters using evaluation plots. Different criteria can beapplied to identify the optimal number(s) of clusters.
Step 5.Export the optimal partitions from your clusterobject.
3.1 A practical example
In this example we will compute the evaluation metric used byHoltet al. (2013), which compares thetotal dissimilarity of the distance matrix (sum of all distances) withthe inter-cluster dissimilarity (sum of distances between clusters).Then we will choose the optimal number of clusters as the elbow of theevaluation plot.
data(vegemat)# Calculate dissimilaritiesdissim<-dissimilarity(vegemat)# Step 1 & 2. Compute the tree and cut it into many different partitionstree4<-hclu_hierarclust(dissim, n_clust=2:100)
## Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#### Final tree has a 0.6842 cophenetic correlation coefficient with the initial dissimilarity matrix
## Warning in cut_tree(outputs, n_clust = n_clust, cut_height = cut_height, : The## requested number of cluster could not be found for k = 4. Closest number found:## 5
## Warning in cut_tree(outputs, n_clust = n_clust, cut_height = cut_height, : The## requested number of cluster could not be found for k = 67. Closest number## found: 66
# Step 3. Calculate the same evaluation metric as Holt et al. 2013eval_tree4<-bioregionalization_metrics(tree4, dissimilarity=dissim,# Provide distances to compute the metrics eval_metric="pc_distance")
## Computing similarity-based metrics...
## - pc_distance OK
# Step 4. Find the optimal number of clustersopti_n_tree4<-find_optimal_n(eval_tree4)
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the elbow method
## * elbow found at:
## pc_distance 19
## Plotting results...
opti_n_tree4
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): pc_distance#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: elbow## - Optimal partition(s) of clusters for each metric:## pc_distance - 19
# Step 5. Extract the optimal number of clusters# We get the name of the correct partition in the next lineK_name<-opti_n_tree4$evaluation_df$K[opti_n_tree4$evaluation_df$optimal_n_pc_distance]# Look at the site-cluster tablehead(tree4$clusters[,c("ID",K_name)])
## ID K_19## 1003 1003 1## 1004 1004 1## 1005 1005 1## 1006 1006 1## 1007 1007 1## 1008 1008 1
# Make a map of the clustersdata("vegesf")library("sf")map_bioregions(tree4$clusters[,c("ID",K_name)],vegesf)
Or if you are allergic to lines of code, you could also simply recutyour tree at the identified optimal number of cut-offs withcut_tree()
.
3.2 Evaluation metrics
Currently, there are four evaluation metrics available in thepackage:
pc_distance
:\(\sum{between-cluster\beta_{sim }} /\sum{\beta_{sim}}\) This metric is the metric computed inHoltet al. (2013).anosim
: this the statistic used in Analysis ofSimilarities, as suggested inCastro-Insuaetal. (2018). It compares the between-cluster dissimilaritiesto the within-cluster dissimilarities. It is based on the difference ofmean ranks between groups and within groups with the following formula:\(R=(r_B-r_W)/(N(N-1)/4)\) where\(r_B\) and\(r_W\) are the average ranks between andwithin clusters respectively, and\(N\)is the total number of sites.avg_endemism
: it is the average percentage ofendemism in clusters(Kreft & Jetz,2010). It is calculated as follows:\(End_{mean} = \frac{\sum_{i=1}^K E_i /S_i}{K}\) where\(E_i\) is thenumber of endemic species in cluster\(i\),\(S_i\) is the number of species in cluster\(i\), and\(K\) the maximum number ofclusters.tot_endemism
: it is the total endemism across allclusters(Kreft & Jetz, 2010). It iscalculated as follows:\(End_{tot} = E /C\) where\(E\) is the totalnumber of endemic species (i.e., species occurring in only one cluster)and\(C\) is the number of non- endemicspecies.
Important note
To be able to calculatepc_distance
andanosim
, you need to provide your dissimilarity object tothe argumentdissimilarity
. In addition, to be able tocalculateavg_endemism
andtot_endemism
, youneed to provide your species-site network to the argumentnet
(don’t panick if you only have a species x site matrix!We have a function to make the conversion).
Let’s see that in practice. Depending on the size of your dataset,computing endemism-based metrics can take a while.
# Calculate pc_distance and anosimbioregionalization_metrics(tree4, dissimilarity=dissim, eval_metric=c("pc_distance","anosim"))
## Computing similarity-based metrics...
## - pc_distance OK
## - anosim OK
## Partition metrics:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Requested metric(s): pc_distance anosim## - Metric summary:## pc_distance anosim## Min 0.4982831 0.6968281## Mean 0.8950634 0.8029604## Max 0.9823024 0.8669999#### Access the data.frame of metrics with your_object$evaluation_df
# Calculate avg_endemism and tot_endemism# I have an abundance matrix, I need to convert it into network format first:vegenet<-mat_to_net(vegemat)bioregionalization_metrics(tree4, net=vegenet, eval_metric=c("avg_endemism","tot_endemism"))
## Computing composition-based metrics...
## - avg_endemism OK
## - tot_endemism OK
## Partition metrics:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Requested metric(s): avg_endemism tot_endemism## - Metric summary:## avg_endemism tot_endemism## Min 0.001177433 0.04895862## Mean 0.008144637 0.08297200## Max 0.181513150 0.31539086#### Access the data.frame of metrics with your_object$evaluation_df## Details of endemism % for each bioregionalization are available in## your_object$endemism_results
3.2 Criteria to choose an optimal number of clusters
Choosing the optimal number of clusters is a long-standing issue inthe literature, and there is no absolute and objective answer to thisquestion. A plethora of methods have been proposed over the years, andthe best approach to tackle this issue is probably to compare theresults of multiple approaches to make an informed decision.
In thebioregion
package, we have implemented severalmethods that are specifically suited for bioregionalizationanalysis.
For example, a standard criterion used for identifying the optimalnumber of clusters is theelbow method, which is thedefault criterion infind_optimal_n()
. However, werecommend moving beyond the paradigm of a single optimal number ofclusters, which is likely an oversimplification of the hierarchy of thetree. We recommend considering multiple cuts of the tree, and we provideseveral methods for doing so:identifying large steps in thecurve orusing multiple cutoffs. Additionally,we implement other approaches, such as using the maximum or minimumvalue of the metrics, or by finding break points in the curve with asegmented model.
Before we look at these different methods, we will compute all theevaluation metrics and store them ineval_tree4
:
vegenet<-mat_to_net(vegemat)eval_tree4<-bioregionalization_metrics(tree4, dissimilarity=dissim, net=vegenet, eval_metric=c("pc_distance","anosim","avg_endemism","tot_endemism"))
## Computing similarity-based metrics...
## - pc_distance OK
## - anosim OK
## Computing composition-based metrics...
## - avg_endemism OK
## - tot_endemism OK
3.2.1 Elbow method
The elbow method consists in find the ‘elbow’ in the form of themetric-cluster relationship. This method will typically work for metricswhich have an L-shaped form (typically, pc_distance and endemismmetrics), but not for other metrics (e.g. the form of anosim does notnecessarily follow an L-shape).
The rationale behind the elbow method is to find a cutoff abovewhich the metric values stop increasing significantly, such that addingnew clusters does not provide much significant improvement in themetric.
The elbow method is the default method infind_optimal_n()
. There are no parameters to adjust it. Ifthe curve is not elbow-shaped, it may give spurious results.
find_optimal_n(eval_tree4)
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the elbow method
## * elbow found at:
## pc_distance 19## anosim 24## avg_endemism 9## tot_endemism 12
## Warning in find_optimal_n(eval_tree4): The elbow method is likely not suitable## for the ANOSIM metric. You should rather look for leaps in the curve (see## criterion = 'increasing_step' or decreasing_step)
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): pc_distance anosim avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: elbow## - Optimal partition(s) of clusters for each metric:## pc_distance - 19## anosim - 24## avg_endemism - 9## tot_endemism - 12
In our example above, the optimal number of clusters varies dependingon the metric, from a minimum of 9 to a maximum of 24. The final choicedepends on your metric preferences with respect to metrics, and yourobjectives with the clustering. Alternatively, two cut-offs could beused, a deep cut-off based on the endemism metrics e.g. at a value of 9,and a shallow cutoff based onpc_distance
, at 19.
3.2.2 Step method
The step method consists in identifying the largest “steps” inmetrics, i.e., the largest increases or decreases in the value of themetric.
To do this, the function calculates all successive differences inmetrics between partitions. It will then keep only the largest positivedifferences (increasing_step
) or negative differences(decreasing_step
).increasing_step
is forincreasing metrics (pc_distance
) anddecreasing_step
is for decreasing metrics(avg_endemism
andtot_endemism
).anosim
values can either increase or decrease depending onyour dataset, so you would have to explore both ways.
By default, the function selects the top 1% steps:
find_optimal_n(eval_tree4, metrics_to_use=c("anosim","pc_distance"), criterion="increasing_step")
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the increasing_step method
## - Step method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): anosim pc_distance#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: increasing_step## (step quantile chosen: 0.99 (i.e., only the top 1 % increase in evaluation metrics are used as break points for the number of clusters)## - Optimal partition(s) of clusters for each metric:## anosim - 19## pc_distance - 12
find_optimal_n(eval_tree4, metrics_to_use=c("avg_endemism","tot_endemism"), criterion="decreasing_step")
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the decreasing_step method
## - Step method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: decreasing_step## (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)## - Optimal partition(s) of clusters for each metric:## avg_endemism - 5## tot_endemism - 5
However, you can adjust it in two different ways. First, choose anumber of steps to select, e.g. to select the largest 3 steps, usestep_levels = 3
:
find_optimal_n(eval_tree4, metrics_to_use=c("anosim","pc_distance"), criterion="increasing_step", step_levels=3)
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the increasing_step method
## - Step method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): anosim pc_distance#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: increasing_step## (step quantile chosen: 0.99 (i.e., only the top 1 % increase in evaluation metrics are used as break points for the number of clusters)## - Optimal partition(s) of clusters for each metric:## anosim - 5 12 19## pc_distance - 5 12 19
Note that these steps generally correspond to large jumps in thetree, which is why we like this approach as it fits well with thehierarchical nature of the tree.
Second, you can set a quantile of steps to select, e.g. to select the5% largest steps set the quantile to 0.95(step_quantile = 0.95
):
find_optimal_n(eval_tree4, metrics_to_use=c("anosim","pc_distance"), criterion="increasing_step", step_quantile=0.95)
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the increasing_step method
## - Step method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): anosim pc_distance#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: increasing_step## (step quantile chosen: 0.95 (i.e., only the top 5 % increase in evaluation metrics are used as break points for the number of clusters)## - Optimal partition(s) of clusters for each metric:## anosim - 5 12 19 24 93## pc_distance - 3 5 12 19 68
Finally, a question that may arise is which cluster number to selectwhen a large step occurs. For example, if the largest step occursbetween partitions with 4 and 5 clusters, should we keep the partitionwith 4 clusters, or the partition with 5 clusters?
By default, the function keeps the partition with\(N + 1\) (5 clusters in our example above).You can change this by settingstep_round_above = FALSE
:
find_optimal_n(eval_tree4, metrics_to_use=c("avg_endemism","tot_endemism"), criterion="decreasing_step", step_round_above=FALSE)
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the decreasing_step method
## - Step method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: decreasing_step## (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)## - Optimal partition(s) of clusters for each metric:## avg_endemism - 3## tot_endemism - 3
3.2.3 Cutting at different cut-off values
The idea of this method is to select specific metric values at whichthe number of clusters should be used. For example, in their study,Holtet al. (2013) used differentcutoffs forpc_distance
to find the global biogeographicregions: 0.90, 0.95, 0.99, 0.999. The higher the value, the more-diversity is explained, but also the more clusters there are.Therefore, the choice is a trade-off between the total -diversityexplained and the number of clusters.
Eventually, the choice of these values depends on differentfactors:
The geographic scope of your study. A global scale study can uselarge cutoffs likeHoltet al.(2013) and end up with a reasonable number of clusters, whereasin regional to local scale studies with less endemism and more taxashared among clusters, these values are too high, and other cutoffsshould be explored, such as 0.5 and 0.75.
The characteristics of your study which will increase or decreasethe degree of endemism among clusters: dispersal capacities of yourtaxonomic group, the connectivity/barriers in your study area, etc. Uselower cutoffs when you have a large number of widespread species, usehigher cutoffs when you have high degrees of endemism.
Using abundance or phylogenetic data to compute -diversitymetrics may allow you to better distinguish clusters, which in turn willallow you to use higher cutoffs.
For example, in our case, a regional-scale study on vegetation, wecan use three cutoffs: 0.6 (deep cutoff), 0.8 (intermediate cutoff), and0.9 (shallow cutoff).
find_optimal_n(eval_tree4, metrics_to_use="pc_distance", criterion="cutoff", metric_cutoffs=c(.6,.8,.9))
## Number of bioregionalizations: 99
## Searching for potential optimal number(s) of clusters based on the cutoff method
## - Cutoff method
## Plotting results...
## Search for an optimal number of clusters:## - 99 partition(s) evaluated## - Range of clusters explored: from 2 to 100## - Evaluated metric(s): pc_distance#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: cutoff## --> cutoff(s) chosen: 0.6 0.8 0.9## - Optimal partition(s) of clusters for each metric:## pc_distance - 5 12 29
3.2.4 Cutting at the maximum or minimum metric value
This criterion finds the maximum (criterion = "max"
) orminimum (criterion = "min"
) value of the metric in the listof partitions and selects the corresponding partition. Such a criterioncan be interesting in the case ofanosim
, but is probablymuch less useful for the other metrics implemented in the package.
3.2.5 Finding break points in the curve
This criterion consists in applying a segmented regression model withthe formula evaluation metric ~ number of clusters. The user can definethe number of breaks to be identified on the curve. Note that such amodel is likely to require a minimum number of points to find anappropriate number of clusters. In our example here, we make 100 cuts inthe tree to have enough values.
tree5<-cut_tree(tree4, cut_height=seq(0,max(tree4$algorithm$final.tree$height), length=100))eval_tree5<-bioregionalization_metrics(tree5, dissimilarity=dissim, net=vegenet, eval_metric=c("pc_distance","anosim","avg_endemism","tot_endemism"))
## Computing similarity-based metrics...
## - pc_distance OK
## - anosim OK
## Computing composition-based metrics...
## - avg_endemism OK
## - tot_endemism OK
find_optimal_n(eval_tree5, criterion="breakpoints")
## Number of bioregionalizations: 100
## Searching for potential optimal number(s) of clusters based on the breakpoints method
## Exact break point not in the list of bioregionalizations: finding the closest bioregionalization...
## Plotting results...
## (the red line is the prediction from the segmented regression)
## Warning: Removed 1 row containing missing values or values outside the scale range## (`geom_line()`).
## Search for an optimal number of clusters:## - 100 partition(s) evaluated## - Range of clusters explored: from 1 to 701## - Evaluated metric(s): pc_distance anosim avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: breakpoints## - Optimal partition(s) of clusters for each metric:## pc_distance - 16## anosim - 111## avg_endemism - 2## tot_endemism - 2
We can ask for a higher number of breaks:
- 2 breaks
find_optimal_n(eval_tree5, criterion="breakpoints", n_breakpoints=2)
## Number of bioregionalizations: 100
## Searching for potential optimal number(s) of clusters based on the breakpoints method
## Exact break point not in the list of bioregionalizations: finding the closest bioregionalization...
## Plotting results...
## (the red line is the prediction from the segmented regression)
## Warning: Removed 1 row containing missing values or values outside the scale range## (`geom_line()`).
## Search for an optimal number of clusters:## - 100 partition(s) evaluated## - Range of clusters explored: from 1 to 701## - Evaluated metric(s): pc_distance anosim avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: breakpoints## - Optimal partition(s) of clusters for each metric:## pc_distance - 13 92## anosim - 25 183## avg_endemism - 2 16## tot_endemism - 2 92
- 3 breaks
find_optimal_n(eval_tree5, criterion="breakpoints", n_breakpoints=3)
## Number of bioregionalizations: 100
## Searching for potential optimal number(s) of clusters based on the breakpoints method
## Exact break point not in the list of bioregionalizations: finding the closest bioregionalization...
## Plotting results...
## (the red line is the prediction from the segmented regression)
## Warning: Removed 1 row containing missing values or values outside the scale range## (`geom_line()`).
## Search for an optimal number of clusters:## - 100 partition(s) evaluated## - Range of clusters explored: from 1 to 701## - Evaluated metric(s): pc_distance anosim avg_endemism tot_endemism#### Potential optimal partition(s):## - Criterion chosen to optimise the number of clusters: breakpoints## - Optimal partition(s) of clusters for each metric:## pc_distance - 2 16 100## anosim - 25 183 566## avg_endemism - 2 13 56## tot_endemism - 2 14 111
Increasing the number of breaks can be useful in situations where youhave, for example, non-linear silhouettes of metric ~ n clusters.
4. OPTICS as a semi-hierarchical clustering approach
OPTICS (Ordering Points To Identify the Clustering Structure) is asemi-hierarchical clustering approach that orders the points in thedatasets so that the closest points become neighbors, calculates a‘reachability’ distance between each point, and then extracts clustersfrom this reachability distance in a hierarchical manner. However, thehierarchical nature of clusters is not directly provided by thealgorithm in a tree-like output. Hence, users should explore the‘reachability plot’ to understand the hierarchical nature of theirOPTICS clusters, and read the related publication to further understandthis method(Hahsleret al.,2019).
To run the optics algorithm, use thehclu_optics()
function:
data(vegemat)# Calculate dissimilaritiesdissim<-dissimilarity(vegemat)clust1<-hclu_optics(dissim)clust1
## Clustering results for algorithm : hclu_optics## - Number of sites: 715## Clustering results:## - Number of partitions: 1## - Number of clusters: 9