stats::hclust withhclust1dThe purpose of this vignette is to provide guidelines on replacingstats::hclust calls withhclust1d calls forunivariate (1D) data, in a plug-and-play manner, i.e. without changingany surrounding code (or little of the surrounding code, as an option tothe programmer).
To enable use ofhclust1d you need to include this linein your script or markdown notebook:
library(hclust1d)In case of packages you need to importhclust1d in yourDESCRIPTION file.
The main reason a programmer might want to replacestats:hclust calls withhclust1d in case ofclustering 1D points is that the computational complexity ofhclust1d is\(\mathcal{O}(n\logn)\), while it is at least quadratic (and\(\mathcal{O}(n^2\log n)\) for case ofgeneral linkage) for multidimensional algorithms.
The better algorithmic time complexity (compared to multidimensionalhierarchical clustering) paired with its efficientC++implementation makehclust1d very fast. The computationaltime beatsstats::hclust on all sizes of data and isenpar withfastcluster::hclust with small data sizes.However, it is of orders of magnitude faster than both multivariateclustering routines on larger data sizes.
hclust1d withstats::hclustMaintaining compatibility ofhclust withstats::hclustwas high on a list of design priorities forhclust1d.
All linkage functions ofstats::hclust are supportedinhclust1d, too.
Input tostats::hclust should bedistS3 class structure as produced bystats::dist function andthe same input is accepted inhclust1d (with adistance argument explicitly set toTRUE).
There are three atypical linkages instats::hclust.Namely,stats::hclust requires that thesquareddist structure is provided forward.D,centroid andmedianlinkage functions. This is implicit. The same input is accepted inhclust1d (withdistance andsquared arguments explicitly set toTRUE).
The object returned fromhclust1d call is the sameS3 class as the result ofstats::hclust, namelyhclust S3 class.
The heights returned fromhclust1d are calculatedthe same, as instats::hclust:
ward.D,centroid andmedianlinkage functions,hclust1dThe list of all linkage functions supported inhclust1dis available by calling:
supported_methods()#> [1] "complete" "average" "centroid" "true_median" "median"#> [6] "mcquitty" "ward.D" "ward.D2" "single"The in-depth description of the linkage functions inhclust1d, together with the inter-cluster distance metricdefinition used in case of each linkage function (and returned as themerging height) can be found in ourgettingstarted vignette.
The choice of a linkage function is the same inhclust1das instats::hclust, i.e. by specifying amethod argument and passing the name of a linkage functionintohclust1d as a character string.
To provide an example, the following two calls executeaverage linkage hierarchical clustering on distancescomputed for a set of 1D points, by passingmethod = "average" argument to relevant calls:
points<-rnorm(10)res<- stats::hclust(stats::dist(points),method ="average")res<-hclust1d(stats::dist(points),method ="average",distance =TRUE)hclust1dThe user ofstats::hclust and ofhclust1dcan select a number of distance metrics when building distance-basedinput withstats::dist, by selecting an appropriate name ofa metric and passing it as amethod argument tostats::dist as a character string. Not all of them aresupported inhclust1d. The list of distance metricssupported inhclust1d is available by calling:
supported_dist.methods()#> [1] "euclidean" "maximum" "manhattan" "minkowski"The trick here is that for 1D pointseuclidean,maximum,manhattan andminkowskidistances are equivalent.
To provide an example, the following two calls executeaverage linkage hierarchical clustering on distancescomputed byminkowski\(L_3\) norm for a set of 1D points, bypassingmethod = "minkowski" andp = 3arguments to relevantstats::dist calls:
points<-rnorm(10)res<- stats::hclust(stats::dist(points,method ="minkowski",p=3),method="average")res<-hclust1d(stats::dist(points,method ="minkowski",p=3),method="average",distance =TRUE)We don’t supportmembers argument inhclust1d.
stats::hclust in case ofward.D,centroid ormedian linkage functionsThis section DOES NOT apply toward.D2 linkagefunction, despite the similarity in its name.
For those three linkage functions (ward.D,centroid ormedian) the default hierarchicalclustering routinestats::hclust requiressquareddistance structure calculated on original points. Consequently, as canbe seen from the sections above, to replace astats::hclustcall withhclust1d for 1D data, one needs to replace anycall to
res<- stats::hclust(squared_d,method = linkage_function_name,members =NULL)by a call to
res<-hclust1d(squared_d,method = linkage_function_name,distance =TRUE,square =TRUE)Somewhere in the code above this line,squared_d hasbeen computed by a call tostats::dist from a vector of 1Dpoints and subsequently squaring thestats::distresult.
Somewhere below in the coderes gets analyzed, but it isOK, because the results of both calls are compatible.
If the programmer has access to the originalstats::dist result (let’s denote this variabled), the computation ofsquared_d can beremoved (provided it is not used for other purpose) and a call to
res<- stats::hclust(squared_d,method = linkage_function_name,members =NULL)can be replaced by a call to
res<-hclust1d(d,method = linkage_function_name,distance =TRUE)If the programmer has access to the original points (let’s denotethis variablepoints), the computation ofsquared_d and ofd can be removed altogether(provided they are not used for other purpose) and a call to
res<- stats::hclust(squared_d,method = linkage_function_name,members =NULL)can be replaced by a call to
res<-hclust1d(points,method = linkage_function_name)stats::hclust in case of all other linkagefunctions, besidesward.D,centroid ormedianThis section applies to, among others,ward.D2linkage function.
For those remaining linkage functionsstats::hclustaccepts regular (unsquared) distance structure calculated onoriginal points. Consequently, as can be seen from the sections above,to replace astats::hclust call withhclust1dfor 1D data, one needs to replace any call to
res<- stats::hclust(d,method = linkage_function_name,members =NULL)by a call to
res<-hclust1d(d,method = linkage_function_name,distance =TRUE)Somewhere in the code above this line,d has beencomputed by a call tostats::dist from a vector of 1Dpoints.
Somewhere below in the coderes gets analyzed, but it isOK, because the results of both calls are compatible.
If the programmer has access to the original points (let’s denotethis variablepoints), the computation ofdcan be removed (provided it is not used for other purpose) and a callto
res<- stats::hclust(d,method = linkage_function_name,members =NULL)can be replaced by a call to
res<-hclust1d(points,method = linkage_function_name)