Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Fast, Robust, and Outlier Resistant Hierarchical Clustering
Version:1.0.6
Date:2025-12-01
Description:Includes the basic implementation of Genie - a hierarchical clustering algorithm that links two point groups in such a way that an inequity measure (namely, the Gini index) of the cluster sizes does not significantly increase above a given threshold. This method most often outperforms many other data segmentation approaches in terms of clustering quality as tested on a wide range of benchmark datasets. At the same time, Genie retains the high speed of the single linkage approach, therefore it is also suitable for analysing larger data sets. For more details see (Gagolewski et al. 2016 <doi:10.1016/j.ins.2016.05.003>). For an even faster and more feature-rich implementation, including, amongst others, see the 'genieclust' package (Gagolewski, 2021 <doi:10.1016/j.softx.2021.100722>).
License:GPL (≥ 3)
BugReports:https://github.com/gagolews/genie/issues
URL:https://genieclust.gagolewski.com/,https://github.com/gagolews/genie
Depends:R (≥ 4.0.0), stats, genieclust
Imports:Rcpp (≥ 1.0.0)
Suggests:datasets, testthat, stringi
LinkingTo:Rcpp (≥ 1.0.0)
Encoding:UTF-8
RoxygenNote:7.3.2
NeedsCompilation:yes
Packaged:2025-12-01 17:10:49 UTC; gagolews
Author:Marek GagolewskiORCID iD [aut, cre, cph], Maciej BartoszukORCID iD [aut], Anna CenaORCID iD [aut]
Maintainer:Marek Gagolewski <marek@gagolewski.com>
Repository:CRAN
Date/Publication:2025-12-01 18:10:02 UTC

The Genie Package

Description

Seehclust2() for details.

Author(s)

Marek Gagolewski, Maciej Bartoszuk, Anna Cena

See Also

Useful links:


Fast Hierarchical Clustering in Spaces Equipped Witha Dissimilarity Measure

Description

The reference implementation of the fast, robust and outlier resistantGenie algorithm described in (Gagolewski, Bartoszuk, Cena, 2016).Note that thegenie package has been superseded bygenieclust,seegclust andgeniefor more details.

Usage

hclust2(d = NULL, objects = NULL, thresholdGini = 0.3, ...)

Arguments

d

an object of classdist,NULL, or a single string, see below

objects

NULL, numeric matrix, a list, or a character vector

thresholdGini

single numeric value in [0,1],threshold for the Gini index, 1 gives the standard single linkage algorithm

...

internal parameters used to tune up the algorithm

Details

The time needed to apply a hierarchical clustering algorithmis most often dominated by the number of computations of a pairwisedissimilarity measure. Such a constraint, for larger data sets,puts at a disadvantage the use of all the classical linkagecriteria but the single linkage one. However, it is known that the singlelinkage clustering algorithm is very sensitive to outliers, produces highlyskewed dendrograms, and therefore usually does not reflect the trueunderlying data structure – unless the clusters are well-separated.

To overcome its limitations, in (Gagolewski, Bartoszuk, Cena, 2016)we proposed a new hierarchical clustering linkagecriterion. Namely, our algorithm links two clusters in such a way that a choseneconomic inequity measure (here, the Gini index) of the clustersizes does not increase drastically above a given threshold. Thebenchmarks indicate a high practical usefulness of the introduced method:it most often outperforms the Ward or average linkage in terms of theclustering quality while retaining the single linkage speed.The algorithm can be run in parallel (via OpenMP) on multiple threadsto speed up its execution further on.Its memory overhead is small: there is no need to precompute the completedistance matrix to perform the computations in order to obtain a desiredclustering.

For compatibility withhclust,d may be an objectof classdist. In such a case, theobjectsargument is ignored. Note that such an object requires ca.8n(n-1)/2bytes of computer's memory, wheren is the number of objects to cluster,and therefore this setting can be used to analyse data sets of sizesup to about 10,000-50,000.

Ifobjects is a character vector or a list, thendshould be a single string, one of:levenshtein (orNULL),hamming,dinu (Dinu, Sgarro, 2006),oreuclinf (Cena et al., 2015).Note that the list must consisteither of integer or of numeric vectors only (depending on the dissimilaritymeasure of choice). On the other hand, each string must be in ASCII,but you can always convert it to UTF-32 withstri_enc_toutf32.

Otherwise, ifobjects is a numeric matrix (here, each rowdenotes a distinct observation), thend should bea single string, one of:euclidean_squared (orNULL),euclidean (which yields the same results aseuclidean_squared)manhattan,maximum, orhamming.

Value

A named list of classhclust, seehclust,with additional components:

References

Cena A., Gagolewski M., Mesiar R., Problems and challenges of informationresources producers' clustering,Journal of Informetrics 9(2), 2015,pp. 273-284.

Dinu L.P., Sgarro A., A Low-complexity Distance for DNA Strings,Fundamenta Informaticae 73(3), 2006, pp. 361-372.

Gagolewski M., Bartoszuk M., Cena A.,Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm,Information Sciences 363, 2016, pp. 8-23.

Gagolewski M., Cena A., Bartoszuk M.Hierarchical clustering via penalty-based aggregation and the Genieapproach, In: Torra V. et al. (Eds.),Modeling Decisions forArtificial Intelligence (Lecture Notes in Artificial Intelligence9880), Springer, 2016.

Examples

library("datasets")data("iris")h <- hclust2(objects=as.matrix(iris[,2:3]), thresholdGini=0.2)plot(iris[,2], iris[,3], col=cutree(h, 3), pch=as.integer(iris[,5]), asp=1, las=1)

[8]ページ先頭

©2009-2025 Movatter.jp