Movatterモバイル変換


[0]ホーム

URL:


Version:0.3-66
Encoding:UTF-8
Title:Cluster Ensembles
Description:CLUster Ensembles.
License:GPL-2
Depends:R (≥ 3.2.0)
Imports:stats, cluster, graphics, methods
Suggests:e1071, lpSolve (≥ 5.5.7), quadprog (≥ 1.4-8), relations
Enhances:RWeka, ape, cba, cclust, flexclust, flexmix, kernlab, mclust,movMF, modeltools
NeedsCompilation:yes
Packaged:2024-11-13 06:15:44 UTC; hornik
Author:Kurt HornikORCID iD [aut, cre], Walter Böhm [ctb]
Maintainer:Kurt Hornik <Kurt.Hornik@R-project.org>
Repository:CRAN
Date/Publication:2024-11-13 07:13:18 UTC

Cassini Data Partitions Obtained by K-Means

Description

A cluster ensemble of 50k-means partitions of the Cassini datainto three classes.

Usage

data("CKME")

Format

A cluster ensemble of 50 (k-means) partitions.

Details

The ensemble was generated via

    require("clue")    data("Cassini")    set.seed(1234)    CKME <- cl_boot(Cassini$x, 50, 3)

Cassini Data

Description

A Cassini data set with 1000 points in 2-dimensional space which aredrawn from the uniform distribution on 3 structures. The two outerstructures are banana-shaped; the “middle” structure in betweenthem is a circle.

Usage

data("Cassini")

Format

A classed list with components

x

a matrix with 1000 rows and 2 columns giving thecoordinates of the points.

classes

a factor indicating which structure therespective points belong to.

Details

Instances of Cassini data sets can be created using functionmlbench.cassini in packagemlbench.The data set at hand was obtained using

    library("mlbench")    set.seed(1234)    Cassini <- mlbench.cassini(1000)

Examples

data("Cassini")op <- par(mfcol = c(1, 2))## Plot the data set:plot(Cassini$x, col = as.integer(Cassini$classes),     xlab = "", ylab = "")## Create a "random" k-means partition of the data:set.seed(1234)party <- kmeans(Cassini$x, 3)## And plot that.plot(Cassini$x, col = cl_class_ids(party),     xlab = "", ylab = "")## (We can see the problem ...)par(op)

Gordon-Vichi Macroeconomic Partition Ensemble Data

Description

Soft partitions of 21 countries based on macroeconomic data for theyears 1975, 1980, 1985, 1990, and 1995.

Usage

data("GVME")

Format

A named cluster ensemble of 5 soft partitions of 21 countries into 2or 3 classes. The names are the years to which the partitionscorrespond.

Details

The partitions were obtained using fuzzyc-means on measurementsof the following variables: the annual per capita gross domesticproduct (GDP) in USD (converted to 1987 prices); the percentage of GDPprovided by agriculture; the percentage of employees who worked inagriculture; and gross domestic investment, expressed as a percentageof the GDP. See Gordon and Vichi (2001), page 230, for more details.

Source

Table 1 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions.Psychometrika,66, 229–248.doi:10.1007/BF02294837.


Gordon-Vichi Macroeconomic Consensus Partition Data

Description

The soft (“fuzzy”) consensus partitions for the macroeconomicpartition data given in Gordon and Vichi (2001).

Usage

data("GVME_Consensus")

Format

A named cluster ensemble of eight soft partitions of 21 countriesterms into two or three classes.

Details

The elements of the ensemble are consensus partitions for themacroeconomic partition data in Gordon and Vichi (2001), which areavailable as data setGVME. Element names are of theform"m/k", wherem indicates the consensusmethod employed (one of ‘⁠MF1⁠’, ‘⁠MF2⁠’, ‘⁠JMF⁠’, and‘⁠S&S⁠’, corresponding to the application of models 1, 2, and 3in Gordon and Vichi (2001) and the approach in Sato and Sato (1994),respectively), andk denotes the number classes (2 or 3).

Source

Tables 4 and 5 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001).Fuzzy partition models for fitting a set of partitions.Psychometrika,66, 229–248.doi:10.1007/BF02294837.

M. Sato and Y. Sato (1994).On a multicriteria fuzzy clustering method for 3-way data.International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems,2, 127–142.
doi:10.1142/S0218488594000122.

Examples

## Load the consensus partitions.data("GVME_Consensus")## Pick the partitions into 2 classes.GVME_Consensus_2 <- GVME_Consensus[1 : 4]## Fuzziness using the Partition Coefficient.cl_fuzziness(GVME_Consensus_2)## (Corresponds to 1 - F in the source.)## Dissimilarities:cl_dissimilarity(GVME_Consensus_2)cl_dissimilarity(GVME_Consensus_2, method = "comem")

Rosenberg-Kim Kinship Terms Partition Data

Description

Partitions of 15 kinship terms given by 85 female undergraduates atRutgers University who were asked to sort the terms into classes“on the basis of some aspect of meaning”.

Usage

data("Kinship82")

Format

A cluster ensemble of 85 hard partitions of the 15 kinship terms.

Details

Rosenberg and Kim (1975) describe an experiment where perceivedsimilarities of the kinship terms were obtained from six different“sorting” experiments. These “original” Rosenberg-Kimkinship terms data were published in Arabie, Carroll and de Sarbo(1987), and are also contained in file ‘indclus.data’ in theshell archivehttps://netlib.org/mds/indclus.shar.

For one of the experiments, partitions of the terms were printed inRosenberg (1982). Comparison with the original data indicates thatthe partition data have the “nephew” and “niece” columnsinterchanged, which is corrected in the data set at hand.

Source

Table 7.1 in Rosenberg (1982), with the “nephew” and“niece” columns interchanged.

References

P. Arabie, J. D. Carroll and W. S. de Sarbo (1987).Three-way scaling and clustering.Newbury Park, CA: Sage.

S. Rosenberg and M. P. Kim (1975).The method of sorting as a data-gathering procedure in multivariateresearch.Multivariate Behavioral Research,10, 489–502.
doi:10.1207/s15327906mbr1004_7.

S. Rosenberg (1982).The method of sorting in multivariate research with applicationsselected from cognitive psychology and person perception.In N. Hirschberg and L. G. Humphreys (eds.),Multivariate Applications in the Social Sciences, 117–142.Hillsdale, NJ: Erlbaum.


Gordon-Vichi Kinship82 Consensus Partition Data

Description

The soft (“fuzzy”) consensus partitions for the Rosenberg-Kimkinship terms partition data given in Gordon and Vichi (2001).

Usage

data("Kinship82_Consensus")

Format

A named cluster ensemble of three soft partitions of the 15 kinshipterms into three classes.

Details

The elements of the ensemble are named"MF1","MF2", and"JMF", and correspond to the consensus partitions obtained byapplying models 1, 2, and 3 in Gordon and Vichi (2001) to the kinshipterms partition data in Rosenberg (1982), which are available as datasetKinship82.

Source

Table 6 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions.Psychometrika,66, 229–248.doi:10.1007/BF02294837.

S. Rosenberg (1982).The method of sorting in multivariate research with applicationsselected from cognitive psychology and person perception.In N. Hirschberg and L. G. Humphreys (eds.),Multivariate Applications in the Social Sciences, 117–142.Hillsdale, NJ: Erlbaum.

Examples

## Load the consensus partitions.data("Kinship82_Consensus")## Fuzziness using the Partition Coefficient.cl_fuzziness(Kinship82_Consensus)## (Corresponds to 1 - F in the source.)## Dissimilarities:cl_dissimilarity(Kinship82_Consensus)cl_dissimilarity(Kinship82_Consensus, method = "comem")

Miller-Nicely Consonant Phoneme Confusion Data

Description

Miller-Nicely data on the auditory confusion of 16 consonantphonemes.

Usage

data("Phonemes")

Format

A symmetric matrix of the misclassification probabilities of 16English consonant phonemes.

Details

Miller and Nicely (1955) obtained the confusions by exposing femalesubjects to a series of syllables consisting of one of the 16consonants followed by the vowel ‘⁠a⁠’ under 17 differentexperimental conditions. The data provided are obtained fromaggregating the six so-called flat-noise conditions in which only thespeech-to-noise ratio was varied into a single matrix ofmisclassification frequencies.

Source

The data set is also contained in file ‘mapclus.data’ in theshell archivehttps://netlib.org/mds/mapclus.shar.

References

G. A. Miller and P. E. Nicely (1955).An analysis of perceptual confusions among some English consonants.Journal of the Acoustical Society of America,27,338–352.doi:10.1121/1.1907526.


Additive Tree Distances

Description

Objects representing additive tree distances.

Usage

as.cl_addtree(x)

Arguments

x

an R object representing additive tree distances.

Details

Additive tree distances are object dissimilaritiesd satisfyingthe so-calledadditive tree conditions, also known asfour-point conditionsd_{ij} + d_{kl} \le \max(d_{ik} + d_{jl}, d_{il} + d_{jk}) for all quadruplesi, j, k, l.Equivalently, for each such quadruple, the largest two values of thesumsd_{ij} + d_{kl},d_{ik} + d_{jl}, andd_{il} + d_{jk} must be equal.Centroid distances are additive tree distances where the inequalitiesin the four-point conditions are strengthened to equalities (such thatall three sums are equal), and can be represented asd_{ij} = g_i + g_j, i.e., as sums of distances from a “centroid”.See, e.g., Barthélémy and Guénoche (1991) for more details on additivetree distances.

as.cl_addtree is a generic function. Its default method canhandle objects representing ultrametric distances and raw additivedistance matrices. In addition, there is a method for coercingobjects of class"phylo" from packageape.

Functionsls_fit_addtree andls_fit_centroid can be used to find the additive treedistance or centroid distance minimizing least squares distance(Euclidean dissimilarity) to a given dissimilarity object.

There is aplot method for additive tree distances.

Value

An object of class"cl_addtree" containing the additivetree distances.

References

J.-P. Barthélémy and A. Guénoche (1991).Trees and proximity representations.Chichester: John Wiley & Sons.ISBN 0-471-92263-3.


Agreement Between Partitions or Hierarchies

Description

Compute the agreement between (ensembles) of partitions orhierarchies.

Usage

cl_agreement(x, y = NULL, method = "euclidean", ...)

Arguments

x

an ensemble of partitions or hierarchies and dissimilarities,or something coercible to that (seecl_ensemble).

y

NULL (default), or as forx.

method

a character string specifying one of the built-inmethods for computing agreement, or a function to be taken asa user-defined method. If a character string, its lower-casedversion is matched against the lower-cased names of the availablebuilt-in methods usingpmatch. SeeDetails foravailable built-in methods.

...

further arguments to be passed to methods.

Details

Ify is given, its components must be of the same kind as thoseofx (i.e., components must either all be partitions, or all behierarchies or dissimilarities).

If all components are partitions, the following built-in methods formeasuring agreement between two partitions with respective membershipmatricesu andv (brought to a common number of columns)are available:

"euclidean"

1 - d / m, whered is theEuclidean dissimilarity of the memberships, i.e., the square rootof the minimal sum of the squared differences ofu and allcolumn permutations ofv, andm is an upper bound forthe maximal Euclidean dissimilarity. See Dimitriadou, Weingesseland Hornik (2002).

"manhattan"

1 - d / m, whered is theManhattan dissimilarity of the memberships, i.e., the minimalsum of the absolute differences ofu and all columnpermutations ofv, andm is an upper bound for themaximal Manhattan dissimilarity.

"Rand"

the Rand index (the rate of distinct pairs ofobjects both in the same class or both in different classes inboth partitions), see Rand (1971) or Gordon (1999), page 198.For soft partitions, (currently) the Rand index of thecorresponding nearest hard partitions is used.

"cRand"

the Rand index corrected for agreement bychance, see Hubert and Arabie (1985) or Gordon (1999), page 198.Can only be used for hard partitions.

"NMI"

Normalized Mutual Information, see Strehl andGhosh (2002). For soft partitions, (currently) the NMI of thecorresponding nearest hard partitions is used.

"KP"

the Katz-Powell index, i.e., the product-momentcorrelation coefficient between the elements of the co-membershipmatricesC(u) = u u' andC(v), respectively, see Katzand Powell (1953). For soft partitions, (currently) theKatz-Powell index of the corresponding nearest hard partitions isused. (Note that for hard partitions, the(i,j) entry ofC(u) is one iff objectsi andj are in the sameclass.)

"angle"

the maximal cosine of the angle between theelements ofu and all column permutations ofv.

"diag"

the maximal co-classification rate, i.e., themaximal rate of objects with the same class ids in bothpartitions after arbitrarily permuting the ids.

"FM"

the index of Fowlkes and Mallows (1983), i.e.,the ratioN_{xy} / \sqrt{N_x N_y} ofthe numberN_{xy} of distinct pairs of objects in thesame class in both partitions and the geometric mean of thenumbersN_x andN_y of distinct pairs of objects inthe same class in partitionx and partitiony,respectively.For soft partitions, (currently) the Fowlkes-Mallows index of thecorresponding nearest hard partitions is used.

"Jaccard"

the Jaccard index, i.e., the ratio of thenumbers of distinct pairs of objects in the same class in bothpartitions and in at least one partition, respectively.For soft partitions, (currently) the Jaccard index of thecorresponding nearest hard partitions is used.

"purity"

the purity of the classes ofx withrespect to those ofy, i.e.,\sum_j \max_i n_{ij} / n,wheren_{ij} is the joint frequency of objects in classi forx and in classj fory, andn is the total number of objects.

"PS"

Prediction Strength, see Tibshirani and Walter(2005): the minimum, over all classesj ofy, of themaximal rate of objects in the same class forx and inclassj fory.

If all components are hierarchies, available built-in methods formeasuring agreement between two hierarchies with respectiveultrametricsu andv are as follows.

"euclidean"

1 / (1 + d), whered is theEuclidean dissimilarity of the ultrametrics (i.e., the square rootof the sum of the squared differences ofu andv).

"manhattan"

1 / (1 + d), whered is theManhattan dissimilarity of the ultrametrics (i.e., the sum of theabsolute differences ofu andv).

"cophenetic"

The cophenetic correlation coefficient.(I.e., the product-moment correlation of the ultrametrics.)

"angle"

the cosine of the angle between theultrametrics.

"gamma"

1 - d, whered is the rate ofinversions between the associated ultrametrics (i.e., the rate ofpairs(i,j) and(k,l) for whichu_{ij} < u_{kl}andv_{ij} > v_{kl}). (This agreement measure is a lineartransformation of Kruskal's\gamma.)

The measures based on ultrametrics also allow computing agreement with“raw” dissimilarities on the underlying objects (R objectsinheriting from class"dist").

If a user-defined agreement method is to be employed, it must be afunction taking two clusterings as its arguments.

Symmetric agreement objects of class"cl_agreement" areimplemented as symmetric proximity objects with self-proximitiesidentical to one, and inherit from class"cl_proximity". Theycan be coerced to dense square matrices usingas.matrix. It ispossible to use 2-index matrix-style subscripting for such objects;unless this uses identical row and column indices, this results in a(non-symmetric agreement) object of class"cl_cross_agreement".

Value

Ify isNULL, an object of class"cl_agreement"containing the agreements between the all pairs of components ofx. Otherwise, an object of class"cl_cross_agreement"with the agreements between the components ofx and thecomponents ofy.

References

E. Dimitriadou, A. Weingessel and K. Hornik (2002).A combination scheme for fuzzy clustering.International Journal of Pattern Recognition and ArtificialIntelligence,16, 901–912.
doi:10.1142/S0218001402002052.

E. B. Fowlkes and C. L. Mallows (1983).A method for comparing two hierarchical clusterings.Journal of the American Statistical Association,78,553–569.
doi:10.1080/01621459.1983.10478008.

A. D. Gordon (1999).Classification (2nd edition).Boca Raton, FL: Chapman & Hall/CRC.

L. Hubert and P. Arabie (1985).Comparing partitions.Journal of Classification,2, 193–218.doi:10.1007/bf01908075.

W. M. Rand (1971).Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association,66,846–850.doi:10.2307/2284239.

L. Katz and J. H. Powell (1953).A proposed index of the conformity of one sociometric measurement toanother.Psychometrika,18, 249–256.doi:10.1007/BF02289063.

A. Strehl and J. Ghosh (2002).Cluster ensembles — A knowledge reuse framework for combiningmultiple partitions.Journal of Machine Learning Research,3, 583–617.
https://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf.

R. Tibshirani and G. Walter (2005).Cluster validation by Prediction Strength.Journal of Computational and Graphical Statistics,14/3,511–528.doi:10.1198/106186005X59243.

See Also

cl_dissimilarity;classAgreement in packagee1071.

Examples

## An ensemble of partitions.data("CKME")pens <- CKME[1 : 20]# for saving precious time ...summary(c(cl_agreement(pens)))summary(c(cl_agreement(pens, method = "Rand")))summary(c(cl_agreement(pens, method = "diag")))cl_agreement(pens[1:5], pens[6:7], method = "NMI")## Equivalently, using subscripting.cl_agreement(pens, method = "NMI")[1:5, 6:7]## An ensemble of hierarchies.d <- dist(USArrests)hclust_methods <-    c("ward", "single", "complete", "average", "mcquitty")hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))names(hclust_results) <- hclust_methods hens <- cl_ensemble(list = hclust_results)summary(c(cl_agreement(hens)))## Note that the Euclidean agreements are *very* small.## This is because the ultrametrics differ substantially in height:u <- lapply(hens, cl_ultrametric)round(sapply(u, max), 3)## Rescaling the ultrametrics to [0, 1] gives:u <- lapply(u, function(x) (x - min(x)) / (max(x) - min(x)))shens <- cl_ensemble(list = lapply(u, as.cl_dendrogram))summary(c(cl_agreement(shens)))## Au contraire ...summary(c(cl_agreement(hens, method = "cophenetic")))cl_agreement(hens[1:3], hens[4:5], method = "gamma")

Bagging for Clustering

Description

Construct partitions of objects by running a base clustering algorithmon bootstrap samples from a given data set, and “suitably”aggregating these primary partitions.

Usage

cl_bag(x, B, k = NULL, algorithm = "kmeans", parameters = NULL,        method = "DFBC1", control = NULL)

Arguments

x

the data set of objects to be clustered, as appropriate forthe base clustering algorithm.

B

an integer giving the number of bootstrap replicates.

k

NULL (default), or an integer giving the number ofclasses to be used for a partitioning base algorithm.

algorithm

a character string or function specifying the baseclustering algorithm.

parameters

a named list of additional arguments to be passed tothe base algorithm.

method

a character string indicating the bagging method touse. Currently, only method"DFBC1" is available, whichimplements algorithmBagClust1 in Dudoit & Fridlyand (2003).

control

a list of control parameters for the aggregation.Currently, not used.

Details

Bagging for clustering is really a rather general conceptual frameworkthan a specific algorithm. If the primary partitions generated in thebootstrap stage form a cluster ensemble (so that class memberships ofthe objects inx can be obtained), consensus methods forcluster ensembles (as implemented, e.g., incl_consensusandcl_medoid) can be employed for the aggregationstage. In particular, (possibly new) bagging algorithms can easily berealized by directly runningcl_consensus on the resultsofcl_boot.

In BagClust1, aggregation proceeds by generating a reference partitionby running the base clustering algorithm on the whole given data set,and averaging the ensemble memberships after optimally matching themto the reference partition (in fact, by minimizing Euclideandissimilarity, seecl_dissimilarity).

If the base clustering algorithm yields prototypes, aggregation can bebased on clustering these. This is the idea underlying the“Bagged Clustering” algorithm introduced in Leisch (1999) andimplemented by functionbclust in packagee1071.

Value

An R object representing a partition of the objects given inx.

References

S. Dudoit and J. Fridlyand (2003).Bagging to improve the accuracy of a clustering procedure.Bioinformatics,19/9, 1090–1099.doi:10.1093/bioinformatics/btg038.

F. Leisch (1999).Bagged Clustering.Working Paper 51, SFB “Adaptive Information Systems andModeling in Economics and Management Science”.doi:10.57938/9b129f95-b53b-44ce-a129-5b7a1168d832.

Examples

set.seed(1234)## Run BagClust1 on the Cassini data.data("Cassini")party <- cl_bag(Cassini$x, 50, 3)plot(Cassini$x, col = cl_class_ids(party), xlab = "", ylab = "")## Actually, using fuzzy c-means as a base learner works much better:if(require("e1071", quietly = TRUE)) {    party <- cl_bag(Cassini$x, 20, 3, algorithm = "cmeans")    plot(Cassini$x, col = cl_class_ids(party), xlab = "", ylab = "")}

Bootstrap Resampling of Clustering Algorithms

Description

Generate bootstrap replicates of the results of applying a“base” clustering algorithm to a given data set.

Usage

cl_boot(x, B, k = NULL,        algorithm = if (is.null(k)) "hclust" else "kmeans",         parameters = list(), resample = FALSE)

Arguments

x

the data set of objects to be clustered, as appropriate forthe base clustering algorithm.

B

an integer giving the number of bootstrap replicates.

k

NULL (default), or an integer giving the number ofclasses to be used for a partitioning base algorithm.

algorithm

a character string or function specifying the baseclustering algorithm.

parameters

a named list of additional arguments to be passed tothe base algorithm.

resample

a logical indicating whether the data should beresampled in addition to “sampling from the algorithm”.If resampling is used, the class memberships of the objects given inx are predicted from the results of running the basealgorithm on bootstrap samples ofx.

Details

This is a rather simple-minded function with limited applicability,and mostly useful for studying the effect of (uncontrolled) randominitializations of fixed-point partitioning algorithms such askmeans orcmeans, see theexamples. To study the effect of varying control parameters orexplicitly providing random starting values, the respective clusterensemble has to be generated explicitly (most conveniently by usingreplicate to create a listlst of suitableinstances of clusterings obtained by the base algorithm, and usingcl_ensemble(list = lst) to create the ensemble).

Value

A cluster ensemble of lengthB, with either (if resampling isnot used, default) the results of running the base algorithm on thegiven data set, or (if resampling is used) the memberships for thegiven data predicted from the results of running the base algorithm onbootstrap samples of the data.

Examples

## Study e.g. the effect of random kmeans() initializations.data("Cassini")pens <- cl_boot(Cassini$x, 15, 3)diss <- cl_dissimilarity(pens)summary(c(diss))plot(hclust(diss))

Cluster Classes

Description

Extract the classes in a partition or hierarchy.

Usage

cl_classes(x)

Arguments

x

an R object representing a partition or hierarchy ofobjects.

Details

For partitions, the classes are the equivalence classes(“clusters”) of the partition; for soft partitions, the classesof the nearest hard partition are used.

For hierarchies represented by trees, the classes are the sets ofobjects corresponding to (joined at or split by) the nodes of thetree.

Value

A list inheriting from"cl_classes_of_objects" of vectorsindicating the classes.


Consensus Partitions and Hierarchies

Description

Compute the consensus clustering of an ensemble of partitions orhierarchies.

Usage

cl_consensus(x, method = NULL, weights = 1, control = list())

Arguments

x

an ensemble of partitions or hierarchies, or somethingcoercible to that (seecl_ensemble).

method

a character string specifying one of the built-inmethods for computing consensus clusterings, or a function to betaken as a user-defined method, orNULL (default value). Ifa character string, its lower-cased version is matched against thelower-cased names of the available built-in methods usingpmatch. SeeDetails for available built-inmethods and defaults.

weights

a numeric vector with non-negative case weights.Recycled to the number of elements in the ensemble given byxif necessary.

control

a list of control parameters. SeeDetails.

Details

Consensus clusterings “synthesize” the information in theelements of a cluster ensemble into a single clustering, often byminimizing a criterion function measuring how dissimilar consensuscandidates are from the (elements of) the ensemble (the so-called“optimization approach” to consensus clustering).

The most popular criterion functions are of the formL(x) = \sum w_b d(x_b, x)^p, whered is a suitable dissimilarity measure(seecl_dissimilarity),w_b is the case weightgiven to elementx_b of the ensemble, andp \ge 1. Ifp = 1 and minimization is over all possible base clusterings, aconsensus solution is called amedian of the ensemble; ifminimization is restricted to the elements of the ensemble, aconsensus solution is called amedoid (seecl_medoid). Forp = 2, we obtainleastsquares consensus partitions and hierarchies (generalized means).See also Gordon (1999) for more information.

If all elements of the ensemble are partitions, the built-in consensusmethods compute consensus partitions by minimizing a criterion of theformL(x) = \sum w_b d(x_b, x)^p over all hard or softpartitionsx with a given (maximal) numberk of classes.Available built-in methods are as follows.

"SE"

a fixed-point algorithm for obtainingsoftleast squares Euclidean consensus partitions (i.e., for minimizingL with Euclidean dissimilarityd andp =2 overall soft partitions with a given maximal number of classes).

This iterates between individually matching all partitions to thecurrent approximation to the consensus partition, and computingthe next approximation as the membership matrix closest to asuitable weighted average of the memberships of all partitionsafter permuting their columns for the optimal matchings of classids.

The following control parameters are available for this method.

k

an integer giving the number of classes to beused for the least squares consensus partition.By default, the maximal number of classes in the ensemble isused.

maxiter

an integer giving the maximal number ofiterations to be performed.Defaults to 100.

nruns

an integer giving the number of runs to beperformed. Defaults to 1.

reltol

the relative convergence tolerance.Defaults tosqrt(.Machine$double.eps).

start

a matrix with number of rows equal to thenumber of objects of the cluster ensemble, andkcolumns, to be used as a starting value, or a list of suchmatrices. By default, suitable random membership matrices areused.

verbose

a logical indicating whether to providesome output on minimization progress.Defaults togetOption("verbose").

In the case of multiple runs, the first optimum found is returned.

This method can also be referred to as"soft/euclidean".

"GV1"

the fixed-point algorithm for the “firstmodel” in Gordon and Vichi (2001) for minimizingL withd being GV1 dissimilarity andp = 2 over all softpartitions with a given maximal number of classes.

This is similar to"SE", but uses GV1 rather than Euclideandissimilarity.

Available control parameters are the same as for"SE".

"DWH"

an extension of the greedy algorithm inDimitriadou, Weingessel and Hornik (2002) for (approximately)obtaining soft least squares Euclidean consensus partitions.The reference provides some structure theory relating findingthe consensus partition to an instance of the multiple assignmentproblem, which is known to be NP-hard, and suggests a simpleheuristic based on successively matching an individual partitionx_b to the current approximation to the consensus partition,and compute the memberships of the next approximation as aweighted average of those of the current one and ofx_bafter permuting its columns for the optimal matching of classids.

The following control parameters are available for this method.

k

an integer giving the number of classes to beused for the least squares consensus partition. By default,the maximal number of classes in the ensemble is used.

order

a permutation of the integers from 1 to thesize of the ensemble, specifying the order in which thepartitions in the ensemble should be aggregated. Defaults tousing a random permutation (unlike the reference, which doesnot permute at all).

"HE"

a fixed-point algorithm for obtaininghardleast squares Euclidean consensus partitions (i.e., for minimizingL with Euclidean dissimilarityd andp =2 overall hard partitions with a given maximal number of classes.)

Available control parameters are the same as for"SE".

This method can also be referred to as"hard/euclidean".

"SM"

a fixed-point algorithm for obtainingsoftmedian Manhattan consensus partitions (i.e., for minimizingL with Manhattan dissimilarityd andp =1 over all soft partitions with a given maximal number of classes).

Available control parameters are the same as for"SE".

This method can also be referred to as"soft/manhattan".

"HM"

a fixed-point algorithm for obtaininghardmedian Manhattan consensus partitions (i.e., for minimizingL with Manhattan dissimilarityd andp =1 over all hard partitions with a given maximal number of classes).

Available control parameters are the same as for"SE".

This method can also be referred to as"hard/manhattan".

"GV3"

aSUMT algorithm for the “thirdmodel” in Gordon and Vichi (2001) for minimizingL withd being co-membership dissimilarity andp = 2. (Seesumt for more information on theSUMTapproach.) This optimization problem is equivalent to finding themembership matrixm for which the sum of the squareddifferences betweenC(m) = m m' and the weighted averageco-membership matrix\sum_b w_b C(m_b) of the partitions isminimal.

Available control parameters aremethod,control,eps,q, andverbose, which have the sameroles as forsumt, and the following.

k

an integer giving the number of classes to beused for the least squares consensus partition. By default,the maximal number of classes in the ensemble is used.

nruns

an integer giving the number of runs to beperformed. Defaults to 1.

start

a matrix with number of rows equal to thesize of the cluster ensemble, andk columns, to be usedas a starting value, or a list of such matrices. By default,a membership based on a rankk approximation to theweighted average co-membership matrix is used.

In the case of multiple runs, the first optimum found is returned.

"soft/symdiff"

aSUMT approach forminimizingL = \sum w_b d(x_b, x) over all soft partitionswith a given maximal number of classes, whered is theManhattan dissimilarity of the co-membership matrices (coincidingwith symdiff partition dissimilarity in the case of hardpartitions).

Available control parameters are the same as for"GV3".

"hard/symdiff"

an exact solver for minimizingL = \sum w_b d(x_b, x) over all hard partitions (possiblywith a given maximal number of classes as specified by the controlparameterk), whered is symdiff partitiondissimilarity (so that soft partitions in the ensemble arereplaced by their closest hard partitions), or equivalently, Randdistance or pair-bonds (Boorman-ArabieD) distance. Theconsensus solution is found via mixed linear or quadraticprogramming.

By default, method"SE" is used for ensembles of partitions.

If all elements of the ensemble are hierarchies, the followingbuilt-in methods for computing consensus hierarchies are available.

"euclidean"

an algorithm for minimizingL(x) = \sum w_b d(x_b, x) ^ 2 over all dendrograms, whered is Euclidean dissimilarity. This is equivalent to findingthe best least squares ultrametric approximation of the weightedaveraged = \sum w_b u_b of the ultrametricsu_b ofthe hierarchiesx_b, which is attempted by callingls_fit_ultrametric ond with appropriatecontrol parameters.

This method can also be referred to as"cophenetic".

"manhattan"

aSUMT for minimizingL = \sum w_b d(x_b, x) over all dendrograms, wheredis Manhattan dissimilarity.

Available control parameters are the same as for"euclidean".

"majority"

a hierarchy obtained from an extension ofthe majority consensus tree of Margush and McMorris (1981), whichminimizesL(x) = \sum w_b d(x_b, x) over all dendrograms,whered is the symmetric difference dissimilarity. Theunweightedp-majority tree is then-tree (hierarchy inthe strict sense) consisting of all subsets of objects containedin more than100 p percent of then-treesT_binduced by the dendrograms, where1/2 \le p < 1 andp = 1/2 (default) corresponds to the standard majority tree.In the weighted case, it consists of all subsetsA for which\sum_{b: A \in T_b} w_b > W p, whereW = \sum_b w_b.We also allow forp = 1, which gives thestrictconsensus tree consisting of all subsets contained in each ofthen-trees. The majority dendrogram returned is arepresentation of the majority tree where all splits have heightone.

The fractionp can be specified via the control parameterp.

By default, method"euclidean" is used for ensembles ofhierarchies.

If a user-defined consensus method is to be employed, it must be afunction taking the cluster ensemble, the case weights, and a list ofcontrol parameters as its arguments, with formals namedx,weights, andcontrol, respectively.

Most built-in methods use heuristics for solving hard optimizationproblems, and cannot be guaranteed to find a global minimum. Standardpractice would recommend to use the best solution found in“sufficiently many” replications of the methods.

Value

The consensus partition or hierarchy.

References

E. Dimitriadou, A. Weingessel and K. Hornik (2002).A combination scheme for fuzzy clustering.International Journal of Pattern Recognition and ArtificialIntelligence,16, 901–912.
doi:10.1142/S0218001402002052.

A. D. Gordon and M. Vichi (2001).Fuzzy partition models for fitting a set of partitions.Psychometrika,66, 229–248.doi:10.1007/BF02294837.

A. D. Gordon (1999).Classification (2nd edition).Boca Raton, FL: Chapman & Hall/CRC.

T. Margush and F. R. McMorris (1981).Consensusn-trees.Bulletin of Mathematical Biology,43, 239–244.doi:10.1007/BF02459446.

See Also

cl_medoid,consensus

Examples

## Consensus partition for the Rosenberg-Kim kinship terms partition## data based on co-membership dissimilarities.data("Kinship82")m1 <- cl_consensus(Kinship82, method = "GV3",                   control = list(k = 3, verbose = TRUE))## (Note that one should really use several replicates of this.)## Value for criterion function to be minimized:sum(cl_dissimilarity(Kinship82, m1, "comem") ^ 2)## Compare to the consensus solution given in Gordon & Vichi (2001).data("Kinship82_Consensus")m2 <- Kinship82_Consensus[["JMF"]]sum(cl_dissimilarity(Kinship82, m2, "comem") ^ 2)## Seems we get a better solution ...## How dissimilar are these solutions?cl_dissimilarity(m1, m2, "comem")## How "fuzzy" are they?cl_fuzziness(cl_ensemble(m1, m2))## Do the "nearest" hard partitions fully agree?cl_dissimilarity(as.cl_hard_partition(m1),                 as.cl_hard_partition(m2))## Consensus partition for the Gordon and Vichi (2001) macroeconomic## partition data based on Euclidean dissimilarities.data("GVME")set.seed(1)## First, using k = 2 classes.m1 <- cl_consensus(GVME, method = "GV1",                   control = list(k = 2, verbose = TRUE))## (Note that one should really use several replicates of this.)## Value of criterion function to be minimized:sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)## Compare to the consensus solution given in Gordon & Vichi (2001).data("GVME_Consensus")m2 <- GVME_Consensus[["MF1/2"]]sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)## Seems we get a slightly  better solution ...## But note thatcl_dissimilarity(m1, m2, "GV1")## and that the maximal deviation of the memberships ismax(abs(cl_membership(m1) - cl_membership(m2)))## so the differences seem to be due to rounding.## Do the "nearest" hard partitions fully agree?table(cl_class_ids(m1), cl_class_ids(m2))## And now for k = 3 classes.m1 <- cl_consensus(GVME, method = "GV1",                   control = list(k = 3, verbose = TRUE))sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)## Compare to the consensus solution given in Gordon & Vichi (2001).m2 <- GVME_Consensus[["MF1/3"]]sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)## This time we look much better ...## How dissimilar are these solutions?cl_dissimilarity(m1, m2, "GV1")## Do the "nearest" hard partitions fully agree?table(cl_class_ids(m1), cl_class_ids(m2))

Dissimilarity Between Partitions or Hierarchies

Description

Compute the dissimilarity between (ensembles) of partitionsor hierarchies.

Usage

cl_dissimilarity(x, y = NULL, method = "euclidean", ...)

Arguments

x

an ensemble of partitions or hierarchies and dissimilarities,or something coercible to that (seecl_ensemble).

y

NULL (default), or as forx.

method

a character string specifying one of the built-inmethods for computing dissimilarity, or a function to be taken asa user-defined method. If a character string, its lower-casedversion is matched against the lower-cased names of the availablebuilt-in methods usingpmatch. SeeDetails foravailable built-in methods.

...

further arguments to be passed to methods.

Details

Ify is given, its components must be of the same kind as thoseofx (i.e., components must either all be partitions, or all behierarchies or dissimilarities).

If all components are partitions, the following built-in methods formeasuring dissimilarity between two partitions with respectivemembership matricesu andv (brought to a common number ofcolumns) are available:

"euclidean"

the Euclidean dissimilarity of thememberships, i.e., the square root of the minimal sum of thesquared differences ofu and all column permutations ofv. See Dimitriadou, Weingessel and Hornik (2002).

"manhattan"

the Manhattan dissimilarity of thememberships, i.e., the minimal sum of the absolute differences ofu and all column permutations ofv.

"comemberships"

the Euclidean dissimilarity of theelements of the co-membership matricesC(u) = u u' andC(v), i.e., the square root of the sum of the squareddifferences ofC(u) andC(v).

"symdiff"

the cardinality of the symmetric setdifference of the sets of co-classified pairs of distinct objectsin the partitions. I.e., the number of distinct pairs of objectsin the same class in exactly one of the partitions.(Alternatively, the cardinality of the symmetric set differencebetween the (binary) equivalence relations corresponding to thepartitions.) For soft partitions, (currently) the symmetric setdifference of the corresponding nearest hard partitions is used.

"Rand"

the Rand distance, i.e., the rate of distinctpairs of objects in the same class in exactly one of thepartitions. (Related to the Rand indexa via the lineartransformationd = (1 - a) / 2.) For soft partitions,(currently) the Rand distance of the corresponding nearest hardpartitions is used.

"GV1"

the square root of the dissimilarity\Delta_1 used for the first model in Gordon andVichi (2001), i.e., the square root of the minimal sum of thesquared differences of thematched non-zero columns ofu andv.

"BA/d"

distance measures for hard partitionsdiscussed in Boorman and Arabie (1972), withd one of‘⁠A⁠’, ‘⁠C⁠’, ‘⁠D⁠’, or ‘⁠E⁠’. For soft partitions,the distances of the corresponding nearest hard partitions areused.

"BA/A" is the minimum number of single element moves (movefrom one class to another or a new one) needed to transform onepartition into the other. Introduced in Rubin (1967).

"BA/C" is the minimum number of lattice moves fortransforming one partition into the other, where partitions aresaid to be connected by a lattice move if one isjust finerthan the other (i.e., there is no other partition between them) inthe partition lattice (seecl_meet). Equivalently,withz the join ofx andy andS givingthe number of classes, this can be written asS(x) + S(y) - 2S(z). Attributed to David Pavy.

"BA/D" is the “pair-bonds” distance, which can bedefined asS(x) + S(y) - 2 S(z), withz the meet ofx andy andS thesupervaluation (i.e.,non-decreasing with respect to the partial order on the partitionlattice) function\sum_i (n_i (n_i - 1)) / (n (n - 1)),where then_i are the numbers of objects in the respectiveclasses of the partition (such thatn_i (n_i - 1) / 2 are thenumbers of pair bonds in the classes), andn the totalnumber of objects.

"BA/E" is the normalized information distance, defined as1 - I / H, whereI is the average mutual informationbetween the partitions, andH is the average entropy of themeetz of the partitions. Introduced in Rajski (1961).

(Boorman and Arabie also discuss a distance measure (B)based on the minimum number of set moves needed to transform onepartition into the other, which, differently from theA andC distance measures is hard to compute (Day, 1981) and(currently) not provided.)

"VI"

Variation of Information, see Meila (2003). If... has an argument namedweights, it is taken tospecify case weights.

"Mallows"

the Mallows-type distance by Zhou, Li andZha (2005), which is related to the Monge-Kantorovich masstransfer problem, and given as thep-th root of the minimalvalue of the transportation problem\sum w_{jk} \sum_i |u_{ij} - v_{ik}| ^ p with constraintsw_{jk} \ge 0,\sum_j w_{jk} = \alpha_j,\sum_k w_{jk} = \beta_k,where\sum_j \alpha_j = \sum_k \beta_k. The parametersp,\alpha and\beta all default to one (in thiscase, the Mallows distance coincides with the Manhattandissimilarity), and can be specified via additional argumentsnamedp,alpha, andbeta, respectively.

"CSSD"

the Cluster Similarity Sensitive Distance ofZhou, Li and Zha (2005), which is given as the minimal value of\sum_{k,l} (1 - 2 w_{kl} / (\alpha_k + \beta_l)) L_{kl},whereL_{kl} = \sum_i u_{ik} v_{il} d(p_{x;k}, p_{y;l}) withp_{x;k} andp_{y;l} the prototype of thek-thclass ofx and thel-th class ofy,respectively,d is the distance between these, and thew_{kl} as for Mallows distance. If prototypes are matrices,the Euclidean distance between these is used as default. Usingthe additional argumentL, one can give a matrix ofL_{kl} values, or the functiond. Parameters\alpha and\beta all default to one, and can bespecified via additional arguments namedalpha andbeta, respectively.

For hard partitions, both Manhattan and squared Euclideandissimilarity give twice thetransfer distance (Charon et al.,2005), which is the minimum number of objects that must be removed sothat the implied partitions (restrictions to the remaining objects)are identical. This is also known as theR-metric in Day(1981), i.e., the number of augmentations and removals of singleobjects needed to transform one partition into the other, and thepartition-distance in Gusfield (2002), and equals twice thenumber of single element moves distance of Boorman and Arabie.

For hard partitions, the pair-bonds (Boorman-ArabieD) distanceis identical to the Rand distance, and can also be written as theManhattan distance between the co-membership matrices corresponding tothe partitions, or equivalently, their symdiff distance, normalized byn (n - 1).

If all components are hierarchies, available built-in methods formeasuring dissimilarity between two hierarchies with respectiveultrametricsu andv are as follows.

"euclidean"

the Euclidean dissimilarity of theultrametrics (i.e., the square root of the sum of the squareddifferences ofu andv).

"manhattan"

the Manhattan dissimilarity of theultrametrics (i.e., the sum of the absolute differences ofuandv).

"cophenetic"

1 - c^2, wherec is thecophenetic correlation coefficient (i.e., the product-momentcorrelation of the ultrametrics).

"gamma"

the rate of inversions between theultrametrics (i.e., the rate of pairs(i,j) and(k,l)for whichu_{ij} < u_{kl} andv_{ij} > v_{kl}).

"symdiff"

the cardinality of the symmetric setdifference of the sets of classes (hierarchies in the strictsense) induced by the dendrograms. I.e., the number of sets ofobjects obtained by a split in exactly one of the hierarchies.

"Chebyshev"

the Chebyshev (maximal) dissimilarity ofthe ultrametrics (i.e., the maximum of the absolute differences ofu andv).

"Lyapunov"

the logarithm of the product of themaximal and minimal ratios of the ultrametrics. This is alsoknown as the “Hilbert projective metric” on the conerepresented by the ultrametrics (e.g., Jardine & Sibson (1971),page 107), and only defined forstrict ultrametrics (whichare strictly positive for distinct objects).

"BO"

them_\delta family of tree metrics byBoorman and Olivier (1973), which are of the formm_\delta =\int_0^\infty \delta(p(h), q(h)) dh, wherep(h) andq(h) are the hard partitions obtaining by cutting the trees(dendrograms) at heighth, and\delta is a suitablydissimilarity measure for partitions. In particular, when taking\delta as symdiff or Rand dissimilarity,m_\delta isthe Manhattan dissimilarity of the hierarchies.

If... has an argument nameddelta it is taken tospecify the partition dissimilarity\delta to be employed.

"spectral"

the spectral norm (2-norm) of thedifferences of the ultrametrics, suggested in Mérigot, Durbec, andGaertner (2010).

The measures based on ultrametrics also allow computing dissimilaritywith “raw” dissimilarities on the underlying objects (R objectsinheriting from class"dist").

If a user-defined dissimilarity method is to be employed, it must be afunction taking two clusterings as its arguments.

Symmetric dissimilarity objects of class"cl_dissimilarity" areimplemented as symmetric proximity objects with self-proximitiesidentical to zero, and inherit from class"cl_proximity". Theycan be coerced to dense square matrices usingas.matrix. Itis possible to use 2-index matrix-style subscripting for such objects;unless this uses identical row and column indices, this results in a(non-symmetric dissimilarity) object of class"cl_cross_dissimilarity".

Symmetric dissimilarity objects also inherit from class"dist" (although they currently do not “strictly”extend this class), thus making it possible to use them directly forclustering algorithms based on dissimilarity matrices of this class,see the examples.

Value

Ify isNULL, an object of class"cl_dissimilarity" containing the dissimilarities between allpairs of components ofx. Otherwise, an object of class"cl_cross_dissimilarity" with the dissimilarities between thecomponents ofx and the components ofy.

References

S. A. Boorman and P. Arabie (1972).Structural measures and the method of sorting.In R. N. Shepard, A. K. Romney, & S. B. Nerlove (eds.),Multidimensional Scaling: Theory and Applications in theBehavioral Sciences, 1: Theory (pages 225–249).New York: Seminar Press.

S. A. Boorman and D. C. Olivier (1973).Metrics on spaces of finite trees.Journal of Mathematical Psychology,10, 26–59.doi:10.1016/0022-2496(73)90003-5.

I. Charon, L. Denoeud, A. Guénoche and O. Hudry (2006).Maximum Transfer Distance Between Partitions.Journal of Classification,23, 103–121.doi:10.1007/s00357-006-0006-2.

W. E. H. Day (1981).The complexity of computing metric distances between partitions.Mathematical Social Sciences,1, 269–287.doi:10.1016/0165-4896(81)90042-1.

E. Dimitriadou, A. Weingessel and K. Hornik (2002).A combination scheme for fuzzy clustering.International Journal of Pattern Recognition and ArtificialIntelligence,16, 901–912.
doi:10.1142/S0218001402002052.

A. D. Gordon and M. Vichi (2001).Fuzzy partition models for fitting a set of partitions.Psychometrika,66, 229–248.doi:10.1007/BF02294837.

D. Gusfield (2002).Partition-distance: A problem and class of perfect graphs arising inclustering.Information Processing Letters,82, 159–164.doi:10.1016/S0020-0190(01)00263-0.

N. Jardine and E. Sibson (1971).Mathematical Taxonomy.London: Wiley.

M. Meila (2003).Comparing clusterings by the variation of information.In B. Schölkopf and M. K. Warmuth (eds.),Learning Theory andKernel Machines, pages 173–187.Springer-Verlag: Lecture Notes in Computer Science 2777.

B. Mérigot, J.-P. Durbec and J.-C. Gaertner (2010).On goodness-of-fit measure for dendrogram-based analyses.Ecology,91, 1850—-1859.doi:10.1890/09-1387.1.

C. Rajski (1961).A metric space of discrete probability distributions,Information and Control,4, 371–377.doi:10.1016/S0019-9958(61)80055-7.

J. Rubin (1967).Optimal classification into groups: An approach for solving thetaxonomy problem.Journal of Theoretical Biology,15, 103–144.doi:10.1016/0022-5193(67)90046-X.

D. Zhou, J. Li and H. Zha (2005).A new Mallows distance based metric for comparing clusterings.InProceedings of the 22nd international Conference on MachineLearning (Bonn, Germany, August 07–11, 2005), pages 1028–1035.ICML '05, volume 119.ACM Press, New York, NY.doi:10.1145/1102351.1102481.

See Also

cl_agreement

Examples

## An ensemble of partitions.data("CKME")pens <- CKME[1 : 30]diss <- cl_dissimilarity(pens)summary(c(diss))cl_dissimilarity(pens[1:5], pens[6:7])## Equivalently, using subscripting.diss[1:5, 6:7]## Can use the dissimilarities for "secondary" clustering## (e.g. obtaining hierarchies of partitions):hc <- hclust(diss)plot(hc)## Example from Boorman and Arabie (1972).P1 <- as.cl_partition(c(1, 2, 2, 2, 3, 3, 2, 2))P2 <- as.cl_partition(c(1, 1, 2, 2, 3, 3, 4, 4))cl_dissimilarity(P1, P2, "BA/A")cl_dissimilarity(P1, P2, "BA/C")## Hierarchical clustering.d <- dist(USArrests)x <- hclust(d)cl_dissimilarity(x, d, "cophenetic")cl_dissimilarity(x, d, "gamma")

Cluster Ensembles

Description

Creation and manipulation of cluster ensembles.

Usage

cl_ensemble(..., list = NULL)as.cl_ensemble(x)is.cl_ensemble(x)

Arguments

...

R objects representing clusterings of or dissimilaritiesbetween the same objects.

list

a list of R objects as in....

x

foras.cl_ensemble, an R object as in...;foris.cl_ensemble, an arbitrary R object.

Details

cl_ensemble creates “cluster ensembles”, which arerealized as lists of clusterings (or dissimilarities) with additionalclass information, always inheriting from"cl_ensemble". Allelements of the ensemble must have the same number of objects.

If all elements are partitions, the ensemble has class"cl_partition_ensemble";if all elements are dendrograms, it has class"cl_dendrogram_ensemble" and inherits from"cl_hierarchy_ensemble";if all elements are hierarchies (but not always dendrograms), it hasclass"cl_hierarchy_ensemble".Note that empty or “mixed” ensembles cannot be categorizedaccording to the kind of elements they contain, and hence only haveclass"cl_ensemble".

The list representation makes it possible to uselapply forcomputations on the individual clusterings in (i.e., the componentsof) a cluster ensemble.

Available methods for cluster ensembles include those forsubscripting,c,rep, andprint. There is also aplot method for ensembles for which all elements can be plotted(currently, additive trees, dendrograms and ultrametrics).

Value

cl_ensemble returns a list of the given clusterings ordissimilarities, with additional class information (seeDetails).

Examples

d <- dist(USArrests)hclust_methods <-    c("ward", "single", "complete", "average", "mcquitty")hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))names(hclust_results) <- hclust_methods ## Now create an ensemble from the results.hens <- cl_ensemble(list = hclust_results)hens## Subscripting.hens[1 : 3]## Replication.rep(hens, 3)## Plotting.plot(hens, main = names(hens))## And continue to analyze the ensemble, e.g.round(cl_dissimilarity(hens, method = "gamma"), 4)

Partition Fuzziness

Description

Compute the fuzziness of partitions.

Usage

cl_fuzziness(x, method = NULL, normalize = TRUE)

Arguments

x

a cluster ensemble of partitions, or an R object coercible tosuch.

method

a character string indicating the fuzziness measure tobe employed, orNULL (default), or a function to be taken asa user-defined method. Currently available built-in methods are"PC" (Partition Coefficient) and"PE" (PartitionEntropy), with the default corresponding to the first one. Ifmethod is a character string, its lower-cased version ismatched against the lower-cased names of the available built-inmethods usingpmatch.

normalize

a logical indicating whether the fuzziness measureshould be normalized in a way that hard partitions have value 0, and“completely fuzzy” partitions (where for all objects, allclasses get the same membership) have value 1.

Details

Ifm contains the membership values of a partition, the(unnormalized) Partition Coefficient and Partition Entropy are givenby\sum_{n,i} m_{n,i}^2 and\sum_{n,i} H(m_{n,i}),respectively, whereH(u) = u \log u - (1-u) \log(1-u).

Note that the normalization used here is different from thenormalizations typically found in the literature.

If a user-defined fuzziness method is to be employed, is must be afunction taking a matrix of membership values and a logical toindicate whether normalization is to be performed as its arguments (inthat order; argument names are not used).

Value

An object of class"cl_fuzziness" giving the fuzzinessvalues.

References

J. C. Bezdek (1981).Pattern Recognition with Fuzzy Objective Function Algorithms.New York: Plenum.

See Also

FunctionfclustIndex in packagee1071,which also computes several other “fuzzy cluster indexes”(typically based on more information than just the membershipvalues).

Examples

if(require("e1071", quietly = TRUE)) {    ## Use an on-line version of fuzzy c-means from package e1071 if    ## available.    data("Cassini")    pens <- cl_boot(Cassini$x, B = 15, k = 3, algorithm = "cmeans",                    parameters = list(method = "ufcl"))    pens    summary(cl_fuzziness(pens, "PC"))    summary(cl_fuzziness(pens, "PE"))}

Membership Margins

Description

Compute themargin of the memberships of a partition, i.e., thedifference between the largest and second largest membership values ofthe respective objects.

Usage

cl_margin(x)

Arguments

x

anR object representing a partition of objects.

Details

For hard partitions, the margins are always 1.

For soft partitions, the margins may be taken as an indication of the“sureness” of classifying an object to the class with maximummembership value.

Examples

data("GVME")## Look at the classes obtained for 1980:split(cl_object_names(GVME[["1980"]]), cl_class_ids(GVME[["1980"]]))## Margins:x <- cl_margin(GVME[["1980"]])## Add names, and sort:names(x) <- cl_object_names(GVME[["1980"]])sort(x)## Note the "uncertainty" of assigning Egypt to the "intermediate" class## of nations.

Medoid Partitions and Hierarchies

Description

Compute the medoid of an ensemble of partitions or hierarchies, i.e.,the element of the ensemble minimizing the sum of dissimilarities toall other elements.

Usage

cl_medoid(x, method = "euclidean")

Arguments

x

an ensemble of partitions or hierarchies, or somethingcoercible to that (seecl_ensemble).

method

a character string or a function, as for argumentmethod of functioncl_dissimilarity.

Details

Medoid clusterings are special cases of “consensus” clusteringscharacterized as the solutions of an optimization problem. See Gordon(2001) for more information.

The dissimilaritiesd for determining the medoid are obtainedby callingcl_dissimilarity with argumentsx andmethod. The medoid can then be found as the (first) row indexfor which the row sum ofas.matrix(d) is minimal. Modulopossible differences in the case of ties, this gives the same resultsas (the medoid obtained by)pam in packagecluster.

Value

The medoid partition or hierarchy.

References

A. D. Gordon (1999).Classification (2nd edition).Boca Raton, FL: Chapman & Hall/CRC.

See Also

cl_consensus

Examples

## An ensemble of partitions.data("CKME")pens <- CKME[1 : 20]m1 <- cl_medoid(pens)diss <- cl_dissimilarity(pens)require("cluster")m2 <- pens[[pam(diss, 1)$medoids]]## Agreement of medoid consensus partitions.cl_agreement(m1, m2)## Or, more straightforwardly:table(cl_class_ids(m1), cl_class_ids(m2))

Memberships of Partitions

Description

Compute the memberships values for objects representing partitions.

Usage

cl_membership(x, k = n_of_classes(x))as.cl_membership(x)

Arguments

x

an R object representing a partition of objects (forcl_membership) or raw memberships or class ids (foras.cl_membership).

k

an integer giving the number of columns (corresponding toclass ids) to be used in the membership matrix. Must not be less,and default to, the number of classes in the partition.

Details

cl_membership is a generic function.

The methods provided in packageclue handle the partitionsobtained from clustering functions in the base R distribution, as wellas packagesRWeka,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clueitself).

as.cl_membership can be used for coercing “raw” classids (given as atomic vectors) or membership values (given as numericmatrices) to membership objects.

Value

An object of class"cl_membership" with the matrix ofmembership values.

See Also

is.cl_partition

Examples

## Getting the memberships of a single soft partition.d <- dist(USArrests)hclust_methods <-    c("ward", "single", "complete", "average", "mcquitty")hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))names(hclust_results) <- hclust_methods ## Now create an ensemble from the results.hens <- cl_ensemble(list = hclust_results)## And add the results of agnes and diana.require("cluster")hens <- c(hens, list(agnes = agnes(d), diana = diana(d)))## Create a dissimilarity object from this.d1 <- cl_dissimilarity(hens)## And compute a soft partition.party <- fanny(d1, 2)round(cl_membership(party), 5)## The "nearest" hard partition to this:as.cl_hard_partition(party)## (which has the same class ids as cl_class_ids(party)).## Extracting the memberships from the elements of an ensemble of## partitions.pens <- cl_boot(USArrests, 30, 3)pensmems <- lapply(pens, cl_membership)## And turning these raw memberships into an ensemble of partitions.pens <- cl_ensemble(list = lapply(mems, as.cl_partition))penspens[[length(pens)]]

Find Object Names

Description

Find the names of the objects from which a taxonomy (partition orhierarchy) or proximity was obtained.

Usage

cl_object_names(x)

Arguments

x

anR object representing a taxonomy or proximity.

Details

This is a generic function.

The methods provided in packageclue handle the partitions andhierarchies obtained from clustering functions in the base Rdistribution, as well as packagesRWeka,ape,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clue itself), in as much aspossible.

There is also a method for object dissimilarities which inherit fromclass"dist".

Value

A character vector of lengthn_of_objects(x) in case thenames of the objects could be determined, orNULL.


K-Medoids Partitions of Clusterings

Description

Computek-medoids partitions of clusterings.

Usage

cl_pam(x, k, method = "euclidean", solver = c("pam", "kmedoids"))

Arguments

x

an ensemble of partitions or hierarchies, or somethingcoercible to that (seecl_ensemble).

k

an integer giving the number of classes to be used in thepartition.

method

a character string or a function, as for argumentmethod of functioncl_dissimilarity.

solver

a character string indicating thek-medoids solverto be employed. May be abbreviated. If"pam" (default), thePartitioning Around Medoids (Kaufman & Rousseeuw (1990), Chapter 2)heuristicpam of packagecluster isused. Otherwise, the exact algorithm ofkmedoids isemployed.

Details

An optimalk-medoids partition of the given cluster ensemble isdefined as a partition of the objectsx_i (the elements of theensemble) intok classesC_1, \ldots, C_k such that thecriterion functionL = \sum_{l=1}^k \min_{j \in C_l} \sum_{i \in C_l} d(x_i, x_j) is minimized.

Such secondary partitions (e.g., Gordon & Vichi, 1998) are obtained bycomputing the dissimilaritiesd of the objects in the ensemblefor the given dissimilarity method, and applying a dissimilarity-basedk-medoids solver tod.

Value

An object of class"cl_pam" representing the obtained“secondary” partition, which is a list with the followingcomponents.

cluster

the class ids of the partition.

medoid_ids

the indices of the medoids.

prototypes

a cluster ensemble with thek prototypes(medoids).

criterion

the value of the criterion function of thepartition.

description

a character string indicating the dissimilaritymethod employed.

References

L. Kaufman and P. J. Rousseeuw (1990).Finding Groups in Data: An Introduction to Cluster Analysis.Wiley, New York.

A. D. Gordon and M. Vichi (1998).Partitions of partitions.Journal of Classification,15, 265–285.doi:10.1007/s003579900034.

See Also

cl_pclust for more general prototype-based partitions ofclusterings.

Examples

data("Kinship82")party <- cl_pam(Kinship82, 3, "symdiff")## Compare results with tables 5 and 6 in Gordon & Vichi (1998).partylapply(cl_prototypes(party), cl_classes)table(cl_class_ids(party))

Prototype-Based Partitions of Clusterings

Description

Compute prototype-based partitions of a cluster ensemble by minimizing\sum w_b u_{bj}^m d(x_b, p_j)^e, the sum of the case-weighted andmembership-weightede-th powers of the dissimilarities betweenthe elementsx_b of the ensemble and the prototypesp_j,for suitable dissimilaritiesd and exponentse.

Usage

cl_pclust(x, k, method = NULL, m = 1, weights = 1,          control = list())

Arguments

x

an ensemble of partitions or hierarchies, or somethingcoercible to that (seecl_ensemble).

k

an integer giving the number of classes to be used in thepartition.

method

the consensus method to be employed, seecl_consensus.

m

a number not less than 1 controlling the softness of thepartition (as the “fuzzification parameter” of the fuzzyc-means algorithm). The default value of 1 corresponds tohard partitions obtained from a generalizedk-means problem;values greater than one give partitions of increasing softnessobtained from a generalized fuzzyc-means problem.

weights

a numeric vector of non-negative case weights.Recycled to the number of elements in the ensemble given byxif necessary.

control

a list of control parameters. SeeDetails.

Details

Partitioning is performed usingpclust via a familyconstructed frommethod. The dissimilaritiesd andexponente are implied by the consensus method employed, andinferred via a registration mechanism currently only made available tobuilt-in consensus methods. The default methods compute Least SquaresEuclidean consensus clusterings, i.e., use Euclidean dissimilarityd ande = 2.

Form = 1, the partitioning procedure was introduced by Gaul andSchader (1988) for “Clusterwise Aggregation of Relations” (withthe same domains), containing equivalence relations, i.e., hardpartitions, as a special case.

Available control parameters are as forpclust.

The fixed point approach employed is a heuristic which cannot beguaranteed to find the global minimum (as this is already true for thecomputation of consensus clusterings). Standard practice wouldrecommend to use the best solution found in “sufficiently many”replications of the base algorithm.

Value

An object of class"cl_partition" representing the obtained “secondary” partition by an object of class"cl_pclust",which is a list containing at least the following components.

prototypes

a cluster ensemble with thek prototypes.

membership

an object of class"cl_membership"with the membership valuesu_{bj}.

cluster

the class ids of the nearest hard partition.

silhouette

Silhouette information for the partition, seesilhouette.

validity

precomputed validity measures for the partition.

m

the softness control argument.

call

the matched call.

d

the dissimilarity functiond = d(x, p) employed.

e

the exponente employed.

References

J. C. Bezdek (1981).Pattern recognition with fuzzy objective function algorithms.New York: Plenum.

W. Gaul and M. Schader (1988).Clusterwise aggregation of relations.Applied Stochastic Models and Data Analysis,4:273–282.doi:10.1002/asm.3150040406.

Examples

## Use a precomputed ensemble of 50 k-means partitions of the## Cassini data.data("CKME")CKME <- CKME[1 : 30]# for saving precious time ...diss <- cl_dissimilarity(CKME)hc <- hclust(diss)plot(hc)## This suggests using a partition with three classes, which can be## obtained using cutree(hc, 3).  Could use cl_consensus() to compute## prototypes as the least squares consensus clusterings of the classes,## or alternatively:set.seed(123)x1 <- cl_pclust(CKME, 3, m = 1)x2 <- cl_pclust(CKME, 3, m = 2)## Agreement of solutions.cl_dissimilarity(x1, x2)table(cl_class_ids(x1), cl_class_ids(x2))

Predict Memberships

Description

Predict class ids or memberships from R objects representingpartitions.

Usage

cl_predict(object, newdata = NULL,           type = c("class_ids", "memberships"), ...)

Arguments

object

an R object representing a partition of objects.

newdata

an optional data set giving the objects to makepredictions for. This must be of the same “kind” as the dataset employed for obtaining the partition. If omitted, the originaldata are used.

type

a character string indicating whether class ids ormemberships should be returned. May be abbreviated.

...

arguments to be passed to and from methods.

Details

Many algorithms resulting in partitions of a given set of objects canbe taken to induce a partition of the underlying feature space for themeasurements on the objects, so that class memberships for“new” objects can be obtained from the induced partition.Examples include partitions based on assigning objects to their“closest” prototypes, or providing mixture models for thedistribution of objects in feature space.

This is a generic function. The methods provided in packageclue handle the partitions obtained from clustering functions inthe base R distribution, as well as packagesRWeka,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clue itself).

Value

Depending ontype, an object of class"cl_class_ids"with the predicted class ids, or of class"cl_membership" withthe matrix of predicted membership values.

Examples

## Run kmeans on a random subset of the Cassini data, and predict the## memberships for the "test" data set.data("Cassini")nr <- NROW(Cassini$x)ind <- sample(nr, 0.9 * nr, replace = FALSE)party <- kmeans(Cassini$x[ind, ], 3)table(cl_predict(party, Cassini$x[-ind, ]),      Cassini$classes[-ind])

Partition Prototypes

Description

Determine prototypes for the classes of an R object representing apartition.

Usage

cl_prototypes(x)

Arguments

x

an R object representing a partition of objects.

Details

Many partitioning methods are based on prototypes (“centers”,“centroids”, “medoids”, ...). In typical cases, theseare points in the feature space for the measurements on the objects tobe partitioned, such that one can quantify the distance between theobjects and the prototypes, and, e.g., classify objects to theirclosest prototype.

This is a generic function. The methods provided in packageclue handle the partitions obtained from clustering functions inthe base R distribution, as well as packagescba,cclust,cluster,e1071,flexclust,kernlab, andmclust (and of course,clue itself).

Examples

## Show how prototypes ("centers") vary across k-means runs on## bootstrap samples from the Cassini data.data("Cassini")nr <- NROW(Cassini$x)out <- replicate(50,                 { kmeans(Cassini$x[sample(nr, replace = TRUE), ], 3) },                 simplify = FALSE)## Plot the data points in light gray, and the prototypes found.plot(Cassini$x, col = gray(0.8))points(do.call("rbind", lapply(out, cl_prototypes)), pch = 19)

Tabulate Vector Objects

Description

Tabulate the unique values in vector objects.

Usage

cl_tabulate(x)

Arguments

x

a vector.

Value

A data frame with components:

values

the unique values.

counts

an integer vector with the number of times each of theunique values occurs inx.

Examples

data("Kinship82")tab <- cl_tabulate(Kinship82)## The counts:tab$counts## The most frequent partition:tab$values[[which.max(tab$counts)]]

Ultrametrics of Hierarchies

Description

Compute the ultrametric distances for objects representing (totalindexed) hierarchies.

Usage

cl_ultrametric(x, size = NULL, labels = NULL)as.cl_ultrametric(x)

Arguments

x

an R object representing a (total indexed) hierarchy ofobjects.

size

an integer giving the number of objects in the hierarchy.

labels

a character vector giving the names of the objects inthe hierarchy.

Details

Ifx is not an ultrametric or a hierarchy with an ultrametricrepresentation,cl_ultrametric usescophenetic to obtain the ultrametric (also knownas cophenetic) distances from the hierarchy, which in turn by defaultcalls the S3 genericas.hclust on the hierarchy.Support for a class which represents hierarchies can thus be added by providingas.hclust methods for this class. In R 2.1.0 orbetter,cophenetic is an S3 generic as well, and one can alsomore directly provide methods for this if necessary.

as.cl_ultrametric is a generic function which can be used forcoercingraw (non-classed) ultrametrics, represented as numericvectors (of the lower-half entries) or numeric matrices, toultrametric objects.

Ultrametric objects are implemented as symmetric proximity objectswith a dissimilarity interpretation so that self-proximities are zero,and inherit from classes"cl_dissimilarity" and"cl_proximity". See sectionDetails in thedocumentation forcl_dissimilarity for implications.

Ultrametric objects can also be coerced to classes"dendrogram" and"hclust", and hence in particular use theplot methods for these classes. By default, plotting anultrametric object uses the plot method for dendrograms.

Value

An object of class"cl_ultrametric" containing the ultrametricdistances.

See Also

is.cl_hierarchy

Examples

hc <- hclust(dist(USArrests))u <- cl_ultrametric(hc)## Subscripting.u[1 : 5, 1 : 5]u[1 : 5, 6 : 7]## Plotting.plot(u)

Validity Measures for Partitions and Hierarchies

Description

Compute validity measures for partitions and hierarchies, attemptingto measure how well these clusterings capture the underlying structurein the data they were obtained from.

Usage

cl_validity(x, ...)## Default S3 method:cl_validity(x, d, ...)

Arguments

x

an object representing a partition or hierarchy.

d

a dissimilarity object from whichx was obtained.

...

arguments to be passed to or from methods.

Details

cl_validity is a generic function.

For partitions, its default method gives the “dissimilarityaccounted for”, defined as1 - a_w / a_t, wherea_t isthe average total dissimilarity, and the “average withindissimilarity”a_w is given by

\frac{\sum_{i,j} \sum_k m_{ik}m_{jk} d_{ij}}{ \sum_{i,j} \sum_k m_{ik}m_{jk}}

whered andm are the dissimilarities and memberships,respectively, and the sums are over all pairs of objects and allclasses.

For hierarchies, the validity measures computed by default are“variance accounted for” (VAF, e.g., Hubert, Arabie & Meulman,2006) and “deviance accounted for” (DEV, e.g., Smith, 2001).Ifu is the ultrametric corresponding to the hierarchyxandd the dissimilarityx was obtained from, thesevalidity measures are given by

\mathrm{VAF} = \max\left(0, 1 - \frac{\sum_{i,j} (d_{ij} - u_{ij})^2}{ \sum_{i,j} (d_{ij} - \mathrm{mean}(d)) ^ 2}\right)

and

\mathrm{DEV} = \max\left(0, 1 - \frac{\sum_{i,j} |d_{ij} - u_{ij}|}{ \sum_{i,j} |d_{ij} - \mathrm{median}(d)|}\right)

respectively. Note that VAF and DEV are not invariant under rescalingu, and may be “arbitrarily small” (i.e., 0 using theabove definitions) even thoughu andd are“structurally close” in some sense.

For the results of usingagnes anddiana, the agglomerative and divisivecoefficients are provided in addition to the default ones.

Value

A list of class"cl_validity" with the computed validitymeasures.

References

L. Hubert, P. Arabie and J. Meulman (2006).The structural representation of proximity matrices withMATLAB.Philadelphia, PA: SIAM.

T. J. Smith (2001).Constructing ultrametric and additive trees based on theL_1norm.Journal of Classification,18/2, 185–207.https://link.springer.com/article/10.1007/s00357-001-0015-0.

See Also

cluster.stats in packagefpc for a variety ofcluster validation statistics;fclustIndex in packagee1071 for severalfuzzy cluster indexes;clustIndex in packagecclust;silhouette in packagecluster.


Fit Dissimilarities to a Hierarchy

Description

Find the ultrametric from a target equivalence class of hierarchieswhich minimizes weighted Euclidean or Manhattan dissimilarity to agiven dissimilarity object.

Usage

ls_fit_ultrametric_target(x, y, weights = 1)l1_fit_ultrametric_target(x, y, weights = 1)

Arguments

x

a dissimilarity object inheriting from class"dist".

y

a target hierarchy.

weights

a numeric vector or matrix with non-negative weightsfor obtaining a weighted fit. If a matrix, its numbers of rows andcolumns must be the same as the number of objects inx.Otherwise, it is recycled to the number of elements inx.

Details

The target equivalence class consists of all dendrograms for which thecorrespondingn-trees are the same as the one corresponding toy. I.e., all splits are the same as fory, andoptimization is over the height of the splits.

The criterion function to be optimized over all ultrametrics from theequivalence class is\sum w_{ij} |x_{ij} - u_{ij}|^p, wherep = 2 in the Euclidean andp = 1 in the Manhattan case,respectively.

The optimum can be computed as follows. Suppose splits joinsobject classesA andB. As the ultrametricdissimilarities of all objects inA to all objects inBmust be the same value, say,u_{A,B} = u_s, the contributionfrom the split to the criterion function is of the formf_s(u_s) = \sum_{i \in A, j \in B} w_{ij} |x_{ij} - u_s|^p.We need to minimize\sum_s f_s(u_s) under the constraint thattheu_s form a non-decreasing sequence, which is accomplished byusing the Pool Adjacent Violator Algorithm (PAVA) using theweighted mean (p = 2) or weighted median (p = 1) forsolving the blockwise optimization problems.

Value

An object of class"cl_ultrametric" containing theoptimal ultrametric distances.

See Also

ls_fit_ultrametric for finding the ultrametricminimizing Euclidean dissimilarity (without fixing the splits).

Examples

data("Phonemes")## Note that the Phonemes data set has the consonant misclassification## probabilities, i.e., the similarities between the phonemes.d <- as.dist(1 - Phonemes)## Find the maximal dominated and miminal dominating ultrametrics by## hclust() with single and complete linkage:y1 <- hclust(d, "single")y2 <- hclust(d, "complete")## Note that these are quite different:cl_dissimilarity(y1, y2, "gamma")## Now find the L2 optimal members of the respective dendrogram## equivalence classes.u1 <- ls_fit_ultrametric_target(d, y1)u2 <- ls_fit_ultrametric_target(d, y2)## Compute the L2 optimal ultrametric approximation to d.u <- ls_fit_ultrametric(d)## And compare ...cl_dissimilarity(cl_ensemble(Opt = u, Single = u1, Complete = u2), d)## The solution obtained via complete linkage is quite close:cl_agreement(u2, u, "cophenetic")

Hierarchies

Description

Determine whether an R object represents a hierarchy of objects, orcoerce to an R object representing such.

Usage

is.cl_hierarchy(x)is.cl_dendrogram(x)as.cl_hierarchy(x)as.cl_dendrogram(x)

Arguments

x

an R object.

Details

These functions are generic functions.

The methods provided in packageclue handle the partitions andhierarchies obtained from clustering functions in the base Rdistribution, as well as packagesRWeka,ape,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clue itself).

The hierarchies considered byclue aren-trees(hierarchies in the strict sense) anddendrograms (also knownas valuedn-trees or total indexed hierarchies), which arerepresented by the virtual classes"cl_hierarchy" and"cl_dendrogram" (which inherits from the former),respectively.

n-trees on a setX of objects correspond to collectionsH of subsets ofX, usually calledclasses of thehierarchy, which satisfy the following properties:

The classes of a hierarchy can be obtained bycl_classes.

Dendrograms aren-trees where additionally a heighth isassociated with each of the classes, so that for two classesAandB with non-empty intersection we haveh(A) \le h(B)iffA is a subset ofB. For each pair of objects one canthen defineu_{ij} as the height of the smallest classcontaining bothi andj: this results in a dissimilarityonX which satisfies the ultrametric (3-point) conditionsu_{ij} \le \max(u_{ik}, u_{jk}) for all triples(i, j, k)of objects. Conversely, an ultrametric dissimilarity induces a uniquedendrogram.

The ultrametric dissimilarities of a dendrogram can be obtained bycl_ultrametric.

as.cl_hierarchy returns an object of class"cl_hierarchy" “containing” the given objectx ifthis already represents a hierarchy (i.e.,is.cl_hierarchy(x)is true), or the ultrametric obtained fromx viaas.cl_ultrametric.

as.cl_dendrogram returns an object which has class"cl_dendrogram" and inherits from"cl_hierarchy",and containsx if it represents a dendrogram (i.e.,is.cl_dendrogram(x) is true), or the ultrametric obtained fromx.

Conceptually, hierarchies and dendrograms arevirtual classes,allowing for a variety of representations.

There are group methods for comparing dendrograms and computing theirminimum, maximum, and range based on the meet and join operations, seecl_meet. There is also aplot method.

Value

For the testing functions, a logical indicating whether the givenobject represents a clustering of objects of the respective kind.

For the coercion functions, a container object inheriting from"cl_hierarchy", with a suitable representation of the hierarchygiven byx.

Examples

hcl <- hclust(dist(USArrests))is.cl_dendrogram(hcl)is.cl_hierarchy(hcl)

K-Medoids Clustering

Description

Compute ak-medoids partition of a dissimilarity object.

Usage

kmedoids(x, k)

Arguments

x

a dissimilarity object inheriting from class"dist", or a square matrix of pairwiseobject-to-object dissimilarity values.

k

an integer giving the number of classes to be used in thepartition.

Details

Letd denote the pairwise object-to-object dissimilarity matrixcorresponding tox. Ak-medoids partition ofx isdefined as a partition of the numbers from 1 ton, the number ofobjects inx, intok classesC_1, \ldots, C_k suchthat the criterion functionL = \sum_l \min_{j \in C_l} \sum_{i \in C_l} d_{ij}is minimized.

This is an NP-hard optimization problem. PAM (Partitioning AroundMedoids, see Kaufman & Rousseeuw (1990), Chapter 2) is a very popularheuristic for obtaining optimalk-medoids partitions, andprovided bypam in packagecluster.

kmedoids is an exact algorithm based on a binary linearprogramming formulation of the optimization problem (e.g., Gordon &Vichi (1998), [P4']), usinglp from packagelpSolve as solver. Depending on available hardware resources(the number of constraints of the program is of the ordern^2),it may not be possible to obtain a solution.

Value

An object of class"kmedoids" representing the obtainedpartition, which is a list with the following components.

cluster

the class ids of the partition.

medoid_ids

the indices of the medoids.

criterion

the value of the criterion function of thepartition.

References

L. Kaufman and P. J. Rousseeuw (1990).Finding Groups in Data: An Introduction to Cluster Analysis.Wiley, New York.

A. D. Gordon and M. Vichi (1998).Partitions of partitions.Journal of Classification,15, 265–285.doi:10.1007/s003579900034.


Least Absolute Deviation Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric with minimal absolute distance (Manhattandissimilarity) to a given dissimilarity object.

Usage

l1_fit_ultrametric(x, method = c("SUMT", "IRIP"), weights = 1,                   control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class"dist".

method

a character string indicating the fitting method to beemployed. Must be one of"SUMT" (default) or"IRIP",or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weightsfor obtaining a weighted least squares fit. If a matrix, itsnumbers of rows and columns must be the same as the number ofobjects inx, and the lower diagonal part is used.Otherwise, it is recycled to the number of elements inx.

control

a list of control parameters. SeeDetails.

Details

The problem to be solved is minimizing

L(u) = \sum_{i,j} w_{ij} |x_{ij} - u_{ij}|

over allu satisfying the ultrametric constraints (i.e., for alli, j, k,u_{ij} \le \max(u_{ik}, u_{jk})). This problemis known to be NP hard (Krivanek and Moravek, 1986).

We provide two heuristics for solving this problem.

Method"SUMT" implements aSUMT (SequentialUnconstrained Minimization Technique, seesumt) approachusing the sign function for the gradients of the absolute valuefunction.

Available control parameters aremethod,control,eps,q, andverbose, which have the same roles asforsumt, and the following.

nruns

an integer giving the number of runs to beperformed. Defaults to 1.

start

a single dissimilarity, or a list ofdissimilarities to be employed as starting values.

Method"IRIP" implements a variant of the IterativelyReweighted Iterative Projection approach of Smith (2001), whichattempts to solve theL_1 problem via a sequence of weightedL_2 problems, determiningu(t+1) by minimizing thecriterion function

\sum_{i,j} w_{ij} (x_{ij} - u_{ij})^2 / \max(|x_{ij} - u_{ij}(t)|, m)

withm a “small” non-zero value to avoid zero divisors.We use theSUMT method ofls_fit_ultrametricfor solving the weighted least squares problems.

Available control parameters are as follows.

maxiter

an integer giving the maximal number ofiteration steps to be performed.Defaults to 100.

eps

a nonnegative number controlling the iteration,which stops when the maximal change inu is less thaneps.Defaults to10^{-6}.

reltol

the relative convergence tolerance. Iterationstops when the relative change in theL_1 criterion is lessthanreltol.Defaults to10^{-6}.

MIN

the cutoffm. Defaults to10^{-3}.

start

a dissimilarity object to be used as thestarting value foru.

control

a list of control parameters to be used by themethod ofls_fit_ultrametric employed for solvingthe weightedL_2 problems.

One may need to adjust the default control parameters to achieveconvergence.

It should be noted that all methods are heuristics which can not beguaranteed to find the global minimum.

Value

An object of class"cl_ultrametric" containing thefitted ultrametric distances.

References

M. Krivanek and J. Moravek (1986).NP-hard problems in hierarchical tree clustering.Acta Informatica,23, 311–323.doi:10.1007/BF00289116.

T. J. Smith (2001).Constructing ultrametric and additive trees based on theL_1norm.Journal of Classification,18, 185–207.https://link.springer.com/article/10.1007/s00357-001-0015-0.

See Also

cl_consensus for computing least absolute deviation(Manhattan) consensus hierarchies;ls_fit_ultrametric.


Cluster Lattices

Description

Computations on the lattice of all (hard) partitions, or the latticeof all dendrograms, or the meet semilattice of all hierarchies(n-trees) of/on a set of objects: meet, join, and comparisons.

Usage

cl_meet(x, y)cl_join(x, y)

Arguments

x

an ensemble of partitions or dendrograms or hierarchies, oran R object representing a partition or dendrogram or hierarchy.

y

an R object representing a partition or dendrogram orhierarchy. Ignored ifx is an ensemble.

Details

For a given finite set of objectsX, the setH(X) of all(hard) partitions ofX can be partially ordered by defining apartitionP to be “finer” than a partitionQ, i.e.,P \le Q, if each class ofP is contained in some class ofQ. With this partial order,H(X) becomes a boundedlattice, with intersection and union of two elements given bytheir greatest lower bound (meet) and their least upper bound(join), respectively.

Specifically, the meet of two partitions computed bycl_meet isthe partition obtained by intersecting the classes of the partitions;the classes of the join computed bycl_join are obtained byjoining all elements in the same class in at least one of thepartitions. Obviously, the least and greatest elements of thepartition lattice are the partitions where each object is in a singleclass (sometimes referred to as the “splitter” partition) or inthe same class (the “lumper” partition), respectively. Meetand join of an arbitrary number of partitions can be definedrecursively.

In addition to computing the meet and join, the comparison operationscorresponding to the above partial order as well asmin,max, andrange are available at least for R objectsrepresenting partitions inheriting from"cl_partition".The summary methods give the meet and join of the given partitions(formin andmax), or a partition ensemble with the meetand join (forrange).

If the partitions specified byx andy are softpartitions, the corresponding nearest hard partitions are used.Future versions may optionally provide suitable “soft” (fuzzy)extensions for computing meets and joins.

The set of all dendrograms onX can be ordered using pointwiseinequality of the associated ultrametric dissimilarities: i.e., ifD andE are the dendrograms with ultrametricsu andv, respectively, thenD \le E ifu_{ij} \le v_{ij}for all pairs(i, j) of objects. This again yields a lattice(of dendrograms). The join ofD andE is the dendrogramwith ultrametrics given by\max(u_{ij}, v_{ij}) (as this givesan ultrametric); the meet is the dendrogram with the maximalultrametric dominated by\min(u_{ij}, v_{ij}), and can beobtained by applying single linkage hierarchical clustering to theminima.

The set of all hierarchies onX can be ordered by set-wiseinclusion of the classes: i.e., ifH andG are twohierarchies, thenH \le G if all classes ofH are alsoclasses ofG. This yields a meet semilattice, with meet givenby the classes contained in both hierarchies. The join only exists ifthe union of the classes is a hierarchy.

In each case, a modular semilattice is obtained, which allows for anatural metrization via least element (semi)lattice move distances,see Barthélémy, Leclerc and Monjardet (1981). These latticial metricsare given by the BA/C (partitions), Manhattan (dendrograms), andsymdiff (hierarchies) dissimilarities, respectively (seecl_dissimilarity).

Value

Forcl_meet andcl_join, an object of class"cl_partition" or"cl_dendrogram" with theclass ids or ultrametric dissimilarities of the meet and join of thepartitions or dendrograms, respectively.

References

J.-P. Barthélémy, B. Leclerc and B. Monjardet (1981).On the use of ordered sets in problems of comparison and consensus ofclassification.Journal of Classification,3, 187–224.doi:10.1007/BF01894188.

Examples

## Two simple partitions of 7 objects.A <- as.cl_partition(c(1, 1, 2, 3, 3, 5, 5))B <- as.cl_partition(c(1, 2, 2, 3, 4, 5, 5))## These disagree on objects 1-3, A splits objects 4 and 5 into## separate classes.  Objects 6 and 7 are always in the same class.(A <= B) || (B <= A)## (Neither partition is finer than the other.)cl_meet(A, B)cl_join(A, B)## Meeting with the lumper (greatest) or joining with the splitter## (least) partition does not make a difference: C_lumper <- as.cl_partition(rep(1, n_of_objects(A)))cl_meet(cl_ensemble(A, B, C_lumper))C_splitter <- as.cl_partition(seq_len(n_of_objects(A)))cl_join(cl_ensemble(A, B, C_splitter))## Another way of computing the join:range(A, B, C_splitter)$max

Least Squares Fit of Additive Tree Distances to Dissimilarities

Description

Find the additive tree distance or centroid distance minimizing leastsquares distance (Euclidean dissimilarity) to a given dissimilarityobject.

Usage

ls_fit_addtree(x, method = c("SUMT", "IP", "IR"), weights = 1,               control = list())ls_fit_centroid(x)

Arguments

x

a dissimilarity object inheriting from class"dist".

method

a character string indicating the fitting method to beemployed. Must be one of"SUMT" (default),"IP", or"IR", or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weightsfor obtaining a weighted least squares fit. If a matrix, itsnumbers of rows and columns must be the same as the number ofobjects inx, and the lower diagonal part is used.Otherwise, it is recycled to the number of elements inx.

control

a list of control parameters. SeeDetails.

Details

Seeas.cl_addtree for details on additive tree distancesand centroid distances.

WithL(d) = \sum w_{ij} (x_{ij} - d_{ij})^2, the problem to besolved byls_fit_addtree is minimizingL over alladditive tree distancesd. This problem is known to be NPhard.

We provide three heuristics for solving this problem.

Method"SUMT" implements theSUMT (SequentialUnconstrained Minimization Technique, Fiacco and McCormick, 1968)approach of de Soete (1983). Incomplete dissimilarities are currentlynot supported.

Methods"IP" and"IR" implement the IterativeProjection and Iterative Reduction approaches of Hubert and Arabie(1995) and Roux (1988), respectively. Non-identical weights andincomplete dissimilarities are currently not supported.

Seels_fit_ultrametric for details on these methods andavailable control parameters.

It should be noted that all methods are heuristics which can not beguaranteed to find the global minimum. Standard practice wouldrecommend to use the best solution found in “sufficiently many”replications of the base algorithm.

ls_fit_centroid finds the centroid distanced minimizingL(d) (currently, only for the case of identical weights). Thisoptimization problem has a closed-form solution.

Value

An object of class"cl_addtree" containing the optimal additivetree distances.

References

A. V. Fiacco and G. P. McCormick (1968).Nonlinear programming: Sequential unconstrained minimizationtechniques.New York: John Wiley & Sons.

L. Hubert and P. Arabie (1995).Iterative projection strategies for the least squares fitting of treestructures to proximity data.British Journal of Mathematical and Statistical Psychology,48, 281–317.doi:10.1111/j.2044-8317.1995.tb01065.x.

M. Roux (1988).Techniques of approximation for building two tree structures.In C. Hayashi and E. Diday and M. Jambu and N. Ohsumi (Eds.),Recent Developments in Clustering and Data Analysis, pages151–170.New York: Academic Press.

G. de Soete (1983).A least squares algorithm for fitting additive trees to proximitydata.Psychometrika,48, 621–626.doi:10.1007/BF02293884.


Least Squares Fit of Sums of Ultrametrics to Dissimilarities

Description

Find a sequence of ultrametrics with sum minimizing square distance(Euclidean dissimilarity) to a given dissimilarity object.

Usage

ls_fit_sum_of_ultrametrics(x, nterms = 1, weights = 1,                           control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class"dist".

nterms

an integer giving the number of ultrametrics to befitted.

weights

a numeric vector or matrix with non-negative weightsfor obtaining a weighted least squares fit. If a matrix, itsnumbers of rows and columns must be the same as the number ofobjects inx, and the lower diagonal part is used.Otherwise, it is recycled to the number of elements inx.

control

a list of control parameters. SeeDetails.

Details

The problem to be solved is minimizing the criterion function

L(u(1), \dots, u(n)) = \sum_{i,j} w_{ij} \left(x_{ij} - \sum_{k=1}^n u_{ij}(k)\right)^2

over allu(1), \ldots, u(n) satisfying the ultrametricconstraints.

We provide an implementation of the iterative heuristic suggested inCarroll & Pruzansky (1980) which in each stept sequentiallyrefits theu(k) as the least squares ultrametric fit to the“residuals”x - \sum_{l \ne k} u(l) usingls_fit_ultrametric.

Available control parameters include

maxiter

an integer giving the maximal number ofiteration steps to be performed.Defaults to 100.

eps

a nonnegative number controlling the iteration,which stops when the maximal change in allu(k) is less thaneps.Defaults to10^{-6}.

reltol

the relative convergence tolerance. Iterationstops when the relative change in the criterion function is lessthanreltol.Defaults to10^{-6}.

method

a character string indicating the fittingmethod to be employed by the individual least squares fits.

control

a list of control parameters to be used by themethod ofls_fit_ultrametric employed. By default,if theSUMT method method is used, 10 innerSUMT runs are performed for each refitting.

It should be noted that the method used is a heuristic which can notbe guaranteed to find the global minimum.

Value

A list of objects of class"cl_ultrametric" containingthe fitted ultrametric distances.

References

J. D. Carroll and S. Pruzansky (1980).Discrete and hybrid scaling models.In E. D. Lantermann and H. Feger (eds.),Similarity and Choice.Bern (Switzerland): Huber.


Least Squares Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric with minimal square distance (Euclideandissimilarity) to given dissimilarity objects.

Usage

ls_fit_ultrametric(x, method = c("SUMT", "IP", "IR"), weights = 1,                   control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class"dist", or an ensemble of such objects.

method

a character string indicating the fitting method to beemployed. Must be one of"SUMT" (default),"IP", or"IR", or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weightsfor obtaining a weighted least squares fit. If a matrix, itsnumbers of rows and columns must be the same as the number ofobjects inx, and the lower diagonal part is used.Otherwise, it is recycled to the number of elements inx.

control

a list of control parameters. SeeDetails.

Details

For a single dissimilarity objectx, the problem to be solvedis minimizing

L(u) = \sum_{i,j} w_{ij} (x_{ij} - u_{ij})^2

over allu satisfying the ultrametric constraints (i.e., for alli, j, k,u_{ij} \le \max(u_{ik}, u_{jk})). This problemis known to be NP hard (Krivanek and Moravek, 1986).

For an ensemble of dissimilarity objects, the criterion function is

L(u) = \sum_b w_b \sum_{i,j} w_{ij} (x_{ij}(b) - u_{ij})^2,

wherew_b is the weight given to elementx_b of theensemble and can be specified via control parameterweights(default: all ones). This problem reduces to the above basic problemwithx as thew_b-weighted mean of thex_b.

We provide three heuristics for solving the basic problem.

Method"SUMT" implements theSUMT (SequentialUnconstrained Minimization Technique, Fiacco and McCormick, 1968)approach of de Soete (1986) which in turn simplifies the suggestionsin Carroll and Pruzansky (1980). (Seesumt for moreinformation on theSUMT approach.) We then use a finalsingle linkage hierarchical clustering step to ensure that thereturned object exactly satisfies the ultrametric constraints. Thestarting valueu_0 is obtained by “random shaking” of thegiven dissimilarity object (if not given). If there are missingvalues inx, i.e., the given dissimilarities areincomplete, we follow a suggestion of de Soete (1984), imputingthe missing values by the weighted mean of the non-missing ones, andsetting the corresponding weights to zero.

Available control parameters aremethod,control,eps,q, andverbose, which have the same roles asforsumt, and the following.

nruns

an integer giving the number of runs to beperformed. Defaults to 1.

start

a single dissimilarity, or a list ofdissimilarities to be employed as starting values.

The default optimization using conjugate gradients should workreasonably well for medium to large size problems. For “small”ones, usingnlm is usually faster. Note that the number ofultrametric constraints is of the ordern^3, wheren isthe number of objects in the dissimilarity object, suggesting to usetheSUMT approach in favor ofconstrOptim.

If starting values for theSUMT are provided viastart, the number of starting values gives the number of runsto be performed, and control optionnruns is ignored.Otherwise,nruns starting values are obtained by random shakingof the dissimilarity to be fitted. In the case of multipleSUMT runs, the (first) best solution found is returned.

Method"IP" implements the Iterative Projection approach ofHubert and Arabie (1995). This iteratively projects the currentdissimilarities to the closed convex set given by the ultrametricconstraints (3-point conditions) for a single index triple(i, j, k), in fact replacing the two largest values amongd_{ij}, d_{ik}, d_{jk} by their mean. The following control parameters canbe provided via thecontrol argument.

nruns

an integer giving the number of runs to beperformed. Defaults to 1.

order

a permutation of the numbers from 1 to thenumber of objects inx, specifying the order in which theultrametric constraints are considered, or a list of suchpermutations.

maxiter

an integer giving the maximal number ofiterations to be employed.

tol

a double indicating the maximal convergencetolerance. The algorithm stops if the total absolute change inthe dissimilarities in an iteration is less thantol.

verbose

a logical indicating whether to provide someoutput on minimization progress. Defaults togetOption("verbose").

If permutations are provided viaorder, the number of thesegives the number of runs to be performed, and control optionnruns is ignored. Otherwise,nruns randomly generatedorders are tried. In the case of multiple runs, the (first) bestsolution found is returned.

Non-identical weights and incomplete dissimilarities are currently notsupported.

Method"IR" implements the Iterative Reduction approachsuggested by Roux (1988), see also Barthélémy and Guénoche (1991).This is similar to the Iterative Projection method, but modifies thedissimilarities between objects proportionally to the aggregatedchange incurred from the ultrametric projections. Available controlparameters are identical to those of method"IP".

Non-identical weights and incomplete dissimilarities are currently notsupported.

It should be noted that all methods are heuristics which can not beguaranteed to find the global minimum. Standard practice wouldrecommend to use the best solution found in “sufficiently many”replications of the base algorithm.

Value

An object of class"cl_ultrametric" containing thefitted ultrametric distances.

References

J.-P. Barthélémy and A. Guénoche (1991).Trees and proximity representations.Chichester: John Wiley & Sons.ISBN 0-471-92263-3.

J. D. Carroll and S. Pruzansky (1980).Discrete and hybrid scaling models.In E. D. Lantermann and H. Feger (eds.),Similarity and Choice.Bern (Switzerland): Huber.

L. Hubert and P. Arabie (1995).Iterative projection strategies for the least squares fitting of treestructures to proximity data.British Journal of Mathematical and Statistical Psychology,48, 281–317.doi:10.1111/j.2044-8317.1995.tb01065.x.

M. Krivanek and J. Moravek (1986).NP-hard problems in hierarchical tree clustering.Acta Informatica,23, 311–323.doi:10.1007/BF00289116.

M. Roux (1988).Techniques of approximation for building two tree structures.In C. Hayashi and E. Diday and M. Jambu and N. Ohsumi (Eds.),Recent Developments in Clustering and Data Analysis, pages151–170.New York: Academic Press.

G. de Soete (1984).Ultrametric tree representations of incomplete dissimilarity data.Journal of Classification,1, 235–242.doi:10.1007/BF01890124.

G. de Soete (1986).A least squares algorithm for fitting an ultrametric tree to adissimilarity matrix.Pattern Recognition Letters,2, 133–137.doi:10.1016/0167-8655(84)90036-9.

See Also

cl_consensus for computing least squares (Euclidean)consensus hierarchies by least squares fitting of average ultrametricdistances;l1_fit_ultrametric.

Examples

## Least squares fit of an ultrametric to the Miller-Nicely consonant## phoneme confusion data.data("Phonemes")## Note that the Phonemes data set has the consonant misclassification## probabilities, i.e., the similarities between the phonemes.d <- as.dist(1 - Phonemes)u <- ls_fit_ultrametric(d, control = list(verbose = TRUE))## Cophenetic correlation:cor(d, u)## Plot:plot(u)## ("Basically" the same as Figure 1 in de Soete (1986).)

Classes in a Partition

Description

Determine the number of classes and the class ids in apartition of objects.

Usage

n_of_classes(x)cl_class_ids(x)as.cl_class_ids(x)

Arguments

x

an object representing a (hard or soft) partition (forn_of_classes andcl_class_ids), or raw class ids (foras.cl_class_ids).

Details

These function are generic functions.

The methods provided in packageclue handle the partitionsobtained from clustering functions in the base R distribution, as wellas packagesRWeka,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clueitself).

Note that the number of classes is taken as the number of distinctclass ids actually used in the partition, and may differ from thenumber of columns in a membership matrix representing the partition.

as.cl_class_ids can be used for coercing “raw” classids (given as atomic vectors) to class id objects.

Value

Forn_of_classes, an integer giving the number of classes inthe partition.

Forcl_class_ids, a vector of integers with the correspondingclass ids. For soft partitions, the class ids returned are those ofthenearest hard partition obtained by taking the class ids ofthe (first) maximal membership values.

See Also

is.cl_partition

Examples

data("Cassini")party <- kmeans(Cassini$x, 3)n_of_classes(party)## A simple confusion matrix:table(cl_class_ids(party), Cassini$classes)## For an "oversize" membership matrix representation:n_of_classes(cl_membership(party, 6))

Number of Objects in a Partition or Hierarchy

Description

Determine the number of objects from which a partition orhierarchy was obtained.

Usage

n_of_objects(x)

Arguments

x

anR object representing a (hard of soft) partition or ahierarchy of objects, or dissimilarities between objects.

Details

This is a generic function.

The methods provided in packageclue handle the partitions andhierarchies obtained from clustering functions in the base Rdistribution, as well as packagesRWeka,ape,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clue itself).

There is also a method for object dissimilarities which inherit fromclass"dist".

Value

An integer giving the number of objects.

See Also

is.cl_partition,is.cl_hierarchy

Examples

data("Cassini")pcl <- kmeans(Cassini$x, 3)n_of_objects(pcl)hcl <- hclust(dist(USArrests))n_of_objects(hcl)

Partitions

Description

Determine whether an R object represents a partition of objects, orcoerce to an R object representing such.

Usage

is.cl_partition(x)is.cl_hard_partition(x)is.cl_soft_partition(x)as.cl_partition(x)as.cl_hard_partition(x)

Arguments

x

an R object.

Details

is.cl_partition andis.cl_hard_partition are genericfunctions.

The methods provided in packageclue handle the partitionsobtained from clustering functions in the base R distribution, as wellas packagesRWeka,cba,cclust,cluster,e1071,flexclust,flexmix,kernlab,mclust,movMF andskmeans (and of course,clueitself).

is.cl_soft_partition gives true iffis.cl_partition istrue andis.cl_hard_partition is false.

as.cl_partition returns an object of class"cl_partition" “containing” the given objectx ifthis already represents a partition (i.e.,is.cl_partition(x)is true), or the memberships obtained fromx viaas.cl_membership.

as.cl_hard_partition(x) returns an object which has class"cl_hard_partition" and inherits from"cl_partition",and containsx if it already represents a hard partition (i.e.,provided thatis.cl_hard_partition(x) is true), or the classids obtained fromx, usingx if this is an atomic vectorof raw class ids, or, ifx represents a soft partition or is araw matrix of membership values, using the class ids of thenearest hard partition, defined by taking the class ids of the(first) maximal membership values.

Conceptually, partitions and hard partitions arevirtualclasses, allowing for a variety of representations.

There are group methods for comparing partitions and computing theirminimum, maximum, and range based on the meet and join operations, seecl_meet.

Value

For the testing functions, a logical indicating whether the givenobject represents a clustering of objects of the respective kind.

For the coercion functions, a container object inheriting from"cl_partition", with a suitable representation of the partitiongiven byx.

Examples

data("Cassini")pcl <- kmeans(Cassini$x, 3)is.cl_partition(pcl)is.cl_hard_partition(pcl)is.cl_soft_partition(pcl)

Prototype-Based Partitioning

Description

Obtain prototype-based partitions of objects by minimizing the criterion\sum w_b u_{bj}^m d(x_b, p_j)^e, the sum of the case-weighted andmembership-weightede-th powers of the dissimilarities betweenthe objectsx_b and the prototypesp_j, for suitabledissimilaritiesd and exponentse.

Usage

pclust(x, k, family, m = 1, weights = 1, control = list())pclust_family(D, C, init = NULL, description = NULL, e = 1,              .modify = NULL, .subset = NULL)pclust_object(prototypes, membership, cluster, family, m = 1,              value, ..., classes = NULL, attributes = NULL)

Arguments

x

the object to be partitioned.

k

an integer giving the number of classes to be used in thepartition.

family

an object of class"pclust_family" as generatedbypclust_family, containing the information aboutdande.

m

a number not less than 1 controlling the softness of thepartition (as the “fuzzification parameter” of the fuzzyc-means algorithm). The default value of 1 corresponds tohard partitions obtained from a generalizedk-means problem;values greater than one give partitions of increasing softnessobtained from a generalized fuzzyc-means problem.

weights

a numeric vector of non-negative case weights.Recycled to the number of elements given byx if necessary.

control

a list of control parameters. SeeDetails.

D

a function for computing dissimilaritiesd betweenobjects and prototypes.

C

a ‘consensus’ function with formalsx,weights andcontrol for computing a consensusprototypep minimizing\sum_b w_b d(x_b, p) ^ e.

init

a function with formalsx andk initializingan object withk prototypes from the objectx to bepartitioned.

description

a character string describing the family.

e

a number giving the exponente of the criterion.

.modify

a function with formalsx,i andvalue for modifying a single prototype,orNULL (default).

.subset

a function with formalsx andi forsubsetting prototypes,orNULL (default).

prototypes

an object representing the prototypes of thepartition.

membership

an object of class"cl_membership"with the membership valuesu_{bj}.

cluster

the class ids of the nearest hard partition.

value

the value of the criterion to be minimized.

...

further elements to be included in the generated pclustobject.

classes

a character vector giving further classes to be givento the generated pclust object in addition to"pclust", orNULL (default).

attributes

a list of attributes, orNULL (default).

Details

Form = 1, a generalization of the Lloyd-Forgy variant of thek-means algorithm is used, which iterates between reclassifyingobjects to their closest prototypes (according to the dissimilaritiesgiven byD), and computing new prototypes as the consensus forthe classes (usingC).

Form > 1, a generalization of the fuzzyc-means recipe(e.g., Bezdek (1981)) is used, which alternates between computingoptimal memberships for fixed prototypes, and computing new prototypesas the suitably weighted consensus clusterings for the classes.

This procedure is repeated until convergence occurs, or the maximalnumber of iterations is reached.

Currently, no local improvement heuristics are provided.

It is possible to perform several runs of the procedure via controlargumentsnruns orstart (the default is to perform asingle run), in which case the first partition with the smallestvalue of the criterion is returned.

The dissimilarity and consensus functions as well as the exponente are specified viafamily. In principle, arbitraryrepresentations of objects to be partitioned and prototypes (which donot necessarily have to be “of the same kind”) can be employed.In addition toD andC, what is needed are means toobtain an initial collection ofk prototypes (init), tomodify a single prototype (.modify), and subset the prototypes(.subset). By default, list and (currently, only dense) matrix(with the usual convention that the rows correspond to the objects)are supported. Otherwise, the family has to provide the functionsneeded.

Available control parameters are as follows.

maxiter

an integer giving the maximal number ofiterations to be performed.Defaults to 100.

nruns

an integer giving the number of runs to beperformed.Defaults to 1.

reltol

the relative convergence tolerance.Defaults tosqrt(.Machine$double.eps).

start

a list of prototype objects to be used asstarting values.

verbose

a logical indicating whether to providesome output on minimization progress.Defaults togetOption("verbose").

control

control parameters to be used in the consensusfunction.

The fixed point approach employed is a heuristic which cannot beguaranteed to find the global minimum, in particular ifC isnot an exact minimizer. Standard practice would recommend to use thebest solution found in “sufficiently many” replications of thebase algorithm.

Value

pclust returns the partition found as an object of class"pclust" (as obtained by callingpclust_object) which inaddition to thedefault components containscall (the matched call) and aconverged attribute indicating convergencestatus (i.e., whether the maximal number of iterations was reached).

pclust_family returns an object of class"pclust_family", which is a list with components correspondingto the formals ofpclust_family.

pclust_object returns an object inheriting from class"pclust", which is a list with components correspondingto the formals (up to and including...) ofpclust_object, and additional classes and attributes specifiedbyclasses andattributes, respectively.

References

J. C. Bezdek (1981).Pattern recognition with fuzzy objective function algorithms.New York: Plenum.

See Also

kmeans,cmeans.


Solve Linear Sum Assignment Problem

Description

Solve the linear sum assignment problem using the Hungarian method.

Usage

solve_LSAP(x, maximum = FALSE)

Arguments

x

a matrix with nonnegative entries and at least as manycolumns as rows.

maximum

a logical indicating whether to minimize of maximizethe sum of assigned costs.

Details

Ifnr andnc are the numbers of rows and columns ofx,solve_LSAP finds an optimalassignment of rowsto columns, i.e., a one-to-one mapp of the numbers from 1 tonr to the numbers from 1 tonc (a permutation of thesenumbers in casex is a square matrix) such that\sum_{i=1}^{nr} x[i, p[i]] is minimized or maximized.

This assignment can be found using a linear program (and packagelpSolve provides a functionlp.assign for doing so), buttypically more efficiently and provably in polynomial timeO(n^3) using primal-dual methods such as the so-called Hungarianmethod (see the references).

Value

An object of class"solve_LSAP" with the optimal assignment ofrows to columns.

Author(s)

Walter BöhmWalter.Boehm@wu.ac.at kindly provided C codeimplementing the Hungarian method.

References

C. Papadimitriou and K. Steiglitz (1982),Combinatorial Optimization: Algorithms and Complexity.Englewood Cliffs: Prentice Hall.

Examples

x <- matrix(c(5, 1, 4, 3, 5, 2, 2, 4, 4), nrow = 3)solve_LSAP(x)solve_LSAP(x, maximum = TRUE)## To get the optimal value (for now):y <- solve_LSAP(x)sum(x[cbind(seq_along(y), y)])

Sequential Unconstrained Minimization Technique

Description

Solve constrained optimization problems via the SequentialUnconstrained Minimization Technique (SUMT).

Usage

sumt(x0, L, P, grad_L = NULL, grad_P = NULL, method = NULL,     eps = NULL, q = NULL, verbose = NULL, control = list())

Arguments

x0

a list of starting values, or a single starting value.

L

a function to minimize.

P

a non-negative penalty function such thatP(x) is zeroiff the constraints are satisfied.

grad_L

a function giving the gradient ofL, orNULL (default).

grad_P

a function giving the gradient ofP, orNULL (default).

method

a character string, orNULL. If not given,"CG" is used. If equal to"nlm", minimization iscarried out usingnlm. Otherwise,optim is used withmethod as the givenmethod.

eps

the absolute convergence tolerance. The algorithm stops ifthe (maximum) distance between successivex values isless thaneps.

Defaults tosqrt(.Machine$double.eps).

q

a double greater than one controlling the growth of the\rho_k as described inDetails.

Defaults to 10.

verbose

a logical indicating whether to provide some output onminimization progress.

Defaults togetOption("verbose").

control

a list of control parameters to be passed to theminimization routine in caseoptim is used.

Details

The Sequential Unconstrained Minimization Technique is a heuristic forconstrained optimization. To minimize a functionL subject toconstraints, one employs a non-negative functionP penalizingviolations of the constraints, such thatP(x) is zero iffxsatisfies the constraints. One iteratively minimizesL(x) + \rho_k P(x), where the\rho values are increased according tothe rule\rho_{k+1} = q \rho_k for some constantq > 1,until convergence is obtained in the sense that the Euclidean distancebetween successive solutionsx_k andx_{k+1} is smallenough. Note that the “solution”x obtained does notnecessarily satisfy the constraints, i.e., has zeroP(x). Notealso that there is no guarantee that global (approximately)constrained optima are found. Standard practice would recommend touse the best solution found in “sufficiently many” replicationsof the algorithm.

The unconstrained minimizations are carried out by eitheroptim ornlm, using analyticgradients if bothgrad_L andgrad_P are given, andnumeric ones otherwise.

If more than one starting value is given, the solution with theminimal augmented criterion function value is returned.

Value

A list inheriting from class"sumt", with componentsx,L,P, andrho giving the solution obtained, thevalue of the criterion and penalty function atx, and the final\rho value used in the augmented criterion function.

References

A. V. Fiacco and G. P. McCormick (1968).Nonlinear programming: Sequential unconstrained minimizationtechniques.New York: John Wiley & Sons.


[8]ページ先頭

©2009-2025 Movatter.jp