| Title: | Cluster Sharpening |
| Version: | 0.1.0.1 |
| Author: | Tomasz Konopka [aut, cre] |
| Maintainer: | Tomasz Konopka <tokonopka@gmail.com> |
| Description: | Clustering typically assigns data points into discrete groups, but the clusters can sometimes be indistinct. Cluster sharpening adjusts an existing clustering to create contrast between groups. This package provides a general interface for cluster sharpening along with several implementations based on different excision criteria. |
| Depends: | R (≥ 3.5.0) |
| Imports: | methods, stats |
| License: | MIT + file LICENSE |
| URL: | https://github.com/tkonopka/ksharp |
| BugReports: | https://github.com/tkonopka/ksharp/issues |
| LazyData: | true |
| Suggests: | cluster, dbscan, knitr, Rcssplot (≥ 1.0.0), rmarkdown,testthat |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.0.2 |
| NeedsCompilation: | no |
| Packaged: | 2020-01-18 15:56:31 UTC; tkonopka |
| Repository: | CRAN |
| Date/Publication: | 2020-01-26 10:10:02 UTC |
Toy dataset with two convex groups with partial overlap
Description
Toy dataset with two convex groups with partial overlap
Usage
data(kdata.1)Format
matrix with two columns:D1,D2
Toy dataset with two non-overalpping and non-spherical groups
Description
Toy dataset with two non-overalpping and non-spherical groups
Usage
data(kdata.2)Format
matrix with two columns:D1,D2
Toy dataset with three groups
Description
Toy dataset with three groups
Usage
data(kdata.3)Format
matrix with two columns:D1,D2
Toy dataset with four groups atop a wide area of noise points
Description
Toy dataset with four groups atop a wide area of noise points
Usage
data(kdata.4)Format
matrix with two columns:D1,D2
sharpen a clustering
Description
Each data point in a clustering is assigned to a cluster, but some datapoints may lie in ambiguous zones between two or more clusters, or farfrom other points. Cluster sharpening assigns these border points intoa separate noise group, thereby creating more stark distinctions betweengroups.
Usage
ksharp( x, threshold = 0.1, data = NULL, method = c("silhouette", "neighbor", "medoid"), threshold.abs = NULL)Arguments
x | clustering object; several types of inputs are acceptable,including objects of class kmeans, pam, and self-made lists with acomponent "cluster". |
threshold | numeric; the fraction of points to place in noise group |
data | matrix, raw data corresponding to clustering x; must be presentwhen sharpening for the first time or if data is not present within x. |
method | character, determines method used for sharpening |
threshold.abs | numeric; absolute-value of threshold for sharpening.When non-NULL, this value overrides value in argument 'threshold' |
Details
Noise points are assigned to a group with cluster index 0. This isanalogous behavior to output produced by dbscan.
Value
clustering object based on input x, with adjusted clusterassignments and additional list components with sharpness measures.Cluster assignments are placed in $cluster and excised data pointsare given a cluster index of 0. Original cluster assignments aresaved in $cluster.original. Sharpness measures are stored incomponents $silinfo, $medinfo, and $neiinfo, although these detailsmay change in future versions of the package.
Examples
# prepare iris dataset for analysisiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))# cluster the dataset into three groupsiris.clustered = kmeans(iris.data, centers=3)table(iris.clustered$cluster)# sharpen the clustering by excluding 10% of the data pointsiris.sharp = ksharp(iris.clustered, threshold=0.1, data=iris.data)table(iris.sharp$cluster)# visualize cluster assignmentsiris.pca = prcomp(iris.data)$x[,1:2]plot(iris.pca, col=iris$Species, pch=ifelse(iris.sharp$cluster==0, 1, 19))compute info on distances to medoids/centroids
Description
Analogous in structure to silinfo and neiinfo, it computesa "widths" matrix assessing how well each data point belongsto its cluster. Here, this measure is the ratio of two distances:in the numerator, the distance from the point to the nearest cluster center,and in the denominator, from the point to its own cluster center.
Usage
medinfo(cluster, data, silwidths)Arguments
cluster | named vector |
data | matrix with raw data |
silwidths | matrix with silhouette widths |
Value
list with component widths. The widths object is a matrixwith one row per data item, with column med_ratio holdingthe sharpness measure.
Examples
# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute sharpnessvalues based on medoidsiris.silinfo = silinfo(iris.clusters, iris.dist)medinfo(iris.clusters, iris.data, iris.silinfo$widths)Compute info on 'neighbor widths'
Description
This function provides information on how well each data pointbelongs to its cluster. For each query point, the function considersn of its nearest neighbors. The neighbor widths are defined as thefraction of those neighbors that belong to the same cluster as thequery point. These values are termed 'widths' in analogy tosilhouette widths, another measure of cluster membership.
Usage
neiinfo(cluster, dist)Arguments
cluster | vector with assignments of data elements to clusters |
dist | distance object or matrix |
Details
The function follows a similar signature as silinfo from this package.
Value
list with component widths. The wdiths object is a matrixwith one row per data item, wth column neighborhood holding thesharpness value.
Examples
# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute neighbor-based sharpness widthsneiinfo(iris.clusters, iris.dist)Compute info on silhouette widths
Description
This function provides information on how well each data pointbelongs to its cluster. For each query point, the function considersthe average distance to other members of the same cluster and theaverage distance to members of another, nearest, cluster. The widthsare defined as the
Usage
silinfo(cluster, dist)Arguments
cluster | vector with assignments of data elements to clusters |
dist | distance object or matrix |
Details
The function signature is very similar to cluster::silhouette butthe implementation has important differences. This implementationrequires both the dist object and and cluster vector must have names.This prevents accidental assignment of silhouette widths to the wrongelements.
Value
list, analogous to object within output from cluster::pam.In particular, the list has a component widths. The widths object ismatrix with one row per data item, with column sil_width holding thesilhouette width.
Examples
# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute sharpness values based on silhouette widthssilinfo(iris.clusters, iris.dist)