Movatterモバイル変換

Title:

Cluster Sharpening

Version:

0.1.0.1

Author:

Tomasz Konopka [aut, cre]

Maintainer:

Tomasz Konopka <tokonopka@gmail.com>

Description:

Clustering typically assigns data points into discrete groups, but the clusters can sometimes be indistinct. Cluster sharpening adjusts an existing clustering to create contrast between groups. This package provides a general interface for cluster sharpening along with several implementations based on different excision criteria.

Depends:

R (≥ 3.5.0)

Imports:

methods, stats

License:

MIT + file LICENSE

URL:

https://github.com/tkonopka/ksharp

BugReports:

https://github.com/tkonopka/ksharp/issues

LazyData:

true

Suggests:

cluster, dbscan, knitr, Rcssplot (≥ 1.0.0), rmarkdown,testthat

VignetteBuilder:

knitr

Encoding:

UTF-8

RoxygenNote:

7.0.2

NeedsCompilation:

Packaged:

2020-01-18 15:56:31 UTC; tkonopka

Repository:

CRAN

Date/Publication:

2020-01-26 10:10:02 UTC

Toy dataset with two convex groups with partial overlap

Description

Toy dataset with two convex groups with partial overlap

Usage

data(kdata.1)

Format

matrix with two columns:D1,D2

Toy dataset with two non-overalpping and non-spherical groups

Description

Toy dataset with two non-overalpping and non-spherical groups

Usage

data(kdata.2)

Format

matrix with two columns:D1,D2

Toy dataset with three groups

Description

Toy dataset with three groups

Usage

data(kdata.3)

Format

matrix with two columns:D1,D2

Toy dataset with four groups atop a wide area of noise points

Description

Toy dataset with four groups atop a wide area of noise points

Usage

data(kdata.4)

Format

matrix with two columns:D1,D2

sharpen a clustering

Description

Each data point in a clustering is assigned to a cluster, but some datapoints may lie in ambiguous zones between two or more clusters, or farfrom other points. Cluster sharpening assigns these border points intoa separate noise group, thereby creating more stark distinctions betweengroups.

Usage

ksharp(  x,  threshold = 0.1,  data = NULL,  method = c("silhouette", "neighbor", "medoid"),  threshold.abs = NULL)

Arguments

x

clustering object; several types of inputs are acceptable,including objects of class kmeans, pam, and self-made lists with acomponent "cluster".

threshold

numeric; the fraction of points to place in noise group

data

matrix, raw data corresponding to clustering x; must be presentwhen sharpening for the first time or if data is not present within x.

method

character, determines method used for sharpening

threshold.abs

numeric; absolute-value of threshold for sharpening.When non-NULL, this value overrides value in argument 'threshold'

Details

Noise points are assigned to a group with cluster index 0. This isanalogous behavior to output produced by dbscan.

Value

clustering object based on input x, with adjusted clusterassignments and additional list components with sharpness measures.Cluster assignments are placed in $cluster and excised data pointsare given a cluster index of 0. Original cluster assignments aresaved in $cluster.original. Sharpness measures are stored incomponents $silinfo, $medinfo, and $neiinfo, although these detailsmay change in future versions of the package.

Examples

# prepare iris dataset for analysisiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))# cluster the dataset into three groupsiris.clustered = kmeans(iris.data, centers=3)table(iris.clustered$cluster)# sharpen the clustering by excluding 10% of the data pointsiris.sharp = ksharp(iris.clustered, threshold=0.1, data=iris.data)table(iris.sharp$cluster)# visualize cluster assignmentsiris.pca = prcomp(iris.data)$x[,1:2]plot(iris.pca, col=iris$Species, pch=ifelse(iris.sharp$cluster==0, 1, 19))

compute info on distances to medoids/centroids

Description

Analogous in structure to silinfo and neiinfo, it computesa "widths" matrix assessing how well each data point belongsto its cluster. Here, this measure is the ratio of two distances:in the numerator, the distance from the point to the nearest cluster center,and in the denominator, from the point to its own cluster center.

Usage

medinfo(cluster, data, silwidths)

Arguments

cluster

named vector

data

matrix with raw data

silwidths

matrix with silhouette widths

Value

list with component widths. The widths object is a matrixwith one row per data item, with column med_ratio holdingthe sharpness measure.

Examples

# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute sharpnessvalues based on medoidsiris.silinfo = silinfo(iris.clusters, iris.dist)medinfo(iris.clusters, iris.data, iris.silinfo$widths)

Compute info on 'neighbor widths'

Description

This function provides information on how well each data pointbelongs to its cluster. For each query point, the function considersn of its nearest neighbors. The neighbor widths are defined as thefraction of those neighbors that belong to the same cluster as thequery point. These values are termed 'widths' in analogy tosilhouette widths, another measure of cluster membership.

Usage

neiinfo(cluster, dist)

Arguments

cluster

vector with assignments of data elements to clusters

dist

distance object or matrix

Details

The function follows a similar signature as silinfo from this package.

Value

list with component widths. The wdiths object is a matrixwith one row per data item, wth column neighborhood holding thesharpness value.

Examples

# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute neighbor-based sharpness widthsneiinfo(iris.clusters, iris.dist)

Compute info on silhouette widths

Description

This function provides information on how well each data pointbelongs to its cluster. For each query point, the function considersthe average distance to other members of the same cluster and theaverage distance to members of another, nearest, cluster. The widthsare defined as the

Usage

silinfo(cluster, dist)

Arguments

cluster

vector with assignments of data elements to clusters

dist

distance object or matrix

Details

The function signature is very similar to cluster::silhouette butthe implementation has important differences. This implementationrequires both the dist object and and cluster vector must have names.This prevents accidental assignment of silhouette widths to the wrongelements.

Value

list, analogous to object within output from cluster::pam.In particular, the list has a component widths. The widths object ismatrix with one row per data item, with column sil_width holding thesilhouette width.

Examples

# construct a manual clustering of the iris datasetiris.data = iris[, 1:4]rownames(iris.data) = paste0("iris_", seq_len(nrow(iris.data)))iris.dist = dist(iris.data)iris.clusters = setNames(as.integer(iris$Species), rownames(iris.data))# compute sharpness values based on silhouette widthssilinfo(iris.clusters, iris.dist)