Movatterモバイル変換

Title:

Perform Monothetic Clustering with Extensions to Circular Data

Version:

1.2.1

Description:

Implementation of the Monothetic Clustering algorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) on continuous data sets. A lot of extensions are included in the package, including applying Monothetic clustering on data sets with circular variables, visualizations with the results, and permutation and cross-validation based tests to support the decision on the number of clusters.

License:

GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]

URL:

https://vinhtantran.github.io/monoClust/,https://github.com/vinhtantran/monoClust

BugReports:

https://github.com/vinhtantran/monoClust/issues

Depends:

R (≥ 3.3.0)

Imports:

cluster (≥ 2.0.5), doParallel, dplyr (≥ 1.0.0), foreach,ggplot2, graphics, grDevices, parallel, permute, purrr (≥0.3.0), rlang (≥ 0.3.0), stats, stringr (≥ 0.5), tibble (≥3.0.0), tidyr (≥ 1.0.0)

Suggests:

knitr, mice, rmarkdown, covr, testthat (≥ 3.0.0)

VignetteBuilder:

knitr

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.1.1

Config/testthat/edition:

NeedsCompilation:

Packaged:

2021-02-15 00:10:00 UTC; vinht

Author:

Tan Tran

[aut, cre], Brian McGuire [aut], Mark Greenwood

[aut]

Maintainer:

Tan Tran <vinhtantran@gmail.com>

Repository:

CRAN

Date/Publication:

2021-02-15 15:00:02 UTC

monoClust: Perform Monothetic Clustering with Extensions to CircularData

Description

Implementation of the Monothetic Clusteringalgorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) oncontinuous data sets. A lot of extensions are included in the package,including applying Monothetic clustering on data sets with circularvariables, visualizations with the results, and permutation andcross-validation based tests to support the decision on the number ofclusters.

Author(s)

Maintainer: Tan Tranvinhtantran@gmail.com (ORCID)

Authors:

Brian McGuiremcguirebc@gmail.com
Mark Greenwoodgreenwood@montana.edu (ORCID)

Monothetic Clustering

Description

Creates a MonoClust object after partitioning the data set using MonotheticClustering.

Usage

MonoClust(  toclust,  cir.var = NULL,  variables = NULL,  distmethod = NULL,  digits = getOption("digits"),  nclusters = 2L,  minsplit = 5L,  minbucket = round(minsplit/3),  ncores = 1L)

Arguments

toclust

Data set as a data frame.

cir.var

Index or name of the circular variable in the data set.

variables

List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names.

distmethod

Distance method to use with the data set. Can be chosenfrom "euclidean" (for Euclidean distance), "mahattan" (for Manhattandistance), or "gower" (for Gower distance). If not set, Euclidean distanceis used unlesscir.var is set, then it is Gower distance is used bydefault. Abbreviations can be used.

digits

Significant decimal number printed in the output.

nclusters

Number of clusters created. Default is 2.

minsplit

The minimum number of observations that must exist in a nodein order for a split to be attempted. Default is 5.

minbucket

The minimum number of observations in any terminal leafnode. Default isminsplit/3.

ncores

Number of CPU cores on the current host. If greater than 1,parallel processing withforeach::foreach() is used to distribute cutsearch on variables to processes. When set to NULL, all available cores areused.

Value

AMonoClust object. SeeMonoClust.object.

References

Chavent, M. (1998). A monothetic clustering method. Pattern RecognitionLetters, 19(11), 989-996. doi:10.1016/S0167-8655(98)00087-7.
Tran, T. V. (2019). Monothetic Cluster Analysis with Extensions toCircular and Functional Data. Montana State University - Bozeman.

Examples

# Very simple data setlibrary(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)ruspini4sol# data with circular variablelibrary(monoClust)data(wind_sensit_2007)# Use a small data setset.seed(12345)wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 10), ]circular_wind <- MonoClust(wind_reduced, cir.var = 3, nclusters = 2)circular_wind

Monothetic Clustering Tree Object

Description

The structure and objects contained in MonoClust, an object returned fromtheMonoClust() function and used as the input in other functions in thepackage.

Value

frame

Data frame in the form of atibble::tibble() representinga tree structure with one row for each node. The columns include:

number: Index of the node. Depth of a node can be derived bynumber %/% 2.
var: Name of the variable used in the split at a node or"<leaf>" if it is a leaf node.
cut: Splitting value, so values ofvar that are smaller thanthat go to left branch while values greater than that go to the rightbranch.
n: Cluster size, the number of observations in that cluster.
inertia: Inertia value of the cluster at that node.
bipartsplitrow: Position of the next split row in the data set(that position will belong to left node (smaller)).
bipartsplitcol: Position of the next split variable in the dataset.
inertiadel: Proportion of inertia value of the cluster at thatnode to the inertia of the root.
medoid: Position of the data point regarded as the medoid ofits cluster.
loc: y-coordinate of the splitting node to facilitate showingon the tree. Seeplot.MonoClust() for details.
split.order: Order of the splits with root is 0.
inertia_explained: Percent inertia explained as described inChavent (2007). It is⁠1 - (sum(current inertia)/inertial[1])⁠.
alt: A nested tibble of alternate splits at a node. It containsbipartsplitrow andbipartsplitcol with the same meaning above.Note that this is only for information purpose. CurrentlymonoClustdoes not support choosing an alternate splitting route. RunningMonoClust() withnclusters = 2 step-by-step can be run ifneeded.

membership

Vector of the same length as the number of rows in thedata, containing the value offrame$number corresponding to the leafnode that an observation falls into.

dist

Distance matrix calculated using the method indicated indistmethod argument ofMonoClust().

terms

Vector of variable names in the data that were used to split.

centroids

Data frame with one row for centroid value of eachcluster.

medoids

Named vector of positions of the data points regarded asmedoids of clusters.

alt

Indicator of having an alternate splitting route occurred whensplitting.

circularroot

List of values designed for circular variable in thedata set.var is the name of circular variable andcut is its firstbest split value. If circular variable is not available, both objects areNULL.

References

Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monotheticdivisive hierarchical clustering method. Computational Statistics & DataAnalysis, 52(2), 687-701. doi:10.1016/j.csda.2007.03.013.

Coerce Similar Object to MonoClust

Description

The function turns a MonoClust-similar object into MonoClust object so itcan use supported functions for MonoClust such asprint.MonoClust() andplot.MonoClust().

Usage

as_MonoClust(x, ...)## Default S3 method:as_MonoClust(x, ...)

Arguments

x

An object that can be coerced to MonoClust object.

...

For extensibility.

Details

as_MonoClust() is an S3 generic. The function itself doesn't run unlessit is implemented for another similar object. Currently, this function is notimplemented withinmonoClust package.

Find Centroid of the Cluster

Description

Centroid is point whose coordinates are the means of their cluster values.

Usage

centroid(data, frame, cloc)

Arguments

data

Original data set.

frame

The split tree transferred as data frame.

cloc

Vector of current cluster membership.

Value

A data frame with coordinates of centroids

First Gate Function

Description

This function checks what are available nodes to split and then callfind_split() on each node, then decide which node creates best split, andcallsplitter() to perform the split.

Usage

checkem(  data,  cuts,  frame,  cloc,  dist,  variables,  minsplit,  minbucket,  split_order,  ncores)

Arguments

data

Original data set.

cuts

Cuts data set, which has the next higher value of each variablein the original data set.

frame

The split tree transferred as data frame.

cloc

Vector of current cluster membership.

dist

Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations.

variables

List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names.

minsplit

The minimum number of observations that must exist in a nodein order for a split to be attempted.

minbucket

The minimum number of observations in any terminal leafnode. Default isminsplit/3.

split_order

The control argument to see how many split has been done.

ncores

Number of CPU cores on the current host.

Value

It is not supposed to return anything because global environment wasused. However, if there is nothing left to split, it returns 0 to tell thecaller to stop running the loop.

Add/Subtract Circular Values in Degrees/Radian

Description

Add/subtract two circular variables in degrees (⁠%cd+%⁠ and⁠%cd-%⁠) andradian (⁠%cr+%⁠ and⁠%cr-%⁠).

Usage

x %cd+% yx %cd-% yx %cr+% yx %cr-% y

Arguments

x,y

Circular values in degrees/radians.

Value

A value between [0, 360) in degrees or [0, 2*pi) in radian.

Examples

90 %cd+% 90250 %cd+% 20025 %cd-% 80pi %cr+% (pi/2)

Distance Matrix of Circular Variables

Description

Calculates the distance matrix of observations with circular variables usingan adapted version of Gower's distance. This distance should be compatiblewith the Gower's distance for other variable types.

Usage

circ_dist(frame)

Arguments

frame

A data frame with all columns are circular measured in degrees.

Details

The distance between two observations i and j of a circular variable qis suggested to be

(y_{iq}, y_{jq}) = \frac{180 - |180 - |y_{iq} - y_{jq}||}{180}.

Value

Object of class "dist".

References

Tran, T. V. (2019). Chapter 3. Monothetic Cluster Analysis with Extensionsto Circular and Functional Data. Montana State University - Bozeman.

Examples

# Make a sample data set of 20 observations with 2 circular variablesdata <- data.frame(var1 = sample.int(359, 20),                   var2 = sample.int(359, 20))circ_dist(data)

Cluster Statistics Calculation

Description

Calinski-Harabasz's pseudo-F (Calinski and Harabasz, 1974) and Averagesilhoutte width (Rousseeuw, 1987) calculation.

Usage

cluster_stats(d, clustering)

Arguments

d

Distance object (as generated bydist()) or a distance matrixbetween cases.

clustering

Integer vector of length of the number of cases, whichindicates a clustering. The clusters have to be numbered from 1 to thenumber of clusters.

Value

f_stat: Calinski-Harabasz's pseudo-F.
asw: Average silhouette width.

References

Caliński, T. and Harabasz, J (1974). "A dendrite method for clusteranalysis". en. In:Communications in Statistics 3.1, pp. 1–27.doi:10.1080/03610927408827101.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to theinterpretation and validation of cluster analysis". In:Journal ofComputational and Applied Mathematics 20, pp. 53–65. ISSN: 03770427.doi:10.1016/0377-0427(87)90125-7.

Create Labels for Split Variables

Description

This function prints variable's labels for aMonoClust tree.

Usage

create_labels(x, abbrev, digits = getOption("digits"), ...)

Arguments

x

MonoClust result object.

abbrev

Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used.

If "no", the labels recorded inx$labels are used.

If "short", variable names will be turned into "V1", "V2", ...

If "abbreviate",abbreviate() function will be used. Use the optionalarguments for this function.

digits

Number of significant digits to print.

...

Optional arguments toabbreviate().

Value

A list containing two elements:

varnames: A named vector of labels corresponding to variable's names(at vector names).
labels: Vector of labels of splitting rules to be displayed.

Cross-Validation Test on MonoClust

Description

Perform cross-validation test for different different number of clusters ofMonothetic Clustering.

Usage

cv.test(data, fold = 10L, minnodes = 2L, maxnodes = 10L, ncores = 1L, ...)

Arguments

data

Data set to be partitioned.

fold

Number of folds (k).fold = 1 is the special case, when thefunction performs a Leave-One-Out Cross-Validation (LOOCV).

minnodes

Minimum number of clusters to be checked.

maxnodes

Maximum number of clusters to be checked.

ncores

Number of CPU cores on the current host. When set to NULL,all available cores are used.

...

Other parameters transferred toMonoClust().

Details

Thek-fold cross-validation randomly partitions data intoksubsets with equal (or close to equal) sizes.k - 1 subsets are used asthe training data set to create a tree with a desired number of leaves andthe other subset is used as validation data set to evaluate the predictiveperformance of the trained tree. The process repeats for each subset as thevalidating set (m = 1, \ldots, k) and the mean squared difference,

MSE_m=\frac{1}{n_m} \sum_{q=1}^Q\sum_{i \in m} d^2_{euc}(y_{iq},\hat{y}_{(-i)q}),

is calculated, where\hat{y}_{(-i)q} is the cluster mean on thevariableq of the cluster created by the training data where the observed value,y_{iq}, of the validation data set will fall into, andd^2_{euc}(y_{iq}, \hat{y}_{(-i)q}) is the squared Euclidean distance(dissimilarity) between two observations at variable $q$. This process isrepeated for the $k$ subsets of the data set and the average of these testerrors is the cross-validation-based estimate of the mean squared error ofpredicting a new observation,

CV_K = \overline{MSE} = \frac{1}{M} \sum_{m=1}^M MSE_m.

Value

AMonoClust.cv class containing a data frame of mean sum of squareerror and its standard deviation.

Note

This function supports parallel processing withforeach::foreach().It distributes MonoClust calls to processes.

Examples

library(cluster)data(ruspini)# Leave-one-out cross-validationcv.test(ruspini, fold = 1, minnodes = 2, maxnodes = 4)# 5-fold cross-validationcv.test(ruspini, fold = 5, minnodes = 2, maxnodes = 4)

Make Error Bars

Description

Make Error Bars

Usage

error_bar(x, y, upper, lower = upper, length = 0.1, ...)

Arguments

x

x coordinates.

y

y coordinates.

upper

Distance from y to the upper bar.

lower

Distance from y to the lower bar.

length

Length of the horizontal bar.

...

Other arguments tographics::arrows()

Value

Plot

Find the Closest Cut

Description

Find the cuts for a quantitative variable. These cuts are what we aregoing to consider when thinking about bi-partitioning the data. For aquantitative column, find the next larger value of each value, if it is thelargest, that value + 1

Usage

find_closest(col)

Arguments

col

a quantitative vector.

Value

a quantitative vector which contains the closest higher cut.

Find the Best Split

Description

Find the best split in terms of reduction in inertia for the transferrednode, indicate by row. Find the terminal node with the greatest change ininertia and bi-partition it.

Usage

find_split(  data,  cuts,  frame_row,  cloc,  dist,  variables,  minsplit,  minbucket,  ncores)

Arguments

data

Original data set.

cuts

Cuts data set, which has the next higher value of each variablein the original data set.

frame_row

One row of the split tree as data frame.

cloc

Vector of current cluster membership.

dist

Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations.

variables

List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names.

minsplit

The minimum number of observations that must exist in a nodein order for a split to be attempted.

minbucket

The minimum number of observations in any terminal leafnode. Default isminsplit/3.

ncores

Number of CPU cores on the current host.

Value

The updatedframe_row with the next split updated.

GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error

Description

GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error

Usage

ggcv(  cv.obj,  title = "MSE for CV of monothetic clustering",  xlab = "Number of clusters",  ylab = "MSE +/- 1 SE",  type = c("b", "p", "l"),  linetype = 2,  err.col = "red",  err.width = 0.2)

Arguments

cv.obj

Acv.MonoClust object (output ofcv.test()).

title

Overall title for the plot.

xlab

Title for x axis.

ylab

Title for y axis.

type

What type of plot should be drawn. Choosing between"l" (lineonly),"p" (point only), and"b" (both line and point).

linetype

The line type. Seevignette("ggplot2-specs").

err.col

Color of the error bars.

err.width

Width of the bars.

Value

A ggplot2 object.

Examples

library(cluster)data(ruspini)# 10-fold cross-validationcptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)ggcv(cptable)

Parallel Coordinates Plot with Circular Variables

Description

Making a parallel coordinates plot with the circular variables are plottedas ellipses. The function currently works well with data with one circularvariable.

Usage

ggpcp(  data,  circ.var = NULL,  is.degree = TRUE,  rotate = 0,  north = 0,  cw = FALSE,  order.appear = NULL,  linetype = 1,  size = 0.5,  alpha = 0.5,  clustering,  medoids = NULL,  cluster.col = NULL,  show.medoids = FALSE,  labelsize = 4,  xlab = "Variables",  ylab = NULL,  legend.cluster = "groups")

Arguments

data

Data set.

circ.var

Circular variable(s) in the data set, indicated by namesor index in the data set.

is.degree

Whether the unit of the circular variables is degree or not(radian). Default isTRUE.

rotate

The rotate (offset, shift) of the circular variable, inradians. Default is 0 (no rotation).

north

What value of the circular variable is labeled North. Default is0 radian.

cw

Which direction of the circular variable is considered increasingin value, clockwise (TRUE) or counter-clockwise (FALSE). Default isTRUE.

order.appear

The order of appearance of the variables, listed by avector of names or index. If set, length has to be equal to the number ofvariables in the data set.

linetype

Line type. Default is solid line. See details invignette("ggplot2-specs").

size

Size of a line is its width in mm. Default is 0.5. See details invignette("ggplot2-specs").

alpha

The transparency of the lines. Default is 0.1.

clustering

Cluster membership.

medoids

Vector of medoid observations of cluster. Only required whenshow.medoids = TRUE.

cluster.col

Color of clusters, indicating by a vector. If set, thelength of this vector must be equal to the number of clusters inclustering.

show.medoids

Whether to highlight the median lines or not. Default isFALSE.

labelsize

The size of labels on the plot. Default is 4.

xlab

Labels for x-axis.

ylab

Labels for y-axis.

legend.cluster

Labels for group membership. Implemented by settinglabel for ggplotcolor aesthetics.

Value

A ggplot2 object.

Examples

# Set color constantCOLOR4 <- c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3")# Reduce the size of the data for for sake of example speedset.seed(12345)wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 50), ]sol42007 <- MonoClust(wind_reduced, cir.var = 3, nclusters = 4)library(ggplot2)ggpcp(data = wind_reduced,      circ.var = "WDIR",      # To improve aesthetics      rotate = pi*3/4-0.3,      order.appear = c("WDIR", "has.sensit", "WS"),      alpha = 0.5,      clustering = sol42007$membership,      medoids = sol42007$medoids,      cluster.col = COLOR4,      show.medoids = TRUE) +  theme(panel.background = element_rect(color = "white"),        panel.border = element_rect(color = "white", fill = NA),        panel.grid.major = element_line(color = "#f0f0f0"),        panel.grid.minor = element_blank(),        axis.line = element_line(color = "black"),        legend.key = element_rect(color = NA),        legend.position = "bottom",        legend.direction = "horizontal",        legend.title = element_text(face = "italic"),        legend.justification = "center")

Cluster Inertia Calculation

Description

Calculate inertia for a given subset of the distance matrix from the originaldata set provided tox. Assumes that distance matrices are stored asmatrices and not distance objects.

Usage

inertia_calc(x)

Arguments

x

Distance matrix, not an object of some distance measure.

Value

Inertia value of the matrix, formula in Chavent (1998). Ifx is asingle number, return 0.

Examples

data(iris)# Euclidean distance on first 20 rows of the 4 continuous variablesdist_mat <- as.matrix(dist(iris[1:20, 1:4]))inertia_calc(dist_mat)

Test If The Object is A MonoClust

Description

This function returnsTRUE for MonoClust, and FALSE for all other objects.

Usage

is_MonoClust(mono_obj)

Arguments

mono_obj

An object.

Value

TRUE if the object inherits from theMonoClust class.

Create Jump Table

Description

Create jump table from the MonoClust's frame object.number andvar willbe used to create the table.

Usage

make_jump_table(frame)

Arguments

frame

MonoClust's frame object

Value

Jump table withnumber,var, and two new columnsleft andright indicate the left and right number at split.

Find Medoid of the Cluster

Description

Medoid is the point that has minimum distance to all other points in thecluster.

Usage

medoid(members, dist_mat)

Arguments

members

index vector indicating which observation belongs to thecluster.

dist_mat

distance matrix of the whole data set. A class ofdistobject must be coerced to a matrix before using.

Value

index of the medoid point in the members vector.

Examples

library(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)ruspini4solmedoid(which(ruspini4sol$membership == 4), ruspini4sol$dist)# Check with the output with "4" labelruspini4sol$medoids

Create A New Node for Split Data Frame

Description

This function is just a helper to make sure that the default values of thesplit data frame is correct when unspecified. It helps reduce type error,especially when moving to use dplyr which is stricter in data types.

Usage

new_node(  number,  var,  cut = -99L,  n,  inertia,  bipartsplitrow = -99L,  bipartsplitcol = -99L,  inertiadel = 0,  inertia_explained = -99,  medoid,  loc,  split.order = -99L,  alt = list(tibble::tibble(bipartsplitrow = numeric(), bipartsplitcol = numeric())))

Arguments

number

Row index of the data frame.

var

Whether it is a leaf, or the name of the next split variable.

cut

The splitting value, so values (ofvar) smaller than thatgo to left branch while values greater than that go to right branch.

n

Cluster size. Number of observations in that cluster.

inertia

Inertia value of the cluster at that node.

bipartsplitrow

Position of the next split row in the data set (thatposition will belong to left node (smaller)).

bipartsplitcol

Position of the next split variable in the data set.

inertiadel

The proportion of inertia value of the cluster at that nodeto the inertia of the root.

inertia_explained

Percent inertia explained as described in Chavent(2007)

medoid

Position of the data point regarded as the medoid of itscluster.

loc

y-coordinate of the splitting node to facilitate showing on thetree. Seeplot.MonoClust() for details.

split.order

Order of the splits. Root is 0, and increasing.

alt

Indicator of an alternative cut yielding the same reduction ininertia at that split.

Value

A tibble with only one row and correct default data type for even anunspecified variables.

References

Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monotheticdivisive hierarchical clustering method. Computational Statistics & DataAnalysis, 52(2), 687–701. https://doi.org/10.1016/j.csda.2007.03.013

Permutation Test on Monothetic Tree

Description

Testing the significance of each monothetic clustering split by permutationmethods. The "simple-withhold" method ("sw") shuffles the observationsbetween two groups without the splitting variable. The other two methodsshuffle the values in the splitting variable to create a new data set, thenit either splits again on that variable ("resplit-limit","rl") or use allvariables as the splitting candidates ("resplit-nolimit","rn").

Usage

perm.test(  object,  data,  auto.pick = FALSE,  sig.val = 0.05,  method = c("sw", "rl", "rn"),  rep = 1000L,  stat = c("f", "aw"),  bon.adj = TRUE,  ncores = 1L)

Arguments

object

TheMonoClust object as the result of the clustering.

data

The data set which is being clustered.

auto.pick

Whether the algorithm stops when p-value becomes larger thansig.val or keeps testing and let the researcher pick the final splittingtree. Default value isFALSE.

sig.val

Significance value to decide when to stop splitting. Thisoption is ignored ifauto.pick = FALSE, and is 0.05 by default whenauto.pick = TRUE.

method

Can be chosen betweensw (simple-withhold, default),rl(resplit-limit), orrn (resplit-nolimit). See Details.

rep

Number of permutations required to calculate test statistic.

stat

Statistic to use. Choosing between"f" (Calinski-Harabasz'spseudo-F (Calinski and Harabasz, 1974)) or"aw" (Average silhoutte widthby Rousseeuw (1987)).

bon.adj

Whether to adjust for multiple testing problem usingBonferroni correction.

ncores

Number of CPU cores on the current host. When set to NULL,all available cores are used.

Details

Permutation Methods

Simple-Withhold: Shuffle the observations between two proposed clusters

Thestat calculated from the shuffles create the reference distributionto find the p-value. Because the splitting variable that was chosen isalready the best in terms of reduction of inertia, that variable is withheldfrom the distance matrix used in the permutation test.

Resplit-Limit: Shuffle splitting variable, split again on that variable

This method shuffles the values of the splitting variables while keepingother variables fixed to create a new data set, then the chosenstat iscalculated for each rep to compare with the observedstat.

Resplit-Nolimit: Shuffle splitting variable, split on all variables

Similar to Method 2 but all variables are splitting candidates.

Bonferroni Correction

A hypothesis test occurred lower in the monothetic clustering tree could haveits p-value corrected for multiple tests happened before it in order to reachthat node. The formula is

adj.p = unadj.p \times depth,

withdepth is 1 at the root node.

Value

The sameMonoClust object with an extra column (p-value), as wellas thenumofclusters object ifauto.pick = TRUE.

Note

This function usesforeach::foreach() to facilitate parallelprocessing. It distributes reps to processes.

References

Calinski, T. and Harabasz, J (1974). "A dendrite method for clusteranalysis". en. In:Communications in Statistics 3.1, pp. 1-27.doi:10.1080/03610927408827101.

Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretationand validation of cluster analysis". In:Journal of Computational andApplied Mathematics 20, pp. 53-65. ISSN: 03770427.doi:10.1016/0377-0427(87)90125-7.

Examples

library(cluster)data(ruspini)ruspini6sol <- MonoClust(ruspini, nclusters = 6)ruspini6.p_value <- perm.test(ruspini6sol, data = ruspini, method = "sw",                              rep = 1000)ruspini6.p_value

Plot MonoClust Splitting Rule Tree

Description

Print the MonoClust tree in the form of dendrogram.

Usage

## S3 method for class 'MonoClust'plot(  x,  uniform = FALSE,  branch = 1,  margin = c(0.12, 0.02, 0, 0.05),  minbranch = 0.3,  text = TRUE,  which = 4,  stats = TRUE,  abbrev = c("no", "short", "abbreviate"),  digits = getOption("digits") - 2,  cols = NULL,  col.type = c("l", "p", "b"),  rel.loc.x = TRUE,  show.pval = TRUE,  ...)

Arguments

x

MonoClust result object.

uniform

If TRUE, uniform vertical spacing of the nodes is used; thismay be less cluttered when fitting a large plot onto a page. The default isto use a non-uniform spacing proportional to the inertia in the fit.

branch

Controls the shape of the branches from parent to child node.Any number from 0 to 1 is allowed. A value of 1 gives square shoulderedbranches, a value of 0 give V shaped branches, with other values beingintermediate.

margin

An extra fraction of white space to leave around the borders ofthe tree. (Long labels sometimes get cut off by the default computation).

minbranch

Set the minimum length for a branch tominbranch times theaverage branch length. This parameter is ignored ifuniform = TRUE.Sometimes a split will give very little improvement, or even no improvementat all. A tree with branch lengths strictly proportional to improvementleaves no room to squeeze in node labels.

text

Whether to print the labels on the tree.

which

Labeling modes, which are:

1: only splitting variable names are shown, no splitting rules.
2: only splitting rules to the left branches are shown.
3: only splitting rules to the right branches are shown.
4 (default): splitting rules are shown on both sides of branches.

stats

Whether to show statistics (cluster sizes and medoid points) onthe tree.

abbrev

Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used.

If "no", the labels recorded inx$labels are used.

If "short", variable names will be turned into "V1", "V2", ...

If "abbreviate",abbreviate() function will be used. Use the optionalarguments for this function.

digits

Number of significant digits to print.

cols

Whether to shown color bars at leaves or not. It helps matchingthis tree plot with other plots whose cluster membership were colored. Itonly works whentext isTRUE. EitherNULL, a vector of one color, ora vector of colors matching the number of leaves.

col.type

Whencols is set, choose whether the color indicators areshown in a form of solid lines below the leaves ("l"), or big points("p"), or both ("b").

rel.loc.x

Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE.

show.pval

If MonoClust object has been run throughperm.test(),whether to show p-value on the tree.

...

Arguments to be passed tographics::plot.default() andgraphics::lines().

Value

A plot of splitting rule.

Examples

library(cluster)data(ruspini)# MonoClust treeruspini4sol <- MonoClust(ruspini, nclusters = 4)plot(ruspini4sol)# MonoClust tree after permutation test is runruspini6sol <- MonoClust(ruspini, nclusters = 6)ruspini6_test <- perm.test(ruspini6sol,                           data = ruspini,                           method = "sw",                           rep = 1000)plot(ruspini6_test, branch = 1, uniform = TRUE)

Plot the Mean Square Error with Error Bar for +/- 1 Standard Error

Description

Plot the Mean Square Error with Error Bar for +/- 1 Standard Error

Usage

## S3 method for class 'cv.MonoClust'plot(  x,  main = "MSE for CV of monothetic clustering",  xlab = "Number of clusters",  ylab = "MSE +/- 1 SE",  type = "b",  lty = 2,  err.col = "red",  err.width = 0.1,  ...)

Arguments

x

Acv.MonoClust object (output ofcv.test()).

main

Overall title for the plot.

xlab

Title for x axis.

ylab

Title for y axis.

type

What type of plot should be drawn. Seegraphics::par().

lty

The line type.

err.col

Color of the error bars.

err.width

Width of the bars.

...

Arguments to be passed tographics::plot.default().

Value

A line plot with error bars.

Examples

library(cluster)data(ruspini)# 10-fold cross-validationcptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)plot(cptable)

Calculate Branch Coordinates

Description

Calculate Branch Coordinates

Usage

plot_prep_branch(x, y, node, branch = 0)

Arguments

x

Nodes x-coordinates.

y

Nodes y-coordinates.

node

Nodes row number.

branch

Value

Branch coordinates in a list of x and y axis.

Calculate Nodes Coordinates

Description

Calculate Nodes Coordinates

Usage

plot_prep_node(tree, uniform = FALSE, minbranch = 0.3)

Arguments

tree

MonoClust result object.

uniform

minbranch

Value

Nodes coordinates in a list of x and y axis.

Plot the monoClust Tree.

Description

This function plots the MonoClust tree. It is partially inspired by rpartpackage.

Usage

plot_tree(  x,  uniform = FALSE,  branch = 1,  margin = 0,  minbranch = 0.3,  rel.loc.x = TRUE,  ...)

Arguments

x

MonoClust result object.

uniform

branch

margin

An extra fraction of white space to leave around the borders ofthe tree. (Long labels sometimes get cut off by the default computation).

minbranch

rel.loc.x

Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE.

...

Arguments to be passed tographics::plot.default() andgraphics::lines().

Value

Plot of tree

Predictions from a MonoClust Object

Description

Predict the cluster memberships of a new data set from aMonoClust object.

Usage

## S3 method for class 'MonoClust'predict(object, newdata, type = c("centroid", "medoid"), ...)

Arguments

object

MonoClust result object.

newdata

Data frame containing the values to be predicted. If missing,the memberships of the MonoClust object are returned.

type

Type of returned cluster representatives. Either"centroid" toreturn the centroid values of the terminal clusters, or"medoid" toreturn the index of the medoid observations in the clustered data set.

...

Further arguments passed to or from other methods.

Value

A tibble of cluster index incname and either centroid values ormedoid observations index based on the value oftype argument.

Examples

library(cluster)data(ruspini)set.seed(1234)test_index <- sample(1:nrow(ruspini), nrow(ruspini)/5)train_index <- setdiff(1:nrow(ruspini), test_index)ruspini_train <- ruspini[train_index, ]ruspini_test <- ruspini[test_index, ]ruspini_train_4sol <- MonoClust(ruspini_train, nclusters = 4)predict(ruspini_train_4sol, newdata = ruspini_test)

Print Monothetic Clustering Results

Description

Render theMonoClust split tree in an easy to read format with importantinformation such as terminal nodes, p-value (if possible), etc.

Usage

## S3 method for class 'MonoClust'print(  x,  abbrev = c("no", "short", "abbreviate"),  spaces = 2L,  digits = getOption("digits"),  ...)

Arguments

x

MonoClust result object.

abbrev

Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used.

If "no", the labels recorded inx$labels are used.

If "short", variable names will be turned into "V1", "V2", ...

If "abbreviate",abbreviate() function will be used. Use the optionalarguments for this function.

spaces

Spaces indent between 2 tree levels.

digits

Number of significant digits to print.

...

Optional arguments toabbreviate().

Value

A nicely displayed MonoClust split tree.

Examples

library(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)print(ruspini4sol, digits = 2)

Print MonoClust Cross-Validation Result

Description

Print MonoClust Cross-Validation Result

Usage

## S3 method for class 'cv.MonoClust'print(x, ...)

Arguments

x

Acv.MonoClust object (output ofcv.test()).

...

Further arguments passed to or from other methods.

Examples

library(cluster)data(ruspini)# 10-fold cross-validationcp_table <- cv.test(ruspini, minnodes = 2, maxnodes = 4)print(cp_table)

Split Function

Description

Given the Cluster's frame's row position to split atsplit_row, thisfunction performs the split, calculate all necessary information for thesplitting tree and cluster memberships.

Usage

splitter(data, cuts, split_row, frame, cloc, dist, split_order = 0L)

Arguments

data

Original data set.

cuts

Cuts data set, which has the next higher value of each variablein the original data set.

split_row

The row index in frame that would be split on.

frame

The split tree transferred as data frame.

cloc

Vector of current cluster membership.

dist

Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations.

split_order

The control argument to see how many split has been done.

Value

Updatedframe andcloc saved in a list.

Hypothesis Test at Split

Description

Hypothesis Test at Split

Usage

test_split(members_l, members_r, method, data, split_var, rep, stat, ncores)

Arguments

members_l,members_r

Vector of the index of observations thatare members of the left child node and the right child node, respectively.

method

Can be chosen betweensw (simple-withhold, default),rl(resplit-limit), orrn (resplit-nolimit). See Details.

data

The data set which is being clustered.

split_var

Splitting variable at current split.

rep

Number of permutations required to calculate test statistic.

stat

Statistic to use. Choosing between"f" (Calinski-Harabasz'spseudo-F (Calinski and Harabasz, 1974)) or"aw" (Average silhoutte widthby Rousseeuw (1987)).

ncores

Number of CPU cores on the current host. When set to NULL,all available cores are used.

Value

p-value of the test

Implementation of Print Labels on MonoClust Tree

Description

This function plots the labels onto the MonoClust tree. It is partiallyinspired by rpart package.

Usage

text_tree(  x,  which = 4,  digits = getOption("digits") - 2,  stats = TRUE,  abbrev,  cols = NULL,  cols.type = c("l", "p", "b"),  rel.loc.x = TRUE,  show.pval = TRUE,  uniform = FALSE,  minbranch = 0.3,  ...)

Arguments

x

MonoClust result object.

which

Labeling modes, which are:

1: only splitting variable names are shown, no splitting rules.
2: only splitting rules to the left branches are shown.
3: only splitting rules to the right branches are shown.
4 (default): splitting rules are shown on both sides of branches.

digits

Number of significant digits to print.

stats

Whether to show statistics (cluster sizes and medoid points) onthe tree.

abbrev

Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used.

If "no", the labels recorded inx$labels are used.

If "short", variable names will be turned into "V1", "V2", ...

If "abbreviate",abbreviate() function will be used. Use the optionalarguments for this function.

cols

rel.loc.x

Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE.

show.pval

If MonoClust object has been run throughperm.test(),whether to show p-value on the tree.

uniform

minbranch

...

Extra arguments that would be transferred tographics::text()

Value

Labels on tree.

Transform Between Degree and Radian

Description

This function transforms a circular angle from degree to radian or fromradian to degree.

Usage

torad(x)todeg(x)

Arguments

x

A degree value iftorad or radian value iftodeg.

Value

A radian value iftorad or degree value iftodeg.

Examples

torad(90)torad(-45)todeg(pi/2)

Find Tree Depth Based on Node Indexes

Description

Find Tree Depth Based on Node Indexes

Usage

tree_depth(nodes)

Arguments

nodes

Vector of node indexes in the tree.

Details

When building MonoClust tree, the node index was created with the rule thatnew node indexes are the split node times 2 plus 0 (left) and 1 (right).Therefore, this function is just a back-transform, taking a log base 2.

Value

Depth of the node, with 0 is the root relative to the input.

Traverse a Tree to Find the Leaves (Terminal Nodes)

Description

Traverse a Tree to Find the Leaves (Terminal Nodes)

Usage

tree_walk(new_point, jump_table)

Arguments

new_point

New data point

jump_table

Jump table

Value

The index of the terminal node after traversing the new data point onthe tree.

Existence of Microorganisms Carried in Wind

Description

Data set is a part of a study on microorganisms carried in strong f\"ohnwinds at the Bonney Riegel location of Taylor Valley, an ice free area in theAntarctic continent. Wind direction and wind speed data were obtained fromthe meteorological station. Wind direction was recorded every 30 seconds andwind speeds every 4 seconds at 1.15 meters above the ground surface. Therecorded wind directions and speeds were averaged at 15 minute intervals. Forwind direction, as discussed previously, winds from the north are defined as0/360 degrees and from the east as 90 degrees. 2007 data were collected fromAugust 4–11, 2007.

Usage

wind_sensit_2007

Format

A data frame with 671 rows and 3 variables:

has.sensit: A binary variable of the existence of particles in thewind (1) or not (0).
WS: Wind speed measured in m/s.
WDIR: Wind direction in degree with 0 indicates "from the north" and90 degrees indicate "from the east".

Source

Sabacka, M., Priscu, J. C., Basagic, H. J., Fountain, A. G., Wall, D.H., Virginia, R. A., and Greenwood, M. C. (2012). "Aeolian flux of biotic andabiotic material in Taylor Valley, Antarctica". In: Geomorphology 155-156,pp. 102-111. issn: 0169555X. doi:10.1016/j.geomorph.2011.12.009.

Existence of Microorganisms Carried in Wind

Description

Usage

wind_sensit_2008

Format

A data frame with 673 rows and 3 variables:

has.sensit: A binary variable of the existence of particles in thewind (1) or not (0).
WS: Wind speed measured in m/s.
WDIR: Wind direction in degree with 0 indicates "from the north" and90 degrees indicate "from the east".

Movatterモバイル変換

monoClust: Perform Monothetic Clustering with Extensions to CircularData

Description

Author(s)

See Also

Monothetic Clustering

Description

Usage

Arguments

Value

References

Examples

Monothetic Clustering Tree Object

Description

Value

References

See Also

Coerce Similar Object to MonoClust

Description

Usage

Arguments

Details

Find Centroid of the Cluster

Description

Usage

Arguments

Value

First Gate Function

Description

Usage

Arguments

Value

Add/Subtract Circular Values in Degrees/Radian

Description

Usage

Arguments

Value

Examples

Distance Matrix of Circular Variables

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Cluster Statistics Calculation

Description

Usage

Arguments

Value

References

See Also

Create Labels for Split Variables

Description

Usage

Arguments

Value

See Also

Cross-Validation Test on MonoClust

Description

Usage

Arguments

Details

Value

Note

See Also

Examples

Make Error Bars

Description

Usage

Arguments

Value

Find the Closest Cut

Description

Usage

Arguments

Value

Find the Best Split