| Title: | Perform Monothetic Clustering with Extensions to Circular Data |
| Version: | 1.2.1 |
| Description: | Implementation of the Monothetic Clustering algorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) on continuous data sets. A lot of extensions are included in the package, including applying Monothetic clustering on data sets with circular variables, visualizations with the results, and permutation and cross-validation based tests to support the decision on the number of clusters. |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2)] |
| URL: | https://vinhtantran.github.io/monoClust/,https://github.com/vinhtantran/monoClust |
| BugReports: | https://github.com/vinhtantran/monoClust/issues |
| Depends: | R (≥ 3.3.0) |
| Imports: | cluster (≥ 2.0.5), doParallel, dplyr (≥ 1.0.0), foreach,ggplot2, graphics, grDevices, parallel, permute, purrr (≥0.3.0), rlang (≥ 0.3.0), stats, stringr (≥ 0.5), tibble (≥3.0.0), tidyr (≥ 1.0.0) |
| Suggests: | knitr, mice, rmarkdown, covr, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.1.1 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2021-02-15 00:10:00 UTC; vinht |
| Author: | Tan Tran |
| Maintainer: | Tan Tran <vinhtantran@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2021-02-15 15:00:02 UTC |
monoClust: Perform Monothetic Clustering with Extensions to CircularData
Description
Implementation of the Monothetic Clusteringalgorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) oncontinuous data sets. A lot of extensions are included in the package,including applying Monothetic clustering on data sets with circularvariables, visualizations with the results, and permutation andcross-validation based tests to support the decision on the number ofclusters.
Author(s)
Maintainer: Tan Tranvinhtantran@gmail.com (ORCID)
Authors:
Brian McGuiremcguirebc@gmail.com
Mark Greenwoodgreenwood@montana.edu (ORCID)
See Also
Useful links:
Report bugs athttps://github.com/vinhtantran/monoClust/issues
Monothetic Clustering
Description
Creates a MonoClust object after partitioning the data set using MonotheticClustering.
Usage
MonoClust( toclust, cir.var = NULL, variables = NULL, distmethod = NULL, digits = getOption("digits"), nclusters = 2L, minsplit = 5L, minbucket = round(minsplit/3), ncores = 1L)Arguments
toclust | Data set as a data frame. |
cir.var | Index or name of the circular variable in the data set. |
variables | List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names. |
distmethod | Distance method to use with the data set. Can be chosenfrom "euclidean" (for Euclidean distance), "mahattan" (for Manhattandistance), or "gower" (for Gower distance). If not set, Euclidean distanceis used unless |
digits | Significant decimal number printed in the output. |
nclusters | Number of clusters created. Default is 2. |
minsplit | The minimum number of observations that must exist in a nodein order for a split to be attempted. Default is 5. |
minbucket | The minimum number of observations in any terminal leafnode. Default is |
ncores | Number of CPU cores on the current host. If greater than 1,parallel processing with |
Value
AMonoClust object. SeeMonoClust.object.
References
Chavent, M. (1998). A monothetic clustering method. Pattern RecognitionLetters, 19(11), 989-996. doi:10.1016/S0167-8655(98)00087-7.
Tran, T. V. (2019). Monothetic Cluster Analysis with Extensions toCircular and Functional Data. Montana State University - Bozeman.
Examples
# Very simple data setlibrary(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)ruspini4sol# data with circular variablelibrary(monoClust)data(wind_sensit_2007)# Use a small data setset.seed(12345)wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 10), ]circular_wind <- MonoClust(wind_reduced, cir.var = 3, nclusters = 2)circular_windMonothetic Clustering Tree Object
Description
The structure and objects contained in MonoClust, an object returned fromtheMonoClust() function and used as the input in other functions in thepackage.
Value
- frame
Data frame in the form of a
tibble::tibble()representinga tree structure with one row for each node. The columns include:- number
Index of the node. Depth of a node can be derived by
number %/% 2.- var
Name of the variable used in the split at a node or
"<leaf>"if it is a leaf node.- cut
Splitting value, so values of
varthat are smaller thanthat go to left branch while values greater than that go to the rightbranch.- n
Cluster size, the number of observations in that cluster.
- inertia
Inertia value of the cluster at that node.
- bipartsplitrow
Position of the next split row in the data set(that position will belong to left node (smaller)).
- bipartsplitcol
Position of the next split variable in the dataset.
- inertiadel
Proportion of inertia value of the cluster at thatnode to the inertia of the root.
- medoid
Position of the data point regarded as the medoid ofits cluster.
- loc
y-coordinate of the splitting node to facilitate showingon the tree. See
plot.MonoClust()for details.- split.order
Order of the splits with root is 0.
- inertia_explained
Percent inertia explained as described inChavent (2007). It is
1 - (sum(current inertia)/inertial[1]).- alt
A nested tibble of alternate splits at a node. It contains
bipartsplitrowandbipartsplitcolwith the same meaning above.Note that this is only for information purpose. CurrentlymonoClustdoes not support choosing an alternate splitting route. RunningMonoClust()withnclusters = 2step-by-step can be run ifneeded.
- membership
Vector of the same length as the number of rows in thedata, containing the value of
frame$numbercorresponding to the leafnode that an observation falls into.- dist
Distance matrix calculated using the method indicated in
distmethodargument ofMonoClust().- terms
Vector of variable names in the data that were used to split.
- centroids
Data frame with one row for centroid value of eachcluster.
- medoids
Named vector of positions of the data points regarded asmedoids of clusters.
- alt
Indicator of having an alternate splitting route occurred whensplitting.
- circularroot
List of values designed for circular variable in thedata set.
varis the name of circular variable andcutis its firstbest split value. If circular variable is not available, both objects areNULL.
References
Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monotheticdivisive hierarchical clustering method. Computational Statistics & DataAnalysis, 52(2), 687-701. doi:10.1016/j.csda.2007.03.013.
See Also
Coerce Similar Object to MonoClust
Description
The function turns a MonoClust-similar object into MonoClust object so itcan use supported functions for MonoClust such asprint.MonoClust() andplot.MonoClust().
Usage
as_MonoClust(x, ...)## Default S3 method:as_MonoClust(x, ...)Arguments
x | An object that can be coerced to MonoClust object. |
... | For extensibility. |
Details
as_MonoClust() is an S3 generic. The function itself doesn't run unlessit is implemented for another similar object. Currently, this function is notimplemented withinmonoClust package.
Find Centroid of the Cluster
Description
Centroid is point whose coordinates are the means of their cluster values.
Usage
centroid(data, frame, cloc)Arguments
data | Original data set. |
frame | The split tree transferred as data frame. |
cloc | Vector of current cluster membership. |
Value
A data frame with coordinates of centroids
First Gate Function
Description
This function checks what are available nodes to split and then callfind_split() on each node, then decide which node creates best split, andcallsplitter() to perform the split.
Usage
checkem( data, cuts, frame, cloc, dist, variables, minsplit, minbucket, split_order, ncores)Arguments
data | Original data set. |
cuts | Cuts data set, which has the next higher value of each variablein the original data set. |
frame | The split tree transferred as data frame. |
cloc | Vector of current cluster membership. |
dist | Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations. |
variables | List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names. |
minsplit | The minimum number of observations that must exist in a nodein order for a split to be attempted. |
minbucket | The minimum number of observations in any terminal leafnode. Default is |
split_order | The control argument to see how many split has been done. |
ncores | Number of CPU cores on the current host. |
Value
It is not supposed to return anything because global environment wasused. However, if there is nothing left to split, it returns 0 to tell thecaller to stop running the loop.
Add/Subtract Circular Values in Degrees/Radian
Description
Add/subtract two circular variables in degrees (%cd+% and%cd-%) andradian (%cr+% and%cr-%).
Usage
x %cd+% yx %cd-% yx %cr+% yx %cr-% yArguments
x,y | Circular values in degrees/radians. |
Value
A value between [0, 360) in degrees or [0, 2*pi) in radian.
Examples
90 %cd+% 90250 %cd+% 20025 %cd-% 80pi %cr+% (pi/2)Distance Matrix of Circular Variables
Description
Calculates the distance matrix of observations with circular variables usingan adapted version of Gower's distance. This distance should be compatiblewith the Gower's distance for other variable types.
Usage
circ_dist(frame)Arguments
frame | A data frame with all columns are circular measured in degrees. |
Details
The distance between two observations i and j of a circular variable qis suggested to be
(y_{iq}, y_{jq}) = \frac{180 - |180 - |y_{iq} - y_{jq}||}{180}.
Value
Object of class "dist".
References
Tran, T. V. (2019). Chapter 3. Monothetic Cluster Analysis with Extensionsto Circular and Functional Data. Montana State University - Bozeman.
See Also
Examples
# Make a sample data set of 20 observations with 2 circular variablesdata <- data.frame(var1 = sample.int(359, 20), var2 = sample.int(359, 20))circ_dist(data)Cluster Statistics Calculation
Description
Calinski-Harabasz's pseudo-F (Calinski and Harabasz, 1974) and Averagesilhoutte width (Rousseeuw, 1987) calculation.
Usage
cluster_stats(d, clustering)Arguments
d | Distance object (as generated by |
clustering | Integer vector of length of the number of cases, whichindicates a clustering. The clusters have to be numbered from 1 to thenumber of clusters. |
Value
- f_stat
Calinski-Harabasz's pseudo-F.
- asw
Average silhouette width.
References
Caliński, T. and Harabasz, J (1974). "A dendrite method for clusteranalysis". en. In:Communications in Statistics 3.1, pp. 1–27.doi:10.1080/03610927408827101.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to theinterpretation and validation of cluster analysis". In:Journal ofComputational and Applied Mathematics 20, pp. 53–65. ISSN: 03770427.doi:10.1016/0377-0427(87)90125-7.
See Also
Create Labels for Split Variables
Description
This function prints variable's labels for aMonoClust tree.
Usage
create_labels(x, abbrev, digits = getOption("digits"), ...)Arguments
x | MonoClust result object. |
abbrev | Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
digits | Number of significant digits to print. |
... | Optional arguments to |
Value
A list containing two elements:
varnames: A named vector of labels corresponding to variable's names(at vector names).labels: Vector of labels of splitting rules to be displayed.
See Also
Cross-Validation Test on MonoClust
Description
Perform cross-validation test for different different number of clusters ofMonothetic Clustering.
Usage
cv.test(data, fold = 10L, minnodes = 2L, maxnodes = 10L, ncores = 1L, ...)Arguments
data | Data set to be partitioned. |
fold | Number of folds (k). |
minnodes | Minimum number of clusters to be checked. |
maxnodes | Maximum number of clusters to be checked. |
ncores | Number of CPU cores on the current host. When set to NULL,all available cores are used. |
... | Other parameters transferred to |
Details
Thek-fold cross-validation randomly partitions data intoksubsets with equal (or close to equal) sizes.k - 1 subsets are used asthe training data set to create a tree with a desired number of leaves andthe other subset is used as validation data set to evaluate the predictiveperformance of the trained tree. The process repeats for each subset as thevalidating set (m = 1, \ldots, k) and the mean squared difference,
MSE_m=\frac{1}{n_m} \sum_{q=1}^Q\sum_{i \in m} d^2_{euc}(y_{iq},\hat{y}_{(-i)q}),
is calculated, where\hat{y}_{(-i)q} is the cluster mean on thevariableq of the cluster created by the training data where the observed value,y_{iq}, of the validation data set will fall into, andd^2_{euc}(y_{iq}, \hat{y}_{(-i)q}) is the squared Euclidean distance(dissimilarity) between two observations at variable $q$. This process isrepeated for the $k$ subsets of the data set and the average of these testerrors is the cross-validation-based estimate of the mean squared error ofpredicting a new observation,
CV_K = \overline{MSE} = \frac{1}{M} \sum_{m=1}^M MSE_m.
Value
AMonoClust.cv class containing a data frame of mean sum of squareerror and its standard deviation.
Note
This function supports parallel processing withforeach::foreach().It distributes MonoClust calls to processes.
See Also
plot.cv.MonoClust(),MonoClust(),predict.MonoClust()
Examples
library(cluster)data(ruspini)# Leave-one-out cross-validationcv.test(ruspini, fold = 1, minnodes = 2, maxnodes = 4)# 5-fold cross-validationcv.test(ruspini, fold = 5, minnodes = 2, maxnodes = 4)Make Error Bars
Description
Make Error Bars
Usage
error_bar(x, y, upper, lower = upper, length = 0.1, ...)Arguments
x | x coordinates. |
y | y coordinates. |
upper | Distance from y to the upper bar. |
lower | Distance from y to the lower bar. |
length | Length of the horizontal bar. |
... | Other arguments to |
Value
Plot
Find the Closest Cut
Description
Find the cuts for a quantitative variable. These cuts are what we aregoing to consider when thinking about bi-partitioning the data. For aquantitative column, find the next larger value of each value, if it is thelargest, that value + 1
Usage
find_closest(col)Arguments
col | a quantitative vector. |
Value
a quantitative vector which contains the closest higher cut.
Find the Best Split
Description
Find the best split in terms of reduction in inertia for the transferrednode, indicate by row. Find the terminal node with the greatest change ininertia and bi-partition it.
Usage
find_split( data, cuts, frame_row, cloc, dist, variables, minsplit, minbucket, ncores)Arguments
data | Original data set. |
cuts | Cuts data set, which has the next higher value of each variablein the original data set. |
frame_row | One row of the split tree as data frame. |
cloc | Vector of current cluster membership. |
dist | Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations. |
variables | List of variables selected for clustering procedure. Itcould be a vector of variable indexes, or a vector of variable names. |
minsplit | The minimum number of observations that must exist in a nodein order for a split to be attempted. |
minbucket | The minimum number of observations in any terminal leafnode. Default is |
ncores | Number of CPU cores on the current host. |
Value
The updatedframe_row with the next split updated.
GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error
Description
GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error
Usage
ggcv( cv.obj, title = "MSE for CV of monothetic clustering", xlab = "Number of clusters", ylab = "MSE +/- 1 SE", type = c("b", "p", "l"), linetype = 2, err.col = "red", err.width = 0.2)Arguments
cv.obj | A |
title | Overall title for the plot. |
xlab | Title for x axis. |
ylab | Title for y axis. |
type | What type of plot should be drawn. Choosing between |
linetype | The line type. See |
err.col | Color of the error bars. |
err.width | Width of the bars. |
Value
A ggplot2 object.
See Also
Plot using base Rplot.cv.MonoClust()
Examples
library(cluster)data(ruspini)# 10-fold cross-validationcptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)ggcv(cptable)Parallel Coordinates Plot with Circular Variables
Description
Making a parallel coordinates plot with the circular variables are plottedas ellipses. The function currently works well with data with one circularvariable.
Usage
ggpcp( data, circ.var = NULL, is.degree = TRUE, rotate = 0, north = 0, cw = FALSE, order.appear = NULL, linetype = 1, size = 0.5, alpha = 0.5, clustering, medoids = NULL, cluster.col = NULL, show.medoids = FALSE, labelsize = 4, xlab = "Variables", ylab = NULL, legend.cluster = "groups")Arguments
data | Data set. |
circ.var | Circular variable(s) in the data set, indicated by namesor index in the data set. |
is.degree | Whether the unit of the circular variables is degree or not(radian). Default is |
rotate | The rotate (offset, shift) of the circular variable, inradians. Default is 0 (no rotation). |
north | What value of the circular variable is labeled North. Default is0 radian. |
cw | Which direction of the circular variable is considered increasingin value, clockwise ( |
order.appear | The order of appearance of the variables, listed by avector of names or index. If set, length has to be equal to the number ofvariables in the data set. |
linetype | Line type. Default is solid line. See details in |
size | Size of a line is its width in mm. Default is 0.5. See details in |
alpha | The transparency of the lines. Default is 0.1. |
clustering | Cluster membership. |
medoids | Vector of medoid observations of cluster. Only required when |
cluster.col | Color of clusters, indicating by a vector. If set, thelength of this vector must be equal to the number of clusters in |
show.medoids | Whether to highlight the median lines or not. Default is |
labelsize | The size of labels on the plot. Default is 4. |
xlab | Labels for x-axis. |
ylab | Labels for y-axis. |
legend.cluster | Labels for group membership. Implemented by settinglabel for ggplot |
Value
A ggplot2 object.
Examples
# Set color constantCOLOR4 <- c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3")# Reduce the size of the data for for sake of example speedset.seed(12345)wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 50), ]sol42007 <- MonoClust(wind_reduced, cir.var = 3, nclusters = 4)library(ggplot2)ggpcp(data = wind_reduced, circ.var = "WDIR", # To improve aesthetics rotate = pi*3/4-0.3, order.appear = c("WDIR", "has.sensit", "WS"), alpha = 0.5, clustering = sol42007$membership, medoids = sol42007$medoids, cluster.col = COLOR4, show.medoids = TRUE) + theme(panel.background = element_rect(color = "white"), panel.border = element_rect(color = "white", fill = NA), panel.grid.major = element_line(color = "#f0f0f0"), panel.grid.minor = element_blank(), axis.line = element_line(color = "black"), legend.key = element_rect(color = NA), legend.position = "bottom", legend.direction = "horizontal", legend.title = element_text(face = "italic"), legend.justification = "center")Cluster Inertia Calculation
Description
Calculate inertia for a given subset of the distance matrix from the originaldata set provided tox. Assumes that distance matrices are stored asmatrices and not distance objects.
Usage
inertia_calc(x)Arguments
x | Distance matrix, not an object of some distance measure. |
Value
Inertia value of the matrix, formula in Chavent (1998). Ifx is asingle number, return 0.
Examples
data(iris)# Euclidean distance on first 20 rows of the 4 continuous variablesdist_mat <- as.matrix(dist(iris[1:20, 1:4]))inertia_calc(dist_mat)Test If The Object is A MonoClust
Description
This function returnsTRUE for MonoClust, and FALSE for all other objects.
Usage
is_MonoClust(mono_obj)Arguments
mono_obj | An object. |
Value
TRUE if the object inherits from theMonoClust class.
Create Jump Table
Description
Create jump table from the MonoClust's frame object.number andvar willbe used to create the table.
Usage
make_jump_table(frame)Arguments
frame | MonoClust's frame object |
Value
Jump table withnumber,var, and two new columnsleft andright indicate the left and right number at split.
Find Medoid of the Cluster
Description
Medoid is the point that has minimum distance to all other points in thecluster.
Usage
medoid(members, dist_mat)Arguments
members | index vector indicating which observation belongs to thecluster. |
dist_mat | distance matrix of the whole data set. A class of |
Value
index of the medoid point in the members vector.
Examples
library(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)ruspini4solmedoid(which(ruspini4sol$membership == 4), ruspini4sol$dist)# Check with the output with "4" labelruspini4sol$medoidsCreate A New Node for Split Data Frame
Description
This function is just a helper to make sure that the default values of thesplit data frame is correct when unspecified. It helps reduce type error,especially when moving to use dplyr which is stricter in data types.
Usage
new_node( number, var, cut = -99L, n, inertia, bipartsplitrow = -99L, bipartsplitcol = -99L, inertiadel = 0, inertia_explained = -99, medoid, loc, split.order = -99L, alt = list(tibble::tibble(bipartsplitrow = numeric(), bipartsplitcol = numeric())))Arguments
number | Row index of the data frame. |
var | Whether it is a leaf, or the name of the next split variable. |
cut | The splitting value, so values (of |
n | Cluster size. Number of observations in that cluster. |
inertia | Inertia value of the cluster at that node. |
bipartsplitrow | Position of the next split row in the data set (thatposition will belong to left node (smaller)). |
bipartsplitcol | Position of the next split variable in the data set. |
inertiadel | The proportion of inertia value of the cluster at that nodeto the inertia of the root. |
inertia_explained | Percent inertia explained as described in Chavent(2007) |
medoid | Position of the data point regarded as the medoid of itscluster. |
loc | y-coordinate of the splitting node to facilitate showing on thetree. See |
split.order | Order of the splits. Root is 0, and increasing. |
alt | Indicator of an alternative cut yielding the same reduction ininertia at that split. |
Value
A tibble with only one row and correct default data type for even anunspecified variables.
References
Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monotheticdivisive hierarchical clustering method. Computational Statistics & DataAnalysis, 52(2), 687–701. https://doi.org/10.1016/j.csda.2007.03.013
Permutation Test on Monothetic Tree
Description
Testing the significance of each monothetic clustering split by permutationmethods. The "simple-withhold" method ("sw") shuffles the observationsbetween two groups without the splitting variable. The other two methodsshuffle the values in the splitting variable to create a new data set, thenit either splits again on that variable ("resplit-limit","rl") or use allvariables as the splitting candidates ("resplit-nolimit","rn").
Usage
perm.test( object, data, auto.pick = FALSE, sig.val = 0.05, method = c("sw", "rl", "rn"), rep = 1000L, stat = c("f", "aw"), bon.adj = TRUE, ncores = 1L)Arguments
object | The |
data | The data set which is being clustered. |
auto.pick | Whether the algorithm stops when p-value becomes larger than |
sig.val | Significance value to decide when to stop splitting. Thisoption is ignored if |
method | Can be chosen between |
rep | Number of permutations required to calculate test statistic. |
stat | Statistic to use. Choosing between |
bon.adj | Whether to adjust for multiple testing problem usingBonferroni correction. |
ncores | Number of CPU cores on the current host. When set to NULL,all available cores are used. |
Details
Permutation Methods
Simple-Withhold: Shuffle the observations between two proposed clusters
Thestat calculated from the shuffles create the reference distributionto find the p-value. Because the splitting variable that was chosen isalready the best in terms of reduction of inertia, that variable is withheldfrom the distance matrix used in the permutation test.
Resplit-Limit: Shuffle splitting variable, split again on that variable
This method shuffles the values of the splitting variables while keepingother variables fixed to create a new data set, then the chosenstat iscalculated for each rep to compare with the observedstat.
Resplit-Nolimit: Shuffle splitting variable, split on all variables
Similar to Method 2 but all variables are splitting candidates.
Bonferroni Correction
A hypothesis test occurred lower in the monothetic clustering tree could haveits p-value corrected for multiple tests happened before it in order to reachthat node. The formula is
adj.p = unadj.p \times depth,
withdepth is 1 at the root node.
Value
The sameMonoClust object with an extra column (p-value), as wellas thenumofclusters object ifauto.pick = TRUE.
Note
This function usesforeach::foreach() to facilitate parallelprocessing. It distributes reps to processes.
References
Calinski, T. and Harabasz, J (1974). "A dendrite method for clusteranalysis". en. In:Communications in Statistics 3.1, pp. 1-27.doi:10.1080/03610927408827101.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretationand validation of cluster analysis". In:Journal of Computational andApplied Mathematics 20, pp. 53-65. ISSN: 03770427.doi:10.1016/0377-0427(87)90125-7.
Examples
library(cluster)data(ruspini)ruspini6sol <- MonoClust(ruspini, nclusters = 6)ruspini6.p_value <- perm.test(ruspini6sol, data = ruspini, method = "sw", rep = 1000)ruspini6.p_valuePlot MonoClust Splitting Rule Tree
Description
Print the MonoClust tree in the form of dendrogram.
Usage
## S3 method for class 'MonoClust'plot( x, uniform = FALSE, branch = 1, margin = c(0.12, 0.02, 0, 0.05), minbranch = 0.3, text = TRUE, which = 4, stats = TRUE, abbrev = c("no", "short", "abbreviate"), digits = getOption("digits") - 2, cols = NULL, col.type = c("l", "p", "b"), rel.loc.x = TRUE, show.pval = TRUE, ...)Arguments
x | MonoClust result object. |
uniform | If TRUE, uniform vertical spacing of the nodes is used; thismay be less cluttered when fitting a large plot onto a page. The default isto use a non-uniform spacing proportional to the inertia in the fit. |
branch | Controls the shape of the branches from parent to child node.Any number from 0 to 1 is allowed. A value of 1 gives square shoulderedbranches, a value of 0 give V shaped branches, with other values beingintermediate. |
margin | An extra fraction of white space to leave around the borders ofthe tree. (Long labels sometimes get cut off by the default computation). |
minbranch | Set the minimum length for a branch to |
text | Whether to print the labels on the tree. |
which | Labeling modes, which are:
|
stats | Whether to show statistics (cluster sizes and medoid points) onthe tree. |
abbrev | Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
digits | Number of significant digits to print. |
cols | Whether to shown color bars at leaves or not. It helps matchingthis tree plot with other plots whose cluster membership were colored. Itonly works when |
col.type | When |
rel.loc.x | Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE. |
show.pval | If MonoClust object has been run through |
... | Arguments to be passed to |
Value
A plot of splitting rule.
Examples
library(cluster)data(ruspini)# MonoClust treeruspini4sol <- MonoClust(ruspini, nclusters = 4)plot(ruspini4sol)# MonoClust tree after permutation test is runruspini6sol <- MonoClust(ruspini, nclusters = 6)ruspini6_test <- perm.test(ruspini6sol, data = ruspini, method = "sw", rep = 1000)plot(ruspini6_test, branch = 1, uniform = TRUE)Plot the Mean Square Error with Error Bar for +/- 1 Standard Error
Description
Plot the Mean Square Error with Error Bar for +/- 1 Standard Error
Usage
## S3 method for class 'cv.MonoClust'plot( x, main = "MSE for CV of monothetic clustering", xlab = "Number of clusters", ylab = "MSE +/- 1 SE", type = "b", lty = 2, err.col = "red", err.width = 0.1, ...)Arguments
x | A |
main | Overall title for the plot. |
xlab | Title for x axis. |
ylab | Title for y axis. |
type | What type of plot should be drawn. See |
lty | The line type. |
err.col | Color of the error bars. |
err.width | Width of the bars. |
... | Arguments to be passed to |
Value
A line plot with error bars.
See Also
Plot using ggplot2ggcv()
Examples
library(cluster)data(ruspini)# 10-fold cross-validationcptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)plot(cptable)Calculate Branch Coordinates
Description
Calculate Branch Coordinates
Usage
plot_prep_branch(x, y, node, branch = 0)Arguments
x | Nodes x-coordinates. |
y | Nodes y-coordinates. |
node | Nodes row number. |
branch | Controls the shape of the branches from parent to child node.Any number from 0 to 1 is allowed. A value of 1 gives square shoulderedbranches, a value of 0 give V shaped branches, with other values beingintermediate. |
Value
Branch coordinates in a list of x and y axis.
Calculate Nodes Coordinates
Description
Calculate Nodes Coordinates
Usage
plot_prep_node(tree, uniform = FALSE, minbranch = 0.3)Arguments
tree | MonoClust result object. |
uniform | If TRUE, uniform vertical spacing of the nodes is used; thismay be less cluttered when fitting a large plot onto a page. The default isto use a non-uniform spacing proportional to the inertia in the fit. |
minbranch | Set the minimum length for a branch to |
Value
Nodes coordinates in a list of x and y axis.
Plot the monoClust Tree.
Description
This function plots the MonoClust tree. It is partially inspired by rpartpackage.
Usage
plot_tree( x, uniform = FALSE, branch = 1, margin = 0, minbranch = 0.3, rel.loc.x = TRUE, ...)Arguments
x | MonoClust result object. |
uniform | If TRUE, uniform vertical spacing of the nodes is used; thismay be less cluttered when fitting a large plot onto a page. The default isto use a non-uniform spacing proportional to the inertia in the fit. |
branch | Controls the shape of the branches from parent to child node.Any number from 0 to 1 is allowed. A value of 1 gives square shoulderedbranches, a value of 0 give V shaped branches, with other values beingintermediate. |
margin | An extra fraction of white space to leave around the borders ofthe tree. (Long labels sometimes get cut off by the default computation). |
minbranch | Set the minimum length for a branch to |
rel.loc.x | Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE. |
... | Arguments to be passed to |
Value
Plot of tree
Predictions from a MonoClust Object
Description
Predict the cluster memberships of a new data set from aMonoClust object.
Usage
## S3 method for class 'MonoClust'predict(object, newdata, type = c("centroid", "medoid"), ...)Arguments
object | MonoClust result object. |
newdata | Data frame containing the values to be predicted. If missing,the memberships of the MonoClust object are returned. |
type | Type of returned cluster representatives. Either |
... | Further arguments passed to or from other methods. |
Value
A tibble of cluster index incname and either centroid values ormedoid observations index based on the value oftype argument.
Examples
library(cluster)data(ruspini)set.seed(1234)test_index <- sample(1:nrow(ruspini), nrow(ruspini)/5)train_index <- setdiff(1:nrow(ruspini), test_index)ruspini_train <- ruspini[train_index, ]ruspini_test <- ruspini[test_index, ]ruspini_train_4sol <- MonoClust(ruspini_train, nclusters = 4)predict(ruspini_train_4sol, newdata = ruspini_test)Print Monothetic Clustering Results
Description
Render theMonoClust split tree in an easy to read format with importantinformation such as terminal nodes, p-value (if possible), etc.
Usage
## S3 method for class 'MonoClust'print( x, abbrev = c("no", "short", "abbreviate"), spaces = 2L, digits = getOption("digits"), ...)Arguments
x | MonoClust result object. |
abbrev | Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
spaces | Spaces indent between 2 tree levels. |
digits | Number of significant digits to print. |
... | Optional arguments to |
Value
A nicely displayed MonoClust split tree.
See Also
Examples
library(cluster)data(ruspini)ruspini4sol <- MonoClust(ruspini, nclusters = 4)print(ruspini4sol, digits = 2)Print MonoClust Cross-Validation Result
Description
Print MonoClust Cross-Validation Result
Usage
## S3 method for class 'cv.MonoClust'print(x, ...)Arguments
x | A |
... | Further arguments passed to or from other methods. |
Examples
library(cluster)data(ruspini)# 10-fold cross-validationcp_table <- cv.test(ruspini, minnodes = 2, maxnodes = 4)print(cp_table)Split Function
Description
Given the Cluster's frame's row position to split atsplit_row, thisfunction performs the split, calculate all necessary information for thesplitting tree and cluster memberships.
Usage
splitter(data, cuts, split_row, frame, cloc, dist, split_order = 0L)Arguments
data | Original data set. |
cuts | Cuts data set, which has the next higher value of each variablein the original data set. |
split_row | The row index in frame that would be split on. |
frame | The split tree transferred as data frame. |
cloc | Vector of current cluster membership. |
dist | Distance matrix of all observations in the data.exported function yet. Vector of 1 for all observations. |
split_order | The control argument to see how many split has been done. |
Value
Updatedframe andcloc saved in a list.
Hypothesis Test at Split
Description
Hypothesis Test at Split
Usage
test_split(members_l, members_r, method, data, split_var, rep, stat, ncores)Arguments
members_l,members_r | Vector of the index of observations thatare members of the left child node and the right child node, respectively. |
method | Can be chosen between |
data | The data set which is being clustered. |
split_var | Splitting variable at current split. |
rep | Number of permutations required to calculate test statistic. |
stat | Statistic to use. Choosing between |
ncores | Number of CPU cores on the current host. When set to NULL,all available cores are used. |
Value
p-value of the test
Implementation of Print Labels on MonoClust Tree
Description
This function plots the labels onto the MonoClust tree. It is partiallyinspired by rpart package.
Usage
text_tree( x, which = 4, digits = getOption("digits") - 2, stats = TRUE, abbrev, cols = NULL, cols.type = c("l", "p", "b"), rel.loc.x = TRUE, show.pval = TRUE, uniform = FALSE, minbranch = 0.3, ...)Arguments
x | MonoClust result object. |
which | Labeling modes, which are:
|
digits | Number of significant digits to print. |
stats | Whether to show statistics (cluster sizes and medoid points) onthe tree. |
abbrev | Whether to print the abbreviated versions of variable names.Can be either "no" (default), "short", or "abbreviate". Short forms of themcan also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
cols | Whether to shown color bars at leaves or not. It helps matchingthis tree plot with other plots whose cluster membership were colored. Itonly works when |
rel.loc.x | Whether to use the relative distance between clusters as xcoordinate of the leaves. Default is TRUE. |
show.pval | If MonoClust object has been run through |
uniform | If TRUE, uniform vertical spacing of the nodes is used; thismay be less cluttered when fitting a large plot onto a page. The default isto use a non-uniform spacing proportional to the inertia in the fit. |
minbranch | Set the minimum length for a branch to |
... | Extra arguments that would be transferred to |
Value
Labels on tree.
Transform Between Degree and Radian
Description
This function transforms a circular angle from degree to radian or fromradian to degree.
Usage
torad(x)todeg(x)Arguments
x | A degree value if |
Value
A radian value iftorad or degree value iftodeg.
Examples
torad(90)torad(-45)todeg(pi/2)Find Tree Depth Based on Node Indexes
Description
Find Tree Depth Based on Node Indexes
Usage
tree_depth(nodes)Arguments
nodes | Vector of node indexes in the tree. |
Details
When building MonoClust tree, the node index was created with the rule thatnew node indexes are the split node times 2 plus 0 (left) and 1 (right).Therefore, this function is just a back-transform, taking a log base 2.
Value
Depth of the node, with 0 is the root relative to the input.
Traverse a Tree to Find the Leaves (Terminal Nodes)
Description
Traverse a Tree to Find the Leaves (Terminal Nodes)
Usage
tree_walk(new_point, jump_table)Arguments
new_point | New data point |
jump_table | Jump table |
Value
The index of the terminal node after traversing the new data point onthe tree.
Existence of Microorganisms Carried in Wind
Description
Data set is a part of a study on microorganisms carried in strong f\"ohnwinds at the Bonney Riegel location of Taylor Valley, an ice free area in theAntarctic continent. Wind direction and wind speed data were obtained fromthe meteorological station. Wind direction was recorded every 30 seconds andwind speeds every 4 seconds at 1.15 meters above the ground surface. Therecorded wind directions and speeds were averaged at 15 minute intervals. Forwind direction, as discussed previously, winds from the north are defined as0/360 degrees and from the east as 90 degrees. 2007 data were collected fromAugust 4–11, 2007.
Usage
wind_sensit_2007Format
A data frame with 671 rows and 3 variables:
- has.sensit
A binary variable of the existence of particles in thewind (1) or not (0).
- WS
Wind speed measured in m/s.
- WDIR
Wind direction in degree with 0 indicates "from the north" and90 degrees indicate "from the east".
Source
Sabacka, M., Priscu, J. C., Basagic, H. J., Fountain, A. G., Wall, D.H., Virginia, R. A., and Greenwood, M. C. (2012). "Aeolian flux of biotic andabiotic material in Taylor Valley, Antarctica". In: Geomorphology 155-156,pp. 102-111. issn: 0169555X. doi:10.1016/j.geomorph.2011.12.009.
Existence of Microorganisms Carried in Wind
Description
Data set is a part of a study on microorganisms carried in strong f\"ohnwinds at the Bonney Riegel location of Taylor Valley, an ice free area in theAntarctic continent. Wind direction and wind speed data were obtained fromthe meteorological station. Wind direction was recorded every 30 seconds andwind speeds every 4 seconds at 1.15 meters above the ground surface. Therecorded wind directions and speeds were averaged at 15 minute intervals. Forwind direction, as discussed previously, winds from the north are defined as0/360 degrees and from the east as 90 degrees. 2008 data were collected fromJuly 7–14, 2008.
Usage
wind_sensit_2008Format
A data frame with 673 rows and 3 variables:
- has.sensit
A binary variable of the existence of particles in thewind (1) or not (0).
- WS
Wind speed measured in m/s.
- WDIR
Wind direction in degree with 0 indicates "from the north" and90 degrees indicate "from the east".
Source
Sabacka, M., Priscu, J. C., Basagic, H. J., Fountain, A. G., Wall, D.H., Virginia, R. A., and Greenwood, M. C. (2012). "Aeolian flux of biotic andabiotic material in Taylor Valley, Antarctica". In: Geomorphology 155-156,pp. 102-111. issn: 0169555X. doi:10.1016/j.geomorph.2011.12.009.