Movatterモバイル変換

Type:

Package

Title:

Cluster High Dimensional Categorical Datasets

Version:

0.3.0

Description:

Scalable Bayesian clustering of categorical datasets. The package implements a hierarchical Dirichlet (Process) mixture of multinomial distributions. It is thus a probabilistic latent class model (LCM) and can be used to reduce the dimensionality of hierarchical data and cluster individuals into latent classes. It can automatically infer an appropriate number of latent classes or find k classes, as defined by the user. The model is based on a paper by Dunson and Xing (2009) <doi:10.1198/jasa.2009.tm08439>, but implements a scalable variational inference algorithm so that it is applicable to large datasets. It is described and tested in the accompanying paper by Ahlmann-Eltze and Yau (2018) <doi:10.1109/DSAA.2018.00068>.

URL:

https://github.com/const-ae/mixdir

License:

GPL-3

Encoding:

UTF-8

LazyData:

true

Suggests:

testthat, tibble, purrr, dplyr, rmutil, pheatmap, mcclust,ggplot2, tidyr, utils

RoxygenNote:

6.1.1

Imports:

extraDistr, Rcpp

Depends:

R (≥ 2.10)

LinkingTo:

Rcpp

NeedsCompilation:

yes

Packaged:

2019-09-20 14:51:59 UTC; ahlmanne

Author:

Constantin Ahlmann-Eltze

[aut, cre], Christopher Yau

[ths]

Maintainer:

Constantin Ahlmann-Eltze <artjom31415@googlemail.com>

Repository:

CRAN

Date/Publication:

2019-09-20 15:10:05 UTC

Find the n defining features

Description

Reduce the dimensionality of a dataset by calculating how important each feature isfor inferring the clustering.

Usage

find_defining_features(mixdir_obj, X, n_features = Inf,  measure = c("JS", "ARI"), subsample_size = Inf, step_size = Inf,  exponential_decay = TRUE, verbose = FALSE)

Arguments

mixdir_obj

the result from a call tomixdir(). It needs to have thefields category_prob. category_prob a list of a list of a named vector with probabilitiesfor each feature, latent class and possible category.

X

the original dataset that was used for clustering.

n_features

the number of dimensions that should be selected. If it isInf (the default) all features are returned ordered by importance(most important first).

measure

The measure used to assess the loss of clustering qualityif a variable is removed. Two measures are implemented: "JS" short forJensen-Shannon divergence comparing the original class probabilitiesand the new predicted class probabilities (smaller is better),"ARI" short for adjusted Rand index compares the overlap of the originaland the predicted classes (requires themcclust package) (1 is perfect,0 is as good as random).

subsample_size

Running this method on the full dataset can be slow,but one can easily speed up the calculation by randomly selectinga subset of rows from X without usually disproportionately hurting theselection performance.

step_size

The method can either remove each feature individuallyand return the n features that caused the greatest quality loss(step=Inf) or iteratively remove the least important one untilthe the size of the remaining features equaln_features(step=1). Using a smaller step size increases the sensitivityof the selection process, but takes longer to calculate.

exponential_decay

Boolean or number. Alternative way ofcalculating how many features to remove each step. The default isto always remove the least important 50% of the features(exponential_decay=2).

verbose

Boolean indicating if status messages should be printed.

Details

Iteratively find the variable, whose removal least affects theclustering compared with the original. Ifn_features is a finite numberthe quality is a single number and reflects how good those n features maintainthe original clustering. Ifn_features=Inf, the method returns all featuresordered by decreasing importance. The accompanying quality vector contains the"cumulative" loss if the corresponding variable would be removed.Note that depending on the step size scheme the quality can differ. For exampleif all variables are removed in one step (step_size=Inf andexponential_decay=FALSE) the quality is not cumulative, but simply thequality of the clustering excluding the corresponding feature. In thatsense the quality vector should not be used as a definitive answer, butshould only be used as a guidance to see where there are jumps in the quality.

Examples

    data("mushroom")  res <- mixdir(mushroom[1:100, ], n_latent=20)  find_defining_features(res, mushroom[1:100, ], n_features=3)  find_defining_features(res, mushroom[1:100, ], n_features=Inf)

Find the top predictive features and values for each latent class

Description

Find the top predictive features and values for each latent class

Usage

find_predictive_features(mixdir_obj, top_n = 10)

Arguments

mixdir_obj

the result from a call tomixdir(). It needs to have thefields lambda and category_prob. lambda a vector of probabilities for each category.category_prob a list of a list of a named vector with probabilitiesfor each feature, latent class and possible category.

top_n

the number of top answers per category that will be returned. Default: 10.

Value

A data frame with four columns: column, answer, class and probability.The probability column contains the chance that an observation belongs tothe latent class if all that is known about that observation that`column`=`category`

Examples

  data("mushroom")  res <- mixdir(mushroom[1:30, ], beta=1)  find_predictive_features(res, top_n=3)

Find the most typical features and values for each latent class

Description

Find the most typical features and values for each latent class

Usage

find_typical_features(mixdir_obj, top_n = 10)

Arguments

mixdir_obj

top_n

the number of top answers per category that will be returned. Default: 10.

Value

A data frame with four columns: column, answer, class and probability.The probability column contains the chance to see the answer in that column.

Examples

  data("mushroom")  res <- mixdir(mushroom[1:30, ], beta=1)  find_typical_features(res, top_n=3)

Cluster high dimensional categorical datasets

Description

Cluster high dimensional categorical datasets

Usage

mixdir(X, n_latent = 3, alpha = NULL, beta = NULL,  select_latent = FALSE, max_iter = 100, epsilon = 0.001,  na_handle = c("ignore", "category"), repetitions = 1, ...)

Arguments

X

A matrix or data.frame of size (N_ind x N_quest) that contains the categorical responses.The values can be characters, integers or factors. The most flexibility is provided if factors are used.

n_latent

The number of latent factors that are used to approximate the model. Default: 3.

alpha

A single number or a vector of two numbers in case select_latent=TRUE. If it is NULL alphais initialized to 1. It serves as prior for the Dirichlet distributions over the latent groups. Theyserve as pseudo counts of individuals per group.

beta

A single number. If it is NULL beta is initialized to 0.1.It serves as a prior for the Dirichlet distributions over the categorical responses. Large numbersfavor an equal distribution of responses for a question of the individuals in the same latent group,small numbers indicate that individuals of the same latent group usually answer a question the same way.

select_latent

A boolean that indicates if the exact number n_latent should be used or if a DirichletProcess prior is used that shrinks the number of used latent variables appropriately (can be controlledwith alpha=c(a1, a2) and beta). Default: FALSE.

max_iter

The maximum number of iterations.

epsilon

A number that indicates the numerical precision necessary to consider the algorithm converged.

na_handle

Either "ignore" or "category". If it is "category" allNA's in the dataset are converted tothe string "(Missing)" and treated as their own category. If it is "ignore" theNA's are treated as missing completelyat random and are ignored during the parameter updates.

repetitions

A number specifying how often to repeat the calculation with different initializations. Automaticallyselects the best run (i.e. max(ELBO)). Default: 1.

...

Additional parameters passed on to the underlying functions. The parameters are verbose, phi_init,zeta_init and if select_latent=FALSE omega_init or if select_latent=TRUE kappa1_init and kappa2_init.

Details

The function uses a mixture of multinomials to fit the model.The full model specification is

\lambda | \alpha \sim DirichletProcess(\alpha)

z_i | \lambda \sim Multinomial(\lambda)

U_{j,k} | \beta \sim Dirichlet(\beta)

X_{i,j} | U_j, z_i=k \sim Multinomial(U_{j,k})

In case thatselect_latent=FALSE the first line is replaced with

\lambda | \alpha \sim Dirichlet(\alpha)

The initial inspiration came from Dunson and Xing (2009) who proposed a Gibbssampling algorithm to solve this model. To speed up inferencea variational inference approach was derived and implemented in this package.

Value

A list that is tagged with the class "mixdir" containing 8 elements:

converged: a boolean indicator if the model has converged
convergence: a numerical vector with the ELBO of each iteration
ELBO: the final ELBO of the converged model
lambda: a numerical vector with then_latent class probabilities
pred_class: an integer vector with the the most likely class assignmentfor each individual.
class_prob: a matrix of sizen_ind x n_latent which has for eachindividual the probability to belong to class k.
category_prob: a list with one entry for each feature (i.e. column of X).Each entry is again a list with one entry for each class, that contains theprobability of individuals of that class to answer with a specific response.
specific_params: A list whose content depends on the parameterselect_latent.Ifselect_latent=FALSE it contains the two entries omega and phi whichare the Dirichlet hyperparameters that the model has fitted. Ifselect_latent=TRUEit contains kappa1, kappa2 and phi, which are the hyperparameters for theDirichlet Process and the Dirichlet of the answer.
na_handle: a string indicating the method used to handle missing values. Thisis important for subsequent calls topredict.mixdir.

References

1. C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data", 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

2. Dunson, D. B. and Xing, C. Nonparametric Bayes Modeling of Multivariate Categorical Data. J. Am. Stat. Assoc. 104, 1042–1051 (2009).

3. Blei, D. M., Ng, A. Y. and Jordan, M. I. Latent Dirichlet Allocation. J. Macine Learn. Res. 3, 993–1022 (2003).

4. Blei, D. M. and Jordan, M. I. Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006).

Examples

  data("mushroom")  res <- mixdir(mushroom[1:30, ])

Properties of 8124 mushrooms.

Description

A dataset containing 23 categorical properties of 23 different species of gilledmushrooms including a categorization if it is edible or not.

Usage

mushroom

Format

A data frame with 8124 rows and 23 columns:

bruises: bruisesno
cap-color: brownyellowwhitegrayredpinkbuffpurplecinnamongreen
cap-shape: convexbellsunkenflatknobbedconical
cap-surface: smoothscalyfibrousgrooves
edible: poisonousedible
gill-attachment: freeattached
gill-color: blackbrowngraypinkwhitechocolatepurpleredbuffgreenyelloworange
gill-size: narrowbroad
gill-spacing: closecrowded
habitat: urbangrassesmeadowswoodspathswasteleaves
odor: pungentalmondanisenonefoulcreosotefishyspicymusty
population: scatterednumerousabundantseveralsolitaryclustered
ring-number: onetwonone
ring-type: pendantevanescentlargeflaringnone
spore-print-color: blackbrownpurplechocolatewhitegreenorangeyellowbuff
stalk-color-above-ring: whitegraypinkbrownbuffredorangecinnamonyellow
stalk-color-below-ring: whitepinkgraybuffbrownredyelloworangecinnamon
stalk-root: equalclubbulbousrootedNA
stalk-shape: enlargingtapering
stalk-surface-above-ring: smoothfibroussilkyscaly
stalk-surface-below-ring: smoothfibrousscalysilky
veil-color: whitebrownorangeyellow
veil-type: partial

Details

The records are drawn fromG. H. Lincoff (1981) (Pres.),The Audubon Society Field Guide to North American Mushrooms.New York: Alfred A. Knopf.(See pages 500–525 for the Agaricus and Lepiota Family.)

The Guide clearly states that there is no simple rule for determiningthe edibility of a mushroom; no rule like “leaflets three, letit be” for Poisonous Oak and Ivy.

The actual dataset from the UCI repository has been cleaned up to properlylabel the missing values and have the full category names instead of theirabbreviations.

Source

https://archive.ics.uci.edu/ml/datasets/Mushroom

References

Blake, C.L. & Merz, C.J. (1998).UCI Repository of Machine Learning Databases.Irvine, CA: University of California, Department of Information andComputer Science.

Examples

  data("mushroom")  summary(mushroom)

Plot cluster distribution for a subset of features features

Description

Plot cluster distribution for a subset of features features

Usage

plot_features(features, category_prob,  classes = seq_len(length(category_prob[[1]])))

Arguments

features

a character vector with feature names

category_prob

a list over all features containing alist of the probability of each answer for every class. Itis usually obtained from the result of a call tomixdir().

classes

numerical vector specifying which latent classes are plotted. By default all.

Examples

      data("mushroom")    res <- mixdir(mushroom[1:100, ], n_latent=4)    plot_features(c("bruises", "edible"), res$category_prob)    res2 <- mixdir(mushroom[1:100, ], n_latent=20)    def_feats <- find_defining_features(res2, mushroom[1:100, ], n_features=Inf)    plot_features(def_feats$features[1:6], category_prob = res2$category_prob,                  classes=which(res$lambda > 0.01))

Predict the class of a new observation.

Description

Predict the class of a new observation.

Usage

## S3 method for class 'mixdir'predict(object, newdata, ...)

Arguments

object

the result from a call tomixdir(). It needs to have thefields lambda, category_prob and na_handle. lambda is a vector of probabilities for each category.category_prob a list of a list of a named vector with probabilitiesfor each feature, latent class and possible category. na_handle must either be"ignore" or "category" depending how NA's should be handled.

newdata

a named vector with a single new observation or a data.framewith the same structure as the original data used for fitting the model.Missing features or features not encountered during training are replaced byNA.

...

currently unused

Value

A matrix of with the same number of rows as the input and one column for each latent class.

Examples

  data("mushroom")  X <- as.matrix(mushroom)[1:30, ]  res <- mixdir(X)  # Predict Class  predict(res, mushroom[40:45, ])  predict(res, c(`gill-color`="black"))

Movatterモバイル変換

Find the n defining features

Description

Usage

Arguments

Details

See Also

Examples

Find the top predictive features and values for each latent class

Description

Usage

Arguments

Value

See Also

Examples

Find the most typical features and values for each latent class

Description

Usage

Arguments

Value

See Also

Examples

Cluster high dimensional categorical datasets

Description

Usage

Arguments

Details

Value

References

Examples

Properties of 8124 mushrooms.

Description

Usage

Format

Details

Source

References

Examples

Plot cluster distribution for a subset of features features

Description

Usage

Arguments

Examples

Predict the class of a new observation.

Description

Usage

Arguments

Value

Examples