Movatterモバイル変換


[0]ホーム

URL:


mixdir

The goal of mixdir is to cluster high dimensional categoricaldatasets.

It can

A detailed description of the algorithm and the features of thepackage can be found in the the accompanyingpaper.If you find the package useful please cite

C. Ahlmann-Eltze and C. Yau, “MixDir: Scalable Bayesian Clusteringfor High-Dimensional Categorical Data”, 2018 IEEE 5th InternationalConference on Data Science and Advanced Analytics (DSAA), Turin, Italy,2018, pp. 526-539.

Installation

install.packages("mixdir")# Or to get the latest version from githubdevtools::install_github("const-ae/mixdir")

Example

Clustering themushroomdata set.

# Loading the library and the datalibrary(mixdir)set.seed(1)data("mushroom")# High dimensional dataset: 8124 mushroom and 23 different featuresmushroom[1:10,1:5]#>    bruises cap-color cap-shape cap-surface    edible#> 1  bruises     brown    convex      smooth poisonous#> 2  bruises    yellow    convex      smooth    edible#> 3  bruises     white      bell      smooth    edible#> 4  bruises     white    convex       scaly poisonous#> 5       no      gray    convex      smooth    edible#> 6  bruises    yellow    convex       scaly    edible#> 7  bruises     white      bell      smooth    edible#> 8  bruises     white      bell       scaly    edible#> 9  bruises     white    convex       scaly poisonous#> 10 bruises    yellow      bell      smooth    edible

Calling the clustering functionmixdir on a subset ofthe data:

# Clustering into 3 latent classesresult<-mixdir(mushroom[1:1000,1:5],n_latent=3)

Analyzing the result

# Latent class of of first 10 mushroomshead(result$pred_class,n=10)#>  [1] 3 1 1 3 2 1 1 1 3 1# Soft Clustering for first 10 mushroomshead(result$class_prob,n=10)#>               [,1]         [,2]         [,3]#>  [1,] 3.103495e-07 1.055098e-05 9.999891e-01#>  [2,] 9.998594e-01 4.683764e-06 1.359291e-04#>  [3,] 9.998944e-01 3.111462e-06 1.025194e-04#>  [4,] 5.778033e-04 7.114603e-08 9.994221e-01#>  [5,] 3.662625e-07 9.999992e-01 4.183025e-07#>  [6,] 9.996461e-01 8.764031e-08 3.537838e-04#>  [7,] 9.998944e-01 3.111462e-06 1.025194e-04#>  [8,] 9.997331e-01 5.822320e-08 2.668420e-04#>  [9,] 5.778033e-04 7.114603e-08 9.994221e-01#> [10,] 9.999999e-01 5.850067e-09 9.845112e-08pheatmap::pheatmap(result$class_prob,cluster_cols=FALSE,labels_col =paste("Class",1:3))

# Structure of latent class 1# (bruises, cap color either yellow or white, edible etc.)purrr::map(result$category_prob,1)#> $bruises#>      bruises           no#> 0.9998223256 0.0001776744#>#> $`cap-color`#>        brown         gray          red        white       yellow#> 0.0001775934 0.0001819672 0.0001776373 0.4079822666 0.5914805356#>#> $`cap-shape`#>      bell    convex      flat    sunken#> 0.3926736 0.4767291 0.1304197 0.0001776#>#> $`cap-surface`#>   fibrous     scaly    smooth#> 0.0568571 0.4871396 0.4560033#>#> $edible#>       edible    poisonous#> 0.9998223174 0.0001776826# The most predicitive features for each classfind_predictive_features(result,top_n=3)#>       column    answer class probability#> 19 cap-color    yellow     1   0.9993990#> 22 cap-shape      bell     1   0.9990947#> 1    bruises   bruises     1   0.7089533#> 48    edible poisonous     3   0.9980468#> 15 cap-color       red     3   0.8462032#> 9  cap-color     brown     3   0.6473043#> 5    bruises        no     2   0.9990364#> 11 cap-color      gray     2   0.9978218#> 32 cap-shape    sunken     2   0.9936162# For example: if all I know about a mushroom is that it has a# yellow cap, then I am 99% certain that it will be in class 1predict(result,c(`cap-color`="yellow"))#>          [,1]         [,2]         [,3]#> [1,] 0.999399 0.0003004692 0.0003004907# Note the most predictive features are different from the most typical onesfind_typical_features(result,top_n=3)#>         column  answer class probability#> 1      bruises bruises     1   0.9998223#> 43      edible  edible     1   0.9998223#> 19   cap-color  yellow     1   0.5914805#> 3      bruises bruises     3   0.9995546#> 27   cap-shape  convex     3   0.7460615#> 9    cap-color   brown     3   0.6746224#> 44      edible  edible     2   0.9995310#> 5      bruises      no     2   0.9713177#> 35 cap-surface fibrous     2   0.7355413

Dimensionality Reduction

# Defining Featuresdef_feat<-find_defining_features(result, mushroom[1:1000,1:5],n_features =3)print(def_feat)#> $features#> [1] "cap-color" "bruises"   "edible"#>#> $quality#> [1] 74.35146# Plotting the most important features gives an immediate impression# how the cluster differplot_features(def_feat$features, result$category_prob)#> Loading required namespace: ggplot2#> Loading required namespace: tidyr

Underlying Model

The package implements a variational inference algorithm to solve aBayesian latent class model (LCM).


[8]ページ先頭

©2009-2025 Movatter.jp