Movatterモバイル変換

Type:

Package

Title:

Clustering Big Data using Expectation Maximization Star (EM*)Algorithm

Version:

2.0.5

Maintainer:

Sharma Parichit <parishar@iu.edu>

Description:

Implements the Improved Expectation Maximisation EM* and the traditional EM algorithm for clustering big data (gaussian mixture models for both multivariate and univariate datasets). This version implements the faster alternative-EM* that expedites convergence via structure based data segregation. The implementation supports both random and K-means++ based initialization. Reference: Parichit Sharma, Hasan Kurban, Mehmet Dalkilic (2022) <doi:10.1016/j.softx.2021.100944>. Hasan Kurban, Mark Jenne, Mehmet Dalkilic (2016) <doi:10.1007/s41060-017-0062-1>.

License:

GPL-3

Encoding:

UTF-8

LazyData:

true

Imports:

mvtnorm (≥ 1.0.7), matrixcalc (≥ 1.0.3), MASS (≥ 7.3.49),Rcpp (≥ 1.0.2)

LinkingTo:

Rcpp

RoxygenNote:

7.1.2

Depends:

R(≥ 3.2.0)

URL:

https://github.com/parichit/DCEM

BugReports:

https://github.com/parichit/DCEM/issues

Suggests:

knitr, rmarkdown

VignetteBuilder:

knitr

NeedsCompilation:

yes

Packaged:

2022-01-15 22:53:56 UTC; schmuck

Author:

Sharma Parichit [aut, cre, ctb], Kurban Hasan [aut, ctb], Dalkilic Mehmet [aut]

Repository:

CRAN

Date/Publication:

2022-01-16 00:02:52 UTC

DCEM: Clustering Big Data using Expectation Maximization Star (EM*) Algorithm.

Description

Implements the EM* and EM algorithmfor clustering the (univariate and multivariate) Gaussian mixture data.

Demonstration and Testing

Cleaning the data:The data should be cleaned (redundant columns should be removed). For examplecolumns containing the labels or redundant entries (such as a column ofall 0's or 1's). Seetrim_data for details oncleaning the data. Refer:dcem_test for more details.

Understanding the output of`dcem_test`

The function dcem_test() returns a list of objects.This list contains the parameters associated with the Gaussian(s),posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma),priors (prior) and cluster membership for data (membership).

Note: The routine dcem_test() is only for demonstration purpose.The functiondcem_test calls the main routinedcem_train. Seedcem_train for further details.

How to run on your dataset

Seedcem_train anddcem_star_train for examples.

Package organization

The package is organized as a set of preprocessing functions and the coreclustering modules. These functions are briefly described below.

trim_data: This is used to remove the columnsfrom the dataset. The user should clean the dataset beforecalling the dcem_train routine.User can also clean the dataset themselves(without using trim_data) and then pass it to the dcem_train function
dcem_star_train anddcem_train: These are the primaryinterface to the EM* and EM algorithms respectively. These function accept the cleaned dataset and otherparameters (number of iterations, convergence threshold etc.) and run the algorithm until:
1. The number of iterations is reached.
2. The convergence is achieved.

DCEM supports following initialization schemes

Random Initialization: Initializes the mean randomly.Refermeu_uv andmeu_mv for initializationon univariate and multivariate data respectively.
Improved Initialization: Based on the Kmeans++ idea published in,K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. Seemeu_uv_impr andmeu_mv_impr for details.
Choice of initialization scheme can be specified as theseedingparameter during the training. Seedcem_train for further details.

References

Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944

External Packages: DCEM requires R packages 'mvtnorm'[1], 'matrixcalc'[2]'RCPP'[3] and 'MASS'[4] for multivariate density calculation,checking matrix singularity, compiling routines written in C andsimulating mixture of gaussians, respectively.

[1] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl,Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions.R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm

[2] Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. Rpackage version 1.0-3. https://CRAN.R-project.org/package=matrixcalc

[3] Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal ofStatistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.

[4] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.Springer, New York. ISBN 0-387-95457-0

[5] K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

build_heap: Part of DCEM package.

Description

Implements the creation of heap. Internally called by thedcem_star_train.

Usage

build_heap(data)

Arguments

data

(NumericMatrix): The dataset provided by the user.

Value

A NumericMatrix with the max heap property.

Author(s)

Parichit Sharmaparishar@iu.edu, Hasan Kurban, Mehmet Dalkilic

dcem_cluster (multivariate data): Part of DCEM package.

Description

Implements the Expectation Maximization algorithm for multivariate data. This function is calledby the dcem_train routine.

Usage

dcem_cluster_mv(data, meu, sigma, prior, num_clusters, iteration_count,threshold, num_data)

Arguments

data

A matrix: The dataset provided by the user.

meu

(matrix): The matrix containing the initial meu(s).

sigma

(list): A list containing the initial covariance matrices.

prior

(vector): A vector containing the initial prior.

num_clusters

(numeric): The number of clusters specified by the user. Default value is 2.

iteration_count

(numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops. Default: 200.

threshold

(numeric): A small value to check for convergence (if the estimated meu are within thisspecified threshold then the algorithm stops and exit).

Note: Choosing a very small value (0.0000001) for threshold can increase the runtime substantiallyand the algorithm may not converge. On the other hand, choosing a larger value (0.1)can lead to sub-optimal clustering. Default: 0.00001.

num_data

(numeric): The total number of observations in the data.

Value

A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, co-variance and prior)

(1) Posterior Probabilities:prob :A matrix ofposterior-probabilities.
(2) Meu:meu: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Sigma: Co-variance matrices:sigma
(4) prior:prior: A vector of prior.
(5) Membership:membership: A vector ofcluster membership for data.

References

dcem_cluster_uv (univariate data): Part of DCEM package.

Description

Implements the Expectation Maximization algorithm for the univariate data. This function is internallycalled by the dcem_train routine.

Usage

dcem_cluster_uv(data, meu, sigma, prior, num_clusters, iteration_count,threshold, num_data, numcols)

Arguments

data

(matrix): The dataset provided by the user (converted to matrix format).

meu

(vector): The vector containing the initial meu.

sigma

(vector): The vector containing the initial standard deviation.

prior

(vector): The vector containing the initial prior.

num_clusters

(numeric): The number of clusters specified by the user. Default is 2.

iteration_count

(numeric): The number of iterations for which the algorithm should run. If theconvergence is not achieved then the algorithm stops.Default: 200.

threshold

(numeric): A small value to check for convergence (if the estimated meu(s)are within the threshold then the algorithm stops).

Note: Choosing a very small value (0.0000001) for threshold can increase the runtimesubstantially and the algorithm may not converge. On the other hand, choosing a largervalue (0.1) can lead to sub-optimal clustering. Default: 0.00001.

num_data

(numeric): The total number of observations in the data.

numcols

(numeric): Number of columns in the dataset (After processing themissing values).

Value

A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, standard-deviation and prior)

(1) Posterior Probabilities:prob: A matrix ofposterior-probabilities.
(2) Meu(s):meu: It is a vector ofmeu. Each element of the vector corresponds to one meu.
(3) Sigma: Standard-deviation(s):sigma: A vector of standarddeviation.
(4) prior:prior: A vector of prior.
(5) Membership:membership: A vector ofcluster membership for data.

References

dcem_predict: Part of DCEM package.

Description

Predict the cluster membership of test data based on the learned parameters i.e, output fromdcem_train ordcem_star_train.

Usage

dcem_predict(param_list, data)

Arguments

param_list

(list): List of distribution parameters. The list contains the learnedparameteres of the distribution.

data

(vector or dataframe): A vector of data for univariate data. A dataframe (rows representthe data and columns represent the features) for multivariate data.

Value

A list containing the cluster membership for the test data.

References

Examples

# Simulating a mixture of univariate samples from three distributions# with meu as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Select first few points from each distribution as test datatest_data = as.vector(sample_uv_data[c(1:5, 101:105, 171:175),])# Remove the test data from the training setsample_uv_data = as.data.frame(sample_uv_data[-c(1:5, 101:105, 171:175), ])# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_train() function on the simulated data with threshold of# 0.000001, iteration count of 1000 and random seeding respectively.sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3, iteration_count = 100,threshold = 0.001)# Predict the membership for test datatest_data_membership <- dcem_predict(sample_uv_out, test_data)# Access the outputprint(test_data_membership)

dcem_star_cluster_mv (multivariate data): Part of DCEM package.

Description

Implements the EM* algorithm for multivariate data. This function is calledby the dcem_star_train routine.

Usage

dcem_star_cluster_mv(data, meu, sigma, prior, num_clusters, iteration_count, num_data)

Arguments

data

(matrix): The dataset provided by the user.

meu

(matrix): The matrix containing the initial meu(s).

sigma

(list): A list containing the initial covariance matrices.

prior

(vector): A vector containing the initial priors.

num_clusters

(numeric): The number of clusters specified by the user. Default value is 2.

iteration_count

(numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops and exits. Default: 200.

num_data

(numeric): Number of rows in the dataset.

Value

A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, co-variance and priors)

(1) Posterior Probabilities:probA matrix of posterior-probabilities for the points in the dataset.
(2) Meu:meu: A matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Sigma: Co-variance matrices:sigma: List of co-variancematrices.
(4) Priors:prior: A vector of prior.
(5) Membership:membership: A vector of cluster membership for data.

References

dcem_star_cluster_uv (univariate data): Part of DCEM package.

Description

Implements the EM* algorithm for the univariate data. This function is called by thedcem_star_train routine.

Usage

dcem_star_cluster_uv(data, meu, sigma, prior, num_clusters, num_data,iteration_count)

Arguments

data

(matrix): The dataset provided by the user.

meu

(vector): The vector containing the initial meu.

sigma

(vector): The vector containing the initial standard deviation.

prior

(vector): The vector containing the initial priors.

num_clusters

(numeric): The number of clusters specified by the user. Default is 2.

num_data

(numeric): number of rows in the dataset (After processing the missing values).

iteration_count

(numeric): The number of iterations for which the algorithm should run.If the convergence is not achieved then the algorithm stops. Default is 100.

Value

A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, standard-deviation and priors)

(1) Posterior Probabilities:probA matrix of posterior-probabilities
(2) Meu:meu: It is a vector of meu. Each element ofthe vector corresponds to one meu.
(3) Sigma: Standard-deviation(s):sigma
For univariate data: Vector of standard deviation.
(4) Priors:prior: A vector of priors.
(5) Membership:membership: A vector of cluster membership for data.

References

dcem_star_train: Part of DCEM package.

Description

Implements the improved EM* ([1], [2]) algorithm. EM* avoids revisiting all but highexpressive data via structure based data segregation thus resulting in significant speed gain.It calls thedcem_star_cluster_uv routine internally (univariate data) anddcem_star_cluster_mv for (multivariate data).

Usage

dcem_star_train(data, iteration_count,  num_clusters, seed_meu, seeding)

Arguments

data

(dataframe): The dataframe containing the data. Seetrim_data forcleaning the data.

iteration_count

(numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops and exit.Default: 200.

num_clusters

(numeric): The number of clusters. Default:2

seed_meu

(matrix): The user specified set of meu to use as initial centroids. Default:None

seeding

(string): The initialization scheme ('rand', 'improved'). Default:rand

Value

A list of objects. This list contains parameters associated with the Gaussian(s)(posterior probabilities, meu, sigma and priors). Theparameters can be accessed as follows where sample_out is the list containingthe output:

(1) Posterior Probabilities:sample_out$probA matrix of posterior-probabilities.
(2) Meu(s):sample_out$meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one mean.
For univariate data: It is a vector of meu(s). Each element of the vectorcorresponds to one meu.
(3) Co-variance matrices:sample_out$sigma
For multivariate data: List of co-variance matrices.
Standard-deviation:sample_out$sigma
For univariate data: Vector of standard deviation.
(4) Priors:sample_out$priorA vector of priors.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.

References

Examples

# Simulating a mixture of univariate samples from three distributions# with mean as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_star_train() function on the simulated data with iteration count of 1000# and random seeding respectively.sample_uv_out = dcem_star_train(sample_uv_data, num_clusters = 3, iteration_count = 100)# Simulating a mixture of multivariate samples from 2 gaussian distributions.sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=2, rep(2,5), Sigma = diag(5)),MASS::mvrnorm(n=5, rep(14,5), Sigma = diag(5))))# Calling the dcem_star_train() function on the simulated data with iteration count of 100 and# random seeding method respectively.sample_mv_out = dcem_star_train(sample_mv_data, iteration_count = 100, num_clusters=2)# Access the outputsample_mv_out$meusample_mv_out$sigmasample_mv_out$priorsample_mv_out$probprint(sample_mv_out$membership)

dcem_test: Part of DCEM package.

Description

For demonstrating the execution on the bundled dataset.

Usage

dcem_test()

Details

The dcem_test performs the following steps in order:

Read the data from the disk (from the file data/ionosphere_data.csv). The data folder is under thepackage installation folder.
The dataset details can be see by typingionosphere_data inR-console or athttp://archive.ics.uci.edu/ml/datasets/Ionosphere.
Clean the data (by removing the columns).The data should be cleanedbefore use. Refertrim_data to see what columnsshould be removed and how. The package provides the basic interface for removingcolumns.
Call thedcem_star_train on the cleaned data.

Accessing the output parameters

The function dcem_test() calls thedcem_star_train.It returns a list of objects as output. This list contains estimatedparameters of the Gaussian (posterior probabilities, meu, sigma and prior). Theparameters can be accessed as follows where sample_out is the list containingthe output:

(1) Posterior Probabilities:sample_out$probA matrix of posterior-probabilities
(2) Meu:meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Co-variance matrices:sample_out$sigma
For multivariate data: List of co-variance matrices for the Gaussian(s).
Standard-deviation:sample_out$sigma
For univariate data: Vector of standard deviation for the Gaussian(s))
(4) Priors:sample_out$priorA vector of prior.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.

References

dcem_train: Part of DCEM package.

Description

Implements the EM algorithm. It calls the relevant clustering routine internallydcem_cluster_uv (univariate data) anddcem_cluster_mv (multivariate data).

Usage

dcem_train(data, threshold, iteration_count,  num_clusters, seed_meu, seeding)

Arguments

data

(dataframe): The dataframe containing the data. Seetrim_data forcleaning the data.

threshold

(decimal): A value to check for convergence (if the meu are within thisvalue then the algorithm stops and exit).Default: 0.00001.

iteration_count

(numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved within the specified count then the algorithm stops and exit.Default: 200.

num_clusters

(numeric): The number of clusters. Default:2

seed_meu

(matrix): The user specified set of meu to use as initial centroids. Default:None

seeding

(string): The initialization scheme ('rand', 'improved'). Default:rand

Value

(1) Posterior Probabilities:sample_out$prob: A matrix ofposterior-probabilities
(2) Meu:sample_out$meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
For univariate data: It is a vector of meu(s). Each element of the vectorcorresponds to one meu.
(3) Sigma:sample_out$sigma
For multivariate data: List of co-variance matrices for the Gaussian(s).
For univariate data: Vector of standard deviation for the Gaussian(s).
(4) Priors:sample_out$prior: A vector of priors.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.

References

Examples

# Simulating a mixture of univariate samples from three distributions# with meu as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_train() function on the simulated data with threshold of# 0.000001, iteration count of 1000 and random seeding respectively.sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3, iteration_count = 100,threshold = 0.001)# Simulating a mixture of multivariate samples from 2 gaussian distributions.sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=100, rep(2,5), Sigma = diag(5)),MASS::mvrnorm(n=50, rep(14,5), Sigma = diag(5))))# Calling the dcem_train() function on the simulated data with threshold of# 0.00001, iteration count of 100 and random seeding method respectively.sample_mv_out = dcem_train(sample_mv_data, threshold = 0.001, iteration_count = 100)# Access the outputprint(sample_mv_out$meu)print(sample_mv_out$sigma)print(sample_mv_out$prior)print(sample_mv_out$prob)print(sample_mv_out$membership)

expectation_mv: Part of DCEM package.

Description

Calculates the probabilistic weights for the multivariate data.

Usage

expectation_mv(data, weights, meu, sigma, prior, num_clusters, tolerance)

Arguments

data

(matrix): The input data.

weights

(matrix): The probability weight matrix.

meu

(matrix): The matrix of meu.

sigma

(list): The list of sigma (co-variance matrices).

prior

(vector): The vector of priors.

num_clusters

(numeric): The number of clusters.

tolerance

(numeric): The system epsilon value.

Value

Updated probability weight matrix.

expectation_uv: Part of DCEM package.

Description

Calculates the probabilistic weights for the univariate data.

Usage

expectation_uv(data, weights, meu, sigma, prior, num_clusters, tolerance)

Arguments

data

(matrix): The input data.

weights

(matrix): The probability weight matrix.

meu

(vector): The vector of meu.

sigma

(vector): The vector of sigma (standard-deviations).

prior

(vector): The vector of priors.

num_clusters

(numeric): The number of clusters.

tolerance

(numeric): The system epsilon value.

Value

Updated probability weight matrix.

get_priors: Part of DCEM package.

Description

Initialize the priors.

Usage

get_priors(num_priors)

Arguments

num_priors

(numeric): Number of priors one corresponding to each cluster.

Details

For example, if the user specify 2 priors then the vector will have 2entries (one for each cluster) where each will be 1/2 or 0.5.

Value

A vector of uniformly initialized prior values (numeric).

insert_nodes: Part of DCEM package.

Description

Implements the node insertion into the heaps.

Usage

insert_nodes(heap_list, heap_assn, data_probs, leaves_ind, num_clusters)

Arguments

heap_list

(list): The nested list containing the heaps. Each entry in thelist is a list maintained in max-heap structure.

heap_assn

(numeric): The vector representing the heap assignments.

data_probs

(string): A vector containing the probability for data.

leaves_ind

(numeric): A vector containing the indices of leaves in heap.

num_clusters

(numeric): The number of clusters. Default:2

Value

A nested list. Each entry in the list is a list maintainedin the max-heap structure.

References

Ionosphere data: A dataset of 351 radar readings

Description

This dataset contains 351 entries (radar readings from a system in goose bay laboratory) and 35 columns.The 35th columns is the label columns identifying the entry as either good or bad. Additionally, the 2nd columnonly contains 0's.

Usage

ionosphere_data

Format

A file with 351 rows and 35 columns of multivariate data in a csv file. All values are numeric.

Source

Space Physics GroupApplied Physics LaboratoryJohns Hopkins UniversityJohns Hopkins RoadLaurel, MD 20723Web URL:http://archive.ics.uci.edu/ml/datasets/Ionosphere

References:Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989).Classification of radar returns from the ionosphere using neural networks.Johns Hopkins APL Technical Digest, 10, 262-266.

max_heapify: Part of DCEM package.

Description

Implements the creation of max heap. Internally called by thedcem_star_train.

Usage

max_heapify(data, index, num_data)

Arguments

data

(NumericMatrix): The dataset provided by the user.

index

(int): The index of the data point.

num_data

(numeric): The total number of observations in the data.

Value

A NumericMatrix with the max heap property.

Author(s)

Parichit Sharmaparishar@iu.edu, Hasan Kurban, Mehmet Dalkilic

maximisation_mv: Part of DCEM package.

Description

Calculates meu, sigma and prior based on the updated probability weight matrix.

Usage

maximisation_mv(data, weights, meu, sigma, prior, num_clusters, num_data)

Arguments

data

(matrix): The input data.

weights

(matrix): The probability weight matrix.

meu

(matrix): The matrix of meu.

sigma

(list): The list of sigma (co-variance matrices).

prior

(vector): The vector of priors.

num_clusters

(numeric): The number of clusters.

num_data

(numeric): The total number of observations in the data.

Value

Updated values for meu, sigma and prior.

maximisation_uv: Part of DCEM package.

Description

Calculates meu, sigma and prior based on the updated probability weight matrix.

Usage

maximisation_uv(data, weights, meu, sigma, prior, num_clusters, num_data)

Arguments

data

(matrix): The input data.

weights

(matrix): The probability weight matrix.

meu

(vector): The vector of meu.

sigma

(vector): The vector of sigma (standard-deviations).

prior

(vector): The vector of priors.

num_clusters

(numeric): The number of clusters.

num_data

(numeric): The total number of observations in the data.

Value

Updated values for meu, sigma and prior.

meu_mv: Part of DCEM package.

Description

Initialize the meus(s) by randomly selecting the samples from the dataset. This is thedefault method for initializing the meu(s).

Usage

# Randomly seeding the mean(s).meu_mv(data, num_meu)

Arguments

data

(matrix): The dataset provided by the user.

num_meu

(numeric): The number of meu.

Value

A matrix containing the selected samples from the dataset.

meu_mv_impr: Part of DCEM package.

Description

Initialize the meu(s) by randomly selecting the samples from the dataset. It uses the proposedimplementation from K-means++: The Advantages of Careful Seeding, David Arthur and SergeiVassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf.

Usage

# Randomly seeding the meu.meu_mv_impr(data, num_meu)

Arguments

data

(matrix): The dataset provided by the user.

num_meu

(numeric): The number of meu.

Value

A matrix containing the selected samples from the dataset.

meu_uv: Part of DCEM package.

Description

This function is internally called by the dcem_train to initialize themeu(s). It randomly selects the meu(s) from therange min(data):max(data).

Usage

# Randomly seeding the meu.meu_uv(data, num_meu)

Arguments

data

(matrix): The dataset provided by the user.

num_meu

(number): The number of meu.

Value

A vector containing the selected samples from the dataset.

meu_uv_impr: Part of DCEM package.

Description

This function is internally called by the dcem_train to initialize themeu(s). It uses the proposed implementation fromK-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf.

Usage

# Seeding the meu using the K-means++ implementation.meu_uv_impr(data, num_meu)

Arguments

data

(matrix): The dataset provided by the user.

num_meu

(number): The number of meu.

Value

A vector containing the selected samples from the dataset.

separate_data: Part of DCEM package.

Description

Separate leaf nodes from the heaps.

Usage

separate_data(heap_list, num_clusters)

Arguments

heap_list

(list): The nested list containing the heaps. Each entry in thelist is a list maintained in max-heap structure.

num_clusters

(numeric): The number of clusters. Default:2

Value

A nested list where,

First entry is the list of heaps with leaves removed.

Second entry is the list of leaves.

References

sigma_mv: Part of DCEM package.

Description

Initializes the co-variance matrices as the identity matrices.

Usage

sigma_mv(num_sigma, numcol)

Arguments

num_sigma

(numeric): Number of covariance matrices.

numcol

(numeric): The number of columns in the dataset.

Value

A list of identity matrices. The number of entries in the listis equal to the input parameter (num_cov).

sigma_uv: Part of DCEM package.

Description

Initializes the standard deviation for the Gaussian(s).

Usage

sigma_uv(data, num_sigma)

Arguments

data

(matrix): The dataset provided by the user.

num_sigma

(number): Number of sigma (standard_deviations).

Value

A vector of standard deviation value(s).

trim_data: Part of DCEM package. Used internally in the package.

Description

Removes the specified column(s) from the dataset.

Usage

trim_data(columns, data)

Arguments

columns

(string): A comma separatedlist of column(s) that needs to be removed from the dataset.Default: ”

data

(dataframe): Dataframe containing the input data.

Value

A dataframe with the specified column(s) removed from it.

update_weights: Part of DCEM package.

Description

Update the probability values for specific data points that change between the heaps.

Usage

update_weights(temp_weights, weights, index_list, num_clusters)

Arguments

temp_weights

(matrix): A matrix of probabilistic weights for leaf data.

weights

(matrix): A matrix of probabilistic weights for all data.

index_list

(vector): A vector of indices.

num_clusters

(numeric): The number of clusters.

Value

Updated probabilistic weights matrix.

validate_data: Part of DCEM package. Used internally in the package.

Description

Implements sanity check for the input data. This function is for internal use and is calledby thedcem_train.

Usage

validate_data(columns, numcols)

Arguments

columns

(string): A comma separatedlist of columns that needs to be removed from the dataset. Default: ”

numcols

(numeric): Number of columns in the dataset.

Details

An example would be to check if the column to be removed existor not?trim_data internally calls this function before removingthe column(s).

Value

boolean: TRUE if the columns exists otherwise FALSE.

Movatterモバイル変換

DCEM: Clustering Big Data using Expectation Maximization Star (EM*) Algorithm.

Description

Demonstration and Testing

Understanding the output ofdcem_test

How to run on your dataset

Package organization

DCEM supports following initialization schemes

References

build_heap: Part of DCEM package.

Description

Usage

Arguments

Value

Author(s)

dcem_cluster (multivariate data): Part of DCEM package.

Description

Usage

Arguments

Value

References

dcem_cluster_uv (univariate data): Part of DCEM package.

Description

Usage

Arguments

Value

References

dcem_predict: Part of DCEM package.

Description

Usage

Arguments

Value

References

Examples

dcem_star_cluster_mv (multivariate data): Part of DCEM package.

Description

Usage

Arguments

Value

References

dcem_star_cluster_uv (univariate data): Part of DCEM package.

Description

Usage

Arguments

Value

References

dcem_star_train: Part of DCEM package.

Description

Usage

Arguments

Value

References

Examples

dcem_test: Part of DCEM package.

Description

Usage

Details

Accessing the output parameters

References

dcem_train: Part of DCEM package.

Description

Usage

Arguments

Value

References

Examples

expectation_mv: Part of DCEM package.

Description

Usage

Arguments

Value

expectation_uv: Part of DCEM package.

Description

Usage

Arguments

Value

get_priors: Part of DCEM package.

Description

Usage

Arguments

Understanding the output of`dcem_test`