| Type: | Package |
| Title: | Clustering Big Data using Expectation Maximization Star (EM*)Algorithm |
| Version: | 2.0.5 |
| Maintainer: | Sharma Parichit <parishar@iu.edu> |
| Description: | Implements the Improved Expectation Maximisation EM* and the traditional EM algorithm for clustering big data (gaussian mixture models for both multivariate and univariate datasets). This version implements the faster alternative-EM* that expedites convergence via structure based data segregation. The implementation supports both random and K-means++ based initialization. Reference: Parichit Sharma, Hasan Kurban, Mehmet Dalkilic (2022) <doi:10.1016/j.softx.2021.100944>. Hasan Kurban, Mark Jenne, Mehmet Dalkilic (2016) <doi:10.1007/s41060-017-0062-1>. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| Imports: | mvtnorm (≥ 1.0.7), matrixcalc (≥ 1.0.3), MASS (≥ 7.3.49),Rcpp (≥ 1.0.2) |
| LinkingTo: | Rcpp |
| RoxygenNote: | 7.1.2 |
| Depends: | R(≥ 3.2.0) |
| URL: | https://github.com/parichit/DCEM |
| BugReports: | https://github.com/parichit/DCEM/issues |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2022-01-15 22:53:56 UTC; schmuck |
| Author: | Sharma Parichit [aut, cre, ctb], Kurban Hasan [aut, ctb], Dalkilic Mehmet [aut] |
| Repository: | CRAN |
| Date/Publication: | 2022-01-16 00:02:52 UTC |
DCEM: Clustering Big Data using Expectation Maximization Star (EM*) Algorithm.
Description
Implements the EM* and EM algorithmfor clustering the (univariate and multivariate) Gaussian mixture data.
Demonstration and Testing
Cleaning the data:The data should be cleaned (redundant columns should be removed). For examplecolumns containing the labels or redundant entries (such as a column ofall 0's or 1's). Seetrim_data for details oncleaning the data. Refer:dcem_test for more details.
Understanding the output ofdcem_test
The function dcem_test() returns a list of objects.This list contains the parameters associated with the Gaussian(s),posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma),priors (prior) and cluster membership for data (membership).
Note: The routine dcem_test() is only for demonstration purpose.The functiondcem_test calls the main routinedcem_train. Seedcem_train for further details.
How to run on your dataset
Seedcem_train anddcem_star_train for examples.
Package organization
The package is organized as a set of preprocessing functions and the coreclustering modules. These functions are briefly described below.
trim_data: This is used to remove the columnsfrom the dataset. The user should clean the dataset beforecalling the dcem_train routine.User can also clean the dataset themselves(without using trim_data) and then pass it to the dcem_train functiondcem_star_trainanddcem_train: These are the primaryinterface to the EM* and EM algorithms respectively. These function accept the cleaned dataset and otherparameters (number of iterations, convergence threshold etc.) and run the algorithm until:The number of iterations is reached.
The convergence is achieved.
DCEM supports following initialization schemes
Random Initialization: Initializes the mean randomly.Refer
meu_uvandmeu_mvfor initializationon univariate and multivariate data respectively.Improved Initialization: Based on the Kmeans++ idea published in,K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. See
meu_uv_imprandmeu_mv_imprfor details.Choice of initialization scheme can be specified as theseedingparameter during the training. See
dcem_trainfor further details.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
External Packages: DCEM requires R packages 'mvtnorm'[1], 'matrixcalc'[2]'RCPP'[3] and 'MASS'[4] for multivariate density calculation,checking matrix singularity, compiling routines written in C andsimulating mixture of gaussians, respectively.
[1] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl,Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions.R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm
[2] Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. Rpackage version 1.0-3. https://CRAN.R-project.org/package=matrixcalc
[3] Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal ofStatistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
[4] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.Springer, New York. ISBN 0-387-95457-0
[5] K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
build_heap: Part of DCEM package.
Description
Implements the creation of heap. Internally called by thedcem_star_train.
Usage
build_heap(data)Arguments
data | (NumericMatrix): The dataset provided by the user. |
Value
A NumericMatrix with the max heap property.
Author(s)
Parichit Sharmaparishar@iu.edu, Hasan Kurban, Mehmet Dalkilic
dcem_cluster (multivariate data): Part of DCEM package.
Description
Implements the Expectation Maximization algorithm for multivariate data. This function is calledby the dcem_train routine.
Usage
dcem_cluster_mv(data, meu, sigma, prior, num_clusters, iteration_count,threshold, num_data)Arguments
data | A matrix: The dataset provided by the user. |
meu | (matrix): The matrix containing the initial meu(s). |
sigma | (list): A list containing the initial covariance matrices. |
prior | (vector): A vector containing the initial prior. |
num_clusters | (numeric): The number of clusters specified by the user. Default value is 2. |
iteration_count | (numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops. Default: 200. |
threshold | (numeric): A small value to check for convergence (if the estimated meu are within thisspecified threshold then the algorithm stops and exit). Note: Choosing a very small value (0.0000001) for threshold can increase the runtime substantiallyand the algorithm may not converge. On the other hand, choosing a larger value (0.1)can lead to sub-optimal clustering. Default: 0.00001. |
num_data | (numeric): The total number of observations in the data. |
Value
A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, co-variance and prior)
(1) Posterior Probabilities:prob :A matrix ofposterior-probabilities.
(2) Meu:meu: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Sigma: Co-variance matrices:sigma
(4) prior:prior: A vector of prior.
(5) Membership:membership: A vector ofcluster membership for data.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
dcem_cluster_uv (univariate data): Part of DCEM package.
Description
Implements the Expectation Maximization algorithm for the univariate data. This function is internallycalled by the dcem_train routine.
Usage
dcem_cluster_uv(data, meu, sigma, prior, num_clusters, iteration_count,threshold, num_data, numcols)Arguments
data | (matrix): The dataset provided by the user (converted to matrix format). |
meu | (vector): The vector containing the initial meu. |
sigma | (vector): The vector containing the initial standard deviation. |
prior | (vector): The vector containing the initial prior. |
num_clusters | (numeric): The number of clusters specified by the user. Default is 2. |
iteration_count | (numeric): The number of iterations for which the algorithm should run. If theconvergence is not achieved then the algorithm stops.Default: 200. |
threshold | (numeric): A small value to check for convergence (if the estimated meu(s)are within the threshold then the algorithm stops). Note: Choosing a very small value (0.0000001) for threshold can increase the runtimesubstantially and the algorithm may not converge. On the other hand, choosing a largervalue (0.1) can lead to sub-optimal clustering. Default: 0.00001. |
num_data | (numeric): The total number of observations in the data. |
numcols | (numeric): Number of columns in the dataset (After processing themissing values). |
Value
A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, standard-deviation and prior)
(1) Posterior Probabilities:prob: A matrix ofposterior-probabilities.
(2) Meu(s):meu: It is a vector ofmeu. Each element of the vector corresponds to one meu.
(3) Sigma: Standard-deviation(s):sigma: A vector of standarddeviation.
(4) prior:prior: A vector of prior.
(5) Membership:membership: A vector ofcluster membership for data.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
dcem_predict: Part of DCEM package.
Description
Predict the cluster membership of test data based on the learned parameters i.e, output fromdcem_train ordcem_star_train.
Usage
dcem_predict(param_list, data)Arguments
param_list | (list): List of distribution parameters. The list contains the learnedparameteres of the distribution. |
data | (vector or dataframe): A vector of data for univariate data. A dataframe (rows representthe data and columns represent the features) for multivariate data. |
Value
A list containing the cluster membership for the test data.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
Examples
# Simulating a mixture of univariate samples from three distributions# with meu as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Select first few points from each distribution as test datatest_data = as.vector(sample_uv_data[c(1:5, 101:105, 171:175),])# Remove the test data from the training setsample_uv_data = as.data.frame(sample_uv_data[-c(1:5, 101:105, 171:175), ])# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_train() function on the simulated data with threshold of# 0.000001, iteration count of 1000 and random seeding respectively.sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3, iteration_count = 100,threshold = 0.001)# Predict the membership for test datatest_data_membership <- dcem_predict(sample_uv_out, test_data)# Access the outputprint(test_data_membership)dcem_star_cluster_mv (multivariate data): Part of DCEM package.
Description
Implements the EM* algorithm for multivariate data. This function is calledby the dcem_star_train routine.
Usage
dcem_star_cluster_mv(data, meu, sigma, prior, num_clusters, iteration_count, num_data)Arguments
data | (matrix): The dataset provided by the user. |
meu | (matrix): The matrix containing the initial meu(s). |
sigma | (list): A list containing the initial covariance matrices. |
prior | (vector): A vector containing the initial priors. |
num_clusters | (numeric): The number of clusters specified by the user. Default value is 2. |
iteration_count | (numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops and exits. Default: 200. |
num_data | (numeric): Number of rows in the dataset. |
Value
A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, co-variance and priors)
(1) Posterior Probabilities:probA matrix of posterior-probabilities for the points in the dataset.
(2) Meu:meu: A matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Sigma: Co-variance matrices:sigma: List of co-variancematrices.
(4) Priors:prior: A vector of prior.
(5) Membership:membership: A vector of cluster membership for data.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
dcem_star_cluster_uv (univariate data): Part of DCEM package.
Description
Implements the EM* algorithm for the univariate data. This function is called by thedcem_star_train routine.
Usage
dcem_star_cluster_uv(data, meu, sigma, prior, num_clusters, num_data,iteration_count)Arguments
data | (matrix): The dataset provided by the user. |
meu | (vector): The vector containing the initial meu. |
sigma | (vector): The vector containing the initial standard deviation. |
prior | (vector): The vector containing the initial priors. |
num_clusters | (numeric): The number of clusters specified by the user. Default is 2. |
num_data | (numeric): number of rows in the dataset (After processing the missing values). |
iteration_count | (numeric): The number of iterations for which the algorithm should run.If the convergence is not achieved then the algorithm stops. Default is 100. |
Value
A list of objects. This list contains parameters associated with theGaussian(s) (posterior probabilities, meu, standard-deviation and priors)
(1) Posterior Probabilities:probA matrix of posterior-probabilities
(2) Meu:meu: It is a vector of meu. Each element ofthe vector corresponds to one meu.
(3) Sigma: Standard-deviation(s):sigma
For univariate data: Vector of standard deviation.
(4) Priors:prior: A vector of priors.
(5) Membership:membership: A vector of cluster membership for data.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
dcem_star_train: Part of DCEM package.
Description
Implements the improved EM* ([1], [2]) algorithm. EM* avoids revisiting all but highexpressive data via structure based data segregation thus resulting in significant speed gain.It calls thedcem_star_cluster_uv routine internally (univariate data) anddcem_star_cluster_mv for (multivariate data).
Usage
dcem_star_train(data, iteration_count, num_clusters, seed_meu, seeding)Arguments
data | (dataframe): The dataframe containing the data. See |
iteration_count | (numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved then the algorithm stops and exit.Default: 200. |
num_clusters | (numeric): The number of clusters. Default:2 |
seed_meu | (matrix): The user specified set of meu to use as initial centroids. Default:None |
seeding | (string): The initialization scheme ('rand', 'improved'). Default:rand |
Value
A list of objects. This list contains parameters associated with the Gaussian(s)(posterior probabilities, meu, sigma and priors). Theparameters can be accessed as follows where sample_out is the list containingthe output:
(1) Posterior Probabilities:sample_out$probA matrix of posterior-probabilities.
(2) Meu(s):sample_out$meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one mean.
For univariate data: It is a vector of meu(s). Each element of the vectorcorresponds to one meu.
(3) Co-variance matrices:sample_out$sigma
For multivariate data: List of co-variance matrices.
Standard-deviation:sample_out$sigma
For univariate data: Vector of standard deviation.
(4) Priors:sample_out$priorA vector of priors.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
Examples
# Simulating a mixture of univariate samples from three distributions# with mean as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_star_train() function on the simulated data with iteration count of 1000# and random seeding respectively.sample_uv_out = dcem_star_train(sample_uv_data, num_clusters = 3, iteration_count = 100)# Simulating a mixture of multivariate samples from 2 gaussian distributions.sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=2, rep(2,5), Sigma = diag(5)),MASS::mvrnorm(n=5, rep(14,5), Sigma = diag(5))))# Calling the dcem_star_train() function on the simulated data with iteration count of 100 and# random seeding method respectively.sample_mv_out = dcem_star_train(sample_mv_data, iteration_count = 100, num_clusters=2)# Access the outputsample_mv_out$meusample_mv_out$sigmasample_mv_out$priorsample_mv_out$probprint(sample_mv_out$membership)dcem_test: Part of DCEM package.
Description
For demonstrating the execution on the bundled dataset.
Usage
dcem_test()Details
The dcem_test performs the following steps in order:
Read the data from the disk (from the file data/ionosphere_data.csv). The data folder is under thepackage installation folder.
The dataset details can be see by typing
ionosphere_datainR-console or athttp://archive.ics.uci.edu/ml/datasets/Ionosphere.Clean the data (by removing the columns).The data should be cleanedbefore use. Refer
trim_datato see what columnsshould be removed and how. The package provides the basic interface for removingcolumns.Call the
dcem_star_trainon the cleaned data.
Accessing the output parameters
The function dcem_test() calls thedcem_star_train.It returns a list of objects as output. This list contains estimatedparameters of the Gaussian (posterior probabilities, meu, sigma and prior). Theparameters can be accessed as follows where sample_out is the list containingthe output:
(1) Posterior Probabilities:sample_out$probA matrix of posterior-probabilities
(2) Meu:meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
(3) Co-variance matrices:sample_out$sigma
For multivariate data: List of co-variance matrices for the Gaussian(s).
Standard-deviation:sample_out$sigma
For univariate data: Vector of standard deviation for the Gaussian(s))
(4) Priors:sample_out$priorA vector of prior.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
dcem_train: Part of DCEM package.
Description
Implements the EM algorithm. It calls the relevant clustering routine internallydcem_cluster_uv (univariate data) anddcem_cluster_mv (multivariate data).
Usage
dcem_train(data, threshold, iteration_count, num_clusters, seed_meu, seeding)Arguments
data | (dataframe): The dataframe containing the data. See |
threshold | (decimal): A value to check for convergence (if the meu are within thisvalue then the algorithm stops and exit).Default: 0.00001. |
iteration_count | (numeric): The number of iterations for which the algorithm should run, if theconvergence is not achieved within the specified count then the algorithm stops and exit.Default: 200. |
num_clusters | (numeric): The number of clusters. Default:2 |
seed_meu | (matrix): The user specified set of meu to use as initial centroids. Default:None |
seeding | (string): The initialization scheme ('rand', 'improved'). Default:rand |
Value
A list of objects. This list contains parameters associated with the Gaussian(s)(posterior probabilities, meu, sigma and priors). Theparameters can be accessed as follows where sample_out is the list containingthe output:
(1) Posterior Probabilities:sample_out$prob: A matrix ofposterior-probabilities
(2) Meu:sample_out$meu
For multivariate data: It is a matrix of meu(s). Each row inthe matrix corresponds to one meu.
For univariate data: It is a vector of meu(s). Each element of the vectorcorresponds to one meu.
(3) Sigma:sample_out$sigma
For multivariate data: List of co-variance matrices for the Gaussian(s).
For univariate data: Vector of standard deviation for the Gaussian(s).
(4) Priors:sample_out$prior: A vector of priors.
(5) Membership:sample_out$membership: A dataframe ofcluster membership for data. Columns numbers are data indices and valuesare the assigned clusters.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
Examples
# Simulating a mixture of univariate samples from three distributions# with meu as 20, 70 and 100 and standard deviation as 10, 100 and 40 respectively.sample_uv_data = as.data.frame(c(rnorm(100, 20, 5), rnorm(70, 70, 1), rnorm(50, 100, 2)))# Randomly shuffle the samples.sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])# Calling the dcem_train() function on the simulated data with threshold of# 0.000001, iteration count of 1000 and random seeding respectively.sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3, iteration_count = 100,threshold = 0.001)# Simulating a mixture of multivariate samples from 2 gaussian distributions.sample_mv_data = as.data.frame(rbind(MASS::mvrnorm(n=100, rep(2,5), Sigma = diag(5)),MASS::mvrnorm(n=50, rep(14,5), Sigma = diag(5))))# Calling the dcem_train() function on the simulated data with threshold of# 0.00001, iteration count of 100 and random seeding method respectively.sample_mv_out = dcem_train(sample_mv_data, threshold = 0.001, iteration_count = 100)# Access the outputprint(sample_mv_out$meu)print(sample_mv_out$sigma)print(sample_mv_out$prior)print(sample_mv_out$prob)print(sample_mv_out$membership)expectation_mv: Part of DCEM package.
Description
Calculates the probabilistic weights for the multivariate data.
Usage
expectation_mv(data, weights, meu, sigma, prior, num_clusters, tolerance)Arguments
data | (matrix): The input data. |
weights | (matrix): The probability weight matrix. |
meu | (matrix): The matrix of meu. |
sigma | (list): The list of sigma (co-variance matrices). |
prior | (vector): The vector of priors. |
num_clusters | (numeric): The number of clusters. |
tolerance | (numeric): The system epsilon value. |
Value
Updated probability weight matrix.
expectation_uv: Part of DCEM package.
Description
Calculates the probabilistic weights for the univariate data.
Usage
expectation_uv(data, weights, meu, sigma, prior, num_clusters, tolerance)Arguments
data | (matrix): The input data. |
weights | (matrix): The probability weight matrix. |
meu | (vector): The vector of meu. |
sigma | (vector): The vector of sigma (standard-deviations). |
prior | (vector): The vector of priors. |
num_clusters | (numeric): The number of clusters. |
tolerance | (numeric): The system epsilon value. |
Value
Updated probability weight matrix.
get_priors: Part of DCEM package.
Description
Initialize the priors.
Usage
get_priors(num_priors)Arguments
num_priors | (numeric): Number of priors one corresponding to each cluster. |
Details
For example, if the user specify 2 priors then the vector will have 2entries (one for each cluster) where each will be 1/2 or 0.5.
Value
A vector of uniformly initialized prior values (numeric).
insert_nodes: Part of DCEM package.
Description
Implements the node insertion into the heaps.
Usage
insert_nodes(heap_list, heap_assn, data_probs, leaves_ind, num_clusters)Arguments
heap_list | (list): The nested list containing the heaps. Each entry in thelist is a list maintained in max-heap structure. |
heap_assn | (numeric): The vector representing the heap assignments. |
data_probs | (string): A vector containing the probability for data. |
leaves_ind | (numeric): A vector containing the indices of leaves in heap. |
num_clusters | (numeric): The number of clusters. Default:2 |
Value
A nested list. Each entry in the list is a list maintainedin the max-heap structure.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
Ionosphere data: A dataset of 351 radar readings
Description
This dataset contains 351 entries (radar readings from a system in goose bay laboratory) and 35 columns.The 35th columns is the label columns identifying the entry as either good or bad. Additionally, the 2nd columnonly contains 0's.
Usage
ionosphere_dataFormat
A file with 351 rows and 35 columns of multivariate data in a csv file. All values are numeric.
Source
Space Physics GroupApplied Physics LaboratoryJohns Hopkins UniversityJohns Hopkins RoadLaurel, MD 20723Web URL:http://archive.ics.uci.edu/ml/datasets/Ionosphere
References:Sigillito, V. G., Wing, S. P., Hutton, L. V., & Baker, K. B. (1989).Classification of radar returns from the ionosphere using neural networks.Johns Hopkins APL Technical Digest, 10, 262-266.
max_heapify: Part of DCEM package.
Description
Implements the creation of max heap. Internally called by thedcem_star_train.
Usage
max_heapify(data, index, num_data)Arguments
data | (NumericMatrix): The dataset provided by the user. |
index | (int): The index of the data point. |
num_data | (numeric): The total number of observations in the data. |
Value
A NumericMatrix with the max heap property.
Author(s)
Parichit Sharmaparishar@iu.edu, Hasan Kurban, Mehmet Dalkilic
maximisation_mv: Part of DCEM package.
Description
Calculates meu, sigma and prior based on the updated probability weight matrix.
Usage
maximisation_mv(data, weights, meu, sigma, prior, num_clusters, num_data)Arguments
data | (matrix): The input data. |
weights | (matrix): The probability weight matrix. |
meu | (matrix): The matrix of meu. |
sigma | (list): The list of sigma (co-variance matrices). |
prior | (vector): The vector of priors. |
num_clusters | (numeric): The number of clusters. |
num_data | (numeric): The total number of observations in the data. |
Value
Updated values for meu, sigma and prior.
maximisation_uv: Part of DCEM package.
Description
Calculates meu, sigma and prior based on the updated probability weight matrix.
Usage
maximisation_uv(data, weights, meu, sigma, prior, num_clusters, num_data)Arguments
data | (matrix): The input data. |
weights | (matrix): The probability weight matrix. |
meu | (vector): The vector of meu. |
sigma | (vector): The vector of sigma (standard-deviations). |
prior | (vector): The vector of priors. |
num_clusters | (numeric): The number of clusters. |
num_data | (numeric): The total number of observations in the data. |
Value
Updated values for meu, sigma and prior.
meu_mv: Part of DCEM package.
Description
Initialize the meus(s) by randomly selecting the samples from the dataset. This is thedefault method for initializing the meu(s).
Usage
# Randomly seeding the mean(s).meu_mv(data, num_meu)Arguments
data | (matrix): The dataset provided by the user. |
num_meu | (numeric): The number of meu. |
Value
A matrix containing the selected samples from the dataset.
meu_mv_impr: Part of DCEM package.
Description
Initialize the meu(s) by randomly selecting the samples from the dataset. It uses the proposedimplementation from K-means++: The Advantages of Careful Seeding, David Arthur and SergeiVassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf.
Usage
# Randomly seeding the meu.meu_mv_impr(data, num_meu)Arguments
data | (matrix): The dataset provided by the user. |
num_meu | (numeric): The number of meu. |
Value
A matrix containing the selected samples from the dataset.
meu_uv: Part of DCEM package.
Description
This function is internally called by the dcem_train to initialize themeu(s). It randomly selects the meu(s) from therange min(data):max(data).
Usage
# Randomly seeding the meu.meu_uv(data, num_meu)Arguments
data | (matrix): The dataset provided by the user. |
num_meu | (number): The number of meu. |
Value
A vector containing the selected samples from the dataset.
meu_uv_impr: Part of DCEM package.
Description
This function is internally called by the dcem_train to initialize themeu(s). It uses the proposed implementation fromK-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf.
Usage
# Seeding the meu using the K-means++ implementation.meu_uv_impr(data, num_meu)Arguments
data | (matrix): The dataset provided by the user. |
num_meu | (number): The number of meu. |
Value
A vector containing the selected samples from the dataset.
separate_data: Part of DCEM package.
Description
Separate leaf nodes from the heaps.
Usage
separate_data(heap_list, num_clusters)Arguments
heap_list | (list): The nested list containing the heaps. Each entry in thelist is a list maintained in max-heap structure. |
num_clusters | (numeric): The number of clusters. Default:2 |
Value
A nested list where,
First entry is the list of heaps with leaves removed.
Second entry is the list of leaves.
References
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic DCEM: An R package for clustering big data viadata-centric modification of Expectation Maximization, SoftwareX, 17, 100944 URLhttps://doi.org/10.1016/j.softx.2021.100944
sigma_mv: Part of DCEM package.
Description
Initializes the co-variance matrices as the identity matrices.
Usage
sigma_mv(num_sigma, numcol)Arguments
num_sigma | (numeric): Number of covariance matrices. |
numcol | (numeric): The number of columns in the dataset. |
Value
A list of identity matrices. The number of entries in the listis equal to the input parameter (num_cov).
sigma_uv: Part of DCEM package.
Description
Initializes the standard deviation for the Gaussian(s).
Usage
sigma_uv(data, num_sigma)Arguments
data | (matrix): The dataset provided by the user. |
num_sigma | (number): Number of sigma (standard_deviations). |
Value
A vector of standard deviation value(s).
trim_data: Part of DCEM package. Used internally in the package.
Description
Removes the specified column(s) from the dataset.
Usage
trim_data(columns, data)Arguments
columns | (string): A comma separatedlist of column(s) that needs to be removed from the dataset.Default: ” |
data | (dataframe): Dataframe containing the input data. |
Value
A dataframe with the specified column(s) removed from it.
update_weights: Part of DCEM package.
Description
Update the probability values for specific data points that change between the heaps.
Usage
update_weights(temp_weights, weights, index_list, num_clusters)Arguments
temp_weights | (matrix): A matrix of probabilistic weights for leaf data. |
weights | (matrix): A matrix of probabilistic weights for all data. |
index_list | (vector): A vector of indices. |
num_clusters | (numeric): The number of clusters. |
Value
Updated probabilistic weights matrix.
validate_data: Part of DCEM package. Used internally in the package.
Description
Implements sanity check for the input data. This function is for internal use and is calledby thedcem_train.
Usage
validate_data(columns, numcols)Arguments
columns | (string): A comma separatedlist of columns that needs to be removed from the dataset. Default: ” |
numcols | (numeric): Number of columns in the dataset. |
Details
An example would be to check if the column to be removed existor not?trim_data internally calls this function before removingthe column(s).
Value
boolean: TRUE if the columns exists otherwise FALSE.