AbdalkarimA/iClusterVBPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

License

Unknown, MIT licenses found

Licenses found

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
R		R
data		data
laml_full_data		laml_full_data
man		man
src		src
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
iClusterVB.Rproj		iClusterVB.Rproj

Repository files navigation

iClusterVB

iClusterVB allows for fast integrative clustering and feature selectionfor high dimensional data.

Using a variational Bayes approach, its key features - clustering ofmixed-type data, automated determination of the number of clusters, andfeature selection in high-dimensional settings - address the limitationsof traditional clustering methods while offering an alternative andpotentially faster approach than MCMC algorithms, makingiClusterVBa valuable tool for contemporary data analysis challenges.

Installation

You can install iClusterVB from CRAN with:

install.packages("iClusterVB")

You can install the development version of iClusterVB fromGitHub with:

# install.packages("devtools")devtools::install_github("AbdalkarimA/iClusterVB")

iClusterVB - The Main Function

Mandatory arguments

mydata: A list of length R, where R is the number of datasets,containing the input data.
- Note: Forcategorical data,0’s must be re-coded to another,non-0 value.
dist: A vector of length R specifying the type of data ordistribution. Options include: "gaussian" (for continuous data),"multinomial" (for binary or categorical data), and "poisson" (forcount data).

Optional arguments

K: The maximum number of clusters, with a default value of 10. Thealgorithm will converge to a model with dominant clusters, removingredundant clusters and automating the process of determining thenumber of clusters.
initial_method: The method for the initial cluster allocation, whichthe iClusterVB algorithm will then use to determine the final clusterallocation. Options include "VarSelLCM" (default) for VarSelLCM,"random" for a random sample, "kproto" for k-prototypes, "kmeans" fork-means (continuous data only), "mclust" for mclust (continuous dataonly), or "lca" for poLCA (categorical data only).
VS_method: The feature selection method. The options are 0 (default)for clustering without feature selection and 1 for clustering withfeature selection
initial_cluster: The initial cluster membership. The default isNULL, which usesinitial_method for initial cluster allocation. Ifit is not NULL, it will overwrite the previous initial values settingfor this parameter.
initial_vs_prob: The initial feature selection probability, ascalar. The default is NULL, which assigns a value of 0.5.
initial_fit: Initial values based on a previously fitted iClusterVBmodel (an iClusterVB object). The default is NULL.
initial_omega: Customized initial values for feature inclusionprobabilities. The default is NULL. If the argument is not NULL, itwill overwrite the previous initial values setting for this parameter.IfVS_method = 1,initial_omega is a list of length R, and eachelement of the list is an array with dim=c(N,p[[r]])). N is thesample size and p[[r]] is the number of features for dataset r, r= 1,…,R.
initial_hyper_parameters: A list of the initial hyper-parameters ofthe prior distributions for the model. The default is NULL, whichassignsalpha_00 = 0.001, mu_00 = 0,
s2_00 = 100, a_00 = 1, b_00 = 1, kappa_00 = 1, u_00 = 1, v_00 = 1.These are$\boldsymbol{\alpha}_0, \mu_0, s^2_0, a_0, b_0, \boldsymbol{\kappa}_0, c_0, \text{and } d_0$described inhttps://dx.doi.org/10.2139/ssrn.4971680.
max_iter: The maximum number of iterations for the VB algorithm. Thedefault is 200.
early_stop: Whether to stop the algorithm upon convergence or tocontinue untilmax_iter is reached. Options are 1 (default) to stopwhen the algorithm converges, and 0 to stop only whenmax_iter isreached.
per: Print information every "per" iteration. The default is 10.
convergence_threshold: The convergence threshold for the change inELBO. The default is 0.0001.

Simulated Data

We will demonstrate the clustering and feature selection performance ofiClusterVB using a simulated dataset comprising$N = 240$ individualsand$R = 4$ data views with different data types. Two views werecontinuous,and one was count – a setup commonly found in genomics datawhere gene or mRNA expression (continuous), and DNA copy number (count)are observed. The true number of clusters ($K$) was set to 4, withbalanced cluster proportions($\pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25$). Each dataview consisted of$p_r = 500$ features ($r = 1, \dots, 3$), totaling$p = \sum_{r=1}^3 p_r = 1500$ features across all views. Within eachview, only 50 features (10%) were relevant for clustering, while theremaining features were noise. The relevant features were distributedacross clusters as described in the table below:

Data View	Cluster	Distribution
1 (Continuous)	Cluster 1	$\mathcal{N}(10, 1)$ (Relevant)
	Cluster 2	$\mathcal{N}(5, 1)$ (Relevant)
	Cluster 3	$\mathcal{N}(-5, 1)$ (Relevant)
	Cluster 4	$\mathcal{N}(-10, 1)$ (Relevant)
		$\mathcal{N}(0, 1)$ (Noise)
2 (Continuous)	Cluster 1	$\mathcal{N}(-10, 1)$ (Relevant)
	Cluster 2	$\mathcal{N}(-5, 1)$ (Relevant)
	Cluster 3	$\mathcal{N}(5, 1)$ (Relevant)
	Cluster 4	$\mathcal{N}(10, 1)$ (Relevant)
		$\mathcal{N}(0, 1)$ (Noise)
3 (Count)	Cluster 1	$\text{Poisson}(50)$ (Relevant)
	Cluster 2	$\text{Poisson}(35)$ (Relevant)
	Cluster 3	$\text{Poisson}(20)$ (Relevant)
	Cluster 4	$\text{Poisson}(10)$ (Relevant)
		$\text{Poisson}(2)$ (Noise)

Distribution of relevant and noise features across clusters in each dataview

The simulated dataset is included as a list in the package.

Data pre-processing

library(iClusterVB)# Input data must be a listdat1<-list(gauss_1=sim_data$continuous1_data,gauss_2=sim_data$continuous2_data,multinomial_1=sim_data$binary_data)dist<- c("gaussian","gaussian","multinomial")

Running the model

set.seed(123)fit_iClusterVB<- iClusterVB(mydata=dat1,dist=dist,K=8,initial_method="VarSelLCM",VS_method=1,# Variable Selection is onmax_iter=100,per=100)#> ------------------------------------------------------------#> Pre-processing and initializing the model#> ------------------------------------------------------------#> ------------------------------------------------------------#> Running the CAVI algorithm#> ------------------------------------------------------------#> iteration = 100 elbo = -21293757.232508

Comparing to True Cluster Membership

table(fit_iClusterVB$cluster,sim_data$cluster_true)#>#>      1  2  3  4#>   4  0  0 60  0#>   5  0 60  0  0#>   6  0  0  0 60#>   8 60  0  0  0

Summary of the Model

# We can obtain a summary using summary()summary(fit_iClusterVB)#> Total number of individuals:#> [1] 240#>#> User-inputted maximum number of clusters: 8#> Number of clusters determined by algorithm: 4#>#> Cluster Membership:#>  4  5  6  8#> 60 60 60 60#>#> # of variables above the posterior inclusion probability of 0.5 for View 1 - gaussian#> [1] "54 out of a total of 500"#>#> # of variables above the posterior inclusion probability of 0.5 for View 2 - gaussian#> [1] "59 out of a total of 500"#>#> # of variables above the posterior inclusion probability of 0.5 for View 3 - multinomial#> [1] "500 out of a total of 500"

Generic Plots

plot(fit_iClusterVB)

Probability of Inclusion Plots

# The `piplot` function can be used to visualize the probability of inclusionpiplot(fit_iClusterVB)

Heat maps to visualize the clusters

# The `chmap` function can be used to display heat maps for each data viewlist_of_plots<- chmap(fit_iClusterVB,rho=0,cols= c("green","blue","purple","red"),scale="none")

# The `grid.arrange` function from gridExtra can be used to display all the# plots togethergridExtra::grid.arrange(grobs=list_of_plots,ncol=2,nrow=2)

About

No description, website, or topics provided.

Resources

Readme

License

Unknown, MIT licenses found

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

iClusterVB

Installation

iClusterVB - The Main Function

Simulated Data

Data pre-processing

Running the model

Comparing to True Cluster Membership

Summary of the Model

Generic Plots

Probability of Inclusion Plots

Heat maps to visualize the clusters

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

AbdalkarimA/iClusterVB

Folders and files

Latest commit

History

Repository files navigation

iClusterVB

Installation

iClusterVB - The Main Function

Simulated Data

Data pre-processing

Running the model

Comparing to True Cluster Membership

Summary of the Model

Generic Plots

Probability of Inclusion Plots

Heat maps to visualize the clusters

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages