- Notifications
You must be signed in to change notification settings - Fork1
mfair: Matrix Factorization with Auxiliary Information in R
License
Unknown, MIT licenses found
Licenses found
YangLabHKUST/mfair
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The R packagemfair
implements the methods based on the paperMFAI:A scalable Bayesian matrix factorization approach to leveragingauxiliary information. MFAIintegrates gradient boosted trees in the probabilistic matrixfactorization framework to leverage auxiliary information effectivelyand adaptively.
Note: Two years later, I realized there are a bunch of areas forimprovement in my code. Taking memory management as an example, usingfunctions likec()
,append()
,cbind()
, orrbind()
to dynamicallygrow variables is not recommended, especially with large datasets. Amore efficient approach is to pre-allocate memory if the output size isknown. If you don’t know the size, a good way is to store outputs in alist. You can merge them afterwards using functions likelapply()
anddo.call()
. For more details, please refer toAdvanced R Chapter 24Improving performance orAdvanced R Course Chapter 5Performance.
For a quick start, you can install the development version ofmfair
fromGitHub with:
# install.packages("devtools")devtools::install_github("YangLabHKUST/mfair")
For more illustration and examples, you can alternatively use:
# install.packages("devtools")devtools::install_github("YangLabHKUST/mfair",build_vignettes=TRUE)
to build vignettes simultaneously. Please note that it can take a fewmore minutes.
- This is a basic example which shows you how to solve a common problem:
set.seed(20230306)library(mfair)# Simulate data# Set the data dimension and rankN<-100M<-200K_true<-2L# Set the proportion of variance explained (PVE)PVE_Z<-0.9PVE_Y<-0.5# Generate auxiliary information XX1<- runif(N,min=-10,max=10)X2<- runif(N,min=-10,max=10)X<- cbind(X1,X2)# F(X)FX1<-X1/2-X2FX2<- (X1^2-X2^2+2*X1*X2)/10FX<- cbind(FX1,FX2)# Generate the factor matrix Z (= F(X) + noise)sig1_sq<- var(FX1)* (1/PVE_Z-1)Z1<-FX1+ rnorm(n=N,mean=0,sd= sqrt(sig1_sq))sig2_sq<- var(FX2)* (1/PVE_Z-1)Z2<-FX2+ rnorm(n=N,mean=0,sd= sqrt(sig2_sq))Z<- cbind(Z1,Z2)# Generate the loading matrix WW<-matrix(rnorm(M*K_true),nrow=M,ncol=K_true)# Generate the main data matrix Y_obs (= Y + noise)Y<-Z%*% t(W)Y_var<- var(as.vector(Y))epsilon_sq<-Y_var* (1/PVE_Y-1)Y_obs<-Y+matrix( rnorm(N*M,mean=0,sd= sqrt(epsilon_sq) ),nrow=N,ncol=M)# Create MFAIR objectmfairObject<- createMFAIR(Y_obs, as.data.frame(X),K_max=K_true)#> The main data matrix Y is completely observed!#> The main data matrix Y has been centered with mean = 0.196418181646673!# Fit the MFAI modelmfairObject<- fitGreedy(mfairObject,sf_para=list(verbose_loop=FALSE))#> Set K_max = 2!#> Initialize the parameters of Factor 1......#> After 2 iterations Stage 1 ends!#> After 77 iterations Stage 2 ends!#> Factor 1 retained!#> Initialize the parameters of Factor 2......#> After 2 iterations Stage 1 ends!#> After 78 iterations Stage 2 ends!#> Factor 2 retained!# Prediction based on the low-rank approximationY_hat<- predict(mfairObject)#> The main data matrix Y has no missing entries!# Root-mean-square-errorsqrt(mean((Y_obs-Y_hat)^2))#> [1] 11.04169# Predicted/true matrix variance ratiovar(as.vector(Y_hat))/ var(as.vector(Y_obs))#> [1] 0.466571# Prediction/noise variance ratiovar(as.vector(Y_hat))/ var(as.vector(Y_obs-Y_hat))#> [1] 0.9493455
mfair
can also handle the matrix with missing entries:
# Split the data into the training set and test setn_all<-N*Mtraining_ratio<-0.5train_set<- sample(1:n_all,n_all*training_ratio,replace=FALSE)Y_train<-Y_test<-Y_obsY_train[-train_set]<-NAY_test[train_set]<-NA# Create MFAIR objectmfairObject<- createMFAIR(Y_train, as.data.frame(X),Y_sparse=TRUE,K_max=K_true)#> The main data matrix Y has 50% missing entries!#> The main data matrix Y has been transferred to the sparse mode!#> The main data matrix Y has been centered with mean = 0.0364079914822442!# Fit the MFAI modelmfairObject<- fitGreedy(mfairObject,sf_para=list(verbose_loop=FALSE))#> Set K_max = 2!#> Initialize the parameters of Factor 1......#> After 2 iterations Stage 1 ends!#> After 99 iterations Stage 2 ends!#> Factor 1 retained!#> Initialize the parameters of Factor 2......#> After 2 iterations Stage 1 ends!#> After 68 iterations Stage 2 ends!#> Factor 2 retained!# Prediction based on the low-rank approximationY_hat<- predict(mfairObject)# Root-mean-square-errorsqrt(mean((Y_test-Y_hat)^2,na.rm=TRUE))#> [1] 11.6532# Predicted/true matrix variance ratiovar(as.vector(Y_hat),na.rm=TRUE)/ var(as.vector(Y_obs),na.rm=TRUE)#> [1] 0.4505973# Prediction/noise variance ratiovar(as.vector(Y_hat),na.rm=TRUE)/ var(as.vector(Y_obs-Y_hat),na.rm=TRUE)#> [1] 0.8830444
- Empirically, the backfitting algorithm can further improve theperformance:
# Refine the MFAI model with the backfitting algorithmmfairObject<- fitBack(mfairObject,verbose_bf_inner=FALSE,sf_para=list(verbose_sf=FALSE,verbose_loop=FALSE))#> Iteration: 1, relative difference of model parameters: 0.2827721.#> Iteration: 2, relative difference of model parameters: 0.05223744.#> Iteration: 3, relative difference of model parameters: 0.07034077.#> Iteration: 4, relative difference of model parameters: 0.08374642.#> Iteration: 5, relative difference of model parameters: 0.009833483.# Prediction based on the low-rank approximationY_hat<- predict(mfairObject)# Root-mean-square-errorsqrt(mean((Y_test-Y_hat)^2,na.rm=TRUE))#> [1] 11.63093# Predicted/true matrix variance ratiovar(as.vector(Y_hat),na.rm=TRUE)/ var(as.vector(Y_obs),na.rm=TRUE)#> [1] 0.4697753# Prediction/noise variance ratiovar(as.vector(Y_hat),na.rm=TRUE)/ var(as.vector(Y_obs-Y_hat),na.rm=TRUE)#> [1] 0.9254147
vignette("ml100k")
- Explore thevignette illustrating the spatial and temporal dynamicsof gene regulation among braintissues:
vignette("neocortex")
- For more documentation and examples, please visit our packagewebsite.
If you find themfair
package or any of the source code in thisrepository useful for your work, please cite:
Wang, Z., Zhang, F., Zheng, C., Hu, X., Cai, M., & Yang, C. (2024).MFAI: A Scalable Bayesian Matrix Factorization Approach to LeveragingAuxiliary Information.Journal of Computational and GraphicalStatistics, 33(4), 1339–1349.https://doi.org/10.1080/10618600.2024.2319160
The R packagemfair
is developed and maintained byZhiweiWang.
Please feel free to contactZhiweiWang,Prof. MingxuanCai, orProf. CanYang if any inquiries.
About
mfair: Matrix Factorization with Auxiliary Information in R