- Notifications
You must be signed in to change notification settings - Fork3
pixushi/tempted
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a vignette for the R packagetempted
, which implements thestatistical method TEMPoral TEnsor Decomposition (TEMPTED). The goal ofTEMPTED is to perform dimensionality reduction for multivariatelongitudinal data, with a special attention to longitudinal mirobiomestudies.
Package dependencies: R (>= 4.2.0), np (>= 0.60-17), ggplot2 (>=3.4.0), methods (>= 4.2.1).
Run time for all the example codes in this demo was within oneminute on a MacBook Pro with 2.3 GHz Intel Core i7 processor. You mayexpect longer run time if you have more subjects or more features.
You cancite this paper for using TEMPTED:
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, KnightR, Shenhav L, Zhang AR.Time-Informed Dimensionality Reduction forLongitudinal Microbiome Studies.bioRxiv.
The statistical theories behind TEMPTED can be found in this paper:
Han R, Shi P, Zhang AR.Guaranteed Functional Tensor Singular ValueDecomposition. Journal of the American Statistical Association (2023):1-13.
TEMPTED is alsoimplemented in Python through the Python packagegemelli
and as a plugin of Qiime2. It can be installed throughpip install gemelli
.Documentation forgemelli
.
You can directly install TEMPTED from CRAN by
install.packages("tempted")
Typical installation time is within a few minutes on a typicaldesktop.
You can install the development version of TEMPTED fromGitHub with:
# install.packages("devtools")devtools::install_github("pixushi/tempted")
or download thetarballto your folder of interest and install using
install.packages("folder_of_interest/tempted_0.1.0.tar.gz",repos=NULL,type="source")
library(gridExtra)library(tempted)
The example dataset is originally fromBokulich, Nicholas A., etal. (2016). We provide threedata objects:
meta_table
: A data.frame with rows representing samples and matchingwith data.framecount_table
andprocessed_table
and three columns:studyid
: character denoting the subject ID of the infants.delivery
: character denoting the delivery mode of the infants.}day_of_life
: character denoting the age of infants measured indays when microbiome sample was taken.
count_tabe
: A data.frame with rows representing samples and matchingwith data.framemeta_table
and columns representing microbialfeatures (i.e. OTUs). Each entry is a read count.processed_table
: A data.frame with rows representing samples andmatching with data.framemeta_table
and columns representingmicrobial features (i.e. ASVs, genes). Entries do not need to betransformed, and will be directly used bytempted()
. This data.frameis used to illustrate howtempted()
can be used for general form ofmultivariate longitudinal data already preprocessed by user.
# match rows of different data frames# check number of samplestable(rownames(count_table)==rownames(meta_table))#>#> TRUE#> 852metauni<- unique(meta_table[,c('studyid','delivery')])
We provide example codes for the following scenarios:
- Run TEMPTED for Microbiome Count Data (Straightforward Way)
- Run TEMPTED for Microbiome Compositional Data (Straightforward Way)
- Run TEMPTED for General Form of Multivariate Longitudinal Data(Straightforward Way)
- Run TEMPTED in Customized Way
- Transferring TEMPTED result from training to testing data
A complete description of all parameters can be found in the pdf manual.Here we explain the key parameters:
feature_table
: A sample by feature matrix.time_point
: The time stamp of each sample, matched with the rows offeature_table
.subjectID
: The subject ID of each sample, matched with the rows offeature_table
.threshold
: A threshold for feature filtering for microbiome data.Features with zero value percentage >= threshold will be excluded.Default is 0.95.smooth
: Smoothing parameter for RKHS norm. Larger means smoothertemporal loading functions. Default is set to be 1e-8. Value can beadjusted depending on the dataset by checking the smoothness of theestimated temporal loading function in plot.transform
: The transformation applied to the data. “logcomp” for logof compositions. “comp” for compositions. “ast” for arcsine squaredtransformation. “clr” for central log ratio transformation. “logit”for logit transformation. “lfb” for log fold over baseline. “none” forno transformation. Default transform=“clr” is recommended formicrobiome data. For data that are already transformed, usetransform=“none”.r
: Number of components to decompose into, i.e. rank of the CP typedecomposition. Default is set to 3.pct_ratio
: The percent of features to sum up for taking log ratios.Default is 0.05, i.e. 5%.absolute
:absolute = TRUE
means features are ranked by theabsolute value of feature loadings, and the toppct_ratio
percent offeatures are picked.absolute = FALSE
means features are ranked bythe original value of feature loadings, and the top and bottompct_ratio
percent of features are picked. Then ratio is taken as theabundance of the features with positive loading over the abundance ofthe features with negative loading. Input for ratio_feature.pct_aggregate
: The percent of features to aggregate, features rankedby absolute value of the feature loading of each component. Default is1, which means 100% of features are aggregated. Settingpct_aggregate = 0.01
means top 1% of features is aggregated, wherefeatures are ranked in absolute value of feature loading of eachcomponent. Input for aggregate_feature.pseudo
: A small number to add to all the counts before normalizinginto proportions and log transformation. Default is 1/2 of thesmallest non-zero value that is specific for each sample. This pseudocount is added fortransform=c("logcomp", "clr", "logit", "lfb")
.
IMPORTANT NOTE: In matrix singular value decomposition, the sign ofsubject scores and feature loadings can be flipped together. Similarly,you can flip the signs of any pair of subject loadings, featureloadings, and temporal loadings.
res_count<- tempted_all(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform="clr",pseudo=0.5,r=2,smooth=1e-5,pct_ratio=0.1,pct_aggregate=1)#> Calculate the 1th Component#> Convergence reached at dif=3.57400400849311e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.745809269163e-05, iter=7
Subject loadings are stored in variableA_hat
of thetempted_all()
output.
A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_count$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)
Temporal loadings are stored in variablePhi_hat
of thetempted_all()
output. We provide an R functionplot_time_loading()
to plot these curves. Optionr
lets you decide how many components toplot.
p_time<- plot_time_loading(res_count,r=2)+ geom_line(size=1.5)+ labs(title='temporal loadings',x='days')print(p_time)
Feature loadings are stored in variableB_hat
of thetempted_all()
output.
p_feature<- ggplot(as.data.frame(res_count$B_hat), aes(x=PC1,y=PC2))+ geom_point()+ labs(x='Component 1',y='Component 2',title='feature loading')print(p_feature)
The log ratios are stored in variablemetafeature_ratio
of thetempted_all()
output.
group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_count$metafeature_ratio,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
The subject trajectores are stored inmetafeature_aggregate
of thetempted_all()
output.
group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_count$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
This is an alternative way to visualize the subject trajectories.Instead of plotting against the time points, you can visualize thesamples in low-dimensional spaces.
tab_feat_obs<-res_count$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<- sub(".*value[.]","", colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid" "timepoint" "delivery" "PC1" "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+ geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+ labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter
# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30), aes(x=PC1,y=PC2,color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2
IMPORTANT NOTE: Different form the count data,pseudo = NULL
isused so that 1/2 of the smallest non-zero value is added to each sample.This pseudo count is added fortransform=c("logcomp", "clr", "logit", "lfb")
.
proportion_table<-count_table/rowSums(count_table)res_proportion<- tempted_all(proportion_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform="clr",pseudo=NULL,r=2,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.57400414642375e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.7458092689688e-05, iter=7
A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_proportion$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)
p_time<- plot_time_loading(res_proportion,r=2)+ geom_line(size=1.5)+ labs(title='temporal loadings',x='days')print(p_time)
pfeature<- ggplot(as.data.frame(res_proportion$B_hat), aes(x=PC1,y=PC2))+ geom_point()+ labs(x='Component 1',y='Component 2',title='feature loading')print(pfeature)
group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_proportion$metafeature_ratio,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_proportion$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
This is an alternative way to visualize the subject trajectories.Instead of plotting against the time points, you can visualize thesamples in low-dimensional spaces.
tab_feat_obs<-res_proportion$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<- sub(".*value[.]","", colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid" "timepoint" "delivery" "PC1" "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+ geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+ labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter
# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30), aes(x=PC1,y=PC2,color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2
IMPORTANT NOTE: Different form the microbiome data, no features aregoing to be filtered out by settingthreshold=1
, no transformation isperformed by settingtransform="none"
, and no log ratio is calculatedby settingdo_ratio=FALSE
.
res_processed<- tempted_all(processed_table,meta_table$day_of_life,meta_table$studyid,threshold=1,transform="none",r=2,smooth=1e-5,do_ratio=FALSE)#> Calculate the 1th Component#> Convergence reached at dif=3.8157412154411e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=2.44419512850139e-05, iter=7
A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_processed$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)
p_time<- plot_time_loading(res_processed,r=2)+ geom_line(size=1.5)+ labs(title='temporal loadings',x='days')print(p_time)
pfeature<- ggplot(as.data.frame(res_processed$B_hat), aes(x=PC1,y=PC2))+ geom_point()+ labs(x='Component 1',y='Component 2',title='feature loading')print(pfeature)
group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_processed$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 | Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
tab_feat_obs<-res_processed$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<- sub(".*value[.]","", colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid" "timepoint" "delivery" "PC1" "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+ geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+ labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter
# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30), aes(x=PC1,y=PC2,color=delivery))+ geom_point()+ labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2
This is a breakdown of whattempted_all()
does in each function itwraps into.
The functionformat_tempted()
transforms the sample-by-feature tableand the corresponding time points and subject IDs into the list formatthat is accepted bytempted()
. It filters features with a lot ofzeros, perform transformation of the features, and add pseudo count fortransformations that cannot handle zeros. When a subject has multiplesamples from the same time point,format_tempted()
will only keep thefirst sample in the data input.
IMPORTANT NOTE: For read count table asfeature_table
, setpseudo=0.5
. For compositional/proportion data asfeature_table
,default setting is appropriate. For non-microbiome data, settransform="none"
andthreshold=1
.
# format the data frames into a list that can be used by TEMPTEDdatlist<- format_tempted(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,pseudo=0.5,transform="clr")length(datlist)#> [1] 42print(dim(datlist[[1]]))#> [1] 796 29
The functionsvd_centralize()
uses matrix SVD to fit a constanttrajectory (straight flat line) for all subject-feature pairs, andremove it from the data. This step is optional. If it is not used, werecommend the user to add 1 to the rank parameterr
intempted()
andin general the first component estimated bytempted()
will reflectthis constant trajectory. In this example, we usedr=2
. Optionsmooth
allows the user to choose the smoothness of the estimatedtemporal loading.
# centralize data using matrix SVD after logsvd_tempted<- svd_centralize(datlist)res_tempted<- tempted(svd_tempted$datlist,r=2,resolution=101,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.531461408802e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.37783267396658e-05, iter=7# alternatively, you can replace the two lines above by the two lines below.# the 2nd and 3rd component in the result below will be# similar to the 1st and 2nd component in the result above.#svd_tempted <- NULL#res_tempted <- tempted(datlist, r = 3, resolution = 101)
The output oftempted()
can be used in the same way as the output oftempted_all()
to plot temporal loadings, subject loadings, and featureloadings as in the previous sections of this document.
The feature loadings can be used to rank features. The abundance ratioof top ranking features over bottom ranking features corresponding toeach component can be a biologically meaningful marker. We provide an Rfunctionratio_feature()
to calculate such the abundance ratio. Theabundance ratio is stored in the output variablemetafeature_ratio
inthe output. An TRUE/FALSE vector indicating whether the feature is inthe top ranking or bottom ranking is stored in variabletoppct
andbottompct
, respectively. Below are trajectories of the aggregatedfeatures using observed data. Users can choose the percentage for thecutoff of top/bottom ranking features through optionpct
(by defaultpct=0.05
). By defaultabsolute=TRUE
means the features are chosen ifthey rank in the toppct
percentile in the absolute value of thefeature loadings, and abundance ratio is taken between the features withpositive loadings over negative loadings. Whenabsolute=FALSE
, thefeatures are chose if they rank in the toppct
percentile and haspositive loading or rank in the bottompct
percentile and has negativeloading. We also provide an optioncontrast
allowing users to rankfeatures using linear combinations of the feature loadings fromdifferent components.
datlist_raw<- format_tempted(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform='none')contrast<- cbind(c(1/2,1), c(1/2,-1))contrast#> [,1] [,2]#> [1,] 0.5 0.5#> [2,] 1.0 -1.0ratio_feat<- ratio_feature(res_tempted,datlist_raw,contrast=contrast,pct=0.1)tab_feat_obs<-ratio_feat$metafeature_ratiocolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)colnames(tab_feat_obs)#> [1] "studyid" "value" "timepoint" "PC" "delivery"p_feat_obs<- ggplot(data=tab_feat_obs, aes(x=timepoint,y=value,group=studyid,color=delivery))+ geom_line()+ facet_wrap(.~PC,scales="free",nrow=1)+ xlab("Days of Life")p_feat_obs_summary<- plot_metafeature(ratio_feat$metafeature_ratio,group,bws=30,nrow=1)+ xlab("Days of Life")grid.arrange(p_feat_obs,p_feat_obs_summary,nrow=2)
The feature loadings can be used as weights to aggregate features. Theaggregation can be done using the low-rank denoised data tensor, or theoriginal observed data tensor. We provide an R functionaggregate_feature()
to perform the aggregation. The aggregatedfeatures using low-rank denoised tensor is stored in variablemetafeature_aggregate_est
and the aggregated features using observeddata is stored in variablemetafeature_aggregate
. Below aretrajectories of the aggregated features using observed data. Only thefeatures with absolute loading in the top pct percentile (by default setto 100%, i.e. all features, through optionpct=1
) are used for theaggregation. We also provide an optioncontrast
allowing users toaggregate features using linear combinations of the feature loadingsfrom different components.
## observed, by individual subjectcontrast<- cbind(c(1/2,1), c(1/2,-1))agg_feat<- aggregate_feature(res_tempted,svd_tempted,datlist,contrast=contrast,pct=1)tab_feat_obs<-agg_feat$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)colnames(tab_feat_obs)#> [1] "studyid" "value" "timepoint" "PC" "delivery"p_feat_obs<- ggplot(data=tab_feat_obs, aes(x=timepoint,y=value,group=studyid,color=delivery))+ geom_line()+ facet_wrap(.~PC,scales="free",nrow=1)+ xlab("Days of Life")p_feat_obs_summary<- plot_metafeature(agg_feat$metafeature_aggregate,group,bws=30,nrow=1)+ xlab("Days of Life")grid.arrange(p_feat_obs,p_feat_obs_summary,nrow=2)
The feature loadings can also be used to investigate the drivingfeatures behind each component. Here we focus on the 2nd component andprovide the trajectories of the top features using the estimated andobserved data tensor, respectively.
proportion_table<-count_table/rowSums(count_table)feat_names<- c("OTU4447072","OTU4467447")# individual samplestab_sel_feat<- cbind(proportion_table[,feat_names],meta_table)tab_sel_feat<- reshape(tab_sel_feat,varying=list(feat_names),times=feat_names,timevar="feature",v.names="RA",ids=rownames(tab_sel_feat),direction="long")p_topfeat<- ggplot(data=tab_sel_feat)+ geom_point(aes(x=day_of_life,y=RA,color=delivery))+ facet_wrap(vars(feature))+ labs(x="Day of Life",y="Relative Abundance")+ coord_trans(y="sqrt")# summary plotp_topfeat_summary<- plot_feature_summary(proportion_table[,feat_names],meta_table$day_of_life,meta_table$delivery,bws=30)+ labs(x="Day of Life",y="Relative Abundance")grid.arrange(p_topfeat,p_topfeat_summary,nrow=2)
Here we take thes subject withstudyid="2"
as the testing subject, andthe remaining subjects as training subjects.
id_test<-meta_table$studyid=="2"count_train<-count_table[!id_test,]meta_train<-meta_table[!id_test,]count_test<-count_table[id_test,]meta_test<-meta_table[id_test,]
datlist_train<- format_tempted(count_train,meta_train$day_of_life,meta_train$studyid,threshold=0.95,pseudo=0.5,transform="clr")mean_svd_train<- svd_centralize(datlist_train,r=1)res_tempted_train<- tempted(mean_svd_train$datlist,r=2,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.89952383939211e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.45458808530784e-05, iter=7
IMPORTANT NOTE: the testing samples should contain the features inthe training data.
# get the overlapping features between testing and trainingcount_test<-count_test[,rownames(datlist_train[[1]])[-1]]# format testing datadatlist_test<- format_tempted(count_test,meta_test$day_of_life,meta_test$studyid,threshold=1,pseudo=0.5,transform="clr")# estimate the subject loading of the testing subjectsub_test<- est_test_subject(datlist_test,res_tempted_train,mean_svd_train)sub_test#> PC1 PC2#> 2 -0.07967744 0.1688863
Here we use logistic regression to illustrate how the subject loadingsof the testing data can be used.
# train logistic regression classifier on training subjectsmetauni<- unique(meta_table[,c("studyid","delivery")])rownames(metauni)<-metauni$studyidAtrain<- as.data.frame(res_tempted_train$A_hat)Atrain$delivery<-metauni[rownames(Atrain),"delivery"]=="Cesarean"glm_train<- glm(delivery~PC1+PC2,data=Atrain,family=binomial(link="logit"))summary(glm_train)#>#> Call:#> glm(formula = delivery ~ PC1 + PC2, family = binomial(link = "logit"),#> data = Atrain)#>#> Coefficients:#> Estimate Std. Error z value Pr(>|z|)#> (Intercept) -2.965 1.652 -1.795 0.07266 .#> PC1 -11.334 9.129 -1.242 0.21440#> PC2 16.865 5.529 3.050 0.00229 **#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> (Dispersion parameter for binomial family taken to be 1)#>#> Null deviance: 55.637 on 40 degrees of freedom#> Residual deviance: 34.562 on 38 degrees of freedom#> AIC: 40.562#>#> Number of Fisher Scoring iterations: 6# predict the label of testing subject "2", whose true label is "Cesarean"predict(glm_train,newdata=as.data.frame(sub_test),type="response")#> 2#> 0.6870014