Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
NotificationsYou must be signed in to change notification settings

pixushi/tempted

Repository files navigation

Introduction of TEMPTED

This is a vignette for the R packagetempted, which implements thestatistical method TEMPoral TEnsor Decomposition (TEMPTED). The goal ofTEMPTED is to perform dimensionality reduction for multivariatelongitudinal data, with a special attention to longitudinal mirobiomestudies.

Package dependencies: R (>= 4.2.0), np (>= 0.60-17), ggplot2 (>=3.4.0), methods (>= 4.2.1).

Run time for all the example codes in this demo was within oneminute on a MacBook Pro with 2.3 GHz Intel Core i7 processor. You mayexpect longer run time if you have more subjects or more features.

You cancite this paper for using TEMPTED:

Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, KnightR, Shenhav L, Zhang AR.Time-Informed Dimensionality Reduction forLongitudinal Microbiome Studies.bioRxiv.

The statistical theories behind TEMPTED can be found in this paper:

Han R, Shi P, Zhang AR.Guaranteed Functional Tensor Singular ValueDecomposition. Journal of the American Statistical Association (2023):1-13.

TEMPTED is alsoimplemented in Python through the Python packagegemelli and as a plugin of Qiime2. It can be installed throughpip install gemelli.Documentation forgemelli.

Installation

You can directly install TEMPTED from CRAN by

install.packages("tempted")

Typical installation time is within a few minutes on a typicaldesktop.

You can install the development version of TEMPTED fromGitHub with:

# install.packages("devtools")devtools::install_github("pixushi/tempted")

or download thetarballto your folder of interest and install using

install.packages("folder_of_interest/tempted_0.1.0.tar.gz",repos=NULL,type="source")

Load packages for this vignette

library(gridExtra)library(tempted)

Read the example data

The example dataset is originally fromBokulich, Nicholas A., etal. (2016). We provide threedata objects:

  • meta_table: A data.frame with rows representing samples and matchingwith data.framecount_table andprocessed_table and three columns:

    • studyid: character denoting the subject ID of the infants.
    • delivery: character denoting the delivery mode of the infants.}
    • day_of_life: character denoting the age of infants measured indays when microbiome sample was taken.
  • count_tabe: A data.frame with rows representing samples and matchingwith data.framemeta_table and columns representing microbialfeatures (i.e. OTUs). Each entry is a read count.

  • processed_table: A data.frame with rows representing samples andmatching with data.framemeta_table and columns representingmicrobial features (i.e. ASVs, genes). Entries do not need to betransformed, and will be directly used bytempted(). This data.frameis used to illustrate howtempted() can be used for general form ofmultivariate longitudinal data already preprocessed by user.

# match rows of different data frames# check number of samplestable(rownames(count_table)==rownames(meta_table))#>#> TRUE#>  852metauni<- unique(meta_table[,c('studyid','delivery')])

Running TEMPTED for different formats of data

We provide example codes for the following scenarios:

  1. Run TEMPTED for Microbiome Count Data (Straightforward Way)
  2. Run TEMPTED for Microbiome Compositional Data (Straightforward Way)
  3. Run TEMPTED for General Form of Multivariate Longitudinal Data(Straightforward Way)
  4. Run TEMPTED in Customized Way
  5. Transferring TEMPTED result from training to testing data

Run TEMPTED for Microbiome Count Data (Straightforward Way)

Run TEMPTED

A complete description of all parameters can be found in the pdf manual.Here we explain the key parameters:

  • feature_table: A sample by feature matrix.
  • time_point: The time stamp of each sample, matched with the rows offeature_table.
  • subjectID: The subject ID of each sample, matched with the rows offeature_table.
  • threshold: A threshold for feature filtering for microbiome data.Features with zero value percentage >= threshold will be excluded.Default is 0.95.
  • smooth: Smoothing parameter for RKHS norm. Larger means smoothertemporal loading functions. Default is set to be 1e-8. Value can beadjusted depending on the dataset by checking the smoothness of theestimated temporal loading function in plot.
  • transform: The transformation applied to the data. “logcomp” for logof compositions. “comp” for compositions. “ast” for arcsine squaredtransformation. “clr” for central log ratio transformation. “logit”for logit transformation. “lfb” for log fold over baseline. “none” forno transformation. Default transform=“clr” is recommended formicrobiome data. For data that are already transformed, usetransform=“none”.
  • r: Number of components to decompose into, i.e. rank of the CP typedecomposition. Default is set to 3.
  • pct_ratio: The percent of features to sum up for taking log ratios.Default is 0.05, i.e. 5%.
  • absolute:absolute = TRUE means features are ranked by theabsolute value of feature loadings, and the toppct_ratio percent offeatures are picked.absolute = FALSE means features are ranked bythe original value of feature loadings, and the top and bottompct_ratio percent of features are picked. Then ratio is taken as theabundance of the features with positive loading over the abundance ofthe features with negative loading. Input for ratio_feature.
  • pct_aggregate: The percent of features to aggregate, features rankedby absolute value of the feature loading of each component. Default is1, which means 100% of features are aggregated. Settingpct_aggregate = 0.01 means top 1% of features is aggregated, wherefeatures are ranked in absolute value of feature loading of eachcomponent. Input for aggregate_feature.
  • pseudo: A small number to add to all the counts before normalizinginto proportions and log transformation. Default is 1/2 of thesmallest non-zero value that is specific for each sample. This pseudocount is added fortransform=c("logcomp", "clr", "logit", "lfb").

IMPORTANT NOTE: In matrix singular value decomposition, the sign ofsubject scores and feature loadings can be flipped together. Similarly,you can flip the signs of any pair of subject loadings, featureloadings, and temporal loadings.

res_count<- tempted_all(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform="clr",pseudo=0.5,r=2,smooth=1e-5,pct_ratio=0.1,pct_aggregate=1)#> Calculate the 1th Component#> Convergence reached at dif=3.57400400849311e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.745809269163e-05, iter=7

Low-dimensional representation of subjects

Subject loadings are stored in variableA_hat of thetempted_all()output.

A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_count$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+   geom_point()+  labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)

Plot the temporal loadings

Temporal loadings are stored in variablePhi_hat of thetempted_all() output. We provide an R functionplot_time_loading()to plot these curves. Optionr lets you decide how many components toplot.

p_time<- plot_time_loading(res_count,r=2)+   geom_line(size=1.5)+   labs(title='temporal loadings',x='days')print(p_time)

Plot the feature loadings

Feature loadings are stored in variableB_hat of thetempted_all()output.

p_feature<- ggplot(as.data.frame(res_count$B_hat), aes(x=PC1,y=PC2))+   geom_point()+   labs(x='Component 1',y='Component 2',title='feature loading')print(p_feature)

Plot log ratio of top features

The log ratios are stored in variablemetafeature_ratio of thetempted_all() output.

group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_count$metafeature_ratio,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

Plot subject trajectories

The subject trajectores are stored inmetafeature_aggregate of thetempted_all() output.

group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_count$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

Low-dimensional representation of samples

This is an alternative way to visualize the subject trajectories.Instead of plotting against the time points, you can visualize thesamples in low-dimensional spaces.

tab_feat_obs<-res_count$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<-   sub(".*value[.]","",  colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid"   "timepoint" "delivery"  "PC1"       "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+  geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+   labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter

# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30),                             aes(x=PC1,y=PC2,color=delivery))+  geom_point()+   labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2

Run TEMPTED for Microbiome Compositional Data (Straightforward Way)

Run TEMPTED

IMPORTANT NOTE: Different form the count data,pseudo = NULL isused so that 1/2 of the smallest non-zero value is added to each sample.This pseudo count is added fortransform=c("logcomp", "clr", "logit", "lfb").

proportion_table<-count_table/rowSums(count_table)res_proportion<- tempted_all(proportion_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform="clr",pseudo=NULL,r=2,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.57400414642375e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.7458092689688e-05, iter=7

Low-dimensional representation of subjects

A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_proportion$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+   geom_point()+   labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)

Plot the temporal loadings

p_time<- plot_time_loading(res_proportion,r=2)+   geom_line(size=1.5)+   labs(title='temporal loadings',x='days')print(p_time)

Plot the feature loadings

pfeature<- ggplot(as.data.frame(res_proportion$B_hat), aes(x=PC1,y=PC2))+   geom_point()+   labs(x='Component 1',y='Component 2',title='feature loading')print(pfeature)

Plot log ratio of top features

group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_proportion$metafeature_ratio,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

Plot subject trajectories

group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_proportion$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

Low-dimensional representation of samples

This is an alternative way to visualize the subject trajectories.Instead of plotting against the time points, you can visualize thesamples in low-dimensional spaces.

tab_feat_obs<-res_proportion$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<-   sub(".*value[.]","",  colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid"   "timepoint" "delivery"  "PC1"       "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+  geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+   labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter

# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30),                             aes(x=PC1,y=PC2,color=delivery))+  geom_point()+   labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2

Run TEMPTED for General Form of Multivariate Longitudinal Data (Straightforward Way)

Run TEMPTED

IMPORTANT NOTE: Different form the microbiome data, no features aregoing to be filtered out by settingthreshold=1, no transformation isperformed by settingtransform="none", and no log ratio is calculatedby settingdo_ratio=FALSE.

res_processed<- tempted_all(processed_table,meta_table$day_of_life,meta_table$studyid,threshold=1,transform="none",r=2,smooth=1e-5,do_ratio=FALSE)#> Calculate the 1th Component#> Convergence reached at dif=3.8157412154411e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=2.44419512850139e-05, iter=7

Low-dimensional representation of subjects

A_data<-metaunirownames(A_data)<-A_data$studyidA_data<- cbind(res_processed$A_hat[rownames(A_data),],A_data)p_subj<- ggplot(data=A_data, aes(x=A_data[,1],y=A_data[,2],color=delivery))+   geom_point()+   labs(x='Component 1',y='Component 2',title='subject loading') print(p_subj)

Plot the temporal loadings

p_time<- plot_time_loading(res_processed,r=2)+   geom_line(size=1.5)+   labs(title='temporal loadings',x='days')print(p_time)

Plot the feature loadings

pfeature<- ggplot(as.data.frame(res_processed$B_hat), aes(x=PC1,y=PC2))+   geom_point()+   labs(x='Component 1',y='Component 2',title='feature loading')print(pfeature)

Plot subject trajectories

group<- unique(meta_table[,c("studyid","delivery")])plot_metafeature(res_processed$metafeature_aggregate,group)+ xlab("Days of Life")#> Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |                   Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

Low-dimensional representation of samples

tab_feat_obs<-res_processed$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)reshape_feat_obs<- reshape(tab_feat_obs,idvar=c("studyid","timepoint") ,v.names=c("value"),timevar="PC",direction="wide")colnames(reshape_feat_obs)<-   sub(".*value[.]","",  colnames(reshape_feat_obs))colnames(reshape_feat_obs)#> [1] "studyid"   "timepoint" "delivery"  "PC1"       "PC2"p_aggfeat_scatter<- ggplot(data=reshape_feat_obs, aes(x=PC1,y=PC2,color=timepoint,shape=delivery))+  geom_point()+ scale_color_gradient(low="#2b83ba",high="#d7191c")+   labs(x='Component 1',y='Component 2',color='Day')p_aggfeat_scatter

# subsetting timepoint between 3 months and 1 yearp_aggfeat_scatter2<- ggplot(data=dplyr::filter(reshape_feat_obs,timepoint<365&timepoint>30),                             aes(x=PC1,y=PC2,color=delivery))+  geom_point()+   labs(x='Component 1',y='Component 2',color='Delivery Mode')p_aggfeat_scatter2

Run TEMPTED in Customized Way

This is a breakdown of whattempted_all() does in each function itwraps into.

Format and Transform Data

The functionformat_tempted() transforms the sample-by-feature tableand the corresponding time points and subject IDs into the list formatthat is accepted bytempted(). It filters features with a lot ofzeros, perform transformation of the features, and add pseudo count fortransformations that cannot handle zeros. When a subject has multiplesamples from the same time point,format_tempted() will only keep thefirst sample in the data input.

IMPORTANT NOTE: For read count table asfeature_table, setpseudo=0.5. For compositional/proportion data asfeature_table,default setting is appropriate. For non-microbiome data, settransform="none" andthreshold=1.

# format the data frames into a list that can be used by TEMPTEDdatlist<- format_tempted(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,pseudo=0.5,transform="clr")length(datlist)#> [1] 42print(dim(datlist[[1]]))#> [1] 796  29

Run TEMPTED

The functionsvd_centralize() uses matrix SVD to fit a constanttrajectory (straight flat line) for all subject-feature pairs, andremove it from the data. This step is optional. If it is not used, werecommend the user to add 1 to the rank parameterr intempted() andin general the first component estimated bytempted() will reflectthis constant trajectory. In this example, we usedr=2. Optionsmooth allows the user to choose the smoothness of the estimatedtemporal loading.

# centralize data using matrix SVD after logsvd_tempted<- svd_centralize(datlist)res_tempted<- tempted(svd_tempted$datlist,r=2,resolution=101,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.531461408802e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.37783267396658e-05, iter=7# alternatively, you can replace the two lines above by the two lines below.# the 2nd and 3rd component in the result below will be# similar to the 1st and 2nd component in the result above.#svd_tempted <- NULL#res_tempted <- tempted(datlist, r = 3, resolution = 101)

The output oftempted() can be used in the same way as the output oftempted_all() to plot temporal loadings, subject loadings, and featureloadings as in the previous sections of this document.

Log ratio of top/bottom feature

The feature loadings can be used to rank features. The abundance ratioof top ranking features over bottom ranking features corresponding toeach component can be a biologically meaningful marker. We provide an Rfunctionratio_feature() to calculate such the abundance ratio. Theabundance ratio is stored in the output variablemetafeature_ratio inthe output. An TRUE/FALSE vector indicating whether the feature is inthe top ranking or bottom ranking is stored in variabletoppct andbottompct, respectively. Below are trajectories of the aggregatedfeatures using observed data. Users can choose the percentage for thecutoff of top/bottom ranking features through optionpct (by defaultpct=0.05). By defaultabsolute=TRUE means the features are chosen ifthey rank in the toppct percentile in the absolute value of thefeature loadings, and abundance ratio is taken between the features withpositive loadings over negative loadings. Whenabsolute=FALSE, thefeatures are chose if they rank in the toppct percentile and haspositive loading or rank in the bottompct percentile and has negativeloading. We also provide an optioncontrast allowing users to rankfeatures using linear combinations of the feature loadings fromdifferent components.

datlist_raw<- format_tempted(count_table,meta_table$day_of_life,meta_table$studyid,threshold=0.95,transform='none')contrast<- cbind(c(1/2,1), c(1/2,-1))contrast#>      [,1] [,2]#> [1,]  0.5  0.5#> [2,]  1.0 -1.0ratio_feat<- ratio_feature(res_tempted,datlist_raw,contrast=contrast,pct=0.1)tab_feat_obs<-ratio_feat$metafeature_ratiocolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)colnames(tab_feat_obs)#> [1] "studyid"   "value"     "timepoint" "PC"        "delivery"p_feat_obs<- ggplot(data=tab_feat_obs,                       aes(x=timepoint,y=value,group=studyid,color=delivery))+  geom_line()+ facet_wrap(.~PC,scales="free",nrow=1)+ xlab("Days of Life")p_feat_obs_summary<- plot_metafeature(ratio_feat$metafeature_ratio,group,bws=30,nrow=1)+ xlab("Days of Life")grid.arrange(p_feat_obs,p_feat_obs_summary,nrow=2)

Subject trajectories

The feature loadings can be used as weights to aggregate features. Theaggregation can be done using the low-rank denoised data tensor, or theoriginal observed data tensor. We provide an R functionaggregate_feature() to perform the aggregation. The aggregatedfeatures using low-rank denoised tensor is stored in variablemetafeature_aggregate_est and the aggregated features using observeddata is stored in variablemetafeature_aggregate. Below aretrajectories of the aggregated features using observed data. Only thefeatures with absolute loading in the top pct percentile (by default setto 100%, i.e. all features, through optionpct=1) are used for theaggregation. We also provide an optioncontrast allowing users toaggregate features using linear combinations of the feature loadingsfrom different components.

## observed, by individual subjectcontrast<- cbind(c(1/2,1), c(1/2,-1))agg_feat<- aggregate_feature(res_tempted,svd_tempted,datlist,contrast=contrast,pct=1)tab_feat_obs<-agg_feat$metafeature_aggregatecolnames(tab_feat_obs)[2]<-'studyid'tab_feat_obs<- merge(tab_feat_obs,metauni)colnames(tab_feat_obs)#> [1] "studyid"   "value"     "timepoint" "PC"        "delivery"p_feat_obs<- ggplot(data=tab_feat_obs,                       aes(x=timepoint,y=value,group=studyid,color=delivery))+  geom_line()+ facet_wrap(.~PC,scales="free",nrow=1)+ xlab("Days of Life")p_feat_obs_summary<- plot_metafeature(agg_feat$metafeature_aggregate,group,bws=30,nrow=1)+ xlab("Days of Life")grid.arrange(p_feat_obs,p_feat_obs_summary,nrow=2)

Plot trajectory of top features

The feature loadings can also be used to investigate the drivingfeatures behind each component. Here we focus on the 2nd component andprovide the trajectories of the top features using the estimated andobserved data tensor, respectively.

proportion_table<-count_table/rowSums(count_table)feat_names<- c("OTU4447072","OTU4467447")# individual samplestab_sel_feat<- cbind(proportion_table[,feat_names],meta_table)tab_sel_feat<- reshape(tab_sel_feat,varying=list(feat_names),times=feat_names,timevar="feature",v.names="RA",ids=rownames(tab_sel_feat),direction="long")p_topfeat<- ggplot(data=tab_sel_feat)+   geom_point(aes(x=day_of_life,y=RA,color=delivery))+  facet_wrap(vars(feature))+   labs(x="Day of Life",y="Relative Abundance")+   coord_trans(y="sqrt")# summary plotp_topfeat_summary<- plot_feature_summary(proportion_table[,feat_names],meta_table$day_of_life,meta_table$delivery,bws=30)+   labs(x="Day of Life",y="Relative Abundance")grid.arrange(p_topfeat,p_topfeat_summary,nrow=2)

Transferring TEMPTED result from training to testing data

Split the example data into training and testing

Here we take thes subject withstudyid="2" as the testing subject, andthe remaining subjects as training subjects.

id_test<-meta_table$studyid=="2"count_train<-count_table[!id_test,]meta_train<-meta_table[!id_test,]count_test<-count_table[id_test,]meta_test<-meta_table[id_test,]

Run TEMPTED on training subjects

datlist_train<- format_tempted(count_train,meta_train$day_of_life,meta_train$studyid,threshold=0.95,pseudo=0.5,transform="clr")mean_svd_train<- svd_centralize(datlist_train,r=1)res_tempted_train<- tempted(mean_svd_train$datlist,r=2,smooth=1e-5)#> Calculate the 1th Component#> Convergence reached at dif=3.89952383939211e-05, iter=4#> Calculate the 2th Component#> Convergence reached at dif=4.45458808530784e-05, iter=7

Obtain subject loading for testing subject

IMPORTANT NOTE: the testing samples should contain the features inthe training data.

# get the overlapping features between testing and trainingcount_test<-count_test[,rownames(datlist_train[[1]])[-1]]# format testing datadatlist_test<- format_tempted(count_test,meta_test$day_of_life,meta_test$studyid,threshold=1,pseudo=0.5,transform="clr")# estimate the subject loading of the testing subjectsub_test<- est_test_subject(datlist_test,res_tempted_train,mean_svd_train)sub_test#>           PC1       PC2#> 2 -0.07967744 0.1688863

Here we use logistic regression to illustrate how the subject loadingsof the testing data can be used.

# train logistic regression classifier on training subjectsmetauni<- unique(meta_table[,c("studyid","delivery")])rownames(metauni)<-metauni$studyidAtrain<- as.data.frame(res_tempted_train$A_hat)Atrain$delivery<-metauni[rownames(Atrain),"delivery"]=="Cesarean"glm_train<- glm(delivery~PC1+PC2,data=Atrain,family=binomial(link="logit"))summary(glm_train)#>#> Call:#> glm(formula = delivery ~ PC1 + PC2, family = binomial(link = "logit"),#>     data = Atrain)#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)   -2.965      1.652  -1.795  0.07266 .#> PC1          -11.334      9.129  -1.242  0.21440#> PC2           16.865      5.529   3.050  0.00229 **#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> (Dispersion parameter for binomial family taken to be 1)#>#>     Null deviance: 55.637  on 40  degrees of freedom#> Residual deviance: 34.562  on 38  degrees of freedom#> AIC: 40.562#>#> Number of Fisher Scoring iterations: 6# predict the label of testing subject "2", whose true label is "Cesarean"predict(glm_train,newdata=as.data.frame(sub_test),type="response")#>         2#> 0.6870014

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp