- Notifications
You must be signed in to change notification settings - Fork0
Code for "Conditional prediction of consecutive tumor evolution using cancer progression models: What genotype comes next?", J.Diaz-Colunga, R.Diaz-Uriarte:https://doi.org/10.1371/journal.pcbi.1009055
License
rdiaz02/what_genotype_next
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code and data for "Conditional prediction of consecutive tumor evolution using cancer progression models: What genotype comes next?"
Published in PLoS Computational Biology,https://doi.org/10.1371/journal.pcbi.1009055
Here we provide an overview of the main scripts and directories. Otherfiles and directories will be mentioned, as needed, below. Where deemed ofused for readers, we also provide .Rout files. RData and rds files areprovided when their size permits.
Executing the following scripts in the indicated order and with defaultoptions creates a ./data directory within the current one where transitionmatrices extracted from both the OncoSimulR simulations and the CPMs arestored. It also generates a table.RData object containing comparisonsacross said matrices.
oncoFunctions.R - Contains functions called by many other scripts and some tests.
next-genotype_transitionMatrix-sim.R - Extracts transition matrices directly from simulations. By default, matrices are stored in the ./sim directory.
next-genotype_transitionMatrix-cpm.R - Extracts transition matrices from the outputs of the CPMs. By default, matrices are stored in the ./cpm directory.
next-genotype_transitionMatrix-timeDiscretizedCPMs.R - Works just like the previos script, but for Schill's MHN and the time-discretized versions of CBN_ot and MCCBN. Output matrices are also stored under the ./cpm directory by default.
structData.R - Takes the matrices under the ./sim and ./cpm directories (by default, but different directories can be provided) and groups them into nested lists stored in .RData objects and saved in the ./data directory (by default).
makeTable.R - Reads the content of the ./data directory (by default) and creates the table.rds file containing a data frame with the comparative statistics across transition matrices.
oncoPlots.R - Sourcing this file provides a set of functions that canbe used to produce plots taking the table generated by makeTable asinput. Not used.
merge-additional-info.R - Run aftermakeTable.R. Merges additional info and outputs table-replicates.rds(approx. 20 GB file).
average-and-array-statistics.R - Run aftermerge-additional-info.R. Computes per-arraystatistics (with different weighting schemes) and averages overreplicates, genotypes, and number of mutations, appropriately.
makeTableObservedFreqs.R - Can be run at any time, but necessarilybefore merge-additional-info.R. Returns observed frequencies of genotypesin actual samples used.
makeTableRanksFitness.R - Can be run at any time, but necessarilybefore merge-additional-info.R. Returns fitness ranks of genotypes infitness landscapes.
makeTableLocalMax.R - Can be run at any time, but necessarilybefore merge-additional-info.R. Returns frequency with which each genotypeis a local maximum in simulations.
makeTableWeights.R - Can be run at any time, but necessarilybefore merge-additional-info.R. Returns the true frequency with which agenotype is the most abundant genotype during the simulations, as well asthe frequency of each genotype in the complete 20000 samples. These twoare the two sets of weights used.
makeTable-to-diff-wrt-null-2.R - Computes difference in JS between thenull model and the method output.
paired-wide-data.R - Created wide (or paired-wide) data frames, so findingout best method or performing paired tests is easier.
create_rank_data.R - Rank methods and obtain the best method in eachcase. Requires wide data. Comparisons made at the number mutations, array,and single genotype (per replicate) case.
create_wide_g_long_g.R - Create minimal, concise and with factors,data sets for the glmertree models.
data_for_weighted_glmertree_plots.R - (under plots-glms) Create dataset to produce plots from lmertree fits. This produces""data_for_weighted_glmertree_plots.RData".
merge-compsfixNulls-wide_g.R andexplore-reduce-method-comp.R:comparisons among methods: merging with wide_g data for lmertreefits. For some tables below. (E.g., in file when_good_conditions_0677.R)
method-comp-biol-data.R Comparison of predictions between methods onthe biological data, under /run-biol-examples.
./plot-glms Directory with code for fitting the lmertrees and plottingthe results. Additionally, also scripts for producing tables inSupplementary material of performance when similar CE and TD(when_good_conditions_0677.R)
./run-biol-examples Code and data to run the CPM analyses and obtain thetransition matrices on the cancer data sets.
./biol-data-plots Plots in manuscript and supplementary material for theoutput from the analyses on the cancer data sets.
./Schill-MHN Code from Schill et al., for MHN plus additional code forus to obtain transition matrices. Since there is no license in theoriginal repository (https://github.com/RudiSchill/MHN) we have notprovided it. The code should be left under Shcill-MHN/MHN (withoutcreating a new subdirectory). A README.txt further explains the files andfunctions under this directory.
CBN-with-bug-fix-and-likelihood A directory that contains thecompressed directory of H-CBN with a couple of bug fixes and enhancementsby RDU, as explained in the Supplementary Material. The original code fromhttps://bsse.ethz.ch/cbg/software/ct-cbn.html is under the GNU GPL (v. 3),the same used here. The authors of the code are NikoBeerenwinkel, Moritz Gerstung, and Seth Sullivant.
Fitnesslandscape-characteristics Code for plots of fitness landscapescharacteristics. One of them is used in the Supplementary Material(gamma-rsign-obs-peaks-FL.pdf)
The time discretized CBN and MCCBN transition matrices as well as MHNsare needed for analyses below. (Before the next-genotype-... arerun). Run Time-discretized-CBN-MCCBN/time-discretized-CBN.R andSchill-MHN/run-Schill-MHN-trans-mat.R
At any time, but before running merge-additional-info.R, run each ofmakeTableObservedFreqs.R, makeTableRanksFitness.R, makeTableLocalMax.R,and makeTableWeights.R.
Output:
makeTableObservedFreqs.R: outObsFreqs.RData (~ 95 MB, but fast script)
makeTableRanksFitness.R: outRanksFitness.RData. (~ 43 MB, butfast script)
makeTableWeights.R: weightsAll.RData [plus intermediate RData:outfrerqsSampled and outFrerqsTrue]. (~ 17 MB; takes about 6 h)
makeTableLocalMax.R: outLocalMax.RData (~ 560 KB; takes about 7 h)
Run all three scripts next-genotype_transitionMatrix-*
Run structData
Run makeTable.R. The -scratch flag is needed the first time this scriptruns, then once the temporary file makeTable_tmp.rds file has beencreated, formatting changes can be made and implemented by runningmakeTable with no flags. If any upstream changes are made, a new-scratch run will be needed to incorporate them.
Run merge-additional-file-info.R. Input:pre-table-replicates.RData (plus fl_sampl.RData and the output fromeach of the files run in step 0). Output: table-replicates.rds.
Run average-and-array-statistics.R. Input:table-replicates.rds. Output: FIXME: zz, spell out.
At this point, the generated tables can be used for downstream analyses and plots.
Run makeTable-to-diff-wrt-null-2.R: Input:data_no_any_over_replicate.rds, array_statistics_over_replicate.rds,data_no_any.rds, num_mut_statistics_over_replicate.rds. Output:genotype_diff_null.RData, array_diff_null.RData,genotype_diff_null_no_average.RData, num_mut_diff_null.RData.
Run paired-data.R. Input: genotype_diff_null.RData,array_diff_null.RData, genotype_diff_null_no_average.RData,num_mut_diff_null.RData. Output: wide_array.RData, wide_genotype.RData,wide_genotype_no_average.RData, wide_num_mut.RData.
Run create_rank_data.R, then create wide_g_long_g.R. Note that weeliminate cases with sampledProp == 0 (create_wide_g_long_g.R), thenthose with sampledProp < 1e-3 (fit-lmtree scripts andmodelse-beta-regression.R), as well as the last genotype ---7mutations or 10 mutations, for landscapes with 7 and 10 loci,respectively--- (fit-lmtree scripts and modelse-beta-regression.R).
Run the lmertree fits.In subdirectory plots-glms/fits-Weighted-01-sqrt. A launch.sh will launch thelmertree fits. Once done, plotting with plotting-lmertree-fits-weighted-01-sqrt.R
Analyze the cancer data sets. Under ./run-biol-examples
Launch analyses with CPMs. See details under README.txt. Filelaunch-run-rerun.sh launches all R processes.
Once analyses with CPMs are completed, compare output:method-comp-biol-data.R
Run the two files under ./biol-data-plots to generate the plots forthe cancer data sets (main manuscript and supplementary material).
For questions or comments, please contactJ. Diaz-Colunga orR. Diaz-Uriarte.
About
Code for "Conditional prediction of consecutive tumor evolution using cancer progression models: What genotype comes next?", J.Diaz-Colunga, R.Diaz-Uriarte:https://doi.org/10.1371/journal.pcbi.1009055