lucblassel/HIV-DRM-machine-learningPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star5

This repo contains the data and software used to run the analyses in the "Using machine learning and Big data to explore the drug resistance landscape in HIV"

5 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
images		images
notebooks		notebooks
scripts		scripts
utils_hiv @ 778048e		utils_hiv @ 778048e
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
Snakefile_main.smk		Snakefile_main.smk
Snakefile_preprocess_data.smk		Snakefile_preprocess_data.smk
config.yaml		config.yaml

Repository files navigation

Using machine learning and big data to explore the drug resistance landscape in HIV

This is the main repository forthis article.
It contains the pipelines used to process and generate data and results, the notebooks used to process the results and generate figures, as well as the publicly available data used in this study.
The processed results used to generate figures are also available in this repository.

Available data

The data used in the manuscript is included in this repository. More informationhere

Prerequisites

To be able to run this pipeline several steps are needed.

dependencies

for this pipeline you will needpython >= 3.6,snakemake>=5.26.1 and the packages specified inutils_hiv/requirements.txt. To install the necessary package in aconda virtual environment run:

$cd /path/to/this/directory$ conda create -n pipelineDRMs python=3.7 snakemake">=5.26.1" -y$ conda activate pipelineDRMs$ pip install -e utils_hiv

Alignment data

For this pipeline to run you will need several alignments of HIV-1 pol RT sequences from at least 2 datasets: a training set and at least one testing set.
Each training/testing set is composed of a FASTA alignment of treatment-naive sequences and another of treatment-experienced sequences.
To get the positions of each residue w.r.t. the reference HXB2 sequence and get a suitable format for encoding, you should upload each of your alignmentstrainNaive.fa,trainTreated.fa,testNaive.fa andtestTreated.fa intoStanford's HIVdb program. For each uploaded alignment you will get thePrettyRTAA.tsv andResistanceSummary.tsv files which are need for dataset encoding.
In our study the training data corresponds to the UK dataset, and the testing data corresponds to the African dataset

pipelines

data preprocessing pipeline

This pipeline takes the files generated above by Stanford's HIVdB and encodes them to vectorial form.
The pipeline takes as input the directory where those files are stored, each dataset you want to encode must be in a separate subdirectory. The pipeline also needs a directory where metadata files are located.
In our example we want to encode 2 datasets, a UK dataset and an African dataset, so our directory and files should look like this:

.├── data_dir│   ├── Africa│   │   ├── PrettyRT_naive.tsv│   │   ├── PrettyRT_treated.tsv│   │   ├── ResistanceSummary_naive.tsv│   │   └── ResistanceSummary_treated.tsv│   └── UK│       ├── PrettyRT_naive.tsv│       ├── PrettyRT_treated.tsv│       ├── ResistanceSummary_naive.tsv│       └── ResistanceSummary_treated.tsv└── metadata_dir    ├── Africa-metadata.tsv    └── UK-metadata.tsv

The pipeline looks as follows, with theprocess_data rule encoding sequences as binary vectors of mutation presence/absence, and thehomogenize_data rule making sure all encoded datasets have the same set of features so that classifiers trained on one dataset can predict labels for another.

main training pipeline

This pipeline trains the classifiers on a training set and gets predictions on a testing set. The inputs are specified in theconfig.yaml configuration file. The input data is the one generated by the preprocessing pipeline.

The pipeline takes as input and encoded training and testing set (ie. the UK dataset) and any number of external testing sets (ie the African dataset).

All configuration options are described and listed in the configuration fileconfig.yaml which must be given to the pipeline.

The following figure shows a run for our pipeline, we specified the following options:

we want 3 models trained: Random Forest(RF), Naive Bayes(Bayes) and Logistic regression(Logistic)
we want 3 training sessions:
- training on B subtype of the training set and testing on C subtype of the training set
- training on C subtype of the training set and testing on B subtype of the training set
- training on All subtypes of the training set and testing on All subtypes of the external testing set

In an actual run of this pipeline we might want to also increase the number of repeated training sessions for models such as Random Forests that have a random aspect.

To execute the pipeline execute the following steps:

$ conda activate pipelineDRMs$ snakemake \    --snakefile=Snakefile_main.smk \    --configfile=path/to/config.yml \    --kep-going \    --jobs [nb. of cores/threads to use]

To execute this pipeline in a SLURM cluster environment(fill out partition/account name and qos accordingly):

$ module load [modules]# (ie. conda, python, ...)$ conda activate pipelineDRMs$ snakemake \    --snakefile=Snakefile_main.smk \    --configfile=path/to/config.yml \    --keep-going \    --cluster"sbatch -c {threads} -o {params.logs}/{params.name}.log -e {params.logs}/{params.name}.log --mem {params.mem} -p [partition name] --qos=[qos name] -A [account name] -J {params.name}" \    --jobs [nb. of cores/threads to use]

For more information on pipeline execution in HPC cluster environments see thesnakemake documentation

Scripts

The results from the main pipeline can then be processed by thegather_results.py script. This script takes as input the list of result directories that were created by the pipeline and outputs a concatenated tab-delimited file with all predictions, as well as a concatenated tab-delimited file containing the importances/weights assigned by all trained models to the dataset features. This script can also be used to concatenate the results of several runs of the main pipeline.These files can then be used for interpretation and figure generation (several examples in thenotebooks directory).

About

This repo contains the data and software used to run the analyses in the "Using machine learning and Big data to explore the drug resistance landscape in HIV"

doi.org/10.1371/journal.pcbi.1008873

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Using machine learning and big data to explore the drug resistance landscape in HIV

Available data

Prerequisites

dependencies

Alignment data

pipelines

data preprocessing pipeline

main training pipeline

Scripts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Movatterモバイル変換

lucblassel/HIV-DRM-machine-learning

Folders and files

Latest commit

History

Repository files navigation

Using machine learning and big data to explore the drug resistance landscape in HIV

Available data

Prerequisites

dependencies

Alignment data

pipelines

data preprocessing pipeline

main training pipeline

Scripts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages