- Notifications
You must be signed in to change notification settings - Fork7
A pipeline for identifying indel derived neoantigens using RNA-Seq data
License
ylab-hi/ScanNeo
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
🔥NEW🔥: ScanNeo supports both human and mouse genome.
A pipeline for identifying indel derived neoantigens using RNA-Seq data
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
You need Python 3.7 or above to run ScanNeo.
Installanaconda (python 3.7) firstly.
Make sure you have added Anacondabin directory to the $PATH environment variable in the file ~/.bashrc.Otherwise, please add it the the $PATH by editing .bashrc file, manually.
export PATH="/home/usr/Python3/bin:$PATH"
You can print the current value of the PATH variable by echoing the PATH environment variable:
echo $PATH
Then, install dependent packages via conda in bioconda channel.
conda install -c bioconda optitype # this step will automatically install razers3, pyomo and glpkconda install -c bioconda ensembl-vepconda install -c bioconda sambambaconda install -c bioconda bedtoolsconda install -c bioconda picardconda install -c bioconda bwaconda install -c bioconda yaraconda install -c bioconda razers3conda install -c bioconda tabixconda install -c conda-forge coincbcconda install -c conda-forge pyomoconda install -c bioconda pyfaidxconda install -c conda-forge pyvcfconda install -c anaconda hdf5
The installed executables will be available when you type their names in the Linux Shell, such asvep_install,sambamba,bwa andpicard.
git clone https://github.com/cauyrd/transIndel
PlacetransIndel_build_RNA.py andtransIndel_call.py in your folder in PATH.
Download the archives forHLA class I and unpack them
wget -c https://downloads.iedb.org/tools/mhci/3.1/IEDB_MHC_I-3.1.tar.gztar -zxvf IEDB_MHC_I-3.1.tar.gzcd mhc_i./configure
ScanNeo is only compatible with IEDB 3.1 and above now.
tcsh andgawk are needed for IEDB binding prediction tools. You have to install them if they are not available.
Install them on Debian/Ubuntu/Mint Linux.$ sudo apt-get install tcsh gawk
Install them on CentOS/RHEL.# yum install tcsh gawk
Or install them via condaconda install -c conda-forge tcsh gawk
ForHuman genome
vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT # hg38vep_install -a cf -s homo_sapiens -y GRCh37 --CONVERT # hg19
ForMouse genome
vep_install -a cf -s mus_musculus -y GRCm39 --CONVERT # mm39vep_install -a cf -s mus_musculus -y GRCm38 --CONVERT # mm10
VEP annotations will be installed at ~/.vep be default.
git clone https://github.com/ylab-hi/ScanNeo.gitcd VEP_pluginscp Downstream.pm ~/.vep/Pluginscp Wildtype.pm ~/.vep/Plugins
Make modification onOptiTypePipeline.pyChange from
this_dir=os.path.realpath(os.path.join(os.getcwd(),os.path.dirname(os.path.realpath(__file__))))
to
this_dir=os.path.dirname(os.path.realpath(__file__))
[mapping]# Absolute path to RazerS3 binary, and number of threads to use for mappingrazers3=/path/to/anaconda3/bin/razers3threads=16[ilp]# A Pyomo-supported ILP solver. The solver must be globally accessible in the# environment OptiType is run, so make sure to include it in PATH.# Note: this is NOT a path to the solver binary, but a keyword argument for# Pyomo. Examples: glpk, cplex, cbc.# Optitype uses glpk by default, but we recommand use cbc insteadsolver=cbcthreads=1
Index the HLA reference genome hla_reference_rna.fasta from Optitypepath/to/anaconda3/bin/data/hla_reference_rna.fasta
yara_indexer hla_reference_rna.fasta -o hla.index
By executing this command, it will generate 12 files, namely,
- hla.index.lf.drp
- hla.index.lf.drv
- hla.index.rid.concat
- hla.index.sa.ind
- hla.index.sa.val
- hla.index.txt.limits
- hla.index.lf.drs
- hla.index.lf.pst
- hla.index.rid.limits
- hla.index.sa.len
- hla.index.txt.concat
- hla.index.txt.size
[fasta]# reference genome file in FASTA formathg38=/path/to/hg38.fahg19=/path/to/hg19.famm39=/path/to/mm39.famm10=/path/to/mm10.fa[annotation]# gene annotation file in GTF formathg38=/path/to/gencode.v21.annotation.gtfhg19=/path/to/gencode.v19.annotation.gtfmm39=/path/to/gencode.vM27.annotation.gtfmm10=/path/to/gencode.vM23.annotation.gtf[yara]# yara index for hla_reference_rna.fastaindex=/path/to/hla.index# the number of threads to usethreads=8
ScanNeo.py indel -i rnaseq_bam -r hg38
-h, --help show this help message and exit-i INPUT, --input INPUT RNA-seq alignment file (BAM)--mapq Remove reads with MAPQ = 0-r {hg19,hg38,mm39,mm10}, --ref {hg19,hg38,mm39,mm10} reference genome (default: hg38)
input_bam_file :input RNA-seq BAM file. (e.g., rna-seq.bam)reference_genome :specify reference genome (hg19 or hg38)
your_output_bam_file:BAM file for CIGAR string redefinement. (rna-seq.indel.bam)vcf_file:Reported Indels with VCF format. (rna-seq.indel.vcf)
ScanNeo.py anno -i input_vcf_file -o output_annotated_vcf_file [options]
-h, --help show this help message and exit-i INPUT, --input INPUT input VCF file-c CUTOFF, --cutoff CUTOFF MAF cutoff default: 0.01-s, --slippage Keep putative PCR slippage derived indels-d DIR, --dir DIR Specify the VEP cache/plugin directory to use. (default: $HOME/.vep/)-r {hg19,hg38,mm39,mm10}, --ref {hg19,hg38,mm39,mm10} reference genome (default: hg38)-o OUTPUT, --output OUTPUT output annotated and filtered vcf file (default: output.vcf)
input_vcf_file :input VCF file is produced by ScanNeo indelcutoff :MAF cutoff according to 1000 genome project and gnomAD projectreference_genome :specify reference genome (hg19 or hg38)
output_annotated_vcf_file :VEP annotated Indels with VCF format
ScanNeo.py hla -i vep.vcf --alleles HLA-A*02:01,HLA-B*08:01,HLA-C*03:03 -e 8,9 -o output.tsv [options] # for humanScanNeo.py hla -i vep.vcf --alleles H-2-Dd,H-2-Kk,H-2-Ld -e 8,9 -o output.tsv [options] # for mouseScanNeo.py hla -i vep.vcf -b RNA_seq.bam -e 8,9 -o output.tsv [options]
-h, --help show this help message and exit-i VCF, --input VCF VEP annotated and filtered VCF--alleles ALLELES Name of the allele to use for epitope prediction. Multiple alleles can be specified using a comma- separated listinput HLA class I alleles-b BAM, --bam BAM Input RNA-Seq BAM file if you don't know sample HLA class I alleles. Works for human only!-l LENGTH, --length LENGTH Length of the peptide sequence to use when creating the FASTA (default: 21)--binding BINDING_THRESHOLD binding threshold ic50 (default: 500 nM)--af AF_FIELD The field name for allele frequency in VCF (default: AB)-e EPITOPE_LENGTHS, --epitope-length EPITOPE_LENGTHS Length of subpeptides (neoepitopes) to predict. Multiple epitope lengths can be specified using a comma-separated list. Typical epitope lengths vary between 8-11. (default: 8,9,10,11)-p PATH_TO_IEDB, --path-to-iedb PATH_TO_IEDB Directory that contains the local installation of IEDB-m {lowest,median}, --metric {lowest,median} The ic50 scoring metric to use when filtering epitopes by binding-threshold lowest: Best MT Score - lowest MT ic50 binding score of all chosen prediction methods. median: Median MT Score - median MT ic50 binding score of all chosen prediction methods. (default: lowest)-o OUTPUT, --output OUTPUT output text file name, name.tsv
name.tsv file contains neoantigen prediction results
Report Columns
Column Name | Description |
---|---|
Chromosome | The chromosome of this variant |
Start | The start position of this variant in the zero-based, half-open coordinate system |
Stop | The stop position of this variant in the zero-based, half-open coordinate system |
Reference | The reference allele |
Variant | The alternate allele |
Transcript | The Ensembl ID of the affected transcript |
Ensembl Gene ID | The Ensembl ID of the affected gene |
Variant Type | The type of variant. missense for missense mutations, inframe_ins for inframe insertions, inframe_del for inframe deletions, and FS for frameshift variants |
Mutation | The amnio acid change of this mutation |
Protein Position | The protein position of the mutation |
Gene Name | The Ensembl gene name of the affected gene |
HLA Allele | The HLA allele for this prediction |
Peptide Length | The peptide length of the epitope |
Sub-peptide Position | The one-based position of the epitope in the protein sequence used to make the prediction |
Mutation Position | The one-based position of the start of the mutation in the epitope. 0 if the start of the mutation is before the epitope |
MT Epitope Seq | Mutant epitope sequence |
WT Epitope Seq | Wildtype (reference) epitope sequence at the same position in the full protein sequence. NA if there is no wildtype sequence at this position or if more than half of the amino acids of the mutant epitope are mutated |
Best MT Score Method | Prediction algorithm with the lowest mutant ic50 binding affinity for this epitope |
Best MT Score | Lowest ic50 binding affinity of all prediction algorithms used |
Corresponding WT Score | ic50 binding affinity of the wildtype epitope. NA if there is no WT Epitope Seq. |
Corresponding Fold Change | Corresponding WT Score / Best MT Score. NA if there is no WT Epitope Seq. |
Median MT Score | Median ic50 binding affinity of the mutant epitope of all prediction algorithms used |
Median WT Score | Median ic50 binding affinity of the wildtype epitope of all prediction algorithms used. NA if there is no WT Epitope Seq. |
Median Fold Change | Median WT Score / Median MT Score. NA if there is no WT Epitope Seq. |
Individual Prediction Algorithm WT and MT Scores (multiple) | ic50 scores for the MT Epitope Seq and WT Eptiope Seq for the individual prediction algorithms used |
Ranking Score | A useful metric for neoantigen prioritization. A high score suggests a high priority. |
Gene Ranking Score | The median value of ranking scores for neoantigens from the same gene. A high score suggests a high priority |
ScanNeo_utils.py RNA_seq.bam
This command will output the input BAM file name and the inferred HLA class I types in the screen.
RNA_seq.bam HLA-A*01:01,HLA-A*31:01,HLA-B*37:01,HLA-B*51:01,HLA-C*06:02,HLA-C*15:02
Generate an annotated fasta file from a VCF with protein sequences of mutations and matching wildtypes
ScanNeo.py fasta -i input.vcf -l 21 -o output.fasta
-h, --help show this help message and exit-i VCF, --input VCF A VEP-annotated single-sample VCF containing transcript, Wildtype protein sequence, and Downstream protein sequence information-l LENGTH, --length LENGTH Length of the peptide sequence to use when creating the FASTA (default: 21)-d DOWNSTREAM_LENGTH, --downstream_length DOWNSTREAM_LENGTH Cap to limit the downstream sequence length for frameshifts when creating the fasta file.Use 'full' to include the full downstream sequence. (Default: 1000)-o OUTPUT, --output OUTPUT output fasta file name (default: output.fasta)
output.fasta file contains predicted protein fasta sequences (including Wildtype and Mutant sequences)
This project is licensed underNPOSL-3.0.
Bug reports or feature requests can be submitted on theScanNeo Github page.
Please cite our paper atBioinformatics andSTAR Protocols.
About
A pipeline for identifying indel derived neoantigens using RNA-Seq data