- Notifications
You must be signed in to change notification settings - Fork1
WES HLA Typing based on multiple alternative tools
License
lkuchenb/MultiHLA
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This workflow enables the concurrent analysis of WES or WGS data usingpublicly available software to derive HLA haplotypes from this type of data.
xHLA
Xie, C., Yeo, Z. X., Wong, M., Piper, J., Long, T., Kirkness, E. F., ... & Brady, C. (2017). Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proceedings of the National Academy of Sciences, 114(30), 8059-8064.
The workflow implements read mapping the reads against hg38 without altcontigs using
bwa mem
as instructed by the authors. The mapped reads are thensorted and index using samtools.The workflow utilizes theDocker Image provided by the authors toperform the actual HLA typing.
HLA-VBSeq
Nariai, N., Kojima, K., Saito, S., Mimori, T., Sato, Y., Kawai, Y., ... & Nagasaki, M. (2015, December). HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. In BMC genomics (Vol. 16, No. S2, p. S7). BioMed Central.
Wang, Y. Y., Mimori, T., Khor, S. S., Gervais, O., Kawai, Y., Hitomi, Y., ... & Nagasaki, M. (2019). HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel. Human Genome Variation, 6(1), 1-5.
The workflow implements read mapping the reads against hg19 without altcontigs. The authors instructions merely state to "map against hg19"without any further specifics, but mapping against hg19 with alt contigsyielded very poor typing results with missing HLA class I genes, thusthe workflow uses hg19 without alt contigs.
HLA-VBSeq released two reference database versions:
- v1 database based on IMGT/HLA database, Release 3.15.0
- v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset
OptiType
Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M., & Kohlbacher, O. (2014). OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30(23), 3310-3316.
The workflow invokes theOptiType snakemake wrapper without prior filteringof reads.
HLA-LA
Dilthey, A. T., Mentzer, A. J., Carapito, R., Cutland, C., Cereb, N., Madhi, S. A., ... & Phillippy, A. M. (2019). HLA*LA - HLA typing from linearly projected graph alignments. Bioinformatics, 35(21), 4394-4396.
The workflow uses reads mapped against the human genome (hg38) withoutalt contigs as input for HLA-LA. A corresponding reference txt file for HLA-LAis part of this workflow repository. The preprocessed graph directory
PRG_MHC_GRCh38_withIMGT
can be either placed manually intyping/hla_la/hla_la.graphs/
or it will be downloaded and preprocessedautomatically.The workflow uses theHLA-LA bioconda package for graph preprocessing and HLA typing.
arcasHLA
Orenbuch, R., Filip, I., Comito, D., Shaman, J., Pe’er, I., & Rabadan, R. (2020). arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics, 36(1), 33-40.
The workflow maps RNAseq reads against the human genome (hg38) withoutalt contigs using the STAR aligner with default paramters. It theninvokes the 'extract' and 'genotype' subtools provided by arcasHLA.
Install snakemake
conda install -c conda-forge mambamamba create -c conda-forge -c bioconda -n snakemake snakemakeconda activate snakemake
Clone theMultiHLA repository
git clone https://github.com/lkuchenb/MultiHLA.git hla_typingcd hla_typing
Put the input files in place
MultiHLA comes with a predefined folder structure:dataset/
A dataset is defined as a set of samples. Place a TSV file here for every dataset with the following three named columns:
SampleName FileNameR1 FileNameR2 Donor1 SEQ_D1_DAT_01_S53_L001_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L001_R2_001.fastq.gz Donor1 SEQ_D1_DAT_01_S53_L002_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L002_R2_001.fastq.gz Donor2 SEQ_D2_DAT_01_S54_L001_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L001_R2_001.fastq.gz Donor2 SEQ_D2_DAT_01_S54_L002_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L002_R2_001.fastq.gz Donor3 SEQ_D3_DAT_01_S55_L001_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L001_R2_001.fastq.gz Donor3 SEQ_D3_DAT_01_S55_L002_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L002_R2_001.fastq.gz
FASTQ files have to come in gziped pairs and be named
{prefix}_R[12]{suffix}.fastq.gz
. A sample can be covered by an arbitrarynumber of FASTQ pairs (at least one).fastq/
Place the FASTQ files as listed in your dataset sheet here.
ref/
Place or link the required human genome references here as described for each supported method, otherwise they will be automatically downloaded.
trim/
This is an output folder. It will be filled with adapter trimmed versions of the provided FASTQ files.
typing/{method}/
This is an output folder. It will be filled with subfolders for each method.
workflow/
This folder contains the workflow code.
Run the workflow
Invoke snakemake using
snakemake --use-conda --use-singularity
. This enablessnakemake to automatically install dependencies into conda environments thatare created on the fly and also enables the container based jobs to run. Toprocess all samples of a dataset, for example the datasetdataset_1
described indatasets/dataset_1.tsv
usesnakemake --use-conda --use-singularity typing/dataset_1.all.multihla
Memory and run time requirements for each job are noted in their resources (
mem_mb
andtime
).