Daniel-Ze/NLRextractPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Find NLR hmms in protein sequences.

License

View license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
NLRextract_TAIR10_test		NLRextract_TAIR10_test
hmm		hmm
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NLRextract		NLRextract
README.md		README.md
environment.yml		environment.yml
nlr.venn.pdf		nlr.venn.pdf
plotNLR.r		plotNLR.r
preview_nlrextract.png		preview_nlrextract.png
vennNLR.r		vennNLR.r

Repository files navigation

NLRextract: search for NLR related hmms

NLRextract was written to search for NLR related hmms in multi-fasta proteinsequences. It uses hmms from pfam v32. It includes the search for the CC domain related MADA motif [1]. This tool is under development so let me know if something's odd or can be improved. I tested the performance of NLRextract againstNLRparser[2] andNLRtracker[3] in a small article on my website:https://www.biotinkertech.eu/project_NLRextract.html

IMPORTANT!!
24.02.22 - Potentially wrong output of NLRextract using system supplied awk:

Filtering of hmmrsearch / hmmrscan output was so far done with the system supplied awk version. Standard Installation of Ubuntu 20.04 comes with mawk as awk version. This version of awk has problems with scientific notation of numbers. Therefore, gawk=5.1.0 was included in environment.yml and the bash script was 'NLRextract' was adjusted to use gawk for filtering of results. If you have run NLRextract on standard Ubuntu 20.04 please make sure to rerun your analysis with the updated version!!!

IMPORTANT!!

What it does:

Searching for: NB-ARC, CC, TIR, RPW8 and LRR domains in the proteins
Create a venn diagram from the domain infromation
Extract the overlaps of the different domain combinations
Extract sequences of proteins with the different domain combinations
Plot number of different domains, and number of different NLRs

What it needs:

Please have a look at the environment.yml file for more information

(base) 💻 daniel:NLRextract $ cat environment.yml

What it includes:

NLRextract.sh -> Script to run hmmsearch, hmmscan and pltNLR.r/vennNLR.r
vennNLR.r -> Create venn diagram from the domain infromation and extract the combinations
plotNLR.r -> Create barplots for the domains and the NLR proteins

How to install it:

Clone it

(base) 💻 daniel~ $ git clone https://github.com/Daniel-Ze/NLRextract.git

put the containing folder in your $PATH
chmod a+x NLRextract
Edit line 3 in NLRextract if you chose to install it anywhere else than your home folder:

1#!/bin/bash2########################################################### To get the script running3 NLRextracthome=~/NLRextract# edit this line4########################################################### and put the folder in PATH

Install the environment:

(base) 💻 daniel~ $cd NLRextract(base) 💻 daniel:NLRextract $ mamba env create -f environment.yml

How to run it:

(base) 💻 daniel:NLRextract $ NLRextract/Users/daniel/miniconda3/etc/profile.d/conda.sh exists.[info]No conda environment name supplied. Defaulting to: NRLextract[info]Activating conda environment NLRextract:[info] - Found Rscriptin your path.[info] - Found hmmsearchin your path.[info] - Found bedtoolsin your path.[error]No protein file supplied.NLRextract will run hmmersearch and searchfor theHMMs locatedin /Users/daniel/NLRextract/hmm/.Make sure that you supply a protein multifasta file.Usage: NLRextract -p protein.fa-c number of CPUs to use (default: 1) -p path to the protein multifasta (mandatory)-s suffixfor folder (default: random string)-e conda environment (default: NLRextract)

Output:

The script will generated several folder with the output of the different tools used:

NLRextract_TAIR10_test/├── [3.1K]  nlr.stderr# Error reports of all steps├── [204K]  clust/# Alignment and phylogenetic tree data├── [ 92K]  domain/# Fasta sequences of the extracted domains├── [1.1M]  fasta/# Fasta sequences of the extracted NLR proteins├── [1.2M]  gff/# GFF files of domains and NLR proteins├── [233M]  hmmer/# HMMER results for the NLR HMM motifs├── [ 59K]  name/# Sequence names of NLR proteins and domains├── [ 26K]  seqname/# Sequence names of proteins with a NLR domain└── [ 34K]  stats/# Summary of findings plus plots

Changes:

UPDATE 24.02.21

switched from awk to gawk=5.1.0

PREVIOUS UPDATES

Added usage info
Added suffix and number of CPU options
Adjusted plotNLR.r output
Added phylogenetic trees

This work is licensed under aCreative Commons Attribution 4.0 International License.

[1] Hiroaki Adachi, Mauricio P Contreras, Adeline Harant, Chih-hang Wu, Lida Derevnina, Toshiyuki Sakai, Cian Duggan, Eleonora Moratto, Tolga O Bozkurt, Abbas Maqbool, Joe Win, Sophien Kamoun, 2019, An N-terminal motif in NLR immune receptors is functionally conserved across distantly related plant species. eLife, 8:e49956http://dx.doi.org/10.7554/eLife.49956
[2] Burkhard Steuernagel, Florian Jupe, Kamil Witek, Jonathan D.G. Jones, Brande B.H. Wulff, 2015, NLR-parser: rapid annotation of plant NLR complements. Bioinformatics, Vol. 31, Issue 10, Pages 1665–1667https://doi.org/10.1093/bioinformatics/btv005
[3] Jiorgos Kourelis, Toshiyuki Sakai, Hiroaki Adachi, Sophien Kamoun, 2021, RefPlantNLR: a comprehensive collection of experimentally validated plant NLRs. bioRxiv,https://doi.org/10.1101/2020.07.08.193961

About

Find NLR hmms in protein sequences.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NLRextract: search for NLR related hmms

What it does:

What it needs:

What it includes:

How to install it:

How to run it:

Output:

Changes:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages