clemgoub/TE-AidPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star49

Annotation helper tool for the manual curation of transposable element consensus sequences

License

View license

49 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
Example		Example
dev		dev
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Run-c2g.R		Run-c2g.R
TE-Aid		TE-Aid
TE_AID.yml		TE_AID.yml
blastndotplot.R		blastndotplot.R
consensus2genome.R		consensus2genome.R
extractfasta.sh		extractfasta.sh
getlength.sh		getlength.sh
loop_TE-Aid.sh		loop_TE-Aid.sh
reduce.cpp		reduce.cpp

Repository files navigation

TE+Aid

TE-Aid is ashell+R program aimed to help the manual curation of transposable elements (TE). It inputs a TE consensus sequence (fasta format) and requires a reference genome (in fasta as well). UsingR and theNCBI blast+ suite, TE-Aid produces 4 figures reporting:

(top left) the genomic hits with divergence to consensus
(top right) the genomic coverage of the consensus
(bottom left) a self dot-plot
(bottom right) a structure analysis including: TIR and LTR suggestions, open reading frames (ORFs) and TE protein hit annotation.

🗞️ TE-Aid is presented in"A beginner’s guide to manual curation of transposable elements" by Clement Goubert, Rory J. Craig, Agustin F. Bilat, Valentina Peona, Aaron A. Vogan & Anna V. Protasio, published in Mobile DNA (2022)

Pipeline overview:

The TE (ideally, candidate consensus sequence) is searched against the provided reference genome withblastn
- Fig 1: genomic hits (horizontal lines) are represented relative to the query (TE consensus), the y axis represent theblastn divergence
- Fig 2: pileup of the genomic hits relative to position along the query (TE consensus)
The query is then blasted against itself in order to detect micro repeats and inversions (putative TIRs, LTRs)
- Fig 3: self dot-plot and Fig 4 (top): TIR and LTR are suggested (colored arrows)
- Bonus: a self dot-plot withemboss dotmatcher is also produced in an extra file
Putative ORFs are searched withemboss getorf and the peptides queried against a TE protein database (distributed withRepeatMasker)
- Fig 4: ORFs (black rectangles: + orientation; red rectangles: - orientation), TE protein hits

The consensus size, number of fragments (hits) and full length copies (according to user-defined threshold) are automatically printed on the graph.If any ORFs and protein hits are found, their locations relative to the consensus are printed in thestdout

TE-Aid has been tested on MacOSX (shell, sh, zsh) and Linux (shell, sh)support: click the "issues" tab on github oremail me

TE-Aid comes fromconsensus2genome that is now deprecated

Version and branches

TE+Aid is a fully open software and is being integrated in a growing number of projects (thank you! ❤️). In order to track project-specific modifications of the base code, I have created specific branches based on the pull requests of developpers. Do not hesitate to check them out!

The main branch may not includes all these modifications, but I am happy to consider any request to modify the main branch. If you think your changes should make it to the main branch but are only available in a parallel branch, please let me know, and when time allows, I'll be happy to review and merge!

Install

Dependencies

R (Rscript)
- Biostrings
- Rcpp (when using -r option)
NCBI Blast+ suite
EMBOSSgetorf

TE-Aid callsNCBI blast andR from the command line withblastn,blastp,makeblastdb andRscript commands. All these executables must be accessible in the user path (usually the case following the default install). You can also set up a conda environment specifically for TE-Aid (see below).If not, you need to locate the executables' location and add them to your local path before using TE-Aid.For instance:

export PATH="/path/to/blast/bins/folder/:$PATH"` export PATH="/path/to/R/bins/folder/:$PATH"`

These lines can be added to the user~/.bashrc (Linux) or~/.zshrc (macOS) to add these programs permanently to$PATH.

InstallTE-Aid from github

git clone https://github.com/clemgoub/TE-Aid.git

Setting a conda environment with all dependencies

You can set a conda environment for running TE-Aid after you cloned the repository with this command (usemamba instead of conda because it's way faster):

cd TE-Aidmamba env create -f TE_AID.yml

After that, you'll have all the dependencies ready once you activate the environment:

mamba activate TE_AID

Usage and options

Minimal command line

<user-path>/TE-Aid [-q|--query <query.TE.fa>] [-g|--genome <genome.fa>] [options]

Note. replace<user-path> with the path of the downloadedTE-Aid folder.

Mandatory arguments:

    -q, --query                   TE consensus (fasta file)    -g, --genome                  Reference genome (fasta file)

Optional arguments:

    -h, --help                    show this help message and exit        -o, --output                  output folder (default "./")    -t, --tables                  write features coordinates in tables (self dot-plot, ORFs and protein hits coordinates)    -T, --all-Tables              same as -t plus write the genomic blastn table.                                   Warning: can be very large if your TE is highly repetitive!    -r, --remove-redundant        remove redundant hits from genomic blastn table and a title of the first plot        -e, --e-value                 genome blastn: e-value threshold to keep hit (default: 10e-8)    -f, --full-length-threshold   genome blastn: min. proportion (hit_size)/(consensus_size) to be considered "full length" (0-1; default: 0.9)    -m, --min-orf                 getorf: minimum ORF size (in bp)    -R, --no-reverse-orfs         getorf: don't use ORFs in ther reverse complement of your sequence    -a, --alpha                   graphical: transparency value for blastn hit (0-1; default 0.3)    -F, --full-length-alpha       graphical: transparency value for full-length blastn hits (0-1; default 1)    -y, --auto-y                  graphical: manual override for y lims (default: TRUE; otherwise: -y NUM)    -D | --emboss-dotmatcher      Produce a dotplot with EMBOSS dotmatcher

Tutorial

In this example we are going to analyze some transposable elements ofDrosophila melanogaster. The consensus sequences for this tutorial are located in theExample/ folder, and you will need to download theD. melanogaster reference genome (dm6). Let's go!

1. Download theD. melanogaster genome

curl -o Example/dm6.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gzgunzip Example/dm6.fa.gz

A couple ofD. melanogaster TE consensus sequences are present in the folderExamples

2. Analyze the TE consensus

Let's start with Jockey, a recentLINE element in theD. melanogaster genome

./TE-Aid -q Example/Jockey_DM.fasta -g Example/dm6.fa -o ../dm6example

Next is Gypsy-2, from theLTR lineage

./TE-Aid -q Example/Gypsy2_DM.fasta -g Example/dm6.fa -o ../dm6example

About

Annotation helper tool for the manual curation of transposable element consensus sequences

doi.org/10.1186/s13100-021-00259-7

Releases1

v1.0.0 Latest

Mar 5, 2025

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TE+Aid

Version and branches

Install

Dependencies

InstallTE-Aid from github

Setting a conda environment with all dependencies

Usage and options

Minimal command line

Mandatory arguments:

Optional arguments:

Tutorial

1. Download theD. melanogaster genome

2. Analyze the TE consensus

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Contributors4

Uh oh!

Languages