NotificationsYou must be signed in to change notification settings
Fork28
Star98

Same species annotation lift over pipeline.

98 stars 28 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
scripts		scripts
tests/data		tests/data
CITATION.cff		CITATION.cff
README.md		README.md
Rakefile		Rakefile
gff_compare.rb		gff_compare.rb
gff_helpers.rb		gff_helpers.rb
gff_longest_transcripts.rb		gff_longest_transcripts.rb
gff_recover.rb		gff_recover.rb
gff_remove_feats.rb		gff_remove_feats.rb
opts_example.yaml		opts_example.yaml

Repository files navigation

flo - basic gff annotations lift over using chain files

Lift over is a way of mapping annotations from one genome assembly to another.The idea "lift over" is same as what tools like UCSC LiftOver, NCBI's LiftUpweb service do. However, NCBI and UCSC's web services are available only fora limited number of species.

To perform lift over locally, one can use UCSC chain files (Kent et al 2003)with programs such as UCSC's liftOver orCrossMap. A chain file captureslarge, homologous segments between two genomes as chains of gapless blocks ofalignment. One way of generating chain files is usingthis bash scriptandUCSC tools.

flo is an implementation of the above script in Ruby programming language. Further,both liftOver and CrossMap process GFF files line by line instead of transcripts asa whole. This results in some non-biologically meaningful output. flo provides abasic filtering of UCSC liftOver's GFF output.

We created flo for our work on the fire ant genome. If you use flo, please citethe following paper:

The fire ant social chromosome supergene variant Sb shows low diversity buthigh divergence from SB. 2017. R Pracana, A Priyam, I Levantis, Y Wurm.Molecular Ecology, doi: 10.1111/mec.14054.

Using flo |Results & discussion |Tweaking flo

Using flo

To use flo you must have Ruby 2.0 or higher and the BioRuby gem. Ruby 2.0 canbe installed through package managers on Linux and is available by default onMac. To install BioRuby gem:

sudo gem install bio

flo additionally requires a few programs fromUCSC tools,GNUParallel andgenometools. These can beinstalled in any directory by running 'scripts/install.sh' scriptafter you have downloaded flo:

wget -c https://github.com/yeban/flo/archive/master.tar.gz -O flo.tar.gztar xvf flo.tar.gzmv flo-master flo

It's best to run flo in a new directory - we will call it project dir:

mkdir flo_species_namecd flo_species_name

Copy over example configuration file from where you installed flo toproject dir:

cp /path/to/flo/opts_example.yaml flo_opts.yaml

Install flo's dependencies inext/ directory in the project dir:

/path/to/flo/scripts/install.sh

Now editopts.yaml to indicate:

Location of source and target assembly in FASTA format (required).
Location of GFF3 file(s) containing annotations on the sourceassembly. If this is omitted, flo will stop after generatingthe chain file.
BLAT parameters (optional). By default the target assembly isassumed to be of the same species. If the target assembly isa different (but closely related) species, you may want tolowerminIdentity.
Number of CPU cores to use (required - not auto detected). This
cannot be greater than the number of scaffolds in the target assembly.

Here, it's important to note that flo can only work with transcriptsand their child exons and CDS. Transcripts can be annotated as: mRNA,transcript, or gene. However, if you have a 'gene' annotation foreach transcript, you will need to remove that:

/path/to/flo/gff_remove_feats.rb gene xx_genes.gff \> xx_transcripts.gff

Alternatively, if you have more than one transcript annotated foreach gene, you can select the longest transcript for each gene towork with:

/path/to/flo/gff_longest_transcripts.rb xx_genes.gff \> xx_longest_transcripts.gff

Finally, run flo as:

rake -f /path/to/flo/Rakefile

A common problem encountered is that 1st column of GFF file doesn't matchchromosome, or scaffold, or contig id in the source assembly. In this caseliftOver will generate an empty output file. flo stops at this point. Youcan fix the GFF file and resume flo by running the above command.

flo writes all output to a directory calledrun/ in the current directory.The chain file generated by flo can be found atrun/liftover.chn. If flocompleted successfully, a directory is created for each given GFF3 file in'run/' that contains:

lifted.gff3 andunlifted.gff3 - liftOver's output
lifted_cleaned.gff - lifted.gff3 cleaned by flo -> final output
unmapped.txt - id of all transcripts that were not lifted and whosecoding sequence before and after lift are not identical. Non-identicalcoding sequences can be the result of SNPs and short indels between thesamples used to construct source and target assembly; it could be due tosequencing error in the target assembly or annotation error in the sourceassembly, or it could be that the transcript mapped to a duplicated region.These transcripts are included in the final GFF, but their ids are alsolisted here to signal lower confidence due to the difficulty in separatingtrue polymorphism from assembly errors and paralogous sequence variation.

Results & discussion

Both strengths and weaknesses of flo largely reflect that of the underlyingtools - the chain file and UCSC liftOver. In general, gaps and errors inassemblies may split a long chain. Gene models that are split acrossdifferent chains as well as those that are duplicated in the targetassembly are not lifted.

For an ant genome (~350 Mb) we saw 90% annotations map identically to the new assembly (unpublished result).
flo has been used in:

Tweak flo

If you would like to optimise how chain files are created:

UCSC wiki and website is an amazing resource to learn about BLAT andchain files. Don't forget to read Kent 2003 paper cited above first.
Read theRakefile from top to bottom. Ruby is similar, yet simplercompared to Perl and bash.

You can test things by lifting annotations between the same assembly.

About

Same species annotation lift over pipeline.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

flo - basic gff annotations lift over using chain files

Using flo

Results & discussion

Tweak flo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors5

Uh oh!

Languages

Movatterモバイル変換

wurmlab/flo

Folders and files

Latest commit

History

Repository files navigation

flo - basic gff annotations lift over using chain files

Using flo

Results & discussion

Tweak flo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors5

Uh oh!

Languages

Packages