alejandrogzi/bed2gtfPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star15

high-performance BED-to-GTF converter written in Rust

License

MIT license

15 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Repository files navigation

bed2gtf

A high-performance bed-to-gtf converter written in Rust.

// translateschr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,// intochr27 bed2gtf gene 17266470 17285418 . + . gene_id "ENSG00000151743";chr27 bed2gtf transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";chr27 bed2gtf exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";...

Converts

Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 3.25 seconds.
Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 1.99 seconds.
Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.20 seconds.
Gallus galus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.36 seconds.

What's new on v.1.9.3
Fixes a bug with .gz decoder
Implements reading .bed.gz files!
Fixes bug described in issue #11 with versioning

Usage

Usage: bed2gtf[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT>Arguments:    -b, --bed <BED>: a.bedfile    -i, --isoforms <ISOFORMS>: a tab-delimited file    -o, --output <OUTPUT>: path to output file    -g, --gz[=<FLAG>]Compress output file[default: false][possible values: true, false]    -n, --no-gene[=<FLAG>]Flag to disable gene_id feature[default: false][possible values: true, false]Options:    --help: print help    --version: print version    --threads/-t: number ofthreads(default: max ncpus)    --gz: compress output.gtf

Warning

All the transcripts in .bed file should appear in the isoforms file.

Tip

Here are some commands to get you started:

# convert a .bed file to .gtf (if you have an isoforms file [gene -> transcript names] and want gene_ids in the output .gtf)bed2gtf -b file.bed -i isoforms.txt -o file.gtf# convert a ,.bed file to .gtf without isoforms [same things as UCSC bedToGtf]bed2gtf -b file.bed -o file.gtf --no-gene# convert a .bed.gz file to a .gtf [with or without isoforms]bed2gtf -b file.bed.gz -i isoforms.txt -o file.gtf --gzbed2gtf -b file.bed.gz -o file.gtf --gz --no-gene# convert a .bed.gz to a .gtf.gz [with or without isoforms]bed2gtf -b file.bed.gz -i isoforms.txt -o file.gtf --gzbed2gtf -b file.bed.gz -o file.gtf --gz --no-gene

crate:https://crates.io/crates/bed2gtf

click for detailed formats

bed2gtf just needs two files:

a .bed file

tab-delimited files with 3 required and 9 optional fields:

chrom   chromStart  chromEnd      name    ...  |         |           |           |chr20   50222035    50222038    ENST00000595977    ...

seeBED format for more information

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):
```
> cat isoforms.txtENSG00000198888 ENST00000361390ENSG00000198763 ENST00000361453ENSG00000198804 ENST00000361624ENSG00000188868 ENST00000595977
```
you can build a custom file for your preferred species usingEnsembl BioMart.

Installation

to install bed2gtf on your system follow this steps:

get rust:curl https://sh.rustup.rs -sSf | sh on unix, or gohere for other options
runcargo install bed2gtf (make sure~/.cargo/bin is in your$PATH before running it)
usebed2gtf with the required arguments
enjoy!

Build

to build bed2gtf from this repo, do:

get rust (as described above)
rungit clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf
runcargo run --release -- -b <BED> -i <ISOFORMS> -o <OUTPUT>

Container image

to build the development container image:

rungit clone https://github.com/alejandrogzi/bed2gtf.git && cd bed2gtf
initialize docker withstart docker orsystemctl start docker
build the imagedocker image build --tag bed2gtf .
rundocker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gtf -b /dir/<BED> -i /dir/<ISOFORMS> -o /dir/<OUTPUT>

Conda

to use bed2gtf through Conda just:

conda install bed2gtf -c bioconda orconda create -n bed2gtf -c bioconda gtfsort

Output

bed2gtf will send the output directly to the same .bed file path if you specify so

bed2gtf annotation.bed isoforms.txt output.gtf.├── ...├── isoforms.txt├── annotation.bed└── output.gtf

whereoutput.gtf is the result.

FAQ

Why?

UCSC offers a fast way to convert BED into GTF files through KentUtils or specific binaries (1) + several other bioinformaticians have shared scripts trying to replicate a similar solution (2,3,4).

A GTF file is a 9-column tab-delimited file that holds gene annotation data for a specific assembly (5). The 9th column defines the attributes of each entry. This field is important, as some post-processing tools that handle GTF files need them to extract gene information (e.g. STAR, arriba, etc). An incomplete GTF attribute field would probably lead to annotation-related errors in these software.

Of the available tools/scripts mentioned above, none produce a fully functional attribute GTF file conversion. (1) uses a two-step approach (bedToGenePred | genePredToGtf) written in C, which is extremely fast. Since a .bed file does not preserve any gene-related information, this approach fails to a) include correct gene_id attributes (duplicated transcript_ids) if no refTable is included b) append 3rd column gene features.

This is an example:

chr27 stdin transcript 17266470 17281218 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8";chr27 stdin exon 17266470 17266572 . + . gene_id "ENST00000541931.8"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

On the other hand, available scripts (2,3,4) fall into bad-formatted outputs unable to be used as input to other tools. Some of them show a very customed format, far from a complete GTF file (2):

chr20 ---- peak 50222035 50222038 . + . peak_id "chr20_50222035_50222038";chr20 ---- peak 50188548 50189130 . + . peak_id "chr20_50188548_50189130";

and others (4) just provide exon-related information:

chr20 ensembl exon 50222035 50222038 . + . gene_id "ENST00000595977.1735"; transcript_id "ENST00000595977.1735"; exon_number "0chr20 ensembl exon 50188548 50188930 . + . gene_id "ENST00000595977.3403"; transcript_id "ENST00000595977.3403"; exon_number "0

This is where bed2gtf comes in: a fast and memory efficient BED-to-GTF converter written in Rust. In ~4 seconds this tool produces a fully functional GTF converted file with all the needed features needed for post-processing tools.

How?

bed2gtf is basically the reimplementation of C binaries merged in 1 step. This tool evaluates the position of k exons in j transcript, calculates start/stop/codon/UTR positions preserving reading frames and adjust the index + 1 (to be compatible with GTF convention). The isoforms file works as the refTable in C binaries to map each transcript to their respective gene; however, bed2gtf takes advantage of this and adds an additional "gene" line (to be compatible with other tools).

References

About

high-performance BED-to-GTF converter written in Rust

Releases11

v.1.9.3 Latest

Nov 20, 2024

+ 10 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bed2gtf

Usage

crate:https://crates.io/crates/bed2gtf

Installation

Build

Container image

Conda

Output

FAQ

Why?

How?

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases11

Packages

Uh oh!

Languages

Movatterモバイル変換

License

alejandrogzi/bed2gtf

Folders and files

Latest commit

History

Repository files navigation

bed2gtf

Usage

crate:https://crates.io/crates/bed2gtf

Installation

Build

Container image

Conda

Output

FAQ

Why?

How?

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases11

Packages0

Uh oh!

Languages

Packages