ML-Bioinfo-CEITEC/genomic_benchmarksPublic

NotificationsYou must be signed in to change notification settings
Fork20
Star149

Benchmarks for classification of genomic sequences

License

Apache-2.0 license

149 stars 20 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 366 Commits
.vscode		.vscode
assets/img		assets/img
datasets		datasets
docs		docs
experiments		experiments
notebooks		notebooks
src/genomic_benchmarks		src/genomic_benchmarks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_devel.md		README_devel.md
setup.py		setup.py

Repository files navigation

Genomic Benchmarks 🧬🏋️✔️

In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks isHyenaDNA, see metrics in theexperiments folder.

Install

Genomic Benchmarks can be installed as follows:

pip install genomic-benchmarks

To use it with papermill, TF or pytorch, install the corresponding dependencies:

# if you want to use jupyter and papermillpip install jupyter>=1.0.0pip install papermill>=2.3.0# if you want to train NN with TFpip install tensorflow>=2.6.0pip install tensorflow-addonspip install typing-extensions --upgrade# fixing TF installation issue# if you want to train NN with torchpip install torch>=1.10.0pip install torchtext

For the package development, use Python 3.8 (ideally 3.8.9) and the installation describedhere.

Usage

Get the list of all datasets with thelist_datasets function

>>>fromgenomic_benchmarks.data_checkimportlist_datasets>>>>>>list_datasets()['demo_coding_vs_intergenomic_seqs','demo_human_or_worm','dummy_mouse_enhancers_ensembl','human_enhancers_cohn','human_enhancers_ensembl','human_ensembl_regulatory','human_nontata_promoters','human_ocr_ensembl']

You can get basic information about the benchmark withinfo function:

>>>fromgenomic_benchmarks.data_checkimportinfo>>>>>>info("human_nontata_promoters",version=0)Dataset`human_nontata_promoters`has2classes:negative,positive.Alllenghtsofgenomicintervalsequals251.Totally36131sequenceshavebeenfound,27097fortrainingand9034fortesting.traintestnegative123554119positive147424915

The functiondownload_dataset downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in.genomic_benchmarks subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True).

>>>fromgenomic_benchmarks.loc2seqimportdownload_dataset>>>>>>download_dataset("human_nontata_promoters",version=0)Downloading1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4into/home/petr/.genomic_benchmarks/human_nontata_promoters.zip...Done.Unzipping...Done.PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')

Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:

>>>frompathlibimportPath>>>importtensorflowastf>>>>>>BATCH_SIZE=64>>>SEQ_TRAIN_PATH=Path.home()/'.genomic_benchmarks'/'human_nontata_promoters'/'train'>>>CLASSES= ['negative','positive']>>>>>>train_dset=tf.keras.preprocessing.text_dataset_from_directory(...directory=SEQ_TRAIN_PATH,...batch_size=BATCH_SIZE,...class_names=CLASSES)Found27097filesbelongingto2classes.>>>>>>list(train_dset)[0][0][0]<tf.Tensor:shape=(),dtype=string,numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>

SeeHow_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.

Getting Pytorch Dataset and displaying samples is also easy:

>>>fromgenomic_benchmarks.dataset_getters.pytorch_datasetsimportHumanNontataPromoters>>>>>>dset=HumanNontataPromoters(split='train',version=0)>>>dset[0]('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT',0)

SeeHow_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.

Hugging Face

We also provide these benchmarks through HuggingFace Hub:https://huggingface.co/katarinagresova

If you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. SeeHow_To_Use_Datasets_From_HF.ipynb for a guide.

Structure of package

datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), seeREADME.md for the format specification
docs: Each folder contains a Python notebook that has been used for the dataset creation
experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
tests: Unit tests forpytest andpytest-cov

How to contribute

How to contribute a model

If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at

https://forms.gle/pvkkrgHNCNmAAC1TA

How to contribute a dataset

If you have an interesting genomic dataset, send usan issue with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.

If you are a hero, readthe specification of our dataset format and send us a pull request with newdatasets/[YOUR_DATASET_NAME] anddocs/[YOUR_DATASET_NAME] folders.

How to improve code in this package

We welcome new code contributors. If you see a bug, send usan issue with aminimal reproducible example. Or even better, fix the bug and send us a pull request.

Citing Genomic Benchmarks

If you use Genomic Benchmarks in your research, please cite it as follows.

Text

Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.

BibTeX

@article{grevsova2023genomic,title={Genomic benchmarks: a collection of datasets for genomic sequence classification},author={Gre{\v{s}}ov{\'a}, Katar{\'\i}na and Martinek, Vlastimil and {\v{C}}ech{\'a}k, David and {\v{S}}ime{\v{c}}ek, Petr and Alexiou, Panagiotis},journal={BMC Genomic Data},volume={24},number={1},pages={25},year={2023},publisher={Springer}}

About

Benchmarks for classification of genomic sequences

Releases

4tags

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Genomic Benchmarks 🧬🏋️✔️

Install

Usage

Hugging Face

Structure of package

How to contribute

How to contribute a model

How to contribute a dataset

How to improve code in this package

Citing Genomic Benchmarks

Text

BibTeX

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

ML-Bioinfo-CEITEC/genomic_benchmarks

Folders and files

Latest commit

History

Repository files navigation

Genomic Benchmarks 🧬🏋️✔️

Install

Usage

Hugging Face

Structure of package

How to contribute

How to contribute a model

How to contribute a dataset

How to improve code in this package

Citing Genomic Benchmarks

Text

BibTeX

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors4

Uh oh!

Languages