Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Benchmarks for classification of genomic sequences

License

NotificationsYou must be signed in to change notification settings

ML-Bioinfo-CEITEC/genomic_benchmarks

Repository files navigation

PyPI version

Genomic Benchmarks 🧬🏋️✔️

In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks isHyenaDNA, see metrics in theexperiments folder.

Install

Genomic Benchmarks can be installed as follows:

pip install genomic-benchmarks

To use it with papermill, TF or pytorch, install the corresponding dependencies:

# if you want to use jupyter and papermillpip install jupyter>=1.0.0pip install papermill>=2.3.0# if you want to train NN with TFpip install tensorflow>=2.6.0pip install tensorflow-addonspip install typing-extensions --upgrade# fixing TF installation issue# if you want to train NN with torchpip install torch>=1.10.0pip install torchtext

For the package development, use Python 3.8 (ideally 3.8.9) and the installation describedhere.

Usage

Get the list of all datasets with thelist_datasets function

>>>fromgenomic_benchmarks.data_checkimportlist_datasets>>>>>>list_datasets()['demo_coding_vs_intergenomic_seqs','demo_human_or_worm','dummy_mouse_enhancers_ensembl','human_enhancers_cohn','human_enhancers_ensembl','human_ensembl_regulatory','human_nontata_promoters','human_ocr_ensembl']

You can get basic information about the benchmark withinfo function:

>>>fromgenomic_benchmarks.data_checkimportinfo>>>>>>info("human_nontata_promoters",version=0)Dataset`human_nontata_promoters`has2classes:negative,positive.Alllenghtsofgenomicintervalsequals251.Totally36131sequenceshavebeenfound,27097fortrainingand9034fortesting.traintestnegative123554119positive147424915

The functiondownload_dataset downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in.genomic_benchmarks subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True).

>>>fromgenomic_benchmarks.loc2seqimportdownload_dataset>>>>>>download_dataset("human_nontata_promoters",version=0)Downloading1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4into/home/petr/.genomic_benchmarks/human_nontata_promoters.zip...Done.Unzipping...Done.PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')

Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:

>>>frompathlibimportPath>>>importtensorflowastf>>>>>>BATCH_SIZE=64>>>SEQ_TRAIN_PATH=Path.home()/'.genomic_benchmarks'/'human_nontata_promoters'/'train'>>>CLASSES= ['negative','positive']>>>>>>train_dset=tf.keras.preprocessing.text_dataset_from_directory(...directory=SEQ_TRAIN_PATH,...batch_size=BATCH_SIZE,...class_names=CLASSES)Found27097filesbelongingto2classes.>>>>>>list(train_dset)[0][0][0]<tf.Tensor:shape=(),dtype=string,numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>

SeeHow_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.

Getting Pytorch Dataset and displaying samples is also easy:

>>>fromgenomic_benchmarks.dataset_getters.pytorch_datasetsimportHumanNontataPromoters>>>>>>dset=HumanNontataPromoters(split='train',version=0)>>>dset[0]('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT',0)

SeeHow_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.

Hugging Face

We also provide these benchmarks through HuggingFace Hub:https://huggingface.co/katarinagresova

If you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. SeeHow_To_Use_Datasets_From_HF.ipynb for a guide.

Structure of package

  • datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), seeREADME.md for the format specification
  • docs: Each folder contains a Python notebook that has been used for the dataset creation
  • experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
  • notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
  • src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
  • tests: Unit tests forpytest andpytest-cov

How to contribute

How to contribute a model

If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at

https://forms.gle/pvkkrgHNCNmAAC1TA

How to contribute a dataset

If you have an interesting genomic dataset, send usan issue with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.

If you are a hero, readthe specification of our dataset format and send us a pull request with newdatasets/[YOUR_DATASET_NAME] anddocs/[YOUR_DATASET_NAME] folders.

How to improve code in this package

We welcome new code contributors. If you see a bug, send usan issue with aminimal reproducible example. Or even better, fix the bug and send us a pull request.

Citing Genomic Benchmarks

If you use Genomic Benchmarks in your research, please cite it as follows.

Text

Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.

BibTeX

@article{grevsova2023genomic,title={Genomic benchmarks: a collection of datasets for genomic sequence classification},author={Gre{\v{s}}ov{\'a}, Katar{\'\i}na and Martinek, Vlastimil and {\v{C}}ech{\'a}k, David and {\v{S}}ime{\v{c}}ek, Petr and Alexiou, Panagiotis},journal={BMC Genomic Data},volume={24},number={1},pages={25},year={2023},publisher={Springer}}

About

Benchmarks for classification of genomic sequences

Topics

Resources

License

Stars

Watchers

Forks

Contributors4

  •  
  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp