- Notifications
You must be signed in to change notification settings - Fork20
Benchmarks for classification of genomic sequences
License
ML-Bioinfo-CEITEC/genomic_benchmarks
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks isHyenaDNA, see metrics in theexperiments folder.
Genomic Benchmarks can be installed as follows:
pip install genomic-benchmarks
To use it with papermill, TF or pytorch, install the corresponding dependencies:
# if you want to use jupyter and papermillpip install jupyter>=1.0.0pip install papermill>=2.3.0# if you want to train NN with TFpip install tensorflow>=2.6.0pip install tensorflow-addonspip install typing-extensions --upgrade# fixing TF installation issue# if you want to train NN with torchpip install torch>=1.10.0pip install torchtext
For the package development, use Python 3.8 (ideally 3.8.9) and the installation describedhere.
Get the list of all datasets with thelist_datasets
function
>>>fromgenomic_benchmarks.data_checkimportlist_datasets>>>>>>list_datasets()['demo_coding_vs_intergenomic_seqs','demo_human_or_worm','dummy_mouse_enhancers_ensembl','human_enhancers_cohn','human_enhancers_ensembl','human_ensembl_regulatory','human_nontata_promoters','human_ocr_ensembl']
You can get basic information about the benchmark withinfo
function:
>>>fromgenomic_benchmarks.data_checkimportinfo>>>>>>info("human_nontata_promoters",version=0)Dataset`human_nontata_promoters`has2classes:negative,positive.Alllenghtsofgenomicintervalsequals251.Totally36131sequenceshavebeenfound,27097fortrainingand9034fortesting.traintestnegative123554119positive147424915
The functiondownload_dataset
downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in.genomic_benchmarks
subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True
).
>>>fromgenomic_benchmarks.loc2seqimportdownload_dataset>>>>>>download_dataset("human_nontata_promoters",version=0)Downloading1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4into/home/petr/.genomic_benchmarks/human_nontata_promoters.zip...Done.Unzipping...Done.PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')
Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:
>>>frompathlibimportPath>>>importtensorflowastf>>>>>>BATCH_SIZE=64>>>SEQ_TRAIN_PATH=Path.home()/'.genomic_benchmarks'/'human_nontata_promoters'/'train'>>>CLASSES= ['negative','positive']>>>>>>train_dset=tf.keras.preprocessing.text_dataset_from_directory(...directory=SEQ_TRAIN_PATH,...batch_size=BATCH_SIZE,...class_names=CLASSES)Found27097filesbelongingto2classes.>>>>>>list(train_dset)[0][0][0]<tf.Tensor:shape=(),dtype=string,numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>
SeeHow_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.
Getting Pytorch Dataset and displaying samples is also easy:
>>>fromgenomic_benchmarks.dataset_getters.pytorch_datasetsimportHumanNontataPromoters>>>>>>dset=HumanNontataPromoters(split='train',version=0)>>>dset[0]('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT',0)
SeeHow_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.
We also provide these benchmarks through HuggingFace Hub:https://huggingface.co/katarinagresova
If you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. SeeHow_To_Use_Datasets_From_HF.ipynb for a guide.
- datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), seeREADME.md for the format specification
- docs: Each folder contains a Python notebook that has been used for the dataset creation
- experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
- notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
- src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
- tests: Unit tests for
pytest
andpytest-cov
If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at
https://forms.gle/pvkkrgHNCNmAAC1TA
If you have an interesting genomic dataset, send usan issue with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.
If you are a hero, readthe specification of our dataset format and send us a pull request with newdatasets/[YOUR_DATASET_NAME]
anddocs/[YOUR_DATASET_NAME]
folders.
We welcome new code contributors. If you see a bug, send usan issue with aminimal reproducible example. Or even better, fix the bug and send us a pull request.
If you use Genomic Benchmarks in your research, please cite it as follows.
Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.
@article{grevsova2023genomic,title={Genomic benchmarks: a collection of datasets for genomic sequence classification},author={Gre{\v{s}}ov{\'a}, Katar{\'\i}na and Martinek, Vlastimil and {\v{C}}ech{\'a}k, David and {\v{S}}ime{\v{c}}ek, Petr and Alexiou, Panagiotis},journal={BMC Genomic Data},volume={24},number={1},pages={25},year={2023},publisher={Springer}}
About
Benchmarks for classification of genomic sequences
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.