- Notifications
You must be signed in to change notification settings - Fork0
Quadtree - gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences
License
patrikkaura/quadtree
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The Quadtree is a gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences. It's developed on top of the LightGBM python library. Each sequence base is encoded based on a given encoding prescription. The model was trained to be used with a sliding window and analyses the whole sequence. Machine learning model can be used as python script or thru preview websitequadtree.vercel.app
quadtree └─ web -> preview website source code └─ python └─ model -> lightgbm model params └─ train -> example files how training was performed └─ quadtree.py -> predictor
- lightgbm==3.3.2
- numpy==1.21.2
Before using install the requirements:
pip install -r requirements.txt
fromquadtreeimportQuadtreemodel=Quadtree()
- sequence as a string (maximum length is not limited)
- threshold (recommended values is 0.2)
- quadnet model file path
result=quadtree.analyse(sequence='ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACC...',model_path='/path/to/quadnet_model.txt',',score_threshold=0.1)
Results are then returned in given form which can be loaded into pandas DataFrame.
importpandasaspddf=pd.DataFrame(result)
index | position | sequence | length | |
---|---|---|---|---|
0 | 0 | 907 | GCAACAATGGCTGATCCAGAAGGTACAGACGGGGAGGGCACGGGTTGTAACGGCTGGTTTTATGTACAAGCTATTGTAGACAAAAAAACAGGAGATGTAATATCA | 105 |
1 | 1 | 1184 | GAGGCAGCACAGAAAACAGTCCATTAGGGGAGCGGCTGGAGGTGGATACAGAGTTAAGTCCACGGTTACAAGAAATATCTTTAAATAGTGGGCAGA | 96 |
2 | 2 | 1389 | ATGTAGTGGCGGCAGTACGGAGGCTATAGACAACGGGGGCACAGAGGGCAACAACAGCAGTGTAGACGGTACAAGTGACAATAGCAATATAGAAAATGTAAATCCAC | 107 |
3 | 3 | 1635 | AGATTGGGTTACAGCTATATTTGGAGTAAACCCAACAATAGCAGAAGGATTTAAAACACTAATACAGCCATTTAT | 75 |
4 | 4 | 2229 | AATAGATGAAGGGGGAGATTGGAGACCAATAGTGCAATTCCTGCGATACCAACAAATAGAGTTTATAACATTTTTAG | 77 |
These parameter were used to train lightgbm model
LGBM Classifier | value |
---|---|
colsample bytree | 0.817574864502621 |
learning rate | 0.03744835808549148 |
max bin | 127 |
min child sample | 3 |
number of estimators | 1000 |
number of leaves | 74 |
regularization alpha | 0.0033803043003857677 |
regularization lambda | 0.7013136087939289 |
objective | binary |
- Patrik Kaura -Main developer -patrikkaura
This project is licensed under the MIT License - see theLICENSE file for details. # quadtree