- Notifications
You must be signed in to change notification settings - Fork21
XinhaoLi74/MolPMoFiT
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Implementation ofInductive transfer learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT
MolecularPredictionModelFine-Tuning (MolPMoFiT) is a transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling.
MolPMoFiT is adapted from theULMFiT using Pytorch andFastai v1. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints.
We recommand to build the enviroment withConda
.
conda env create -f molpmofit.yml
We provide all the datasets needed to reproduce the experiments in thedata
folder.
data/MSPM
contains the dataset to train the general domain molecular structure prediction model.data/QSAR
contains the datasets for QSAR tasks.
The code is provided asjupyter notebook
in thenotebooks
folder. All the code was developed in a Ubuntu 18.04 workstation with 2 Quadro P4000 GPUs.
01_MSPM_Pretraining.ipynb
: Training the general domain molecular structure prediction model(MSPM).02_MSPM_TS_finetuning.ipynb
: (1) Fine-tuning the general MSPM on a target dataset to generate a task-specific MSPM model. (2) Fine-tuning the task-specific MSPM to tran a QSAR model.03_QSAR_Classifcation.ipynb
: Fine-tuning the general domain MSPM to train a classification model.04_QSAR_Regression.ipynb
: Fine-tuning the general domain MSPM to train a regression model.
DownloadChEMBL_1M_atom. See
notebooks/05_Pretrained_Models.ipynb
for instructions of usage.- This model is trained on 1M ChEMBL molecules with the atomwise tokenization method (original MoPMoFiT).
DownloadChEMBL_1M_SPE. See
notebooks/06_SPE_Pretrained_Models.ipynb
for instructions of usage.- This model is trained on 1M ChEMBL molecules with the SMILES pair encoding tokenization method.
- SMILES Pair Encoding (SmilesPE) is A Data-Driven Substructure Tokenization Algorithm for Deep Learning.
About
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.