- Notifications
You must be signed in to change notification settings - Fork21
XinhaoLi74/MolPMoFiT
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Implementation ofInductive transfer learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT
MolecularPredictionModelFine-Tuning (MolPMoFiT) is a transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling.
MolPMoFiT is adapted from theULMFiT using Pytorch andFastai v1. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints.
We recommand to build the enviroment withConda
.
conda env create -f molpmofit.yml
We provide all the datasets needed to reproduce the experiments in thedata
folder.
data/MSPM
contains the dataset to train the general domain molecular structure prediction model.data/QSAR
contains the datasets for QSAR tasks.
The code is provided asjupyter notebook
in thenotebooks
folder. All the code was developed in a Ubuntu 18.04 workstation with 2 Quadro P4000 GPUs.
01_MSPM_Pretraining.ipynb
: Training the general domain molecular structure prediction model(MSPM).02_MSPM_TS_finetuning.ipynb
: (1) Fine-tuning the general MSPM on a target dataset to generate a task-specific MSPM model. (2) Fine-tuning the task-specific MSPM to tran a QSAR model.03_QSAR_Classifcation.ipynb
: Fine-tuning the general domain MSPM to train a classification model.04_QSAR_Regression.ipynb
: Fine-tuning the general domain MSPM to train a regression model.
DownloadChEMBL_1M_atom. See
notebooks/05_Pretrained_Models.ipynb
for instructions of usage.- This model is trained on 1M ChEMBL molecules with the atomwise tokenization method (original MoPMoFiT).
DownloadChEMBL_1M_SPE. See
notebooks/06_SPE_Pretrained_Models.ipynb
for instructions of usage.- This model is trained on 1M ChEMBL molecules with the SMILES pair encoding tokenization method.
- SMILES Pair Encoding (SmilesPE) is A Data-Driven Substructure Tokenization Algorithm for Deep Learning.