IBM/AutoPeptideMLPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star31

AutoML system for building trustworthy peptide bioactivity predictors

License

MIT license

31 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 400 Commits
.github		.github
autopeptideml		autopeptideml
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
renovate.json		renovate.json
setup.py		setup.py

Repository files navigation

AutoPeptideML

AutoML system for building trustworthy peptide bioactivity predictors

Documentation:https://ibm.github.io/AutoPeptideML
Source Code:https://github.com/IBM/AutoPeptideML
Webserver:http://peptide.ucd.ie/AutoPeptideML
Google Collaboratory Notebook:AutoPeptideML_Collab.ipynb
Blog post:Portal - AutoPeptideML v. 1.0 Tutorial
Papers:
- AutoPeptideML (v. 1.0)
- ML Generalization from canonical to non-canonical peptides

AutoPeptideML allows researchers without prior knowledge of machine learning to build models that are:

Trustworthy: Robust evaluation following community guidelines for ML evaluation reporting in life sciencesDOME.
Interpretable: Output contains a PDF summary of the model evaluation explaining how to interpret the results to understand how reliable the model is.
Reproducible: Output contains all necessary information for other researchers to reproduce the training and verify the results.
State-of-the-art: Models generated with this system are competitive with state-of-the-art handcrafted approaches.

To use version 1.0, which may be necessary for retrocompatibility with previously built models, please defer to the branch:AutoPeptideML v.1.0.6

Table of Contents

Model builder

In order to build a new model, AutoPeptideML (v.2.0), introduces a new utility to automatically prepare an experiment configuration file, to i) improve the reproducibility of the pipeline and ii) to keep a user-friendly interface despite the much increased flexibility.

autopeptideml prepare-config

This launches an interactive CLI that walks you through:

Choosing a modeling task (classification or regression)
Selecting input modality (macromolecules or sequences)
Loading and parsing datasets (csv, tsv, or fasta)
Defining evaluation strategy
Picking models and representations
Setting hyperparameter search strategy and training parameters

You’ll be prompted to answer various questions like:

- What is the modelling problem you're facing? (Classification or Regression)- How do you want to define your peptides? (Macromolecules or Sequences)- What models would you like to consider? (knn, adaboost, rf, etc.)

And so on. The final config is written to:

<outputdir>/config.yml

This config file allows for easy reproducibility of the results, so that anyone can repeat the training processes. You can check the configuration file and make any changes you deem necessary. Finally, you can build the model by simply running:

autopeptideml build-model --config-path <outputdir>/config.yml

Prediction

In order to use a model that has already built you can run:

autopeptideml predict<model_outputdir><features_path><feature_field> --output-path<my_predictions_path.csv>

Where<features_path> is the path to aCSV file with a columnfeatures_field that contains the peptide sequences/SMILES. The output file<my_predictions_path> will contain the original data with two additional columnsscore (which are the predictions) andstd which is the standard deviation between the predictions of the models in the ensemble, which can be used as a measure of the uncertainty of the prediction.

Benchmark data

Data used to benchmark our approach has been selected from the benchmarks collected byDu et al, 2023. A new set of benchmarks was constructed from the original set following the new data acquisition and dataset partitioning methods within AutoPeptideML. To download the datasets:

Original UniDL4BioPep Benchmarks: Please check the projectGithub Repository.
⚠️ New AutoPeptideML Benchmarks (Amended version): Can be downloaded from thislink. Please note that these are not exactly the same benchmarks as used in the paper (seeIssue #24 for more details).
PeptideGeneralizationBenchmarks: Benchmarks evaluating how peptide representation methods generalize from canonical (peptides composed of the 20 standard amino acids) to non-canonical (peptides with non-standard amino acids or other chemical modifications). Check out thepaper pre-print. They have their own dedicated repository:PeptideGeneralizationBenchmarks Github repository.

Installation

Installing in a conda environment is recommended. For creating the environment, please run:

conda create -n autopeptideml pythonconda activate autopeptideml

1. Python Package

1.1.From PyPI

pip install autopeptideml

1.2. Directly from source

pip install git+https://github.com/IBM/AutoPeptideML

2. Third-party dependencies

To use MMSeqs2https://github.com/steineggerlab/mmseqs2

# static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz;export PATH=$(pwd)/mmseqs/bin/:$PATH# static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz;export PATH=$(pwd)/mmseqs/bin/:$PATH# static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz;export PATH=$(pwd)/mmseqs/bin/:$PATH# MacOSbrew install mmseqs2

To use Needleman-Wunch, either:

conda install -c bioconda emboss

sudo apt install emboss

To use ECFP fingerprints:

pip install rdkit

To use MAPc fingeprints:

pip install mapchiral

To use PepFuNN fingeprints:

pip install git+https://github.com/novonordisk-research/pepfunn

To use PeptideCLM:

pip install smilesPE

Documentation

Configuration file

Top-level structure

pipeline:{...}databases:{...}test:{...}val:{...}train:{...}representation:{...}outputdir:"path/to/experiment_results"

`pipeline`

Defines the preprocessing pipeline depending on the modality (mol orseqs). It includes data cleaning and transformations, such as:

filter-smiles
canonical-cleaner
sequence-to-smiles
smiles-to-sequences

The name of a pipeline object has to include the wordpipe. Pipelines can be elements within a pipeline. Here, is an example. Aggregate will combine the output from the different elements. In this case, the two elements process SMILES and sequences independently and then combine them into a single datastream.

pipeline:name:"macromolecules_pipe"aggregate:trueverbose:falseelements:    -pipe-smiles-input:{...}    -pipe-seq-input:{...}

`databases`

Defines dataset paths and how to interpret them.

Required:

path: Path to main dataset.
feat_fields: Column name with SMILES or sequences.
label_field: Column with classification/regression labels.
verbose: Logging flag.

Optional:

neg_database: If using negative sampling.
path: Path to negative dataset.
feat_fields: Feature column.
columns_to_exclude: Bioactivity columns to ignore.

databases:dataset:path:"data/main.csv"feat_fields:"sequence"label_field:"activity"verbose:falseneg_database:path:"data/negatives.csv"feat_fields:"sequence"columns_to_exclude:["to_exclude"]verbose:false

`test`

Defines evaluation and similarity filtering settings.

min_threshold: Identity threshold for filtering.
sim_arguments: Similarity computation details.

For sequences:

alignment_algorithm:mmseqs,mmseqs+prefilter,needle
denominator: How identity is normalized:longest,shortest,n_aligned
prefilter: Whether to use a prefilter.
field_name: Name of column with the peptide sequences/SMILES
verbose: Logging flag.

For molecules:

sim_function: e.g., tanimoto, jaccard
radius: Radius to define the substructures when computing the fingerprint
bits: Size of the fingerprint, greater gives more resolution but demands more computational resources.
partitions:min,all,<threshold>
algorithm:ccpart,ccpart_random,graph_part
threshold_step: Step size for threshold evaluation.
filter: Minimum proportion of data in the test set that is acceptable (test set proportion = 20%,filter=0.185, does not consider test sets with less than 18.5%)
verbose: Logging level.

Example:

test:min_threshold:0.1sim_arguments:data_type:"sequence"alignment_algorithm:"mmseqs"denominator:"shortest"prefilter:truemin_threshold:0.1field_name:"sequence"verbose:2partitions:"all"algorithm:"ccpart"threshold_step:0.1filter:0.185verbose:2

`val`

Cross-validation strategy:

type:kfold orsingle
k: Number of folds.
random_state: Seed for reproducibility.

`train`

Training configuration.

Required:

task: class or reg
optim_strategy: Optimization strategy.
trainer: grid or optuna
n_steps: Number of trials (Optuna only).
direction: maximize or minimize
metric: mcc or mse
partition: Partitioning type.
n_jobs: Parallel jobs.
patience: Early stopping patience.
hspace: Search space.
representations: List of representations to try.
models:
type: select or ensemble
elements: model names and their hyperparameter space.

Example:

train:task:"class"optim_strategy:trainer:"optuna"n_steps:100direction:"maximize"task:"class"metric:"mcc"partition:"random"n_jobs:8patience:20hspace:representations:["chemberta-2", "ecfp-4"]models:type:"select"elements:knn:n_neighbors:type:intmin:1max:20log:falseweights:type:categoricalvalues:["uniform", "distance"]

`representation`

Specifies molecular or sequence representations.

Each element includes:

engine:lm (language model) orfp (fingerprint)
model: Model name (e.g., chemberta-2, esm2-150m)
device:cpu,gpu, ormps
batch_size: Size per batch
average_pooling: Whether to average token representations (only forlm)

representation:verbose:trueelements:    -chemberta-2:engine:"lm"model:"chemberta-2"device:"gpu"batch_size:32average_pooling:true    -ecfp-4:engine:"fp"fp:"ecfp"radius:2nbits:2048

More details about API

Please check theCode reference documentation

License

AutoPeptideML is an open-source software licensed under the MIT Clause License. Check the details in theLICENSE file.

Credits

Special thanks toSilvia González López for designing the AutoPeptideML logo and toMarcos Martínez Galindo for his aid in setting up the AutoPeptideML webserver.