cthoyt/chembl-downloaderPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star77

Write reproducible code for getting and processing ChEMBL

License

MIT license

77 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.github		.github
docs/source		docs/source
notebooks		notebooks
paper		paper
src/chembl_downloader		src/chembl_downloader
tests		tests
.cruft.json		.cruft.json
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Repository files navigation

ChEMBL Downloader

Reproducibly download, open, parse, and query ChEMBL.

Don't worry about downloading/extracting ChEMBL or versioning - just usechembl_downloader to write code that knows how to download it and use itautomatically.

💪 Getting Started

Download an extract the SQLite dump using the following:

importchembl_downloaderpath=chembl_downloader.download_extract_sqlite(version='28')

After it's been downloaded and extracted once, it's smart and does not need todownload again. It gets stored usingpystow automatically in the~/.data/chembl directory.

Full technical documentation can be found onReadTheDocs. Tutorials can be foundin Jupyter notebooks in thenotebooks/ directory of therepository.

Download the Latest Version

You can modify the previous code slightly by omitting theversion keywordargument to automatically find the latest version of ChEMBL:

importchembl_downloaderpath=chembl_downloader.download_extract_sqlite()

Theversion keyword argument is available for all functions in this package(e.g., includingconnect(),cursor(), andquery()), but will be omittedbelow for brevity.

Automatically Connect to SQLite

Inside the archive is a single SQLite database file. Normally, people manuallyuntar this folder then do something with the resulting file. Don't do this, it'snot reproducible! Instead, the file can be downloaded and a connection can beopened automatically with:

importchembl_downloaderwithchembl_downloader.connect()asconn:withconn.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results

Thecursor() function provides a convenient wrapper around this operation:

importchembl_downloaderwithchembl_downloader.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results

Run a Query and Get a Pandas DataFrame

The most powerful function isquery() which builds on the previousconnect()function in combination withpandas.read_sqlto make a query and load the results into a pandas DataFrame for any downstreamuse.

importchembl_downloadersql="""SELECT    MOLECULE_DICTIONARY.chembl_id,    MOLECULE_DICTIONARY.pref_nameFROM MOLECULE_DICTIONARYJOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregnoWHERE molecule_dictionary.pref_name IS NOT NULLLIMIT 5"""df=chembl_downloader.query(sql)df.to_csv(...,sep='\t',index=False)

Suggestion 1: usepystow to make a reproducible file path that's portable toother people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable withpip install rdkit-pypi, whichmeans most users don't have to muck around with complicated conda environmentsand configurations. One of the powerful but understated tools in RDKit is therdkit.Chem.PandasToolsmodule.

SDF Usage

Access an RDKit supplier over entries in the SDF dump

This example is a bit more fit-for-purpose than the last two. Thesupplier()function makes sure that the latest SDF dump is downloaded and loads it from thegzip file into ardkit.Chem.ForwardSDMolSupplier using a context manager tomake sure the file doesn't get closed until after parsing is done. Like theprevious examples, it can also explicitly take aversion.

fromrdkitimportChemimportchembl_downloaderwithchembl_downloader.supplier()assuppl:data= []fori,molinenumerate(suppl):ifmolisNoneormol.GetNumAtoms()>50:continuefp=Chem.PatternFingerprint(mol,fpSize=1024,tautomerFingerprints=True)smi=Chem.MolToSmiles(mol)data.append((smi,fp))

This example was adapted from Greg Landrum's RDKit blog post ongeneralized substructure search.

Iterate over SMILES

This example uses thesupplier() method and RDKit to get SMILES strings frommolecules in ChEMBL's SDF file. If you want direct access to the RDKit moleculeobjects, usesupplier().

importchembl_downloaderforsmilesinchembl_downloader.iterate_smiles():print(smiles)

Get an RDKit substructure library

Building on thesupplier() function, theget_substructure_library() makesthe preparation of asubstructure libraryautomated and reproducible. Additionally, it caches the results of the build,which takes on the order of tens of minutes, only has to be done once and futureloading from a pickle object takes on the order of seconds.

The implementation was inspired by Greg Landrum's RDKit blog post,Some new features in the SubstructLibrary.The following example shows how it can be used to accomplish some of the firsttasks presented in the post:

fromrdkitimportChemimportchembl_downloaderlibrary=chembl_downloader.get_substructure_library()query=Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1')matches=library.GetMatches(query)

Morgan Fingerprints Usage

Get the Morgan Fingerprint file

ChEMBL makes a file containing pre-computed 2048 bit radius 2 morganfingerprints for each molecule available. It can be downloaded using:

importchembl_downloaderpath=chembl_downloader.download_fps()

Theversion and other keyword arguments are also valid for this function.

Load fingerprints with`chemfp`

The following wraps thedownload_fps function withchemfp's fingerprintloader:

importchembl_downloaderarena=chembl_downloader.chemfp_load_fps()

Theversion and other keyword arguments are also valid for this function. Moreinformation on working with thearena object can be foundhere.

Command Line Interface

After installing, run the following CLI command to ensure it and send the pathto stdout:

$chembl_downloader download

Use thetest subcommand to show two example queries:

$chembl_downloadertest

Configuration

If you want to store the data elsewhere usingpystow (e.g., inpyobo I also keep a copy of this file), youcan use theprefix argument.

importchembl_downloader# It gets downloaded/extracted to# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.dbpath=chembl_downloader.download_extract_sqlite(prefix=['pyobo','raw','chembl'])

See thepystowdocumentation onconfiguring the storage location further.

Theprefix keyword argument is available for all functions in this package(e.g., includingconnect(),cursor(), andquery()).

🚀 Installation

The most recent release can be installed fromPyPI with uv:

$uv pip install chembl_downloader

or with pip:

$python3 -m pip install chembl_downloader

The most recent code and data can be installed directly from GitHub with uv:

$uv pip install git+https://github.com/cthoyt/chembl-downloader.git

or with pip:

$python3 -m pip install git+https://github.com/cthoyt/chembl-downloader.git

Users

Seewho's usingchembl-downloader.

Statistics and Compatibility

chembl-downloader is compatible with all versions of ChEMBL. However, somefiles are not available for all versions. For example, the SQLite version of thedatabase was first added in release 21 (2015-02-12).

ChEMBL Version	Release Date	Total Named Compoundsfrom SQLite
31	2022-07-12	41,585
30	2022-02-22	41,549
29	2021-07-01	41,383
28	2021-01-15	41,049
27	2020-05-18	40,834
26	2020-02-14	40,822
25	2019-02-01	39,885
24_1	2018-05-01	39,877
24
23	2017-05-18	39,584
22_1	2016-11-17
22		39,422
21	2015-02-12	39,347
20	2015-02-03	-
19	2014-07-23	-
18	2014-04-02	-
17	2013-09-16	-
16	2013-05-15	-
15	2013-01-30	-
14	2012-07-18	-
13	2012-02-29	-
12	2011-11-30	-
11	2011-06-07	-
10	2011-06-07	-
09	2011-01-04	-
08	2010-11-05	-
07	2010-09-03	-
06	2010-09-03	-
05	2010-06-07	-
04	2010-05-26	-
03	2010-04-30	-
02	2009-12-07	-
01	2009-10-28	-

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, areappreciated. SeeCONTRIBUTING.mdfor more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License.

🍪 Cookiecutter

This package was created with@audreyfeldroy'scookiecutter package using@cthoyt'scookiecutter-snekpacktemplate.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making acode contribution.

Development Installation

To install in development mode, use the following:

$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$uv pip install -e.

Alternatively, install using pip:

$python3 -m pip install -e.

Updating Package Boilerplate

This project usescruft to keep boilerplate (i.e., configuration, contributionguidelines, documentation configuration) up-to-date with the upstreamcookiecutter package. Install cruft with eitheruv tool install cruft orpython3 -m pip install cruft then run:

$cruft update

More info on Cruft's update command is availablehere.

🥼 Testing

After cloning the repository and installingtox withuv tool install tox --with tox-uv orpython3 -m pip install tox tox-uv, theunit tests in thetests/ folder can be run reproducibly with:

$tox -e py

Additionally, these tests are automatically re-run with each commit in aGitHub Action.

📖 Building the Documentation

The documentation can be built locally using the following:

$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$tox -e docs$open docs/build/html/index.html

The documentation automatically installs the package as well as thedocs extraspecified in thepyproject.toml.sphinx plugins liketexext can be added there. Additionally, they need to be added to theextensions list indocs/source/conf.py.

The documentation can be deployed toReadTheDocs usingthis guide. The.readthedocs.yml YAML file contains all the configurationyou'll need. You can also set up continuous integration on GitHub to check notonly that Sphinx can build the documentation in an isolated environment (i.e.,withtox -e docs-test) but also thatReadTheDocs can build it too.

Configuring ReadTheDocs

Log in to ReadTheDocs with your GitHub account to install the integration athttps://readthedocs.org/accounts/login/?next=/dashboard/
Import your project by navigating tohttps://readthedocs.org/dashboard/importthen clicking the plus icon next to your repository
You can rename the repository on the next screen using a more stylized name(i.e., with spaces and capital letters)
Click next, and you're good to go!

📦 Making a Release

Configuring Zenodo

Zenodo is a long-term archival system that assigns a DOIto each release of your package.

Log in to Zenodo via GitHub with this link:https://zenodo.org/oauth/login/github/?next=%2F. This brings you to a pagethat lists all of your organizations and asks you to approve installing theZenodo app on GitHub. Click "grant" next to any organizations you want toenable the integration for, then click the big green "approve" button. Thisstep only needs to be done once.
Navigate tohttps://zenodo.org/account/settings/github/, which lists all ofyour GitHub repositories (both in your username and any organizations youenabled). Click the on/off toggle for any relevant repositories. When youmake a new repository, you'll have to come back to this

After these steps, you're ready to go! After you make "release" on GitHub (stepsfor this are below), you can navigate tohttps://zenodo.org/account/settings/github/repository/cthoyt/chembl-downloaderto see the DOI for the release and link to the Zenodo record for it.

Registering with the Python Package Index (PyPI)

You only have to do the following steps once.

Register for an account on thePython Package Index (PyPI)
Navigate tohttps://pypi.org/manage/account and make sure you have verifiedyour email address. A verification email might not have been sent by default,so you might have to click the "options" dropdown next to your address to getto the "re-send verification email" button
2-Factor authentication is required for PyPI since the end of 2023 (see thisblog post from PyPI).This means you have to first issue account recovery codes, then set up2-factor authentication
Issue an API token fromhttps://pypi.org/manage/account/token

Configuring your machine's connection to PyPI

You have to do the following steps once per machine.

$uv tool install keyring$keyringset https://upload.pypi.org/legacy/ __token__$keyringset https://test.pypi.org/legacy/ __token__

Note that this deprecates previous workflows using.pypirc.

Uploading to PyPI

After installing the package in development mode and installingtox withuv tool install tox --with tox-uv orpython3 -m pip install tox tox-uv, runthe following from the console:

$tox -e finish

This script does the following:

Usesbump-my-version toswitch the version number in thepyproject.toml,CITATION.cff,src/chembl_downloader/version.py, anddocs/source/conf.py to not have the-dev suffix
Packages the code in both a tar archive and a wheel usinguv build
Uploads to PyPI usinguv publish.
Push to GitHub. You'll need to make a release going with the commit where theversion was bumped.
Bump the version to the next patch. If you made big changes and want to bumpthe version by minor, you can usetox -e bumpversion -- minor after.

Releasing on GitHub

Navigate tohttps://github.com/cthoyt/chembl-downloader/releases/new to drafta new release
Click the "Choose a Tag" dropdown and select the tag corresponding to therelease you just made
Click the "Generate Release Notes" button to get a quick outline of recentchanges. Modify the title and description as you see fit
Click the big green "Publish Release" button

This will trigger Zenodo to assign a DOI to your release as well.

About

Write reproducible code for getting and processing ChEMBL

chembl-downloader.readthedocs.io

Releases17

v0.5.0 Latest

Feb 20, 2025

+ 16 releases

Sponsor this project

Learn more about GitHub Sponsors

Movatterモバイル変換

Uh oh!

License

cthoyt/chembl-downloader

Folders and files

Latest commit

History

Repository files navigation

ChEMBL Downloader

💪 Getting Started

Download the Latest Version

Automatically Connect to SQLite

Run a Query and Get a Pandas DataFrame

SDF Usage

Access an RDKit supplier over entries in the SDF dump

Iterate over SMILES

Get an RDKit substructure library

Morgan Fingerprints Usage

Get the Morgan Fingerprint file

Load fingerprints withchemfp

Command Line Interface

Configuration

🚀 Installation

Users

Statistics and Compatibility

👐 Contributing

👋 Attribution

⚖️ License

🍪 Cookiecutter

🛠️ For Developers

Development Installation

Updating Package Boilerplate

🥼 Testing

📖 Building the Documentation

Configuring ReadTheDocs

📦 Making a Release

Configuring Zenodo

Registering with the Python Package Index (PyPI)

Configuring your machine's connection to PyPI

Uploading to PyPI

Releasing on GitHub

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases17

Sponsor this project

Uh oh!

Uh oh!

Contributors4

Uh oh!

Languages

Load fingerprints with`chemfp`