- Notifications
You must be signed in to change notification settings - Fork14
Write reproducible code for getting and processing ChEMBL
License
cthoyt/chembl-downloader
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Reproducibly download, open, parse, and query ChEMBL.
Don't worry about downloading/extracting ChEMBL or versioning - just usechembl_downloader
to write code that knows how to download it and use itautomatically.
Download an extract the SQLite dump using the following:
importchembl_downloaderpath=chembl_downloader.download_extract_sqlite(version='28')
After it's been downloaded and extracted once, it's smart and does not need todownload again. It gets stored usingpystow
automatically in the~/.data/chembl
directory.
Full technical documentation can be found onReadTheDocs. Tutorials can be foundin Jupyter notebooks in thenotebooks/ directory of therepository.
You can modify the previous code slightly by omitting theversion
keywordargument to automatically find the latest version of ChEMBL:
importchembl_downloaderpath=chembl_downloader.download_extract_sqlite()
Theversion
keyword argument is available for all functions in this package(e.g., includingconnect()
,cursor()
, andquery()
), but will be omittedbelow for brevity.
Inside the archive is a single SQLite database file. Normally, people manuallyuntar this folder then do something with the resulting file. Don't do this, it'snot reproducible! Instead, the file can be downloaded and a connection can beopened automatically with:
importchembl_downloaderwithchembl_downloader.connect()asconn:withconn.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results
Thecursor()
function provides a convenient wrapper around this operation:
importchembl_downloaderwithchembl_downloader.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results
The most powerful function isquery()
which builds on the previousconnect()
function in combination withpandas.read_sql
to make a query and load the results into a pandas DataFrame for any downstreamuse.
importchembl_downloadersql="""SELECT MOLECULE_DICTIONARY.chembl_id, MOLECULE_DICTIONARY.pref_nameFROM MOLECULE_DICTIONARYJOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregnoWHERE molecule_dictionary.pref_name IS NOT NULLLIMIT 5"""df=chembl_downloader.query(sql)df.to_csv(...,sep='\t',index=False)
Suggestion 1: usepystow
to make a reproducible file path that's portable toother people's machines (e.g., it doesn't have your username in the path).
Suggestion 2: RDKit is now pip-installable withpip install rdkit-pypi
, whichmeans most users don't have to muck around with complicated conda environmentsand configurations. One of the powerful but understated tools in RDKit is therdkit.Chem.PandasToolsmodule.
This example is a bit more fit-for-purpose than the last two. Thesupplier()
function makes sure that the latest SDF dump is downloaded and loads it from thegzip file into ardkit.Chem.ForwardSDMolSupplier
using a context manager tomake sure the file doesn't get closed until after parsing is done. Like theprevious examples, it can also explicitly take aversion
.
fromrdkitimportChemimportchembl_downloaderwithchembl_downloader.supplier()assuppl:data= []fori,molinenumerate(suppl):ifmolisNoneormol.GetNumAtoms()>50:continuefp=Chem.PatternFingerprint(mol,fpSize=1024,tautomerFingerprints=True)smi=Chem.MolToSmiles(mol)data.append((smi,fp))
This example was adapted from Greg Landrum's RDKit blog post ongeneralized substructure search.
This example uses thesupplier()
method and RDKit to get SMILES strings frommolecules in ChEMBL's SDF file. If you want direct access to the RDKit moleculeobjects, usesupplier()
.
importchembl_downloaderforsmilesinchembl_downloader.iterate_smiles():print(smiles)
Building on thesupplier()
function, theget_substructure_library()
makesthe preparation of asubstructure libraryautomated and reproducible. Additionally, it caches the results of the build,which takes on the order of tens of minutes, only has to be done once and futureloading from a pickle object takes on the order of seconds.
The implementation was inspired by Greg Landrum's RDKit blog post,Some new features in the SubstructLibrary.The following example shows how it can be used to accomplish some of the firsttasks presented in the post:
fromrdkitimportChemimportchembl_downloaderlibrary=chembl_downloader.get_substructure_library()query=Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1')matches=library.GetMatches(query)
ChEMBL makes a file containing pre-computed 2048 bit radius 2 morganfingerprints for each molecule available. It can be downloaded using:
importchembl_downloaderpath=chembl_downloader.download_fps()
Theversion
and other keyword arguments are also valid for this function.
Load fingerprints withchemfp
The following wraps thedownload_fps
function withchemfp
's fingerprintloader:
importchembl_downloaderarena=chembl_downloader.chemfp_load_fps()
Theversion
and other keyword arguments are also valid for this function. Moreinformation on working with thearena
object can be foundhere.
After installing, run the following CLI command to ensure it and send the pathto stdout
$chembl_downloader
Use--test
to show two example queries
$chembl_downloader --test
If you want to store the data elsewhere usingpystow
(e.g., inpyobo
I also keep a copy of this file), youcan use theprefix
argument.
importchembl_downloader# It gets downloaded/extracted to# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.dbpath=chembl_downloader.download_extract_sqlite(prefix=['pyobo','raw','chembl'])
See thepystow
documentation onconfiguring the storage location further.
Theprefix
keyword argument is available for all functions in this package(e.g., includingconnect()
,cursor()
, andquery()
).
The most recent release can be installed fromPyPI with uv:
$uv pip install chembl_downloader
or with pip:
$python3 -m pip install chembl_downloader
The most recent code and data can be installed directly from GitHub with uv:
$uv --preview pip install git+https://github.com/cthoyt/chembl-downloader.git
or with pip:
$UV_PREVIEW=1 python3 -m pip install git+https://github.com/cthoyt/chembl-downloader.git
Note that this requires settingUV_PREVIEW
mode enabled until the uv buildbackend becomes a stable feature.
Seewho's usingchembl-downloader
.
chembl-downloader
is compatible with all versions of ChEMBL. However, somefiles are not available for all versions. For example, the SQLite version of thedatabase was first added in release 21 (2015-02-12).
ChEMBL Version | Release Date | Total Named Compoundsfrom SQLite |
---|---|---|
31 | 2022-07-12 | 41,585 |
30 | 2022-02-22 | 41,549 |
29 | 2021-07-01 | 41,383 |
28 | 2021-01-15 | 41,049 |
27 | 2020-05-18 | 40,834 |
26 | 2020-02-14 | 40,822 |
25 | 2019-02-01 | 39,885 |
24_1 | 2018-05-01 | 39,877 |
24 | ||
23 | 2017-05-18 | 39,584 |
22_1 | 2016-11-17 | |
22 | 39,422 | |
21 | 2015-02-12 | 39,347 |
20 | 2015-02-03 | - |
19 | 2014-07-2333 | - |
18 | 2014-04-02 | - |
17 | 2013-09-16 | - |
16 | 2013-055555-15 | - |
15 | 2013-01-30 | - |
14 | 2012 -07-18 | - |
13 | 2012-02-29 | - |
12 | 2011-11-30 | - |
11 | 2011-06-07 | - |
10 | 2011-06-07 | - |
09 | 2011-01-04 | - |
08 | 2010-11-05 | - |
07 | 2010-09-03 | - |
06 | 2010-09-03 | - |
05 | 2010-06-07 | - |
04 | 2010-05-26 | - |
03 | 2010-04-30 | - |
02 | 2009-12-07 | - |
01 | 2009-10-28 | - |
Contributions, whether filing an issue, making a pull request, or forking, areappreciated. SeeCONTRIBUTING.mdfor more information on getting involved.
The code in this package is licensed under the MIT License.
This package was created with@audreyfeldroy'scookiecutter package using@cthoyt'scookiecutter-snekpacktemplate.
See developer instructions
The final section of the README is for if you want to get involved by making acode contribution.
To install in development mode, use the following:
$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$uv --preview pip install -e.
Alternatively, install using pip:
$UV_PREVIEW=1 python3 -m pip install -e.
Note that this requires settingUV_PREVIEW
mode enabled until the uv buildbackend becomes a stable feature.
This project usescruft
to keep boilerplate (i.e., configuration, contributionguidelines, documentation configuration) up-to-date with the upstreamcookiecutter package. Install cruft with eitheruv tool install cruft
orpython3 -m pip install cruft
then run:
$cruft update
More info on Cruft's update command is availablehere.
After cloning the repository and installingtox
withuv tool install tox --with tox-uv
orpython3 -m pip install tox tox-uv
, theunit tests in thetests/
folder can be run reproducibly with:
$tox -e py
Additionally, these tests are automatically re-run with each commit in aGitHub Action.
The documentation can be built locally using the following:
$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$tox -e docs$open docs/build/html/index.html
The documentation automatically installs the package as well as thedocs
extraspecified in thepyproject.toml
.sphinx
plugins liketexext
can be added there. Additionally, they need to be added to theextensions
list indocs/source/conf.py
.
The documentation can be deployed toReadTheDocs usingthis guide. The.readthedocs.yml
YAML file contains all the configurationyou'll need. You can also set up continuous integration on GitHub to check notonly that Sphinx can build the documentation in an isolated environment (i.e.,withtox -e docs-test
) but also thatReadTheDocs can build it too.
- Log in to ReadTheDocs with your GitHub account to install the integration athttps://readthedocs.org/accounts/login/?next=/dashboard/
- Import your project by navigating tohttps://readthedocs.org/dashboard/importthen clicking the plus icon next to your repository
- You can rename the repository on the next screen using a more stylized name(i.e., with spaces and capital letters)
- Click next, and you're good to go!
Zenodo is a long-term archival system that assigns a DOIto each release of your package.
- Log in to Zenodo via GitHub with this link:https://zenodo.org/oauth/login/github/?next=%2F. This brings you to a pagethat lists all of your organizations and asks you to approve installing theZenodo app on GitHub. Click "grant" next to any organizations you want toenable the integration for, then click the big green "approve" button. Thisstep only needs to be done once.
- Navigate tohttps://zenodo.org/account/settings/github/, which lists all ofyour GitHub repositories (both in your username and any organizations youenabled). Click the on/off toggle for any relevant repositories. When youmake a new repository, you'll have to come back to this
After these steps, you're ready to go! After you make "release" on GitHub (stepsfor this are below), you can navigate tohttps://zenodo.org/account/settings/github/repository/cthoyt/chembl-downloaderto see the DOI for the release and link to the Zenodo record for it.
You only have to do the following steps once.
- Register for an account on thePython Package Index (PyPI)
- Navigate tohttps://pypi.org/manage/account and make sure you have verifiedyour email address. A verification email might not have been sent by default,so you might have to click the "options" dropdown next to your address to getto the "re-send verification email" button
- 2-Factor authentication is required for PyPI since the end of 2023 (see thisblog post from PyPI).This means you have to first issue account recovery codes, then set up2-factor authentication
- Issue an API token fromhttps://pypi.org/manage/account/token
You have to do the following steps once per machine.
$uv tool install keyring$keyringset https://upload.pypi.org/legacy/ __token__$keyringset https://test.pypi.org/legacy/ __token__
Note that this deprecates previous workflows using.pypirc
.
After installing the package in development mode and installingtox
withuv tool install tox --with tox-uv
orpython3 -m pip install tox tox-uv
, runthe following from the console:
$tox -e finish
This script does the following:
- Usesbump-my-version toswitch the version number in the
pyproject.toml
,CITATION.cff
,src/chembl_downloader/version.py
, anddocs/source/conf.py
to not have the-dev
suffix - Packages the code in both a tar archive and a wheel using
uv build
- Uploads to PyPI using
uv publish
. - Push to GitHub. You'll need to make a release going with the commit where theversion was bumped.
- Bump the version to the next patch. If you made big changes and want to bumpthe version by minor, you can use
tox -e bumpversion -- minor
after.
- Navigate tohttps://github.com/cthoyt/chembl-downloader/releases/new to drafta new release
- Click the "Choose a Tag" dropdown and select the tag corresponding to therelease you just made
- Click the "Generate Release Notes" button to get a quick outline of recentchanges. Modify the title and description as you see fit
- Click the big green "Publish Release" button
This will trigger Zenodo to assign a DOI to your release as well.
About
Write reproducible code for getting and processing ChEMBL