Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Write reproducible code for getting and processing ChEMBL

License

NotificationsYou must be signed in to change notification settings

cthoyt/chembl-downloader

Repository files navigation

TestsPyPIPyPI - Python VersionPyPI - LicenseDocumentation StatusCodecov statusCookiecutter template from @cthoytRuffContributor CovenantDOI

Reproducibly download, open, parse, and query ChEMBL.

Don't worry about downloading/extracting ChEMBL or versioning - just usechembl_downloader to write code that knows how to download it and use itautomatically.

💪 Getting Started

Download an extract the SQLite dump using the following:

importchembl_downloaderpath=chembl_downloader.download_extract_sqlite(version='28')

After it's been downloaded and extracted once, it's smart and does not need todownload again. It gets stored usingpystow automatically in the~/.data/chembl directory.

Full technical documentation can be found onReadTheDocs. Tutorials can be foundin Jupyter notebooks in thenotebooks/ directory of therepository.

Download the Latest Version

You can modify the previous code slightly by omitting theversion keywordargument to automatically find the latest version of ChEMBL:

importchembl_downloaderpath=chembl_downloader.download_extract_sqlite()

Theversion keyword argument is available for all functions in this package(e.g., includingconnect(),cursor(), andquery()), but will be omittedbelow for brevity.

Automatically Connect to SQLite

Inside the archive is a single SQLite database file. Normally, people manuallyuntar this folder then do something with the resulting file. Don't do this, it'snot reproducible! Instead, the file can be downloaded and a connection can beopened automatically with:

importchembl_downloaderwithchembl_downloader.connect()asconn:withconn.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results

Thecursor() function provides a convenient wrapper around this operation:

importchembl_downloaderwithchembl_downloader.cursor()ascursor:cursor.execute(...)# run your query stringrows=cursor.fetchall()# get your results

Run a Query and Get a Pandas DataFrame

The most powerful function isquery() which builds on the previousconnect()function in combination withpandas.read_sqlto make a query and load the results into a pandas DataFrame for any downstreamuse.

importchembl_downloadersql="""SELECT    MOLECULE_DICTIONARY.chembl_id,    MOLECULE_DICTIONARY.pref_nameFROM MOLECULE_DICTIONARYJOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregnoWHERE molecule_dictionary.pref_name IS NOT NULLLIMIT 5"""df=chembl_downloader.query(sql)df.to_csv(...,sep='\t',index=False)

Suggestion 1: usepystow to make a reproducible file path that's portable toother people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable withpip install rdkit-pypi, whichmeans most users don't have to muck around with complicated conda environmentsand configurations. One of the powerful but understated tools in RDKit is therdkit.Chem.PandasToolsmodule.

SDF Usage

Access an RDKit supplier over entries in the SDF dump

This example is a bit more fit-for-purpose than the last two. Thesupplier()function makes sure that the latest SDF dump is downloaded and loads it from thegzip file into ardkit.Chem.ForwardSDMolSupplier using a context manager tomake sure the file doesn't get closed until after parsing is done. Like theprevious examples, it can also explicitly take aversion.

fromrdkitimportChemimportchembl_downloaderwithchembl_downloader.supplier()assuppl:data= []fori,molinenumerate(suppl):ifmolisNoneormol.GetNumAtoms()>50:continuefp=Chem.PatternFingerprint(mol,fpSize=1024,tautomerFingerprints=True)smi=Chem.MolToSmiles(mol)data.append((smi,fp))

This example was adapted from Greg Landrum's RDKit blog post ongeneralized substructure search.

Iterate over SMILES

This example uses thesupplier() method and RDKit to get SMILES strings frommolecules in ChEMBL's SDF file. If you want direct access to the RDKit moleculeobjects, usesupplier().

importchembl_downloaderforsmilesinchembl_downloader.iterate_smiles():print(smiles)

Get an RDKit substructure library

Building on thesupplier() function, theget_substructure_library() makesthe preparation of asubstructure libraryautomated and reproducible. Additionally, it caches the results of the build,which takes on the order of tens of minutes, only has to be done once and futureloading from a pickle object takes on the order of seconds.

The implementation was inspired by Greg Landrum's RDKit blog post,Some new features in the SubstructLibrary.The following example shows how it can be used to accomplish some of the firsttasks presented in the post:

fromrdkitimportChemimportchembl_downloaderlibrary=chembl_downloader.get_substructure_library()query=Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1')matches=library.GetMatches(query)

Morgan Fingerprints Usage

Get the Morgan Fingerprint file

ChEMBL makes a file containing pre-computed 2048 bit radius 2 morganfingerprints for each molecule available. It can be downloaded using:

importchembl_downloaderpath=chembl_downloader.download_fps()

Theversion and other keyword arguments are also valid for this function.

Load fingerprints withchemfp

The following wraps thedownload_fps function withchemfp's fingerprintloader:

importchembl_downloaderarena=chembl_downloader.chemfp_load_fps()

Theversion and other keyword arguments are also valid for this function. Moreinformation on working with thearena object can be foundhere.

Command Line Interface

After installing, run the following CLI command to ensure it and send the pathto stdout

$chembl_downloader

Use--test to show two example queries

$chembl_downloader --test

Configuration

If you want to store the data elsewhere usingpystow (e.g., inpyobo I also keep a copy of this file), youcan use theprefix argument.

importchembl_downloader# It gets downloaded/extracted to# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.dbpath=chembl_downloader.download_extract_sqlite(prefix=['pyobo','raw','chembl'])

See thepystowdocumentation onconfiguring the storage location further.

Theprefix keyword argument is available for all functions in this package(e.g., includingconnect(),cursor(), andquery()).

🚀 Installation

The most recent release can be installed fromPyPI with uv:

$uv pip install chembl_downloader

or with pip:

$python3 -m pip install chembl_downloader

The most recent code and data can be installed directly from GitHub with uv:

$uv --preview pip install git+https://github.com/cthoyt/chembl-downloader.git

or with pip:

$UV_PREVIEW=1 python3 -m pip install git+https://github.com/cthoyt/chembl-downloader.git

Note that this requires settingUV_PREVIEW mode enabled until the uv buildbackend becomes a stable feature.

Users

Seewho's usingchembl-downloader.

Statistics and Compatibility

chembl-downloader is compatible with all versions of ChEMBL. However, somefiles are not available for all versions. For example, the SQLite version of thedatabase was first added in release 21 (2015-02-12).

ChEMBL VersionRelease DateTotal Named Compoundsfrom SQLite
312022-07-1241,585
302022-02-2241,549
292021-07-0141,383
282021-01-1541,049
272020-05-1840,834
262020-02-1440,822
252019-02-0139,885
24_12018-05-0139,877
24
232017-05-1839,584
22_12016-11-17
2239,422
212015-02-1239,347
202015-02-03-
192014-07-2333-
182014-04-02-
172013-09-16-
162013-055555-15-
152013-01-30-
142012 -07-18-
132012-02-29-
122011-11-30-
112011-06-07-
102011-06-07-
092011-01-04-
082010-11-05-
072010-09-03-
062010-09-03-
052010-06-07-
042010-05-26-
032010-04-30-
022009-12-07-
012009-10-28-

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, areappreciated. SeeCONTRIBUTING.mdfor more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License.

🍪 Cookiecutter

This package was created with@audreyfeldroy'scookiecutter package using@cthoyt'scookiecutter-snekpacktemplate.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making acode contribution.

Development Installation

To install in development mode, use the following:

$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$uv --preview pip install -e.

Alternatively, install using pip:

$UV_PREVIEW=1 python3 -m pip install -e.

Note that this requires settingUV_PREVIEW mode enabled until the uv buildbackend becomes a stable feature.

Updating Package Boilerplate

This project usescruft to keep boilerplate (i.e., configuration, contributionguidelines, documentation configuration) up-to-date with the upstreamcookiecutter package. Install cruft with eitheruv tool install cruft orpython3 -m pip install cruft then run:

$cruft update

More info on Cruft's update command is availablehere.

🥼 Testing

After cloning the repository and installingtox withuv tool install tox --with tox-uv orpython3 -m pip install tox tox-uv, theunit tests in thetests/ folder can be run reproducibly with:

$tox -e py

Additionally, these tests are automatically re-run with each commit in aGitHub Action.

📖 Building the Documentation

The documentation can be built locally using the following:

$git clone git+https://github.com/cthoyt/chembl-downloader.git$cd chembl-downloader$tox -e docs$open docs/build/html/index.html

The documentation automatically installs the package as well as thedocs extraspecified in thepyproject.toml.sphinx plugins liketexext can be added there. Additionally, they need to be added to theextensions list indocs/source/conf.py.

The documentation can be deployed toReadTheDocs usingthis guide. The.readthedocs.yml YAML file contains all the configurationyou'll need. You can also set up continuous integration on GitHub to check notonly that Sphinx can build the documentation in an isolated environment (i.e.,withtox -e docs-test) but also thatReadTheDocs can build it too.

Configuring ReadTheDocs

  1. Log in to ReadTheDocs with your GitHub account to install the integration athttps://readthedocs.org/accounts/login/?next=/dashboard/
  2. Import your project by navigating tohttps://readthedocs.org/dashboard/importthen clicking the plus icon next to your repository
  3. You can rename the repository on the next screen using a more stylized name(i.e., with spaces and capital letters)
  4. Click next, and you're good to go!

📦 Making a Release

Configuring Zenodo

Zenodo is a long-term archival system that assigns a DOIto each release of your package.

  1. Log in to Zenodo via GitHub with this link:https://zenodo.org/oauth/login/github/?next=%2F. This brings you to a pagethat lists all of your organizations and asks you to approve installing theZenodo app on GitHub. Click "grant" next to any organizations you want toenable the integration for, then click the big green "approve" button. Thisstep only needs to be done once.
  2. Navigate tohttps://zenodo.org/account/settings/github/, which lists all ofyour GitHub repositories (both in your username and any organizations youenabled). Click the on/off toggle for any relevant repositories. When youmake a new repository, you'll have to come back to this

After these steps, you're ready to go! After you make "release" on GitHub (stepsfor this are below), you can navigate tohttps://zenodo.org/account/settings/github/repository/cthoyt/chembl-downloaderto see the DOI for the release and link to the Zenodo record for it.

Registering with the Python Package Index (PyPI)

You only have to do the following steps once.

  1. Register for an account on thePython Package Index (PyPI)
  2. Navigate tohttps://pypi.org/manage/account and make sure you have verifiedyour email address. A verification email might not have been sent by default,so you might have to click the "options" dropdown next to your address to getto the "re-send verification email" button
  3. 2-Factor authentication is required for PyPI since the end of 2023 (see thisblog post from PyPI).This means you have to first issue account recovery codes, then set up2-factor authentication
  4. Issue an API token fromhttps://pypi.org/manage/account/token

Configuring your machine's connection to PyPI

You have to do the following steps once per machine.

$uv tool install keyring$keyringset https://upload.pypi.org/legacy/ __token__$keyringset https://test.pypi.org/legacy/ __token__

Note that this deprecates previous workflows using.pypirc.

Uploading to PyPI

After installing the package in development mode and installingtox withuv tool install tox --with tox-uv orpython3 -m pip install tox tox-uv, runthe following from the console:

$tox -e finish

This script does the following:

  1. Usesbump-my-version toswitch the version number in thepyproject.toml,CITATION.cff,src/chembl_downloader/version.py, anddocs/source/conf.py to not have the-dev suffix
  2. Packages the code in both a tar archive and a wheel usinguv build
  3. Uploads to PyPI usinguv publish.
  4. Push to GitHub. You'll need to make a release going with the commit where theversion was bumped.
  5. Bump the version to the next patch. If you made big changes and want to bumpthe version by minor, you can usetox -e bumpversion -- minor after.

Releasing on GitHub

  1. Navigate tohttps://github.com/cthoyt/chembl-downloader/releases/new to drafta new release
  2. Click the "Choose a Tag" dropdown and select the tag corresponding to therelease you just made
  3. Click the "Generate Release Notes" button to get a quick outline of recentchanges. Modify the title and description as you see fit
  4. Click the big green "Publish Release" button

This will trigger Zenodo to assign a DOI to your release as well.


[8]ページ先頭

©2009-2025 Movatter.jp