. 2018 Jan 12;4(2):268–276. doi:10.1021/acscentsci.7b00572

Automatic Chemical Design Using a Data-Driven ContinuousRepresentation of Molecules

Rafael Gómez-Bombarelli^†,Jennifer N Wei^‡,David Duvenaud^¶,José Miguel Hernández-Lobato^§,Benjamín Sánchez-Lengeling^‡,Dennis Sheberla^‡,Jorge Aguilera-Iparraguirre^†,Timothy D Hirzel^†,Ryan P Adams^∇,^∥,Alán Aspuru-Guzik^‡,^⊥,^*

^†KyuluxNorth America Inc., 10Post Office Square, Suite 800, Boston, Massachusetts 02109, United States

^‡Departmentof Chemistry and Chemical Biology, HarvardUniversity, Cambridge, Massachusetts 02138, United States

^¶Departmentof Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3H5, Canada

^§Departmentof Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, U.K.

^∇Google Brain, MountainView, California, United States

^∥PrincetonUniversity, Princeton, New Jersey, United States

^⊥Biologically-InspiredSolar Energy Program, Canadian Institutefor Advanced Research (CIFAR), Toronto, Ontario M5S 1M1, Canada

E-mail:alan@aspuru.edu.

Received 2017 Dec 2; Issue date 2018 Feb 28.

This is an open access article published under an ACS AuthorChoiceLicense, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

PMC Copyright notice

PMCID: PMC5833007 PMID:29532027

Abstract

graphic file with name oc-2017-00572f_0005.jpg

We report a methodto convert discrete representations of moleculesto and from a multidimensional continuous representation. This modelallows us to generate new molecules for efficient exploration andoptimization through open-ended spaces of chemical compounds. A deepneural network was trained on hundreds of thousands of existing chemicalstructures to construct three coupled functions: an encoder, a decoder,and a predictor. The encoder converts the discrete representationof a molecule into a real-valued continuous vector, and the decoderconverts these continuous vectors back to discrete molecular representations.The predictor estimates chemical properties from the latent continuousvector representation of the molecule. Continuous representationsof molecules allow us to automatically generate novel chemical structuresby performing simple operations in the latent space, such as decodingrandom vectors, perturbing known chemical structures, or interpolatingbetween molecules. Continuous representations also allow the use ofpowerful gradient-based optimization to efficiently guide the searchfor optimized functional compounds. We demonstrate our method in thedomain of drug-like molecules and also in a set of molecules withfewer that nine heavy atoms.

Short abstract

To solve the inverse designchallenge in chemistry, we convertmolecules into continuous vector representations using neural networks.We demonstrate gradient-based property optimization of molecules.

Introduction

The goal of drug andmaterial design is to identify novel moleculesthat have certain desirable properties. We view this as an optimizationproblem, in which we are searching for the molecules that maximizeour quantitative desiderata. However, optimization in molecular spaceis extremely challenging, because the search space is large, discrete,and unstructured. Making and testing new compounds are costly andtime-consuming, and the number of potential candidates is overwhelming.Only about 10⁸ substances have ever been synthesized,¹ whereas the range of potential drug-like moleculesis estimated to be between 10²³ and 10⁶⁰.²

Virtual screening can be used to speedup this search.³⁻⁶ Virtual libraries containing thousands to hundreds of millions ofcandidates can be assayed with first-principles simulations or statisticalpredictions based on learned proxy models, and only the most promisingleads are selected and tested experimentally.

However, evenwhen accurate simulations are available,⁷ computational molecular design is limited bythe search strategy used to explore chemical space. Current methodseither exhaustively search through a fixed library,^8,9 oruse discrete local search methods such as genetic algorithms¹⁰⁻¹⁵ or similar discrete interpolation techniques.¹⁶⁻¹⁸ Although thesetechniques have led to useful new molecules, these approaches stillface large challenges. Fixed libraries are monolithic, costly to fullyexplore, and require hand-crafted rules to avoid impractical chemistries.The genetic generation of compounds requires manual specificationof heuristics for mutation and crossover rules. Discrete optimizationmethods have difficulty effectively searching large areas of chemicalspace because it is not possible to guide the search with gradients.

A molecular representation method that is continuous, data-driven,and can easily be converted into a machine-readable molecule has severaladvantages. First, hand-specified mutation rules are unnecessary,as new compounds can be generated automatically by modifying the vectorrepresentation and then decoding. Second, if we develop a differentiablemodel that maps from molecular representations to desirable properties,we can enable the use of gradient-based optimization to make largerjumps in chemical space. Gradient-based optimization can be combinedwith Bayesian inference methods to select compounds that are likelyto be informative about the global optimum. Third, a data-driven representationcan leverage large sets of unlabeled chemical compounds to automaticallybuild an even larger implicit library, and then use the smaller setof labeled examples to build a regression model from the continuousrepresentation to the desired properties. This lets us take advantageof large chemical databases containing millions of molecules, evenwhen many properties are unknown for most compounds.

Recentadvances in machine learning have resulted in powerful probabilisticgenerative models that, after being trained on real examples, areable to produce realistic synthetic samples. Such models usually alsoproduce low-dimensional continuous representations of the data beingmodeled, allowing interpolation or analogical reasoning for naturalimages,¹⁹ text,²⁰ speech, and music.^21,22 We apply such generative modelsto chemical design, using a pair of deep networks trained as an autoencoderto convert molecules represented as SMILES strings into a continuousvector representation. In principle, this method of converting froma molecular representation to a continuous vector representation couldbe applied to any molecular representation, including chemical fingerprints,²³ convolutional neural networks on graphs,²⁴ similar graph-convolutions,²⁵ and Coulomb matrices.²⁶ We choseto use SMILES representation because this representation can be readilyconverted into a molecule.

Using this new continuous vector-valuedrepresentation, we experimentwith the use of continuous optimization to produce novel compounds.We trained the autoencoder jointly on a property prediction task:we added a multilayer perceptron that predicts property values fromthe continuous representation generated by the encoder, and includedthe regression error in our loss function. We then examined theeffects that joint training had on the latent space, and tested optimizationin this latent space for new molecules that optimize our desired properties.

Representationand Autoencoder Framework

The autoencoderin comprised of two deep networks: an encoder network to convert eachstring into a fixed-dimensional vector, and a decoder network to convertvectors back into strings (Figure 1a). The autoencoder is trained to minimize error inreproducing the original string; i.e., it attempts to learn the identityfunction. Key to the design of the autoencoder is the mapping of stringsthrough aninformation bottleneck. This bottleneck—herethe fixed-length continuous vector—induces the network to learna compressed representation that captures the most statistically salientinformation in the data. We call the vector-encoded molecule thelatent representation of the molecule.

(a) A diagram of theautoencoder used for molecular design, includingthe joint property prediction model. Starting from a discrete molecularrepresentation, such as a SMILES string, the encoder network convertseach molecule into a vector in the latent space, which is effectivelya continuous molecular representation. Given a point in the latentspace, the decoder network produces a corresponding SMILES string.A mutilayer perceptron network estimates the value of target propertiesassociated with each molecule. (b) Gradient-based optimization incontinuous latent space. After training a surrogate modelf(z) to predict the properties of moleculesbased on their latent representationz, we can optimizef(z) with respect toz to find new latent representations expected to have high valuesof desired properties. These new latent representations can then bedecoded into SMILES strings, at which point their properties can betested empirically.

For unconstrained optimizationin the latent space to work, pointsin the latent space must decode into valid SMILES strings that capturethe chemical nature of the training data. Without this constraint,the latent space learned by the autoencoder may be sparse and maycontain large “dead areas”, which decode to invalidSMILES strings. To help ensure that points in the latent space correspondto valid realistic molecules, we chose to use avariational autoencoder (VAE)²⁷ framework. VAEs weredeveloped as a principled approximate-inference method for latent-variablemodels, in which each datum has a corresponding, but unknown, latentrepresentation. VAEs generalize autoencoders, adding stochasticityto the encoder which combined with a penalty term encourages all areasof the latent space to correspond to a valid decoding. The intuitionis that adding noise to the encoded molecules forces the decoder tolearn how to decode a wider variety of latent points and find morerobust representations. Variational autoencoders with recurrent neuralnetwork encoding/decoding were proposed by Bowman et al. in the contextof written English sentences, and we followed their approach closely.²⁰ To leverage the power of recent advances insequence-to-sequence autoencoders for modeling text, we used the SMILES²⁸ representation, a commonly used text encodingfor organic molecules. We also tested InChI²⁹ as an alternative string representation, but found it to performsubstantially worse than SMILES, presumably due to a more complexsyntax that includes counting and arithmetic.

The character-by-characternature of the SMILES representationand the fragility of its internal syntax (opening and closing cyclesand branches, allowed valences, etc.) can still result in the outputof invalid molecules from the decoder, even with the variational constraint.When converting a molecule from a latent representation to a molecule,the decoder model samples a string from the probability distributionover characters in each position generated by its final layer. Assuch, multiple SMILES strings are possible from a single latent spacerepresentation. We employed the open source cheminformatics suiteRDKit³⁰ to validate the chemical structuresof output molecules and discard invalid ones. While it would be moreefficient to limit the autoencoder to generate only valid strings,this postprocessing step is lightweight and allows for greater flexibilityin the autoencoder to learn the architecture of the SMILES.

To enable molecular design, the chemical structures encoded inthe continuous representation of the autoencoder need to be correlatedwith the target properties that we are seeking to optimize. Therefore,we added a model to the autoencoder that predicts the properties fromthe latent space representation. This autoencoder was then trainedjointly on the reconstruction task and a property prediction task;an additional multilayer perceptron (MLP) was used to predict theproperty from the latent vector of the encoded molecule. To proposepromising new candidate molecules, we can start from the latent vectorof an encoded molecule and then move in the direction most likelyto improve the desired attribute. The resulting new candidate vectorscan then be decoded into corresponding molecules (Figure 1b).

Two autoencoder systemswere trained: one with 108 000 moleculesfrom the QM9 data set of molecules with fewer than 9 heavy atoms³¹ and another with 250 000 drug-like commerciallyavailable molecules extracted at random from the ZINC database.³² We performed random optimization over hyperparametersspecifying the deep autoencoder architecture and training, such asthe choice between a recurrent or convolutional encoder, the numberof hidden layers, layer sizes, regularization, and learning rates.The latent space representations for the QM9 and ZINC data sets had156 dimensions and 196 dimensions, respectively.

Results and Discussion

Representationof Molecules in Latent Space

First,we analyze the fidelity of the autoencoder and the ability of thelatent space to capture structural molecular features.Figure 2a shows a kernel density estimateof each dimension when encoding a set of 5000 randomly selected ZINCmolecules from outside the training set. The kernel density estimateshows the distribution of data points along each dimension of thelatent space. Whereas the distribution of data point in each individualdimension shows a slightly different mean and standard deviation,all the distributions are normal as enforced by the variational regularizer.

Representationsof the sampling results from the variational autoencoder.(a) Kernel Density Estimation (KDE) of each latent dimension of theautoencoder, i.e., the distribution of encoded molecules along eachdimension of our latent space representation; (b) histogram of sampledmolecules for a single point in the latent space; the distances ofthe molecules from the original query are shown by the lines correspondingto the right axis; (c) molecules sampled near the location of ibuprofenin latent space. The values below the molecules are the distance inlatent space from the decoded molecule to ibuprofen; (d)*slerp* interpolation between two molecules in latent space using six stepsof equal distance.

The variational autoencoderis a doubly probabilistic model. Inaddition to the Gaussian noise added to the encoder, which can beturned off by simply sampling the mean of the encoding distribution,the decoding process is also nondeterministic, as the string outputis sampled from the final layer of the decoder. This implies thatdecoding a single point in the latent space back to a string representationis stochastic.Figure 2b shows the probability of decoding the latent representation ofa sample FDA-approved drug molecule into several different molecules.For most latent points, a prominent molecule is decoded, and manyother slight variations appear with lower frequencies. When theseresulting SMILES are re-encoded into the latent space, the most frequentdecoding also tends to be the one with the lowest Euclidean distanceto the original point, indicating the latent space is indeed capturingfeatures relevant to molecules.

Figure 2c showssome molecules in the latent space that are close to ibuprofen. Thesestructures become less similar to increasing distance in the latentspace. When the distance approaches the average distance of moleculesin the training set, the changes are more pronounced, eventually resemblingrandom molecules likely to be sampled from the training set.SI Figure 1d shows the distribution of distancesin latent space between 50 000 random points from our ZINCtraining set. We estimate that we can find 30 such molecules in thelocality of a molecule, i.e., 30 molecules closer to a given seedmolecule from our data set than any other molecule in our data set.As such, we estimate that our autoencoder that was trained on 250 000molecules from ZINC encodes approximately 7.5 million molecules. Theprobability of decoding from a point in latent space is dependenton how close this point is to the latent representations of othermolecules; we observed a decoding rate of 73–79% for pointsthat are close to known molecules, and 4% for randomly selected latentpoints.

A continuous latent space allows interpolation of moleculesbyfollowing the shortest Euclidean path between their latent representations.When exploring high dimensional spaces, it is important to note thatEuclidean distance might not map directly to notions of similarityof molecules.³³ In high dimensional spaces,most of the mass of independent normally distributed random variablesis not near the mean, but in an annulus around the mean.³⁴ Interpolating linearly between two points mightpass by an area of low probability, to keep the sampling on the areasof high probability we utilize spherical interpolation³⁵ (slerp). Withslerp, the path between two points is a circular arc lying on the on thesurface of aN-dimensional sphere.Figure 2d shows the spherical interpolationbetween two random drug molecules, showing smooth transitions in between.SI Figure 3 shows the difference between linear and spherical interpolation.

Two-dimensionalPCA analysis of latent space for variational autoencoder.The two axis are the principle components selected from the PCA analysis;the color bar shows the value of the selected property. The firstcolumn shows the representation of all molecules from the listed dataset using autoencoders trained without joint property prediction.The second column shows the representation of molecules using an autoencodertrained with joint property prediction. The third column shows a representationof random points in the latent space of the autoencoder trained withjoint property prediction; the property values predicted for thesepoints are predicted using the property predictor network. The firstthree rows show the results of training on molecules from the ZINCdata set for the logP, QED, and SAS properties; the last two rowsshow the results of training on the QM9 data set for the LUMO energyand the electronic spatial extent (R²).

Table 1 comparesthe distribution of chemical properties in the training sets againstmolecules generated with a baseline genetic algorithm, and moleculesgenerated from the variational autoencoder. In the genetic algorithm,molecules were generated with a list of hand-designed rules.¹⁰⁻¹⁵ This process was seeded using 1000 random molecules from the ZINCdata set and generated over 10 iterations. For molecules generatedusing the variational autoencoder, we collected the set of all moleculesgenerated from 400 decoding attempts from the latent space pointsencoded from the same 1000 seed molecules. We compare the water–octanolpartition coefficient (logP), the synthetic accessibility score (SAS),³⁷ and Quantitative Estimation of Drug-likeness(QED),³⁸ which ranges in value between0 and 1, with higher values indicating that the molecule is more drug-like.SI3 Figure 2 shows histograms of the propertiesof the molecules generated by each of these approaches and comparesthem to the distribution of properties from the original data set.Despite the fact that the VAE is trained purely on the SMILES stringsindependently of chemical properties, it is able to generate realistic-lookingmolecules whose features follow the intrinsic distribution of thetraining data. The molecules generated using the VAE show chemicalproperties that are more similar to the original data set than theset of molecules generated by the genetic algorithm. The two rightmostcolumns inTable 1 reportthe fraction of molecules that belong to the 17 million drug-likecompounds from which the training set was selected and how often theycan be found in a library of existing organic compounds. In the caseof drug-like molecules, the VAE generates molecules that follow theproperty distribution of the training data, but are new as the combinatorialspace is extremely large and the training set is an arbitrary subsample.The hand-selected mutations are less able to generate new compoundswhile at the same time biasing the properties of the set to higherchemical complexity and decreased drug-likeness. In the case of theQM9 data set, since the combinatorial space is smaller, the trainingset has more coverage and the VAE generates essentially the same populationstatistics as the training data.

Table 1. Comparison of MoleculeGenerationResults to Original Datasets.

source^a	data set^b	samples^c	logP^d	SAS^e	QED^f	% in ZINC^g	% in emol^h
Data	ZINC	249k	2.46 (1.43)	3.05 (0.83)	0.73 (0.14)	100	12.9
GA	ZINC	5303	2.84 (1.86)	3.80 (1.01)	0.57 (0.20)	6.5	4.8
VAE	ZINC	8728	2.67 (1.46)	3.18 (0.86)	0.70 (0.14)	5.8	7.0
Data	QM9	134k	0.30 (1.00)	4.25 (0.94)	0.48 (0.07)	0.0	8.6
GA	QM9	5470	0.96 (1.53)	4.47 (1.01)	0.53 (0.13)	0.018	3.8
VAE	QM9	2839	0.30 (0.97)	4.34 (0.98)	0.47 (0.08)	0.0	8.9

Open in a new tab

Describesthe source of the molecules:data refers to the original data set, GA refers to the genetic algorithmbaseline, and VAE to our variational autoencoder trained without propertyprediction.

Shows the dataset used, eitherZINC or QM9.

Shows the numberof samples generatedfor comparison, for data, this value simply reflects the size of thedata set. Columns d–f show the mean and, in parentheses, thestandard deviation of selected properties of the generated moleculesand compares that to the mean and standard deviation of propertiesin the original data set.

Shows the water–octanal partitioncoefficient (logP).³⁶

Shows the synthetic accessibilityscore (SAS).³⁷

Shows the Qualitative Estimate ofDrug-likeness (QED),³⁸ ranging from 0 to1. We also examine how many of the molecules generated by each methodare found in two major molecule databases:

ZINC;

E-molecules³⁹, and compare these valuesagainst the original data set.

Property Prediction of Molecules

The interest in discoveringnew molecules and chemicals is most often in relation to maximizingsome desirable property. For this reason, we extended the purely generativemodel to also predict property values from the latent representation.We trained a multilayer perceptron jointly with the autoencoder topredict properties from the latent representation of each molecule.

With joint training for property prediction, the distribution ofmolecules in the latent space is organized by property values.Figure 3 shows the mappingof property values to the latent space representation of molecules,compressed into two dimensions using PCA. The latent space generatedby autoencoders jointly trained with the property prediction taskshows in the distribution of molecules a gradient by property values;molecules with high values are located in one region, and moleculeswith low values are in another. Autoencoders that were trained withoutthe property prediction task do not show a discernible pattern withrespect to property values in the resulting latent representationdistribution.

While the primary purpose of adding property predictionwas toorganize the latent space, it is interesting to observe how the propertypredictor model compares with other standard models for property prediction.For a more fair comparison against other methods, we increased thesize of our perceptron to two layers of 1000 neurons.Table 2 compares the performance ofcommonly used molecular embeddings and models to the VAE. Our VAEmodel shows that property prediction performance for electronic properties(i.e., orbital energies) are similar to graph convolutions for someproperties; prediction accuracy could be improved with further hyperparameteroptimization.

Table 2. MAE Prediction Error for PropertiesUsing Various Methods on the ZINC and QM9 Datasets.

database/property	mean^a	ECFP^b	CM^b	GC^b	1-hot SMILES^c	Encoder^d	VAE^e
ZINC250k/logP	1.14	0.38		0.05	0.16	0.13	0.15
ZINC250k/QED	0.112	0.045		0.017	0.041	0.037	0.054
QM9/HOMO, eV	0.44	0.20	0.16	0.12	0.12	0.13	0.16
QM9/LUMO,eV	1.05	0.20	0.16	0.15	0.11	0.14	0.16
QM9/Gap, eV	1.07	0.30	0.24	0.18	0.16	0.18	0.21

Open in a new tab

Baseline, mean prediction.

As implemented in Deepchem benchmark(MoleculeNet),⁴⁰ ECFP-circular fingerprints,CM-coulomb matrix, GC-graph convolutions.

1-hot-encoding of SMILES used asinput to property predictor.

The network trained without decoderloss.

Full variational autoencodernetworktrained for individual properties.

Optimization of Molecules via Properties

We next optimizedmolecules in the latent space from the autoencoder which was jointlytrained for property prediction. In order to create a smoother landscapeto perform optimizations, we used a Gaussian process model to modelthe property predictor model. Gaussian processes can be used to predictany smooth continuous function⁴¹ and areextremely lightweight, requiring only a few minutes to train on adataset of a few thousand molecules. The Gaussian process was trainedto predict target properties for molecules given the latent spacerepresentation of the molecules as an input.

The 2000 moleculesused for training the Gaussian process were selected to be maximallydiverse. Using this model, we optimized in the latent space to finda molecule that maximized our objective. As a baseline, we comparedour optimization results against molecules found using a random Gaussiansearch and molecules optimized via a genetic algorithm.

Theobjective we chose to optimize was 5 × QED – SAS,where QED is the Quantitative Estimation of Drug-likeness,³⁸ and SAS is the Synthetic Accessibility score.³⁷ This objective represents a rough estimate offinding the most drug-like molecule that is also easy to synthesize.To provide the greatest challenge for our optimizer, we started withmolecules from the ZINC data set that had an objective score in thebottom 10%, i.e., were in the 10th percentile.

FromFigure 4a wecan see that the optimization with the Gaussian process (GP) modelon the latent space representation consistently results in moleculeswith a higher percentile score than the two baseline search methods.Figure 4b shows the pathof one optimization from the starting molecule to the final moleculein the two-dimensional PCA representation, the final molecule endingup in the region of high objective value.Figure 4c shows molecules decoded along this optimizationpath using a Gaussian interpolation.

Optimization results for the jointly trainedautoencoder using5 × QED – SAS as the objective function. (a) shows a violin plotwhich compares the distribution of sampled molecules from normal randomsampling, SMILES optimization via a common chemical transformationwith a genetic algorithm, and from optimization on the trained Gaussianprocess model with varying amounts of training points. To offset differencesin computational cost between the random search and the optimizationon the Gaussian process model, the results of 400 iterations of randomsearch were compared against the results of 200 iterations of optimization.This graph shows the combined results of four sets of trials. (b)shows the starting and ending points of several optimization runson a PCA plot of latent space colored by the objective function. Highlightedin black is the path illustrated in part (c). (c) shows a sphericalinterpolation between the actual start and finish molecules usinga constant step size. The QED, SAS, and percentile score are reportedfor each molecule.

Performing this optimizationon a GP model trained with 1000 moleculesleads to a slightly wider range of molecules as shown inFigure 4a. Since the trainingset is smaller, the predictive power of the GP is lower which whenoptimizing in latent space and as a result optimizes to several localminima instead of a global optimization. In cases where it is difficultto define an objective that completely describes all the traits desiredin a molecule, it may be better to use this localized optimizationapproach to reach a larger diversity of potential molecules.

Conclusion

We propose a new family of methods for exploring chemical spacebased on continuous encodings of molecules. These methods eliminatethe need to hand-craft libraries of compounds and allow a new typeof directed gradient-based search through chemical space. In our autoencodermodel, we observed high fidelity in reconstruction of SMILES stringsand the ability to capture characteristic features of a moleculartraining set. The autoencoder exhibited good predictive power whentraining jointly with a property prediction task, and the abilityto perform gradient-based optimization of molecules in the resultingsmoothed latent space.

There are several directionsfor further improvement of this approachto molecular design. In this work, we used a text-based molecularencoding, but using a graph-based autoencoder would have several advantages.Forcing the decoder to produce valid SMILES strings makes the learningproblem unnecessarily hard since the decoder must also implicitlylearn which strings are valid SMILES. An autoencoder that directlyoutputs molecular graphs is appealing since it could explicitly addressissues of graph isomorphism and the problem of strings that do notcorrespond to valid molecular graphs. Building an encoder which takesin molecular graphs is straightforward through the use of off-the-shelfmolecular fingerprinting methods, such as ECFP²³ or a continuously parametrized variant of ECFP such asneural molecular fingerprints.²⁴ However,building a neural network which can output arbitrary graphs is anopen problem.

Further extensions of this work to use a explicitlydefined grammarfor SMILES instead of forcing the model to learn one⁴² or to actively learn valid sequences^43,44 are underway, as is the application of adversarial networks forthis task.⁴⁵⁻⁴⁷ Several proceeding works have further explored theuse of Long Short-Term Memory (LSTM) networks and recurrent networksapplied to SMILES strings to generate new molecules^48,49 and predict the outcomes of organic chemistry reactions.⁵⁰

The autoencoder sometimes produced moleculesthat are formallyvalid as graphs but contain moieties that are not desirable becauseof stability or synthetic constraints. Examples are acid chlorides,anhydrides, cyclopentadienes, aziridines, enamines, hemiaminals, enolethers, cyclobutadiene, and cycloheptatriene. One option is to trainthe autoencoder to predict properties related to steric constraintsof other structural constraints. In general, the objective functionto be optimized needs to capture as many desirable traits as possibleand balance them to ensure that the optimizer focuses on genuinelydesirable compounds. This approach has also been tested in a few followingworks.^43,44

The results reported in this work,and its application to optimizingobjective functions of molecular properties, have already and willcontinue to influence new avenues for molecular design.

Methods

AutoencoderArchitecture

Strings of characters canbe encoded into vectors using recurrent neural networks (RNNs). Anencoder RNN can be paired with a decoder RNN to perform sequence-to-sequencelearning.⁵¹ We also experimented with convolutionalnetworks for string encoding⁵² and observedimproved performance. This is explained by the presence of repetitive,translationally invariant substrings that correspond to chemical substructures,e.g., cycles and functional groups.

Our SMILES-based text encodingused a subset of 35 different characters for ZINC and 22 differentcharacters for QM9. For ease of computation, we encoded strings upto a maximum length of 120 characters for ZINC and 34 characters forQM9, although in principle there is no hard limit to string length.Shorter strings were padded with spaces to this same length. We usedonly canonicalized SMILES for training to avoid dealing with equivalentSMILES representations. The structure of the VAE deep network wasas follows: For the autoencoder used for the ZINC data set, the encoderused three 1D convolutional layers of filter sizes 9, 9, 10 and 9,9, 11 convolution kernels, respectively, followed by one fully connectedlayer of width 196. The decoder fed into three layers of gated recurrentunit (GRU) networks⁵³ with hidden dimensionof 488. For the model used for the QM9 data set, the encoder usedthree 1D convolutional layers of filter sizes 2, 2, 1 and 5, 5, 4convolution kernels, respectively, followed by one fully connectedlayer of width 156. The three recurrent neural network layers eachhad a hidden dimension of 500 neurons.

The last layer of theRNN decoder defines a probability distributionover all possible characters at each position in the SMILES string.This means that the writeout operation is stochastic, and the samepoint in latent space may decode into different SMILES strings, dependingon the random seed used to sample characters. The output GRU layerhad one additional input, corresponding to the character sampled fromthe softmax output of the previous time step and was trained usingteacher forcing.⁵⁴ This increased the accuracyof generated SMILES strings, which resulted in higher fractions ofvalid SMILES strings for latent points outside the training data,but also made training more difficult, since the decoder showed atendency to ignore the (variational) encoding and rely solely on theinput sequence. The variational loss was annealed according to sigmoidschedule after 29 epochs, running for a total 120 epochs.

Forproperty prediction, two fully connected layers of 1000 neuronswere used to predict properties from the latent representation, witha dropout rate of 0.20. To simply shape the latent space, a smallerperceptron of 3 layers of 67 neurons was used for the property predictor,trained with a dropout rate of 0.15. For the algorithm trained onthe ZINC data set, the objective properties include logP, QED, SAS.For the algorithm trained on the QM9 data set, the objective propertiesinclude HOMO energies, LUMO energies, and the electronic spatial extent(R²). The property prediction loss was annealed in at thesame time as the variational loss. We used the Keras⁵⁵ and TensorFlow⁵⁶ packages tobuild and train this model and the RDKit package for cheminformatics.³⁰

Acknowledgments

This work was supported financially by the Samsung AdvancedInstitute of Technology. The authors acknowledge the use of the HarvardFAS Odyssey Cluster and support from FAS Research Computing. J.N.W.acknowledges support from the National Science Foundation GraduateResearch Fellowship Program under Grant No. DGE-1144152. J.M.H.-L.acknowledges support from the Rafael del Pino Foundation. R.P.A. acknowledgessupport from the Alfred P. Sloan Foundation and NSF IIS-1421780. A.A.G.acknowledges support from The Department of Energy, Office of BasicEnergy Sciences under Award DE-SC0015959. We thank Dr. Anders Frøsethfor his generous support of this work.

Supporting Information Available

The Supporting Informationis available free of charge on theACS Publications website at DOI:10.1021/acscentsci.7b00572.

Peripheral findingsincluding statistics on the latentspace, reconstruction accuracy, training robustness with respect todata set size, and more sampling interpolation examples (PDF)

Author Contributions

^# R.G.-B. J.N.W.J.M.H.-L. and D.D. contributed equally to this work.

The authorsdeclare nocompeting financial interest.

Notes

The code and full training data sets are available athttps://github.com/aspuru-guzik-group/chemical_vae.

Supplementary Material

oc7b00572_si_001.pdf^{(1.4MB, pdf)}

References

Kim S.; Thiessen P. A.; Bolton E. E.; Chen J.; Fu G.; Gindulyte A.; Han L.; He J.; He S.; Shoemaker B. A.; Wang J.; Yu B.; Zhang J.; Bryant S. H.PubChemSubstance and Compound databases. Nucleic AcidsRes.2016, 44, D1202–D1213. 10.1093/nar/gkv951. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polishchuk P. G.; Madzhidov T. I.; Varnek A.Estimation of the size of drug-likechemical space based on GDB-17 data. J. Comput.-AidedMol. Des.2013, 27, 675–679. 10.1007/s10822-013-9672-4. [DOI] [PubMed] [Google Scholar]
Shoichet B. K.Virtualscreening of chemical libraries. Nature2004, 432, 862–5. 10.1038/nature03197. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scior T.; Bender A.; Tresadern G.; Medina-Franco J. L.; Martinez-Mayorga K.; Langer T.; Cuanalo-Contreras K.; Agrafiotis D. K.Recognizing Pitfalls in Virtual Screening: A CriticalReview. J. Chem. Inf. Model.2012, 52, 867–881. 10.1021/ci200528d. [DOI] [PubMed] [Google Scholar]
Cheng T.; Li Q.; Zhou Z.; Wang Y.; Bryant S. H.Structure-BasedVirtual Screening for Drug Discovery: a Problem-Centric Review. AAPS J.2012, 14, 133–141. 10.1208/s12248-012-9322-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pyzer-Knapp E. O.; Suh C.; Gómez-Bombarelli R.; Aguilera-Iparraguirre J.; Aspuru-Guzik A.What Is High-ThroughputVirtual Screening? A Perspectivefrom Organic Materials Discovery. Annu. Rev.Mater. Res.2015, 45, 195–216. 10.1146/annurev-matsci-070214-020823. [DOI] [Google Scholar]
Schneider G.Virtual screening:an endless staircase?. Nat. Rev. Drug Discovery2010, 9, 273–276. 10.1038/nrd3139. [DOI] [PubMed] [Google Scholar]
Hachmann J.; Olivares-Amaya R.; Atahan-Evrenk S.; Amador-Bedolla C.; Sánchez-Carrera R. S.; Gold-Parker A.; Vogt L.; Brockway A. M.; Aspuru-Guzik A.The Harvardclean energy project: large-scale computational screening and designof organic photovoltaics on the world community grid. J. Phys. Chem. Lett.2011, 2, 2241–2251. 10.1021/jz200866s. [DOI] [Google Scholar]
Gómez-Bombarelli R.; et al. Nat.Mater.2016, 15, 1120–1127. 10.1038/nmat4717. [DOI] [PubMed] [Google Scholar]
Virshup A.M.; Contreras-García J.; Wipf P.; Yang W.; Beratan D. N.Stochastic Voyagesinto Uncharted Chemical Space Producea Representative Library of All Possible Drug-Like Compounds. J. Am. Chem. Soc.2013, 135, 7296–7303. 10.1021/ja401184g. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rupakheti C.; Virshup A.; Yang W.; Beratan D. N.Strategy To DiscoverDiverse Optimal Molecules in the Small Molecule Universe. J. Chem. Inf. Model.2015, 55, 529–537. 10.1021/ci500749q. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reymond J.-L.The ChemicalSpace Project. Acc. Chem. Res.2015, 48, 722–730. 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
Reymond J.-L.; van Deursen R.; Blum L. C.; Ruddigkeit L.Chemical spaceas a source for new drugs. MedChemComm2010, 1, 30–38. 10.1039/c0md00020e. [DOI] [Google Scholar]
Kanal I. Y.; Owens S. G.; Bechtel J. S.; Hutchison G. R.EfficientComputational Screening of Organic Polymer Photovoltaics. J. Phys. Chem. Lett.2013, 4, 1613–1623. 10.1021/jz400215j. [DOI] [PubMed] [Google Scholar]
O’Boyle N. M.; Campbell C. M.; Hutchison G. R.Computational Design and Selectionof Optimal Organic Photovoltaic Materials. J.Phys. Chem. C2011, 115, 16200–16210. 10.1021/jp202765c. [DOI] [Google Scholar]
vanDeursen R.; Reymond J.-L.Chemical Space Travel. ChemMedChem2007, 2, 636–640. 10.1002/cmdc.200700021. [DOI] [PubMed] [Google Scholar]
Wang M.; Hu X.; Beratan D. N.; Yang W.Designing molecules by optimizingpotentials. J. Am. Chem. Soc.2006, 128, 3228–3232. 10.1021/ja0572046. [DOI] [PubMed] [Google Scholar]
Balamurugan D.; Yang W.; Beratan D. N.Exploringchemical space with discrete,gradient, and hybrid optimization methods. J.Chem. Phys.2008, 129, 174105. 10.1063/1.2987711. [DOI] [PubMed] [Google Scholar]
Radford A.; Metz L.; Chintala S.. UnsupervisedRepresentationLearning with Deep Convolutional Generative Adversarial Networks, 2015;https://arxiv.org/abs/1511.06434.
Bowman S. R.; Vilnis L.; Vinyals O.; Dai A.M.; Jozefowicz R.; Bengio S.. Generating Sentencesfrom a Continuous Space, 2015;https://arxiv.org/abs/1511.06349.
van den Oord A.; Dieleman S.; Zen H.; Simonyan K.; Vinyals O.; Graves A.; Kalchbrenner N.; Senior A.; Kavukcuoglu K.. WaveNet:A GenerativeModel for Raw Audio, 2016;https://arxiv.org/abs/1609.03499.
Engel J.; Resnick C.; Roberts A.; Dieleman S.; Eck D.; Simonyan K.; Norouzi M.. Neural Audio Synthesisof MusicalNotes with WaveNet Autoencoders, 2017;http://arxiv.org/abs/1704.01279.
Rogers D.; Hahn M.Extended-ConnectivityFingerprints. J. Chem.Inf. Model.2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Duvenaud D.K.; Maclaurin D.; Iparraguirre J.; Gómez-Bombarell R.; Hirzel T.; Aspuru-Guzik A.; Adams R. P.Convolutional Networkson Graphs for Learning Molecular Fingerprints. Adv. Neural Information Processing Syst.2015, 2215–2223. [Google Scholar]
Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P.. Molecular Graph Convolutions: MovingBeyond Fingerprints, 2016;http://arxiv.org/abs/1603.00856. [DOI] [PMC free article] [PubMed]
Rupp M.; Tkatchenko A.; Müller K.-R.; von Lilienfeld O. A.Fast andAccurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett.2012, 108, 058301. 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
Kingma D. P.; Welling M.. Auto-encodingVariational Bayes, 2013;https://arxiv.org/abs/1312.6114.
Weininger D.SMILES a chemicallanguage and information system. 1. Introduction to methodology andencoding rules. J. Chem. Inf. Model.1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
Heller S.; McNaught A.; Stein S.; Tchekhovskoi D.; Pletnev I.InChI - the worldwide chemical structureidentifierstandard. J. Cheminf.2013, 5, 7. 10.1186/1758-2946-5-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
RDKit:Open-source cheminformatics;http://www.rdkit.org, [Online; accessed 11-April-2017].
Ramakrishnan R.; Dral P. O.; Rupp M.; Von Lilienfeld O. A.Quantumchemistry structures and properties of 134 kilo molecules. Sci. Data2014, 1, 140022. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Irwin J. J.; Sterling T.; Mysinger M. M.; Bolstad E. S.; Coleman R. G.ZINC: AFree Tool to Discover Chemistry for Biology. J. Chem. Inf. Model.2012, 52, 1757–1768. 10.1021/ci3001277. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aggarwal C. C.; Hinneburg A.; Keim D.A.. Database Theory –ICDT 2001:8th International Conference London, UK, January 4–6,2001 Proceedings; Springer: Berlin, Heidelberg, 2001; pp 420–434. [Google Scholar]
Domingos P.A few usefulthings to know about machine learning. Commun.ACM2012, 55, 78. 10.1145/2347736.2347755. [DOI] [Google Scholar]
White T.Sampling GenerativeNetworks, 2016;http://arxiv.org/abs/1609.04468.
Wildman S. A.; Crippen G. M.Prediction of PhysicochemicalParameters by AtomicContributions. J. Chem. Inf. Comput. Sci.1999, 39, 868–873. 10.1021/ci990307l. [DOI] [Google Scholar]
Ertl P.; Schuffenhauer A.Estimationof synthetic accessibility score of drug-likemolecules based on molecular complexity and fragment contributions. J. Cheminf.2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickerton G. R.; Paolini G. V.; Besnard J.; Muresan S.; Hopkins A. L.Quantifyingthe chemical beauty of drugs. Nat. Chem.2012, 4, 90–98. 10.1038/nchem.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
E-molecules.https://www.emolecules.com/info/plus/download-database, [Online; accessed 22-July-2017].
Wu Z.; Ramsundar B.; Feinberg E. N.; Gomes J.; Geniesse C.; Pappu A.S.; Leswing K.; Pande V.. MoleculeNet: A Benchmark forMolecular Machine Learning, 2017;https://arxiv.org/abs/1703.00564. [DOI] [PMC free article] [PubMed]
Rasmussen C. E.; Williams C. K.. Gaussian Processesfor Machine Learning; MIT Press: Cambridge, 2006; Vol. 1. [Google Scholar]
Kusner M. J.; Paige B.; Hernández-Lobato J. M.. Grammar Variational Autoencoder, 2017;https://arxiv.org/abs/1703.01925.
Janz D.; vander Westhuizen J.; Hernández-Lobato J. M.. Actively Learning what makes a DiscreteSequence Valid, 2017;http://arxiv.org/abs/1603.00856.
Jaques N.; Gu S.; Bahdanau D.; Hernández-Lobato J. M.; Turner R. E.; Eck D.. Sequence tutor:Conservative fine-tuning of sequence generation models with kl-control. International Conference on Machine Learning, 2017; pp 1645–1654.
Guimaraes G. L.; Sanchez-Lengeling B.; Farias P. L. C.; Aspuru-Guzik A.. Objective-ReinforcedGenerative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv:1705.10843, 2017.
Sanchez-Lengeling B.; Outeiral C.; Guimaraes G. L.; Aspuru-Guzik A.. Optimizingdistributions over molecular space. An Objective-Reinforced GenerativeAdversarial Network for Inverse-design Chemistry (ORGANIC), 2017;https://chemrxiv.org/articles/ORGANIC_1_pdf/5309668.
Blaschke T.; Olivecrona M.; Engkvist O.; Bajorath J.; Chen H.Applicationof generative autoencoder in de novo molecular design. Mol. Inf.2017, 36, 1700123. 10.1002/minf.201700123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang X.; Zhang J.; Yoshizoe K.; Terayama K.; Tsuda K.ChemTS: anefficient python library for de novo molecular generation. Sci. Technol. Adv. Mater.2017, 18, 972–976. 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segler M. H.; Kogej T.; Tyrchan C.; Waller M. P.. Generatingfocussed molecule libraries for drug discovery with recurrent neuralnetworks, 2017;https://arxiv.org/abs/1701.01329. [DOI] [PMC free article] [PubMed]
Liu B.; Ramsundar B.; Kawthekar P.; Shi J.; Gomes J.; Luu Nguyen Q.; Ho S.; Sloane J.; Wender P.; Pande V.Retrosynthetic ReactionPrediction Using Neural Sequence-to-SequenceModels. ACS Cent. Sci.2017, 3, 1103–1113. 10.1021/acscentsci.7b00303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutskever I.; Vinyals O.; Le Q. V.Sequenceto sequence learning withneural networks. Adv. Neural Information ProcessingSyst.2014, 3104–3112. [Google Scholar]
Kalchbrenner N.; Grefenstette E.; Blunsom P.. AConvolutional Neural Network for Modelling Sentences, 2014;https://arxiv.org/abs/1404.2188.
Chung J., Gülçehre Ç., Cho K., Bengio Y.. EmpiricalEvaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014;http://arxiv.org/abs/1412.3555.
Williams R. J.; Zipser D.A Learning Algorithmfor Continually Running FullyRecurrent Neural Networks. Neural Comput.1989, 1, 270–280. 10.1162/neco.1989.1.2.270. [DOI] [Google Scholar]
Chollet F. K..https://github.com/fchollet/keras, 2015.
Abadi M.et al. TensorFlow: A system for large-scale machinelearning. 12th USENIX Symposium on Operating SystemsDesign and Implementation(OSDI 16), 2016; pp 265–283.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

oc7b00572_si_001.pdf^{(1.4MB, pdf)}

Movatterモバイル変換

PERMALINK

Automatic Chemical Design Using a Data-Driven ContinuousRepresentation of Molecules

Rafael Gómez-Bombarelli

Jennifer N Wei

David Duvenaud

José Miguel Hernández-Lobato

Benjamín Sánchez-Lengeling

Dennis Sheberla

Jorge Aguilera-Iparraguirre

Timothy D Hirzel

Ryan P Adams

Alán Aspuru-Guzik