. 2019 Sep 12;5(9):1523–1531. doi:10.1021/acscentsci.9b00476

BigSMILES: A Structurally-Based Line Notation forDescribing Macromolecules

Tzyy-Shyang Lin^†,Connor W Coley^†,Hidenobu Mochigase^†,Haley K Beech^†,Wencong Wang^‡,Zi Wang^§,Eliot Woods^∥,Stephen L Craig^§,Jeremiah A Johnson^‡,Julia A Kalow^∥,Klavs F Jensen^†,Bradley D Olsen^†,^*

^†Departmentof Chemical Engineering and^‡Department of Chemistry, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States

^§Departmentof Chemistry, Duke University, Durham, North Carolina 27708, United States

^∥Departmentof Chemistry, Northwestern University, Evanston, Illinois 60208, United States

E-mail:bdolsen@mit.edu.

Received 2019 May 14; Issue date 2019 Sep 25.

This is an open access article published under an ACS AuthorChoiceLicense, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

PMC Copyright notice

PMCID: PMC6764162 PMID:31572779

Abstract

graphic file with name oc9b00476_0006.jpg

Havinga compact yet robust structurally based identifier or representationsystem is a key enabling factor for efficient sharing and disseminationof research results within the chemistry community, and such systemslay down the essential foundations for future informatics and data-drivenresearch. While substantial advances have been made for small molecules,the polymer community has struggled in coming up with an efficientrepresentation system. This is because, unlike other disciplines inchemistry, the basic premise that each distinct chemical species correspondsto a well-defined chemical structure does not hold for polymers. Polymersare intrinsically stochastic molecules that are often ensembles witha distribution of chemical structures. This difficulty limits theapplicability of all deterministic representations developed for smallmolecules. In this work, a new representation system that is capableof handling the stochastic nature of polymers is proposed. The newsystem is based on the popular “simplified molecular-inputline-entry system” (SMILES), and it aims to provide representationsthat can be used as indexing identifiers for entries in polymer databases.As a pilot test, the entries of the standard data set of the glasstransition temperature of linear polymers (Bicerano, 2002) were convertedinto the new BigSMILES language. Furthermore, it is hoped that theproposed system will provide a more effective language for communicationwithin the polymer community and increase cohesion between the researcherswithin the community.

Short abstract

BigSMILES, a line notation thatsupports intrinsically stochasticmolecules on top of the simplified molecular-input line-entry system(SMILES), is presented to pave the way for polymer informatics.

1. Introduction

Line notations that encodethe connectivity of a molecule intoa line of text are a very popular choice for storing chemical structuresowing to their memory compactness, their simultaneous human readabilityand machine-friendliness, and their compatibility with most softwareand input systems.¹ In synergy with theadvances in machine learning and data mining algorithms, a good linenotation can enable data-driven research and materials discovery.²⁻⁴ For small molecules, many line notations have been developed, includingthe simplified molecular-input line-entry system (SMILES),^5,6 the SYBYL line notation (SLN),⁷ the Wiswesserline notation (WLN),⁸ ROSDAL,⁹ the modular chemical descriptor language (MCDL),¹⁰ or more recently the international chemicalidentifier (InChI).¹¹ Among them, SMILESis the most popular linear notation, and it is generally consideredthe most human-readable variant, with by far the widest software support.¹ In practice, SMILES provides a simple set ofrepresentations that are suitable as labels for chemical data andas a memory compact identifier for data exchange between researchers.Moreover, SMILES and its extensions serve as descriptive codes thatallow rapid generation of graphical objects that could be searchedfor chemical structures with tools such as Open Babel.¹² Furthermore, as a text-based system, SMILESis also a natural fit to many text-based machine learning algorithms.When combined with string kernels, SMILES strings can be used withkernelized learning methods such as the support vector machine.¹³ These superior characteristics have made SMILESa perfect tool for translating chemistry knowledge into a machine-friendlyform, and it has been successfully applied for small molecule propertyprediction¹⁴⁻¹⁶ and computer-aided synthesis planning.^2,17,18

However, polymers haveresisted description by these structurallanguages. This is because most structural languages such as SMILEShave been designed to describe molecules or chemical fragments thatare well-defined atomistic graphs. Since polymers are stochastic molecules,they do not have unique SMILES representations. As discussed by Audusand de Pablo in a recent viewpoint,¹⁹ thislack of a unified naming or identifier convention for polymer materialsis one of the major hurdles slowing down the development of the polymerinformatics field. While pioneering efforts on polymer informaticssuch as the Polymer Genome project²⁰ havedemonstrated the usefulness of SMILES extensions in polymer informatics,the fast development of new chemistry and the rapid development ofmaterials informatics and data-driven research make the need for auniversally applicable naming convention for polymers ever more urgent.²⁰⁻²⁸

Recently, several notable schemes have been proposed as apotentialsolution to this issue: the hierarchical editing language for macromolecules(HELM) developed by the Pistoia Alliance,²⁹ the International Union of Pure and Applied (IUPAC) internationalchemical identifier (InChI),¹¹ and theCurlySMILES language,³⁰ an extension ofSMILES that aims to provide support for polymers, composite materials,and crystals. However, while HELM is useful in describing macromoleculesand biopolymers with well-defined structures, it is not designed tocapture the stochastic nature of polymers. On the other hand, InChIis not specifically designed for polymers and does not support branchedpolymers, and CurlySMILES primarily focuses on polymers where structuralfeatures, such as the head–tail configuration, are alreadywell-defined. Moreover, CurlySMILES requires the introduction of manynew parameters to accompany its annotation syntax, which significantlyincrease the complexity of the language and reduce its readability.Finally, CurlySMILES does not support the encoding of randomly branchedpolymers. As such, this means that the need for a flexible structurallybased identifier system that supports a wide variety of differentpolymeric structures remains extremely pressing.

Here, a newstructurally based construct is proposed as an additionto the highly successful SMILES representation that can treat thestochastic nature of polymer materials. Since polymers are high-molar-massmolecules, this construct is named BigSMILES. In BigSMILES, polymericfragments are represented by a list of repeating units enclosed bycurly brackets. The chemical structures of the repeating units areencoded using normal SMILES syntax, but with additional bonding descriptorsthat specify how different repeating units are connected to form polymers.As depicted inFigure 1, this simple design of syntax would enable the encoding of macromoleculesover a wide range of different chemistries, including homopolymer,random copolymers and block copolymers, and a variety of molecularconnectivities, ranging from linear polymers to ring polymers to evenbranched polymers. Except for a handful of new additional rules andoperators, all syntax of BigSMILES follows the same syntax as theoriginal SMILES. This means that, as in SMILES, BigSMILES representationsare compact, self-contained text strings. Furthermore, a multitudeof polymer structures that are more complicated than the examplesschematically illustrated inFigure 1 can be constructed through the composition of threenew basic operators and original SMILES symbols. This is demonstratedin detail along with the discussion of BigSMILES syntax in the nextsection.

Schematic illustration of the syntax of BigSMILES and some of thestructures that can be encoded using BigSMILES. Polymeric fragmentsare represented by a list of repeating units enclosed within curlybrackets. Repeating units are composed of SMILES strings (representedby the red circles and blue squares in the left panel) with additionaldescriptors (black structures on the left panel) that give the detailsof the connectivity pattern between repeating units.

2. Syntax

The major extension of BigSMILES toSMILES is the introductionof an additional stochastic object that represents a fragment of amolecule that is intrinsically stochastic in its structure. Unlikesmall molecules, for which each string corresponds to a single chemicalstructure, references to the BigSMILES string refer to a group ofmolecules that have a distribution of structural features and properties.In analogy with statistical physics, this ensemble of polymer moleculesconsists of many molecular states, where each molecular state is anarrangement of atoms into a molecule that could possibly be realized.Each molecular state has some probability of occurrence, which isdetermined by the specific rules of chemistry governing the systemas the molecules were formed. The rules determining the probabilityof observing a given molecule in a polymer may be extremely complex,involving changes in the probability of forming a given molecularconfiguration as a function of both time and space within a reactivesystem.

While the exact quantification of structural featuresand propertiescan be difficult, the monomer and mechanism of polymerization usedrestrict the set of possible chemical structures present in the ensemblebased on the generally known rules of connectivity. Exploiting this,a stochastic object is defined as a machine-friendly representationof the molecular ensemble, without specifying the probability of occurrenceof any individual molecular state. The stochastic objects resemblethe widely used structural formula representation that is commonlyused to describe macromolecules. The object is identified by a pairof curly brackets around a comma-delimited list of repeating unitsof the polymer:

The curly bracket is used to avoid conflict with othernotationin the existing SMILES syntax. In this representation, each repeatingunit within the object essentially resembles the repeating units thatare bracketed within the parentheses in a structural formula. Thecomparison between a BigSMILES stochastic object and a correspondingstructural formula is illustrated inFigure 2. In BigSMILES, the entire object, whichis bracketed by the curly brackets, symbolizes a piece of a molecularfragment that has a random structure. Since BigSMILES is an extensionof the SMILES language, the SMILES syntax for specifying chiral centers(namely, the use of “@” and “@@” symbols),aromatic atoms (the use of lowercase letters), electric charges, ringclosures, and many other features⁵ is retainedto provide means for encoding detailed molecular structures.Section SVI (p S17) andSection SVII (p S18) in the SI contain more details on the treatmentof tacticity and polyelectrolytes.

Comparison between the structural formula(top) of poly(ethylene-vinylacetate) (EVA) and the BigSMILES representation (bottom) of EVA. Therepresentations of ethylene monomer shaded in orange and vinyl acetatemonomer shaded in green in the structural formula are very similarto the machine-friendly BigSMILES stochastic object representation.Note that the BigSMILES representation exists as both a simplifiedrepresentation, in which “$” are omitted, and a fullrepresentation, where bonding sites to other repeating units are explicitlyindicated. While it is conventional to draw the structural formulawith the repeating units in their canonical orientation, this doesnot imply that all the repeating units are oriented in this specificconfiguration within the polymer chain; the orientation may be head-to-tailor head-to-head in many cases depending on the nature of the polymerization.Here, the nature of vinyl polymerization is captured by the BigSMILESrepresentation by allowing the units to take both orientations.

2.1. Bonding Descriptor Syntax

In simplelinear polymer segments, the repeat units may be written in a waysuch that the strings for each repeat unit may be directly concatenatedtogether in any order or orientation to form a representation of apolymer molecule.Figure 2 includes a representation of one such hypothetical polymersegment; however, in many cases (such as polymers synthesized viaring-opening polymerization with repeating units always in a specificorientation, or if there exist multiple sets of orthogonal reactionsthat prohibit the formation of a certain connection between some repeatingunits), more complex bonding patterns can arise. To differentiateand clearly specify different connectivity patterns between repeatingunits, two types of bonding descriptors are introduced.

Thefirst type of connection is AA type bonding, where connections canoccur between any two bonding moieties within a group of possiblemoieties. This is commonly found in the bonds formed from chain polymerizationof vinyl monomers, where each polymerized vinyl carbon can in principleconnect to any other polymerized vinyl carbon found in other repeatunits (allowing for head-to-head, tail-to-tail, and head-to-tail addition).Section SI (p S2) in the Supporting Informationgives a more detailed discussion of this feature. For this type ofconnectivity, the “$” notation is used. For example,for a linear polymer segment formed from vinyl monomers ethylene and1-butene, the stochastic object reads

As illustratedby the example, in general, there are multiple equallyvalid representations for each repeating unit, and the bonding siteto other repeating units (the position of symbol “$”)can be placed at any position in the repeating unit. Furthermore,there can be more than two such sites per repeat unit, which becomesuseful when the notation is generalized to represent branched polymers.For example,Figure S2a (p S11, SectionSIV in the SI) gives a representation for branched polyethylene usingbranching units with three or more connection sites to the other repeatingunits. It should be emphasized that the list in the BigSMILES stochasticobject is defined based on repeating units rather than monomers. Therefore,for monomers such as isoprene, which may have up to four isomerizationstates upon polymerization,³¹ each isomerizationstate is treated as a distinct repeating unit in the stochastic object,as illustrated inFigure 3b.

Examples to illustrate the syntax of BigSMILES for polymers synthesizedvia different chemistries. (a) Vinyl copolymer poly(ethylene-co-1-butene) formed from chain polymerization, (b) fourdistinct isomerization states for polyisoprene (1,2-addition isomeris retained for completeness despite the fact that its amount canbe negligibly small in natural rubber), (c) step polymerized nylon-6,6,(d) step polymerized poly(alanine-co-glycine), and(e) poly(ethylene glycol) methacrylate formed from polymerizationof epoxides.

If multiple orthogonal sets ofAA type connections exist withinthe same molecule, the symbol “$” can be appended witha positive integern into “$n” to distinguish between different sets of connections. Bydefault, “$” represents a single bond connection; however,if the repeating units are connected by other bonds, the bond typeor bond order can be specified by using the SMILES bond order representation,with “$=n” for double bonds, “$#n” for triple bonds, and “$\n” and“$/n” for explicitly specifying the cis–transisomerization states of single bonds directly adjacent to double bonds,respectively. Note that integer IDsn should serveas unique identifiers for the different sets of bonds and thereforenot be reused within the same stochastic object. Since the scope ofthis identifier is only within the stochastic object (between thecurly brackets), identifiers within different stochastic objects aredistinct even if the IDs appear to be identical. Furthermore, whileexplicitly stating the additional bond order in every bonding descriptorenhances clarity, the first occurrence of the bond of a particularID is treated as the definition for the connectivity pattern associatedwith the ID, and the details of the bond order can be omitted forsimplicity in later occurrences. If there is only one group of bondswithin the stochastic object, the integer ID can be dropped for simplicityif no additional descriptor (such as the bond order) is needed forthe bond. In the special case where there are just two connectivesites per repeat unit, and only one type of AA bond of bond orderone exists, if the repeat unit is written such that these two sitesare at the termini of the repeat unit, the symbol “$”may be omitted altogether, as in the case illustrated inFigure 3a,b. This providesa substantial simplification for a very wide range of common polymersand is referred to as the “simplified representation.”

In the $ representation of AA bonding, any bond indicated by “$n” can be joined to any other bond “$n”, and the repeating unit in the polymeric structureneed not connect in the orientation specified in the repeat unit list.Therefore, structures with repeat units in the flipped orientationare implicitly included. For instance, this bonding descriptor canbe used for representing vinyl polymers, for which both the head-to-headand the head-to-tail configurations need to be included so the overallBigSMILES representations capture the full ensemble of the possibleconfiguration of the polymer (Figure S1, p S2). Including both configurations is especially important indescribing polymers such as poly(vinyl alcohol) or fluorinated vinylpolymers, for which there are known to be a significant number ofhead-to-head oriented pairs along the chain.³² However, it is emphasized that while the bonding descriptor specifiesthe ensemble of possible configurations, it does not provide informationon the relative weights for each of the configurations.

Forthe second type of bonding, AB type bonding, a bonding moietycannot connect directly to other moieties within the same group butcan only connect to moieties in another conjugate group. This is commonlyseen in monomers polymerized with condensation reactions. For example,in a polyamide, the amide bonds between monomers are always betweenan acid moiety and an amine moiety but never between two acid or twoamine moieties. In this case, angle brackets “<”and “>” are used to indicate the bonds, where bondsmust form between conjugate pairs of brackets. For example, the polymericsegment of nylon-6,6, as shown inFigure 3c, may be represented in BigSMILES as

As the asymmetric bonding descriptor represents bonds and connectivityresulting from the reaction of a pair of conjugate end groups, suchas polymers synthesized from the polycondensation reaction of a pairof end groups, conjugate symbols are selected for each of the twobonds. For instance, in the nylon-6,6 example, all the amine endsare denoted by the symbol “>”, whereas the carboxylends are denoted by the opposite symbol “<”; similarly,the amine ends on the polypeptide inFigure 3d share the symbol “>”,andthe carboxyl ends are denoted by the opposite symbol “<”.Similar to the “$” symbol, if multiple groups of ABtype bonds exist, or higher bond order is needed, the notation canbe extended to “<bn” or and “>bn”, whereb is either “–”,“=”, “#”, “\”,or “/” depending on the bond order or bond type, andn is a positive integer. Again, for single bonds, whereb is “–”,b can beomitted for simplicity. Practical examples on the usage of the bondingdescriptors and common errors in encoding BigSMILES strings are, respectively,provided inSection SIX (pp S45–S53)andSection SII (pp S6–S9) of theSupporting Information.

2.2. Fragment Name DefinitionNotation

In BigSMILES syntax, repeating units are representedby an extendedversion of SMILES strings. While this design ensures that BigSMILESstrings are standalone and self-descriptive, in some cases it mightbe more beneficial to have some portions of the BigSMILES representationsbe replaced by more abstract but compact proxies, for example, thenames of repeating units. This is especially helpful when the structureis complex, and the resulting BigSMILES representation becomes long.To facilitate understanding, a definition of molecular fragments thatassociate user-defined names with partial BigSMILES strings is allowedin BigSMILES using the following syntax:

The definitionof repeat unit names is placed at the end of theentire BigSMILES string, with each definition of fragment enclosedby curly brackets and delimited by periods. When fragments are usedwithin the original BigSMILES object, a square bracket should be enclosedto avoid potential confusion of # with triple bonds. Note that thefragments should conform to the BigSMILES syntax and produce a syntacticallyvalid BigSMILES string when embedded within the original BigSMILESobject through a substring replacement. In addition, while havingcomplete (fully bracketed on both sides) BigSMILES stochastic objectswithin the fragment definition is allowed, no bonding descriptors(except for those within fully bracketed stochastic objects) shouldbe included within the fragment definition, so that all occurrencesof the bonding to other repeating units appear explicitly in the BigSMILESstochastic objects. In many cases, the fragment definition notationcan significantly increase the readability of the BigSMILES strings;two such examples are illustrated inFigure 4. Fragment notation also provides a way ofintroducing monomer libraries to improve the readability of BigSMILES.An initial library illustrating a wide variety of examples is providedin the Supporting Information (Section SI, p S3,Table S1).

Examples to illustrate useful features of BigSMILES syntax.First,pendant groups (a) or arms (b) can be replaced by user-defined namesto improve readability. (c, d) Second, direct concatenation of BigSMILESstochastic objects provides simple representation for block copolymers.Finally, nesting of stochastic object becomes useful in representingcopolymers with oligomer chain extenders (e) or polymer grafts (f).

2.3. Concatenation and Nestingof Stochastic Objects

The BigSMILES stochastic object definedearlier represents polymericfragments. In principle, as in the SMILES language, the adjacent stringsoutside the stochastic object concatenated to the string within thestochastic object form a continuous chemical structure. However, toensure chemical validity, how the termini connect to exterior stringsshould be specified using leading and trailing bonding descriptorswithin the curly brackets:

The additional bonding descriptorsindicate how the exterior atomsare connected to the fragment. Therefore, they should be conjugatesto the specific desired terminal; i.e., the additional descriptorshould be “>nb” if the desired terminalbond type is “<nb”. This syntaxis especially useful for explicitly specifying how polymers with ABtype bonds connect to the exterior. For example, if the end groupsbracketing the terminals of the nylon-6,6 inFigure 3c are explicitly specified, with both endsterminated by carboxylate group, additional “>” wouldneed to be added to both ends of the stochastic object to indicatethat the terminal bonds are of connected to the carbon on the carboxylategroup rather than the nitrogen on the amine group:

It shouldbe noted that this concatenation syntax only allows upto two connections to the exterior. In some cases, because of thenature of the repeating unit, by specifying the ending bonding orientation,the bond type at the beginning of the object is also determined. Thisis common in polycondensation of AB type monomers or ring openingpolymerization, where the connectivity on one end completely determinesthe connectivity pattern on the other end. For example, if the endgroups OH were to be positioned on the left of the stochastic objectrepresenting a glycine alanine copolymer, only the C-to-N orientationof the polyamide makes sense given the placement of the end group.In these cases where at least one end of the polymer is capped byexternal groups, if all repeating units within the stochastic objectshave only a pair of conjugate connective sites belonging to the sameAB bond group, and all repeating units are written so that the sitesare placed at the termini with the same orientation, then “<”and “>” at the termini of the repeating units maybeomitted to simplify the representation. With this simplification,the PEG example inFigure 3e may be simplified aswhereas the previous glycine alanine copolymercan be simplifiedasAlthough the N-terminus of the polymer seems uncapped, thereis an implicit hydrogen that terminates the polymer. Collectively,these simplifications for AB bonding are also referred to as the “simplifiedrepresentation.”

The SMILES feature that allows stringconcatenation to representa continuous chemical structure enables blocks of polymeric structurein a copolymer to be written as the direct concatenation of severalstochastic objects. For example, a polyethylene-block-polystyrene structure shown inFigure 4c can be easily encoded by concatenatingthe two polymers segmentsSimilarly, this representation can begeneralized to representmultiblock copolymers, such as the triblock poly(ethylene glycol)-block-poly(propylene glycol)-block-poly(ethyleneglycol) (PEG-b-PPG-b-PEG) illustratedinFigure 4d. Notethat, in this triblock copolymer example, the syntax is greatly simplifiedwith the omission of the terminal “<” and “>”in the repeating units.

In the BigSMILES syntax, it is possibleto nest multiple levelsof stochastic objects within a stochastic object to create more complexstructures. To illustrate the syntax of nesting, consider synthesisof a polyurethane through polycondensation of 1,3-propanediol, ethyleneglycol oligomers, and toluene diisocyanate (TDI), as illustrated inFigure 4e. The ethylene glycololigomers are encoded as one stochastic object, and this can be nestedin a second stochastic object representing the overall polyurethanepolymer:This example can be easily generalized to describe polymersresulting from the polycondensation of more than two types of oligomersor repeating units.

Another scenario that demonstrates the convenienceof nesting isthe representation of graft polymers. Consider polyisobutene-graft-poly(methyl methacrylate) (PIB-g-PMMA)synthesized by grafting from the linear copolymer of poly[isobutene-co-(m-bromomethylstyrene)], illustratedinFigure 4f, as anexample. With the polymer graft nested within the backbone, the graftpolymer can be represented byor, separately defining the polymer graftwith the syntaxprovided in the previous section, the polymer can also be representedasWhen possible, readability and ease of comprehension willusually benefit from encoding a polymer in a non-nested way.

2.4. Branched Polymers and Polymer Networks

Up to now, allexamples have been focused on linear polymer segments,where each repeat unit has two attachment points corresponding tothe start and end of its SMILES string. However, the stochastic objectcan also be generalized to represent randomly branching polymers.For example, consider a low-density polyethylene (LDPE) molecule withlong chain branching (Figure S2a, p S11,Section SIV). Its BigSMILES representation is

Unlike otherrepeating units discussed up to this point, the secondand third repeating units each have functionality larger than two(and therefore the “$” symbols cannot be omitted). Therefore,they serve as branching points, and the entire stochastic object representsa randomly branching structure, which resembles the structure of LDPE.Note that while linear segments of the LDPE molecule can have an oddnumber of carbons because of branching, the overall linear backboneof LDPE must have an even number of carbons. Hence, the repeatingunits in this case consist of molecular fragments with two carbons.In practical cases, the fraction of the last repeat unit should bevery small compared to the other two repeating units, and this unitis retained in the list here for completeness. Other branched polymersor polymer networks can also be encoded similarly. InSection SIII (pp S10 and S11) of the SupportingInformation, more examples, including hyperbranched polymers, end-linkedpolymer networks, and vulcanized networks, are given; additional discussionon noncovalent or dynamic networks can be found inSection SIV, pp S12 and S13 of the SI.

2.5. End Groups

In BigSMILES, there aretwo valid ways of specifying end groups. The first way is to explicitlyappend the end groups around the polymeric fragment represented bythe stochastic object; this method allows specification of a deterministicend group. This was used in the previous section to specify the structurefor a methacrylate terminated PEG, as illustrated inFigure 3e. The other way is to appendthe list of possible end groups as a comma-delimited list to the endof the list of repeating units, separated by a semicolon:

The end groupsare represented as if they are also repeating units,with the same bonding descriptors “$nb”,“<nb”, or “>nb” as repeating units that indicate the allowed connectivitypatterns between repeating units and the end groups. However, thenature of end groups dictates that they should have only one possiblebond to another repeating unit, to terminate the structure. For example,in the nylon-6,6 case, two different end groups are possiblewhere the carboxylicand amine end groups are included withinthe list of repeating units. Note that, in this example, hydrogenatoms are explicitly written for clarity. When end groups are specifiedusing this representation, it means that all the unconnected bondson the molecular fragment generated using the list of repeating unitswith two or more connections to other repeating units are capped withthe specified end groups. This representation can be especially usefulwhen there are multiple possible end groups. For instance, the variabilityof the end groups on the two ends of nylon-6,6 synthesized from polycondensationof adipic acid and hexamethylenediamine is implicitly considered byusing this representation. The effectiveness of the latter representationis especially demonstrated by the following example. Consider linearpolystyrene synthesized from AIBN initiated radical polymerization.It could have three different end groups depending on the route oftermination:

The possible terminal structures are illustrated inFigure 5a. In this example,the SMILESstring leading to the random fragment is synonymous with specifyingthat the fragment already has one of its two ends capped by an initiator,indicated by the leading [$]. Therefore, it leaves only one unconnectedbond on the fragment. The other end group can be one of the threeend groups trailing the ethylene monomer in the list. The first possiblecase is that the other end group is also the initiator, which correspondsto the second entry on the list (first one in the end group list).This happens when termination by coupling takes place. The styrenerepeating unit within the end group is written in a reversed orientationto emphasize the preferred configuration in polymerization. On theother hand, when termination from disproportionation occurs, the endof the polymer can be capped by either of the two groups at the endof the list.

(a) Illustration of possible termination products forfree radicalpolymerization of polystyrene. (b) Polystyrene ring polymer synthesizedfrom azide–alkyne click chemistry. Since the rings and cycleswithin the repeating units are independent of the macrocycle (thatlead to the formation of the ring polymer), the ring closure integeridentifier within the stochastic object is independent of the identifiersoutside of the object even if the numbers were the same. (c) Ringpolymer synthesized from ring expansion metathesis polymerization(REMP).

When randomly branched polymersare considered, the representationthat includes end groups into the list of repeating units has largeadvantages. Consider the hyperbranched polymer example inFigure S2b (p S11, Section SIV); if the end groupfor the #B moiety is #E, then theend groups of the hyperbranched polymer can be easily specified usingthe following representation:Note that, in this case, it is impossibleto explicitlyappend end group #E to the polymer fragment, becausedifferent members of the ensemble of molecules represented by thestochastic fragment have different numbers of unclosed bonds.

2.6. Macrocycles

For macrocycles thatare well-defined, such as the cycle structures in ring polymers, themacrocycles are encoded using the usual syntax for describing cyclesin SMILES. To illustrate this, consider a polystyrene ring synthesizedthrough alkyne azide click chemistry, as illustrated inFigure 5b. In this case, since ringsand cycles within repeating units do not extend beyond a single repeatingunit, the macrocycle associated with the ring polymer can be treatedwith the usual SMILES ring closure syntax. The integer identifierthat was used within the repeating units for ring closure is consideredto be independent of any ring closure ID that was used in other partsof the BigSMILES string. Therefore, the BigSMILES representation forthis polymer readswhere the ring closure for the macrocycle is selected tobe between the sulfur atom and its neighboring atom. Meanwhile, thering closure denoted by 2 and 3 describes the ring closure in thephenyl group and the ring with nitrogen atoms. It should be emphasizedthat, similar to the ID used in bonding descriptors, the scope forring indiceswithin a stochastic object is localto the object, and independent of other ring indices not within thestochastic object. In principle, other well-defined, nonstochasticcycles can be encoded in a similar manner. For example, a ring polymersynthesized with ring expansion metathesis polymerization (REMP) developedby Grubbs and co-workers³³ can also beencoded with similar syntax, as illustrated inFigure 5c.

On the other hand, randomly formedcycles, such as the random loops in polymer networks, cannot be explicitlyenumerated because each cycle requires indexing in the SMILES language.While the examples shown inFigure S2b–d (p S11, Section SIV in the SI) do not explicitly present the possibilityof macrocycle formation, the rules of connectivity implicitly allowit, and enumeration of molecular states represented by the BigSMILESstructure according to algorithms for generating gel connectivity,such as the algorithms adopted by Stepto and co-workers,³⁴ Eichinger and co-workers,³⁵ or Olsen and co-workers,^36,37 will includethe formation of these cyclic structures. Examples of BigSMILES stringsfor such structures are included inSection SIV (pp S10–S11) of the Supporting Information.

2.7. Ladder Polymers and Repeating Units with MultiatomConnections

The syntax up to this point assumes that neighboringrepeating units are always connected through a single pair of atoms.However, for some materials, such as ladder polymers, this conditiondoes not hold. To represent ladder polymers or other polymers withmultiple connections between a single monomer pair, the bonding descriptorsare nested by the following syntax:

The outerlayer (everything except the part bracketed by “[...]”)encodes the connectivity between the repeating units with the samesyntax as detailed in previous sections. Atoms on a repeating unitconnecting to the same neighboring repeating units are indicated byan identical outer layer bond type, bond orderb andbond IDn. For detailed examples of the use of nestedbonding descriptors, please refer to p S14,Section SV in the Supporting Information.

3. Discussion

BigSMILES provides a well-defined, compact, and machine-friendlyextension to the SMILES language that allows stochastic polymer structuresto be represented. In this stochastic sense, a polymeric materialis actually a set of molecules which may be conceptualized as an ensembleof different chemical states (defined by the bonding pattern of atoms),each with a probability of occurrence within the set of moleculesthat represents the material. BigSMILES enables, in a compact form,the ensemble of different chemical states to be represented; however,it does not provide information on the probability of observing anygiven chemical state. This is conceptually similar to the chemicalstructure of a polymer, which does not specify, for example, the molarmass distribution.

In principle, information about the probabilityof observing eachmolecular configuration within the ensemble can be quantified by measurementof physiochemical properties, such as the molar mass distribution,tacticity, or monomer reactivity ratios and feed ratios. However,developing an identifier notation by using a fixed set of propertydescriptors is challenging in practice. In most practical settings,only a few of the chemical structural features and properties of themacromolecules are characterized experimentally. Furthermore, theliterature lacks consensus on how to treat this problem: researcherstypically do not measure the same data using constant methods foreach polymer, and data required to fully define molecular probabilitiesis usually missing. In some cases, measurements may not even be possible.This means that any form of encoding that relies on describing themacromolecules using a predefined set of properties will not meetthe needs of the macromolecular community, nor will it be universallyaccepted. There are also substantial issues with data uncertaintyand disagreements about evidence that have the potential to causecontroversy. Therefore, to make the representation general and universallyapplicable, a syntax is developed that clearly separates the definitionof the ensemble of molecular states accessible in a polymerization,a relatively noncontroversial topic, from the probability of achievinga given molecular state, a topic around which there is much greaterdebate and uncertainty of measurement. This is analogous to definingan ensemble of states in statistical mechanics without assigning theBoltzmann weights. While both are important for property calculation,by separating the two tasks it is possible to provide concrete molecularidentifiers. Alternately, the demarcation of stochastic objects withcurly brackets could enable additional specifications to be includedin the list of elements beyond repeat units and end groups, providingan additional forum for the specification of certain additional chemicalproperties.

In the current form, a single polymer can be representedby multipledistinct yet equally valid BigSMILES representations. For practicalpurposes, canonicalization of BigSMILES to provide a unique representationfor each distinct polymer would be essential for the application ofBigSMILES to polymer informatics. Software packages to accompany BigSMILESare also of prime importance for practical purposes because they wouldserve both as a standard representation generator and a tool thatcould help eliminate human errors. The developments of both the canonicalizationscheme and the supporting software are currently in progress and willbe reported in the near future. In its current form, BigSMILES canstill be used as structural identifiers in applications such as adata entry identifier in polymer databases. To demonstrate its generalapplicability, the entries of a well-known data set³⁸ of glass transition data are converted into BigSMILES representations(cf.Section SVIII, pp S19–S44,in the Supporting Information). In addition to being used as identifiers,BigSMILES representations are designed in a way in which it can alsobe used as the basis of a chemical fingerprint generator. By consideringpairs or triplets of repeating units and higher-order structures,chemical motifs with different levels of complexity and detail canbe easily generated with the representation. These motifs can be usedin cheminformatics applications to construct feature vectors thatare fed into supervised learning models for property predictions.Furthermore, these generated motifs can also be used in chemical fragmentsearch or chemical structure search. Finally, the structures of BigSMILESrepresentations also allow generic chemical pattern searches. Queriessuch as “find polymers that are linear”, “findlinear copolymers that have two components”, or “findbranched polymers with trifunctional junctions” can be straightforwardlyprocessed with regular expression or other pattern matching languages.These aforementioned features, including the generator for chemicalfingerprints, chemical fragment search, and generic structural featurequeries, will be implemented and demonstrated in future work. Thiscapability enables access and searching of polymer materials frommultiple levels of abstraction, which we believe will be highly convenientfor the community.

4. Summary

In thiswork, a new text-based structural representation systemdesigned to accommodate the stochastic nature of polymers is proposed.By adding a novel stochastic object to the widely used simplifiedmolecular-input line-entry system, the features of SMILES can nowbe applied to polymers through BigSMILES. As the new representationsystem adds only a few elementary rules to the original syntax ofSMILES while maintaining full compatibility with SMILES, most of theadvantages of SMILES, including memory compactness, machine friendliness,and wide applicability, are retained in BigSMILES. Therefore, BigSMILESrepresentations are excellent candidates for indexing identifiersin a polymer database system, as well as structural descriptors thatcould be used to search for polymer materials. Furthermore, as thechemical spaces represented by the BigSMILES strings can be straightforwardlyprobed with iterative generation of molecular fragments of varyingsizes, BigSMILES representations can be readily used to automaticallyextract chemical subgraphs and generate molecular fingerprints. Thisfeature can provide a convenient foundation for the generation ofdata sets that could be used along with machine learning models tofuel data-driven research. Ultimately, BigSMILES benefits the polymercommunity and increases cohesion between studies by providing a commonlanguage that is more effective and suitable for polymers.

Acknowledgments

This workwas funded by the Center for the Chemistry of MolecularlyOptimized Networks, a National Science Foundation (NSF) Center forChemical Innovation (CHE-1832256). H.M. was supported by FurukawaElectric Co. Ltd. The authors would like to thank Ísis Biembengut(Braskem) for helpful discussions.

Supporting Information Available

The SupportingInformationis available free of charge on theACS Publications website at DOI:10.1021/acscentsci.9b00476.

Discussion onorientation of repeating units, list ofcommon repeating units and their equivalent string replacement, representationfor charged polymers and tacticity; common mistakes in encoding BigSMILESstrings; examples of BigSMILES encodings, including branched polymers,polymer networks, ladder polymers, and other more complex polymers;and discussion on treating dynamic, topological, and physical bonds(PDF)

The authorsdeclare no competing financial interest.

Supplementary Material

oc9b00476_si_001.pdf^{(1.7MB, pdf)}

References

O’Boyle N. M.Towardsa Universal SMILES Representation-A Standard Method to Generate CanonicalSMILES Based on the InChI. J. Cheminf.2012, 4 (1), 22. 10.1186/1758-2946-4-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coley C. W.; Green W. H.; Jensen K. F.MachineLearning in Computer-AidedSynthesis Planning. Acc. Chem. Res.2018, 51 (5), 1281–1289. 10.1021/acs.accounts.8b00087. [DOI] [PubMed] [Google Scholar]
Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A.AutomaticChemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci.2018, 4 (2), 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kramer S.; DeRaedt L.; Helma C. In Molecular Feature Mining in HIVData, Proceedings of the seventh ACM SIGKDD internationalconference on Knowledge discovery and data mining; 2001; pp 136–143.
Weininger D.SMILES, aChemical Language and Information System. 1. Introduction to Methodologyand Encoding Rules. J. Chem. Inf. Model.1988, 28 (1), 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
Weininger D.; Weininger A.; Weininger J.L.SMILES. 2. Algorithm for Generationof Unique SMILES Notation. J. Chem. Inf. Model.1989, 29 (2), 97–101. 10.1021/ci00062a008. [DOI] [Google Scholar]
Ash S.; Cline M. A.; Homer R. W.; Hurst T.; Smith G. B.SYBYL LineNotation (SLN): A Versatile Language for Chemical Structure Representation. J. Chem. Inf. Comput. Sci.1997, 37 (1), 71–79. 10.1021/ci960109j. [DOI] [Google Scholar]
Vollmer J. J.WiswesserLine Notation: An Introduction. J. Chem. Educ.1983, 60 (3), 192. 10.1021/ed060p192. [DOI] [Google Scholar]
Rohbeck H.-G.Representationof Structure Description Arranged Linearly. In Software Development in Chemistry 5; Springer: Berlin, Heidelberg, 1991; pp 49–58. [Google Scholar]
Gakh A. A.; Burnett M. N.Modular Chemical Descriptor Language (MCDL): Composition,Connectivity, and Supplementary Modules. J.Chem. Inf. Comput. Sci.2001, 41 (6), 1494–1499. 10.1021/ci000108y. [DOI] [PubMed] [Google Scholar]
Heller S. R.; McNaught A.; Pletnev I.; Stein S.; Tchekhovskoi D.InChI, theIUPAC International Chemical Identifier. J.Cheminf.2015, 7 (1), 23. 10.1186/s13321-015-0068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
O’Boyle N. M.; Banck M.; James C. A.; Morley C.; Vandermeersch T.; Hutchison G. R.Open Babel: An Open Chemical Toolbox. J. Cheminf.2011, 3 (1), 33. 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao D.-S.; Zhao J.-C.; Yang Y.-N.; Zhao C.-X.; Yan J.; Liu S.; Hu Q.-N.; Xu Q.-S.; Liang Y.-Z.In Silico ToxicityPrediction by Support Vector Machine and SMILES Representation-BasedString Kernel. SAR QSAR Environ. Res.2012, 23 (1–2), 141–153. 10.1080/1062936X.2011.645874. [DOI] [PubMed] [Google Scholar]
Napolitano F.; Zhao Y.; Moreira V. M.; Tagliaferri R.; Kere J.; D’Amato M.; Greco D.Drug Repositioning:A Machine-Learning Approach through Data Integration. J. Cheminf.2013, 5 (1), 30. 10.1186/1758-2946-5-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
Montavon G.; Rupp M.; Gobre V.; Vazquez-Mayagoitia A.; Hansen K.; Tkatchenko A.; Müller K.-R.; VonLilienfeld O. A.Machine Learning of Molecular Electronic Propertiesin Chemical Compound Space. New J. Phys.2013, 15 (9), 95003. 10.1088/1367-2630/15/9/095003. [DOI] [Google Scholar]
Wu Z.; Ramsundar B.; Feinberg E. N.; Gomes J.; Geniesse C.; Pappu A. S.; Leswing K.; Pande V.MoleculeNet: A Benchmarkfor Molecular Machine Learning. Chem. Sci.2018, 9 (2), 513–530. 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coley C. W.; Jin W.; Rogers L.; Jamison T. F.; Jaakkola T. S.; Green W. H.; Barzilay R.; Jensen K. F.A Graph-Convolutional Neural NetworkModel for the Prediction of Chemical Reactivity. Chem. Sci.2019, 10 (2), 370–377. 10.1039/C8SC04228D. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao H.; Struble T. J.; Coley C. W.; Wang Y.; Green W. H.; Jensen K. F.Using Machine Learning To Predict Suitable Conditionsfor Organic Reactions. ACS Cent. Sci.2018, 4 (11), 1465–1476. 10.1021/acscentsci.8b00357. [DOI] [PMC free article] [PubMed] [Google Scholar]
Audus D. J.; dePablo J. J.Polymer Informatics: Opportunities and Challenges. ACS Macro Lett.2017, 6 (10), 1078–1082. 10.1021/acsmacrolett.7b00228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim C.; Chandrasekaran A.; Huan T. D.; Das D.; Ramprasad R.Polymer Genome:A Data-Powered Polymer Informatics Platform for Property Predictions. J. Phys. Chem. C2018, 122 (31), 17575–17585. 10.1021/acs.jpcc.8b02913. [DOI] [Google Scholar]
Huan T. D.; Mannodi-Kanakkithodi A.; Ramprasad R.Accelerated Materials Property Predictionsand Design Using Motif-Based Fingerprints. Phys.Rev. B: Condens. Matter Mater. Phys.2015, 92 (1), 14106. 10.1103/PhysRevB.92.014106. [DOI] [Google Scholar]
Mannodi-Kanakkithodi A.; Pilania G.; Huan T. D.; Lookman T.; Ramprasad R.Machine LearningStrategy for Accelerated Design of Polymer Dielectrics. Sci. Rep.2016, 6, 20952. 10.1038/srep20952. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nagasawa S.; Al-Naamani E.; Saeki A.Computer-Aided Screening of ConjugatedPolymers for Organic Solar Cell: Classification by Random Forest. J. Phys. Chem. Lett.2018, 9 (10), 2639–2646. 10.1021/acs.jpclett.8b00635. [DOI] [PubMed] [Google Scholar]
Cravero F.; Schustik S.; Martínez M. J.; Barranco C. D.; Díaz M. F.; Ponzoni I. In Feature Selectionand Polydispersity Characterization for QSPR Modelling: Predictinga Tensile Property, International Conference on PracticalApplications of Computational Biology & Bioinformatics; 2018; pp 43–51.
Tchoua R. B.; Chard K.; Audus D.J.; Ward L.T.; Lequieu J.; DePablo J. J.; Foster I. T. In Towards a HybridHuman-Computer Scientific Information Extraction Pipeline, , 2017 IEEE 13th International Conference on e-Science; 2017; pp 109–118.
Pilania G.; Wang C.; Jiang X.; Rajasekaran S.; Ramprasad R.Accelerating Materials Property Predictions Using MachineLearning. Sci. Rep.2013, 3, 2810. 10.1038/srep02810. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sharma V.; Wang C.; Lorenzini R. G.; Ma R.; Zhu Q.; Sinkovits D. W.; Pilania G.; Oganov A. R.; Kumar S.; Sotzing G. A.; et al. Rational Design of All Organic Polymer Dielectrics. Nat. Commun.2014, 5, 4845. 10.1038/ncomms5845. [DOI] [PubMed] [Google Scholar]
Peerless J. S.; Milliken N. J. B.; Oweida T. J.; Manning M. D.; Yingling Y. G.Soft MatterInformatics: Current Progress and Challenges. Adv. Theory Simulations2019, 2, 1800129. 10.1002/adts.201800129. [DOI] [Google Scholar]
Zhang T.; Li H.; Xi H.; Stanton R. V.; Rotstein S.H.HELM: A HierarchicalNotation Language for Complex Biomolecule Structure Representation. J. Chem. Inf. Model.2012, 52 (10), 2796–2806. 10.1021/ci3001925. [DOI] [PubMed] [Google Scholar]
Drefahl A.CurlySMILES:A Chemical Language to Customize and Annotate Encodings of Molecularand Nanodevice Structures. J. Cheminf.2011, 3 (1), 1. 10.1186/1758-2946-3-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hasegawa H.; Tanaka H.; Yamasaki K.; Hashimoto T.BicontinuousMicrodomain Morphology of Block Copolymers. 1. Tetrapod-Network Structureof Polystyrene-Polyisoprene Diblock Polymers. Macromolecules1987, 20 (7), 1651–1662. 10.1021/ma00173a036. [DOI] [Google Scholar]
Odian G.Principles of Polymerization; John Wiley& Sons: Hoboken, NJ, 2004. [Google Scholar]
Bielawski C. W.; Benitez D.; Grubbs R. H.An “Endless” Routeto Cyclic Polymers. Science2002, 297 (5589), 2041–2044. 10.1126/science.1075401. [DOI] [PubMed] [Google Scholar]
Rolfes H.; Stepto R. F. T.A Developmentof Ahmad-Stepto Gelation Theory. Makromol. Chem.,Macromol. Symp.1993, 76, 1–12. 10.1002/masy.19930760103. [DOI] [Google Scholar]
Leung Y.-K.; Eichinger B. E.Computer Simulation of End-LinkedElastomers. I. TrifunctionalNetworks Cured in the Bulk. J. Chem. Phys.1984, 80 (8), 3877–3884. 10.1063/1.447169. [DOI] [Google Scholar]
Wang R.; Alexander-Katz A.; Johnson J. A.; Olsen B. D.Universal CyclicTopology in Polymer Networks. Phys. Rev. Lett.2016, 116 (18), 188302. 10.1103/PhysRevLett.116.188302. [DOI] [PubMed] [Google Scholar]
Lin T.-S.; Wang R.; Johnson J. A.; Olsen B. D.Topological Structureof Networks Formed from Symmetric Four-Arm Precursors. Macromolecules2018, 51 (3), 1224–1231. 10.1021/acs.macromol.7b01829. [DOI] [Google Scholar]
Bicerano J.Prediction of PolymerProperties; cRc Press: Boca Raton, 2002. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

oc9b00476_si_001.pdf^{(1.7MB, pdf)}

Movatterモバイル変換

PERMALINK

BigSMILES: A Structurally-Based Line Notation forDescribing Macromolecules

Tzyy-Shyang Lin

Connor W Coley

Hidenobu Mochigase

Haley K Beech

Wencong Wang

Zi Wang

Eliot Woods

Stephen L Craig

Jeremiah A Johnson

Julia A Kalow

Klavs F Jensen

Bradley D Olsen

Abstract

Short abstract

1. Introduction

Figure 1.

2. Syntax

Figure 2.

2.1. Bonding Descriptor Syntax

Figure 3.

2.2. Fragment Name DefinitionNotation

Figure 4.