Quantitative structure–activity relationship models (QSAR models) areregression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of theresponse variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable.
In QSAR modeling, the predictors consist of physico-chemical properties or theoreticalmolecular descriptors[1][2] of chemicals; the QSAR response-variable could be abiological activity of the chemicals. QSAR models first summarize a supposed relationship betweenchemical structures andbiological activity in a data-set of chemicals. Second, QSAR modelspredict the activities of new chemicals.[3][4]
Related terms includequantitative structure–property relationships (QSPR) when a chemical property is modeled as the response variable.[5][6]"Different properties or behaviors of chemical molecules have been investigated in the field of QSPR. Some examples are quantitative structure–reactivity relationships (QSRRs), quantitative structure–chromatography relationships (QSCRs) and, quantitative structure–toxicity relationships (QSTRs), quantitative structure–electrochemistry relationships (QSERs), and quantitative structure–biodegradability relationships (QSBRs)."[7]
As an example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can find a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression, if carefully validated,[8][9][10][11] can then be used to predict the modeled response of other chemical structures.[12]
A QSAR has the form of amathematical model:
The error includesmodel error (bias) and observational variability, that is, the variability in observations even on a correct model.
The principal steps of QSAR/QSPR include:[7]
The basic assumption for all molecule-basedhypotheses is that similar molecules have similar activities. This principle is also called Structure–Activity Relationship (SAR). The underlying problem is therefore how to define asmall difference on a molecular level, since each kind of activity, e.g.reaction ability,biotransformation ability,solubility, target activity, and so on, might depend on another difference. Examples were given in thebioisosterism reviews by Patanie/LaVoie[13] and Brown.[14]
In general, one is more interested in finding strongtrends. Createdhypotheses usually rely on afinite number of chemicals, so care must be taken to avoidoverfitting: the generation of hypotheses that fit training data very closely but perform poorly when applied to new data.
TheSAR paradox refers to the fact that it is not the case that all similar molecules have similar activities.[15]
Analogously, the "partition coefficient"—a measurement of differential solubility and itself a component of QSAR predictions—can be predicted either by atomic methods (known as "XLogP" or "ALogP") or bychemical fragment methods (known as "CLogP" and other variations). It has been shown that thelogP of compound can be determined by the sum of its fragments; fragment-based methods are generally accepted as better predictors than atomic-based methods.[16] Fragmentary values have been determined statistically, based on empirical data for known logP values. This method gives mixed results and is generally not trusted to have accuracy of more than ±0.1 units.[17]
Group or fragment-based QSAR is also known as GQSAR.[18] GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric sets. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.[18]Lead discovery using fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.[19]
An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed.[20] This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.[20]
The acronym3D-QSAR or3-D QSAR refers to the application offorce field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set needs to be superimposed (aligned) by either experimental data (e.g. based on ligand-proteincrystallography) or moleculesuperimposition software. It uses computed potentials, e.g. theLennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields[21] which were correlated by means ofpartial least squares regression (PLS).
The created data space is then usually reduced by a followingfeature extraction (see alsodimensionality reduction). The following learning method can be any of the already mentionedmachine learning methods, e.g.support vector machines.[22] An alternative approach usesmultiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule).[23]
On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies.[citation needed]
In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR.[24] This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields.
An example of this approach is the QSARs developed for olefin polymerization byhalf sandwich compounds.[25][26]
It has been shown that activity prediction is even possible based purely on theSMILES string.[27][28][29]
Similarly to string-based methods, the molecular graph can directly be used as input for QSAR models,[30][31] but usually yield inferior performance compared to descriptor-based QSAR models.[32][33]
QSAR has been merged with the similarity-based read-across technique to develop a new field ofq-RASAR. TheDTC Laboratory atJadavpur University has developed this hybrid method and the details are available at theirlaboratory page. Recently, the q-RASAR framework has been improved by its integration with theARKA descriptors in QSAR.
In the literature it can be often found that chemists have a preference forpartial least squares (PLS) methods,[citation needed] since it applies thefeature extraction andinduction in one step.
Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face afeature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.
A typicaldata mining based prediction uses e.g.support vector machines,decision trees,artificial neural networks forinducing a predictive learning model.
Molecule mining approaches, a special case ofstructured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore, there exist also approaches usingmaximum common subgraph searches orgraph kernels.[34][35]
Typically QSAR models derived from non linearmachine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept ofmatched molecular pair analysis[36] or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.[37]
QSAR modeling produces predictivemodels derived from application of statistical tools correlatingbiological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative ofmolecular structure orproperties. QSARs are being applied in many disciplines, for example:risk assessment, toxicity prediction, and regulatory decisions[38] in addition todrug discovery andlead optimization.[39] Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds.
For validation of QSAR models, usually various strategies are adopted:[40]
The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances andapplicability domain (AD) of the models.[8][9][11][41][42]
Some validation methodologies can be problematic. For example,leave one-out cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published.
Different aspects of validation of QSAR models that need attention include methods of selection of training set compounds,[43] setting training set size[44] and impact of variable selection[45] for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.[11][46][47]
One of the first historical QSAR applications was to predictboiling points.[48]
It is well known for instance that within a particularfamily ofchemical compounds, especially oforganic chemistry, that there are strongcorrelations between structure and observed properties. A simple example is the relationship between the number of carbons inalkanes and theirboiling points. There is a clear trend in the increase of boiling point with an increase in the number carbons, and this serves as a means for predicting the boiling points ofhigher alkanes.
A still very interesting application is theHammett equation,Taft equation andpKa prediction methods.[49]
The biological activity of molecules is usually measured inassays to establish the level of inhibition of particularsignal transduction ormetabolic pathways.Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specifictargets and have lowtoxicity (non-specific activity). Of special interest is the prediction ofpartition coefficient logP, which is an important measure used in identifying "druglikeness" according toLipinski's Rule of Five.[50]
While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with anenzyme orreceptor binding site, QSAR can also be used to study the interactions between thestructural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted fromsite-directed mutagenesis.[51]
It is part of themachine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see alsoMVUE). In general, all QSAR problems can be divided intocoding[52]andlearning.[53]
(Q)SAR models have been used forrisk management. QSARS are suggested by regulatory authorities; in theEuropean Union, QSARs are suggested by theREACH regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals". Regulatory application of QSAR methods includesin silico toxicological assessment of genotoxic impurities.[54] Commonly used QSAR assessment software such as DEREK or CASE Ultra (MultiCASE) is used to genotoxicity of impurity according to ICH M7.[55]
The chemical descriptor space whoseconvex hull is generated by a particular training set of chemicals is called the training set'sapplicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain usesextrapolation, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic, as a unified strategy has yet to be adopted by modellers and regulatory authorities.[56]
The QSAR equations can be used to predict biological activities of newer molecules before their synthesis.
Examples of machine learning tools for QSAR modeling include:[57]
S.No. | Name | Algorithms | External link |
---|---|---|---|
1. | R | RF, SVM, Naïve Bayesian, and ANN | "R: The R Project for Statistical Computing". |
2. | libSVM | SVM | "LIBSVM -- A Library for Support Vector Machines". |
3. | Orange | RF, SVM, and Naïve Bayesian | "Orange Data Mining". |
4. | RapidMiner | SVM, RF, Naïve Bayes, DT, ANN, and k-NN | "RapidMiner | #1 Open Source Predictive Analytics Platform". |
5. | Weka | RF, SVM, and Naïve Bayes | "Weka 3 - Data Mining with Open Source Machine Learning Software in Java". Archived fromthe original on 2011-10-28. Retrieved2016-03-24. |
6. | Knime | DT, Naïve Bayes, and SVM | "KNIME | Open for Innovation". |
7. | AZOrange[58] | RT, SVM, ANN, and RF | "AZCompTox/AZOrange: AstraZeneca add-ons to Orange".GitHub. 2018-09-19. |
8. | Tanagra | SVM, RF, Naïve Bayes, and DT | "TANAGRA - A free DATA MINING software for teaching and research". Archived fromthe original on 2017-12-19. Retrieved2016-03-24. |
9. | Elki | k-NN | "ELKI Data Mining Framework". Archived fromthe original on 2016-11-19. |
10. | MALLET | "MALLET homepage". | |
11. | MOA | "MOA Massive Online Analysis | Real Time Analytics for Data Streams". Archived fromthe original on 2017-06-19. | |
12. | Deep Chem | Logistic Regression, Naive Bayes, RF, ANN, and others | "DeepChem".deepchem.io. Retrieved20 October 2017. |
13. | alvaModel[59] | Regression (OLS,PLS,k-NN,SVM and Consensus) and Classification (LDA/QDA,PLS-DA,k-NN,SVM and Consensus) | "alvaModel: a software tool to create QSAR/QSPR models".alvascience.com. |
14. | scikit-learn (Python)[60] | Logistic Regression, Naive Bayes, kNN, RF, SVM, GP, ANN, and others | "SciKit-Learn".scikit-learn.org. Retrieved13 August 2023. |
15. | Scikit-Mol[61] | Integration ofScikit-learn models andRDKit featurization | scikit-mol on pypi.org |
16. | scikit-fingerprints[62] | Molecular fingerprints, API compatible withScikit-learn models | "scikit-fingerprints".GitHub. Retrieved29 December 2024. |
17. | DTC Lab Tools | Multiple Linear Regression, Partial Least Squares, Applicability Domain, Validation, and others | "DTCLab Tools". Retrieved12 May 2025. |
18. | DTC Lab Supplementary Tools | Quantitative Read-across, q-RASAR, ARKA, Regression and Classification-based ML tools, and others | "DTCLab Supplementary Tools". Retrieved12 May 2025. |
A regression program that has dual databases of over 21,000 QSAR models
A comprehensive web resource for QSAR modelers