Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
NCBI home page
Search in PMCSearch
  • View on publisher site icon
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more:PMC Disclaimer | PMC Copyright Notice
Nature Communications logo

Combining data and theory for derivable scientific discovery with AI-Descartes

Cristina Cornelio1,2,,Sanjeeb Dash1,Vernon Austel1,Tyler R Josephson3,4,Joao Goncalves1,Kenneth L Clarkson1,Nimrod Megiddo1,Bachir El Khadir1,Lior Horesh1,5,
1IBM Research—Mathematics and Theoretical Computer Science, New York, NY USA
2Samsung AI—Machine Learning and Data Intelligence, Cambridge, UK
3Department of Chemical, Biochemical, and Environmental Engineering, University of Maryland, Baltimore County, MD USA
4Department of Chemistry and Chemical Theory Center, University of Minnesota, Minneapolis, MN USA
5Columbia University, Computer Science, New York, NY USA

Corresponding author.

Received 2021 Oct 28; Accepted 2023 Mar 8; Collection date 2023.

© The Author(s) 2023

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visithttp://creativecommons.org/licenses/by/4.0/.

PMCID: PMC10097814  PMID:37045814

Abstract

Scientists aim to discover meaningful formulae that accurately describe experimental data. Mathematical models of natural phenomena can be manually created from domain knowledge and fitted to data, or, in contrast, created automatically from large datasets with machine-learning algorithms. The problem of incorporating prior knowledge expressed as constraints on the functional form of a learned model has been studied before, while finding models that are consistent with prior knowledge expressed via general logical axioms is an open problem. We develop a method to enable principled derivations of models of natural phenomena from axiomatic knowledge and experimental data by combining logical reasoning with symbolic regression. We demonstrate these concepts for Kepler’s third law of planetary motion, Einstein’s relativistic time-dilation law, and Langmuir’s theory of adsorption. We show we can discover governing laws from few data points when logical reasoning is used to distinguish between candidate formulae having similar error on the data.

Subject terms: Computer science, Information theory and computation


Automatic extraction of consistent governing laws from data is a challenging problem. The authors propose a method that takes as input experimental data and background theory and combines symbolic regression with logical reasoning to obtain scientifically meaningful symbolic formulas.

Introduction

Artificial neural networks (NN) and statistical regression are commonly used to automate the discovery of patterns and relations in data. NNs return “black-box” models, where the underlying functions are typically used for prediction only. In standard regression, the functional form is determined in advance, so model discovery amounts to parameter fitting. In symbolic regression (SR)1,2, the functional form is not determined in advance, but is instead composed from operators in a given list (e.g., + , − , × , and ÷) and calculated from the data. SR models are typically more “interpretable” than NN models, and require less data. Thus, for discovering laws of nature in symbolic form from experimental data, SR may work better than NNs or fixed-form regression3; integration of NNs with SR has been a topic of recent research in neuro-symbolic AI46. A major challenge in SR is to identify, out of many models that fit the data, those that are scientifically meaningful. Schmidt and Lipson3 identify meaningful functions as those that balance accuracy and complexity. However many such expressions exist for a given dataset, and not all are consistent with the known background theory.

Another approach would be to start from the known background theory, but there are no existing practical reasoning tools that generate theorems consistent with experimental data from a set of known axioms. Automated Theorem Provers (ATPs), the most widely-used reasoning tools, instead solve the task of proving a conjecture for a given logical theory. Computational complexity is a major challenge for ATPs; for certain types of logic, proving a conjecture is undecidable. Moreover, deriving models from a logical theory using formal reasoning tools is especially difficult when arithmetic and calculus operators are involved (e.g., see the work of Grigoryev et al.7 for the case of inequalities). Machine-learning techniques have been used to improve the performance of ATPs, for example, by using reinforcement learning to guide the search process8. This research area has received much attention recently911.

Models that are derivable, and not merely empirically accurate, are appealing because they are arguably correct, predictive, and insightful. We attempt to obtain such models by combining a novel mathematical-optimization-based SR method with a reasoning system. This yields an end-to-end discovery system, which extracts formulas from data via SR, and then furnishes either a formal proof of derivability of the formula from a set of axioms, or a proof of inconsistency. We present novel measures that indicate how close a formula is to a derivable formula, when the model is provably non-derivable, and we calculate the values of these measures using our reasoning system. In earlier work combining machine learning with reasoning, Marra et al.12 use a logic-based description to constrain the output of a GAN neural architecture for generating images. Scott et al.13 and Ashok et al.14 combine machine-learning tools and reasoning engines to search for functional forms that satisfy prespecified constraints. They augment the initial dataset with new points in order to improve the efficiency of learning methods and the accuracy of the final model. Kubalik et al.15 also exploit prior knowledge to create additional data points. However, these works only consider constraints on the functional form to be learned, and do not incorporate general background-theory axioms (logic constraints that describe the other laws and unmeasured variables that are involved in the phenomenon).

Results

Discovery as a formal mathematical problem

Our automated scientific discovery method aims to discover an unknownsymbolic model y = f *(x) (bold letters indicate vectors) wherex is the vector (x1, …, xn) of independent variables, andy is the dependent variable. The discovered modelf (an approximation off *) should fit a collection ofm data points, ((X1, Y1), ⋯  , (Xm, Ym)), be derivable from a background theory, have low complexity and bounded prediction error. More specifically, the inputs to our system are 4-tuplesB,C,D,M as follows.

  • Background KnowledgeB: a set of domain-specific axioms expressed as logic formulae. They involvex,y, and possibly more variables that are necessary to formulate the background theory. In this work we focus mainly on first-order-logic formulae with equality, inequality and basic arithmetic operators. We assume that the background theoryB iscomplete, that is, it contains all the axioms necessary to comprehensively explain the phenomena under consideration, andconsistent, that is, the axioms do not contradict one another. These two assumptions guarantee that there exists a unique derivable functionfB that logically represents the variable of interesty. Note that although the derivable function is unique, there may exist different functional forms that are equivalent on the domain of interest. Considering the domain with two points {0, 1} for a variablex, the two functional formsf (x) = x andf (x) = x2 both define the same function.

  • A Hypothesis ClassC: a set of admissible symbolic models defined by a grammar, a set of invariance constraints to avoid redundant expressions (e.g.,A + B is equivalent toB + A) and constraints on the functional form (e.g., monotonicity).

  • DataD: a set ofm examples, each providing certain values forx1, …, xn, andy.

  • Modeler PreferencesM: a set of numerical parameters (e.g., error bounds on accuracy).

Generalized notion of distance

In general, there may not exist a functionfC that fits the data exactly and is derivable fromB. This could happen because the symbolic model generating the data might not belong toC, the sensors used to collect the data might give noisy measurements, or the background knowledge might be inaccurate or incomplete. To quantify the compatibility of a symbolic model with data and background theory, we introduce the notion ofdistance between a modelf andB. Roughly, it reflects the error between the predictions off and the predictions of a formulafB derivable fromB (thus, the distance equals zero whenf is derivable fromB). Figure 1 provides a visualization of these two notions of distance for the problem of learning Kepler’s third law of planetary motion from solar-system data and background theory.

Fig. 1. Visualization of relevant sets and their distances.

Fig. 1

The numerical data, background theory, and a discovered model are depicted for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system. The data consists of measurements (m1, m2, d, p) of the mass of the sunm1, the orbital periodp and massm2 for each planet and its distanced from the sun. The background theory amounts to Newton’s laws of motion, i.e., the formulae for centrifugal force, gravitational force, and equilibrium conditions. The 4-tuples (m1, m2, d, p) are projected into (m1 + m2, d, p). The blue manifold represents solutions offB, which is the function derivable from the background-theory axioms that represents the variable of interest. The gray manifold represents solutions of the discovered modelf. The double arrows indicate the distancesβ ( f ) andε( f ).

Integration of statistical and symbolic AI

Our system consists mainly of an SR module and a reasoning module. The SR module returns multiple candidate symbolic models (or formulae) expressingy as a function ofx1, …, xn and that fit the data. For each of these models, the system outputs the distanceε( f ) betweenf andD and the distanceβ( f ) betweenf andB. We will also be referring toε( f ) andβ( f ) as errors.

These functions are also tested to see if they satisfy the specified constraints on the functional form (inC) and the modeler-specified level of accuracy and complexity (inM). When the models are passed to the reasoning module (along with the background theoryB), they are tested for derivability. If a model is found to be derivable fromB, it is returned as the chosen model for prediction; otherwise, if the reasoning module concludes that no candidate model is derivable, it is necessary to either collect additional data or add constraints. In this case, the reasoning module will return a quality assessment of the input set of candidate hypotheses based on the distanceβ, removing models that do not satisfy the modeler-specified bounds onβ. The distance (or error)β is computed between a function (or formula)f, derived from numerical data, and the derivable functionfB which is implicitly defined by the set of axioms inB and is logically represented by the variable of interesty. The distance between the functionfB and any other formulaf depends only on the background theory and the formulaf and not on any particular functional form offB. Moreover, the reasoning module can prove that a model is not derivable by returning counterexample points that satisfyB but do not fit the model.

Interplay between data and theory in AI-Descartes

SR is typically solved with genetic programming (GP)13,16, however methods based on mixed-integer nonlinear programming (MINLP) have recently been proposed1719. In this work, we develop a new MINLP-based SR solver (described in the Supplementary Information). The input consists of a subset of the operators {+,,×,÷,,log,exp}, an upper bound on expression complexity, and an upper bound on the number of constants used that do not equal 1. Given a dataset, the system formulates multiple MINLP instances to find an expression that minimizes the least-square error. Each instance is solved approximately, subject to a time limit. Both linear and nonlinear constraints can be imposed. In particular, dimensional consistency is imposed when physical dimensions of variables are available.

We use KeYmaera X20 as a reasoning tool; it is an ATP for hybrid systems and combines different types of reasoning: deductive, real-algebraic, and computer-algebraic reasoning. We also use Mathematica21 for certain types of analysis of symbolic expressions. While a formula found by any grammar-based system (such as an SR system) is syntactically correct, it may contradict the axioms of the theory or not be derivable from them. In some cases, a formula may not be derivable as the theory may not have enough axioms; the formula may be provable under an extended axiom set or an alternative one (e.g., using a relativistic set of axioms rather than a “Newtonian” one).

An overview of our system seen as a discovery cycle is shown in Fig. 2. Our discovery cycle is inspired by Descartes who advanced the scientific method and emphasized the role that logical deduction, and not empirical evidence alone, plays in forming and validating scientific discoveries. Our present approach differs from implementations of the scientific method that obtain hypotheses from theory and then check them against data; instead we obtain hypotheses from data and assess them against theory. A more detailed schematic of the system is depicted in Fig. 3, where the colored components correspond to the system we present in this work, and the gray components refer to standard techniques for scientific discovery that we have not yet integrated into our current implementation.

Fig. 2. An interpretation of the scientific method as implemented by our system.

Fig. 2

The colors match the respective components of the system in Fig. 3.

Fig. 3. System overview.

Fig. 3

Colored components correspond to our system, and gray components indicate standard techniques for scientific discovery (human-driven or artificial) that have not been integrated into the current system. The colors match the respective components of the discovery cycle of Fig. 2. The present system generates hypotheses from data using symbolic regression, which are posed as conjectures to an automated deductive reasoning system, which proves or disproves them based on background theory or provides reasoning-based quality measures.

Experimental validation

We tested the different capabilities of our system on three problems (more details in the Methods section). First, we considered the problem of deriving Kepler’s third law of planetary motion, providing reasoning-based measures to analyze the quality and generalizablity of the generated formulae. Extracting this law from experimental data is challenging, especially when the masses involved are of very different magnitudes. This is the case for the solar system, where the solar mass is much larger than the planetary masses. The reasoning module helps in choosing between different candidate formulae and identifying the one that generalizes well: using our data and theory integration we were able to re-discover Kepler’s third law. We then considered Einstein’s time-dilation formula. Although we did not recover this formula from data, we used the reasoning module to identify the formula that generalizes best. Moreover, analyzing the reasoning errors with two different sets of axioms (one with “Newtonian” assumptions and one relativistic), we were able to identify the theory that better explains the phenomenon. Finally, we considered Langmuir’s adsorption equation, whose background theory contains material-dependent coefficients. By relating these coefficients to the ones in the SR-generated models via existential quantification, we were able to logically prove one of the extracted formulae.

Discussion

We have demonstrated the value of combining logical reasoning with symbolic regression in obtaining meaningful symbolic models of physical phenomena, in the sense that they are consistent with background theory and generalize well in a domain that is significantly larger than the experimental data. The synthesis of regression and reasoning yields better models than can be obtained by SR or logical reasoning alone.

Improvements or replacements of individual system components and introduction of new modules such as abductive reasoning or experimental design22 (not described in this work for the sake of brevity) would extend the capabilities of the overall system. A deeper integration of reasoning and regression can help synthesize models that are both data driven and based on first principles, and lead to a revolution in the scientific discovery process. The discovery of models that are consistent with prior knowledge will accelerate scientific discovery, and enable going beyond existing discovery paradigms.

Methods

We next describe in detail the methodologies used to address the three problems studied to validate our method: Kepler’s third law of planetary motion, relativistic time dilation, and Langmuir’s adsorption equation.

Kepler’s third law of planetary motion

Kepler’s law relates the distanced between two bodies (e.g., the sun and a planet in the solar system) and their orbital periods. It can be expressed as

p=4π2d3Gm1+m2,1

wherep is the period,G is the gravitational constant, andm1 andm2 are the two masses. It can be derived using the following axioms of the background theoryB, describing the center of mass (axiom K1), the distance between bodies (axiom K2), the gravitational force (axiom K3), the centrifugal force (axiom K4), the force balance (axiom K5), and the period (axiom K6):

K1.Center of massm1d1=m2d2K2.Distance between bodiesd=d1+d2K3.Gravitational forceFg=Gm1m2d2K4.Centrifugal forceFc=m2d2w2K5.Force balanceFg=FcK6.Period definitionp=2πwK7.Positivity constraintsm1>0,m2>0,p>0,d1>0,d2>0.2

We consider three real-world datasets: planets of the solar system (from the NASA Planetary Fact Sheet23), the solar-system planets along with exoplanets from Trappist-1 and the GJ 667 system (from the NASA exoplanet archive24), and binary stars25. These datasets contain measurements of pairs of masses (a sun and a planet for the first two, and two suns for the third), the distance between them, and the orbital period of the planet around its sun in the first two datasets or the orbital period around the common center of mass in the third dataset. The data we use is given in the Supplementary Information. Note that the dataset does not contain measurements for a number of variables in the axiom system, such asd1, d2, Fg, etc.

The goal is to recover Kepler’s third law (Eq. (1)) from the data, that is, to obtainp as the above-stated function ofd,m1 andm2.

The SR module takes as input the set of operators {+,,×,÷,√ } and outputs a set of candidate formulas. None of the formulae obtained via SR are derivable, though some are close approximations to derivable formulae. We evaluate the quality of these formulae by writing a logic program for calculating the errorβ of a formula with respect to a derivable formula. We use three measures, defined below, to assess the correctness of a data-driven formula from a reasoning viewpoint: thepointwise reasoning error, thegeneralization reasoning error, andvariable dependence.

Pointwise reasoning error

The key idea is to compute a distance between a formula generated from the numerical data and some derivable formula that is implicitly defined by the axiom set. The distance is measured by thel2 orl norm applied to the differences between the values of the numerically-derived formula and a derivable formula at the points in the dataset. This definition can be extended to other norms.

We compute the relative error of numerically derived formulaf (x) applied to them data pointsXi (i = 1, …, m) with respect tofB(x), derivable from the axioms via the following expressions:

β2r=i=1mf(Xi)fB(Xi)fB(Xi)2andβr=max1imf(Xi)fB(Xi)fB(Xi)3

wherefB(Xi) denotes a derivable formula for the variable of interesty evaluated at the data pointXi.

The KeYmaera formulation of these two measures for the first formula of Table 1 can be found in the Supplementary Information. Absolute-error variants of the first and second expressions in Eq. (3) are denoted byβ2a,βa, respectively. The numerical (data) error measuresε2r andεr are defined by replacingfB(Xi) byYi in Eq. (3). Analogous toβ2a andβa, we also define absolute-numerical-error measuresε2a andεa.

Table 1.

Error values of candidate solutions for the Kepler dataset

12345678910
DatasetCandidate formulanumerical errorpoint. reas. err.gen. reas. errordependencies
p = ε2rεrβ2rβrβ,Srm1m2d
Solar0.1319d30.01290.00640.01460.00520.0052001
0.1316(d3+d)1.93481.74981.93851.75331.7559000
(0.03765d3 + d2)/(2 + d)0.31020.27660.30950.27580.2758000
Exoplanet0.1319d3/m10.08450.08190.02310.00520.0052001
m12m23/d+0.1319d3/m10.19880.16360.13200.1097>550000
(10.7362m1)d3/21.22460.46971.24180.46860.4686001
Binary stars1/(d2m12)+1/(dm22)m13m22+0.4787d3/m2+d2m220.00230.00150.00590.0050Timeout000
(d3+m13m2/d)/m1+m20.00320.00310.00380.0031Timeout000
d3/(0.9967m1+m2)0.00580.00530.00140.00080.0020111

Numerical error values, pointwise reasoning error values, and generalization error values are shown. We also give an analysis of the variable dependence of candidate solutions. For simplicity of notation, in the table we use the variablesd,m1,m2 andp, while referring to the scaled counterparts. We assume that all the errors are relative.

Table 1 reports in columns 5 and 6 the values ofβ2r andβr, respectively. It also reports the relative numerical errorsε2r andεr in columns 3 and 4, measured by thel2 andl norms, respectively, for the candidate expressions given in column 2 when evaluated on the points in the dataset. We minimize the absolutel2 errorε2a (and not the relative errorε2r), when obtaining candidate expressions via symbolic regression.

The pointwise reasoning errorsβ2 andβ are not very informative if SR yields a low-error candidate expression (measured with respect to the data), and the data itself satisfies the background theory up to a small error, which indeed is the case with the data we use; the reasoning errors and numerical errors are very similar.

Generalization reasoning error

Even when one can find a function that fits given data points well, it is challenging to obtain a function that generalizes well, that is, one which yields good results at points of the domain not equal to the data points. Letβ,Sr be calculated for a candidate formulaf (x) over a domainS that is not equal to the original set of data points as follows:

β,Sr=maxxSf(x)fB(x)fB(x),4

where we consider the relative error and, as before, the functionfB(x) is not known, but is implicitly defined by the axioms in the background theory. We call this measure therelative generalization reasoning error. If we do not divide byfB(x) in the above expression, we get the correspondingabsolute versionβ,Sa. For the Kepler dataset, we letS be the smallest multi-dimensional interval (or Cartesian product of intervals on the real line) containing all data points. In column 7 of Table 1, we show the relative generalization reasoning errorβ,Sr on the Kepler datasets withS defined as above. If this error is roughly the same asβr the pointwise relative reasoning error forl (e.g., for the solar system dataset) then the formula extracted from the numerical data is as accurate at points inS as it is at the original data points.

Variable dependence

In order to check if the functional dependence of a candidate formula on a specific variable is accurate, we compute the generalization error over a domainS where the domain of this variable is extended by an order of magnitude beyond the smallest interval containing the values of the variable in the dataset. Thus we can check whether there exist special conditions under which the formula does not hold. We modify the endpoints of an interval by one order of magnitude, one variable at a time. If we notice an increase in the generalization reasoning error while modifying intervals for one variable, we deem the candidate formula as missing a dependency on that variable. A missing dependency might occur, for example, because the exponent for a variable is incorrect, or that variable is not considered at all when it should be. One can get further insight into the type of dependency by analyzing how the error varies (e.g., linearly or exponentially). Table 1 provides, in columns 8–10, results regarding the candidate formulae for Kepler’s third law. For each formula, the dependencies onm1,m2, andd are indicated by 1 or 0 (for correct or incorrect dependency). For example, the candidate formulap=0.1319d3 for the solar system does not depend on either mass, and the dependency analysis suggests that the formula approximates well the phenomenon in the solar system, but not for larger masses.

The best formula for the binary-star dataset,d3/(0.9967m1+m2), has no missing dependency (all ones in columns 8–10), that is, it generalizes well; increasing the domain along any variable does not increase the generalized reasoning error.

Figure 4 provides a visualization of the two errorsε2r andβ2r for the first three functions of Table 1 (solar-system dataset) and the ground truthf*.

Fig. 4. Depiction of symbolic models for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system.

Fig. 4

The models produced by our SR system are represented by points (ε, β), whereε represents distance to data, andβ represents distance to background theory. Both distances are computed with an appropriate norm on the scaled data.

Relativistic time dilation

Einstein’s theory of relativity postulates that the speed of light is constant, and implies that two observers in relative motion to each other will experience time differently and observe different clock frequencies. The frequency f for a clock moving at speedv is related to the frequencyf0 of a stationary clock by the formula

ff0f0=1v2c21,5

wherec is the speed of light. This formula was recently confirmed experimentally by Chou et al.26 using high precision atomic clocks. We test our system on the experimental data reported by Chou et al.26 which consists of measurements ofv and associated values of(ff0)/f0, reproduced in the Supplementary Information. We take the axioms for derivation of the time dilation formula from the work of Behroozi27 and Smith28. These are also listed in the Supplementary Information and involve variables that are not present in the experimental data.

In Table 2 we give some functions obtained by our SR module (using {+,,×,÷,√ } as the set of input operators) along with the numerical errors of the associated functions and generalization reasoning errors. The sixth column gives the setS as an interval forv for which our reasoning module can verify that the absolute generalization reasoning error of the function in the first column is at most 1. The last column gives the interval forv for which we can verify a relative generalization reasoning error of at most 2%. Even though the last function has low relative error according to this metric, it can be ruled out as a reasonable candidate if one assumes the target function should be continuous (it has a singularity atv = 1). Thus, even though we cannot obtain the original function, we obtain another which generalizes well, as it yields excellent predictions for a very large range of velocities.

Table 2.

Candidate functions derived from time dilation data, and associated error values

CandidateNumerical Error AbsoluteNumerical Error RelativeS s.t. AbsoluteS s.t. Relative
formulaGen. Reas. ErrorGen. Reas. Error
f = ε2aεaε2rεrβ,Sa1β,Sr.02
−0.00563v20.38220.30671.08110.001837 ≤ v ≤ 11537 ≤ v ≤ 108
v1+0.00689vv0.31520.20971.01250.006937 ≤ v ≤ 4937 ≤ v ≤ 38
0.00537v2v+v2(v1)0.30270.22991.25440.002137 ≤ v ≤ 9837 ≤ v ≤ 109
0.00545v4v2+v2(v1)0.32380.25311.13080.001037 ≤ v ≤ 12637 ≤ v ≤ 107

The values ofv are defined inm/s.

In this case, our system can also help rule out alternative axioms. Consider replacing the axiom that the speed of light is a constant valuec by a “Newtonian” assumption that light behaves like other mechanical objects: if emitted from an object with velocityv in a direction perpendicular to the direction of motion of the object, it has velocityv2+c2. Replacingc byv2+c2 (in axiom R2 in the Supplementary Information to obtain R2’) produces a self-consistent axiom system (as confirmed by the theorem prover), albeit one leading to no time dilation. Our reasoning module concludes that none of the functions in Table 2 is compatible with this updated axiom system: the absolute generalization reasoning error is greater than 1 even on the dataset domain, as well as the pointwise reasoning error. Consequently, the data is used indirectly to discriminate between axiom systems relevant for the phenomenon under study; SR poses only accurate formulae as conjectures.

Langmuir’s adsorption equation

The Langmuir adsorption equation (Nobel Prize in Chemistry, 1932)29 describes a chemical process in which gas molecules contact a surface, and relates the loadingq on the surface to the pressurep of the gas:

q=qmaxKap1+Kap.6

The constantsqmax andKa characterize the maximum loading and the adsorption strength, respectively. A similar model for a material with two types of adsorption sites yields:

q=qmax,1Ka,1p1+Ka,1p+qmax,2Ka,2p1+Ka,2p,7

with parameters for maximum loading and adsorption strength on each type of site. The parameters in Eqs. (6) and (7) fit experimental data using linear or nonlinear regression, and depend on the material, gas, and temperature.

We used data from Langmuir’s 1918 publication29 for methane adsorption on mica at a temperature of 90 K, and also data from the work of Sun et al.30 (Table 1) for isobutane adsorption on silicalite at a temperature of 277 K. In both cases, observed values ofq are given for specific values ofp; the goal is to expressq as a function ofp. We give the SR module the operators { + , − , × , ÷}, and obtain the best fitting functions with two and four constants. The code ran for 20 minutes on 45 cores, and seven of these functions are displayed for each dataset.

To encode the background theory, following Langmuir’s original theory29, we elicited the following setA of axioms:

L1.Site balanceS0=S+SaL2.Adsorption rate modelrads=kadspSL3.Desorption rate modelrdes=kdesSaL4.Equilibrium assumptionrads=rdesL5.Mass balance onqq=Sa.8

Here,S0 is the total number of sites, of whichS are unoccupied andSa are occupied (L1). The adsorption raterads is proportional to the pressurep and the number of unoccupied sites (L2). The desorption raterdes is proportional to the number of occupied sites (L3). At equilibrium,rads = rdes (L4), and the total amount adsorbed,q, is the number of occupied sites (L5) because the model assumes each site adsorbs at most one molecule. Langmuir solved these equations to obtain

q=S0(kads/kdes)p1+(kads/kdes)p,9

which corresponds to Eq. (6), whereqmax=S0 andKa = kads/kdes. An axiomatic formulation for the multi-site Langmuir expression is described in the Supplementary Information. Additionally, constants and variables are constrained to be positive (e.g.,S0 > 0,S > 0, andSa > 0) or non-negative (e.g.,q ≥ 0).

The logic formulation to prove is:

(CA)f,10

whereC is the conjunction of the non-negativity constraints,A is a conjunction of the axioms, the union ofC andA constitutes the background theoryB, andf is the formula we wish to prove.

SR can only generate numerical expressions involving the (dependent and independent) variables occurring in the input data, with certain values for constants; for example, the expressionf = p/(0.709p + 0.157). The expressions built from variables and constants from the background theory, such as Eq. (9), involve the constants (in their symbolic form) explicitly: for example,kads andkdes appear explicitly in Eq. (9) while SR only generates a numerical instance of the ratio of these constants. Thus, we cannot use Formula (10) directly to prove formulae generated from SR. Instead, we replace each numerical constant of the formula by a logic variableci : for example, the formulaf = p/(0.709p + 0.157) is replaced byf=p/(c1p+c2), introducing two new variablesc1 andc2. We then quantify the new variables existentially, and define a new set of non-negativity constraintsC. In the example above we will haveC=c1>0c2>0.

The final formulation to prove is:

c1cn(CA)(fC).11

For example,f=p/(c1p+c2) is proved true if the reasoner can prove that there exist values ofc1 andc2 such thatf satisfies the background theoryA and the constraintsC. Herec1 andc2 can be functions of constantskads,kdes,S0, and/or real numbers, but not the variablesq andp.

We also consider background knowledge in the form of a list of desired properties of the relation betweenp andq, which helps trim the set of candidate formulae. Thus, we define a collectionK of constraints onf, whereq = f ( p), enforcing monotonicity or certain types of limiting behavior (see Supplementary Information). We use Mathematica21 to verify that a candidate function satisfies the constraints inK.

In Table 3, column 1 gives the data source, and column 2 gives the “hyperparameters” used in our SR experiments: we allow either two or four constants in the derived expressions. Furthermore, as the first constraint C1 fromK can be modeled by simply adding the data pointp = q = 0, we also experiment with an “extra point”.

Table 3.

Results on two datasets for the Langmuir problem

1234567
DataConditionCandidate formulaNumerical ErrorKeYmaeraK
q = ε2rεrprovabilityconstr.
Langmuir29 (Table IX)2 const.f1:   (p2 + 2p − 1)/(0.00888p2 + 0.118p)0.06310.0486Timeout2/5
f2:   p/(0.00927p + 0.0759) *0.17990.1258Yes5/5
4 const.f3:   (p2 − 10.5p − 15)/(0.00892p2 − 1.23)0.04430.0295Timeout2/5
f4:   (8.86p + 13.9)/(0.0787p + 1)0.06580.0465No4/5
f5:   p2/(0.00895p2 + 0.0934p − 0.0860)0.07590.0496No2/5
4 const. extra-pointf6:   (p2 + p)/(0.00890p2 + 0.106p −0.0311)0.06830.0470Timeout2/5
f7:   (112p2 − p)/(p2 + 10.4p − 9.66)0.07710.0532Timeout3/5
Sun et al.30 (Table 1)2 const.g1:   (p + 3)/(0.584p + 4.01)0.16250.1007No4/5
g2:   p/(0.709p + 0.157)0.96800.5120Yes5/5
4 const.g3:   (0.0298p2 + 1)/(0.0185p2 + 1.16) − 0.000905/p20.10530.0538Timeout2/5
g4:   1/(p2 + 1) + (2.53p − 1)/(1.54p + 2.77)0.13000.0725Timeout3/5
4 const. extra-pointg5:   (1.74p2 + 7.61p)/(p2 + 9.29p + 0.129)0.11190.0996Timeout5/5
g6:   (0.226p2 + 0.762p − 7.62 ⋅ 10−4)/(0.131p2 + p)0.15400.0935Timeout2/5
g7:   (4.78p2 + 26.6p)/(2.71p2 + 30.4p + 1)0.12390.1364Timeout5/5

(* This expression is also generated when using four constants, and also when the extra point (0, 0) is added).

Column 3 displays a derived expression, while columns 4 and 5 give, respectively, the relative numerical errorsε2r andεr. If the expression can be derived from our background theory, then we indicate that in column 6. These results are visualized in Fig. 5. Column 7 indicates the number of constraints fromK that each expression satisfies, verified by Mathematica. Among the top two-constant expressions,f1 fits the data better thanf2, which is derivable from the background theory, whereasf1 is not.

Fig. 5. Symbolic regression solutions to two adsorption datasets.

Fig. 5

Fig. 5a refers to the methane adsorption on mica at a temperature of 90 K, while Fig. 5b refers to the isobutane adsorption on silicalite at a temperature of 277 K.f2 andg2 are equivalent to the single-site Langmuir equation;g5 andg7 are equivalent to the two-site Langmuir equation.

When we search for four-constant expressions29, we get much smaller errors than Eq. (6) or even Eq. (7), but we do not obtain the two-site formula (Eq. (7)) as a candidate expression. For the dataset from Sun et al.30,g2 has a form equivalent to Langmuir’s one-site formula, andg5 andg7 have forms equivalent to Langmuir’s two-site formula, with appropriate values ofqmax,i andKa,i fori = 1, 2.

System limitations and future improvements

Our results on three problems and associated data are encouraging and provide the foundations of a new approach to automated scientific discovery. However our work is only a first, although crucial, step towards completing the missing links in automating the scientific method.

One limitation of the reasoning component is the assumption of correctness and completeness of the background theory. The incompleteness could be partially solved by the introduction of abductive reasoning31 (as depicted in Fig. 3). Abduction is a logic technique that aims to find explanations of an (or a set of) observation, given a logical theory. The explanation axioms are produced in a way that satisfy the following: (1) the explanation axioms are consistent with the original logical theory and (2) the observation can be deduced by the new enhanced theory (the original logical theory combined with the explanation axioms). In our context the logical theory corresponds to the set of background knowledge axioms that describe a scientific phenomenon, the observation is one of the formulas extracted from the numerical data and the explanations are the missing axioms in the incomplete background theory.

However the availability of background theory axioms in machine readable format for physics and other natural sciences is currently limited. Acquiring axioms could potentially be automated (or partially automated) using knowledge extraction techniques. Extraction from technical books or articles that describe a natural science phenomenon can be done by, for example, deep learning methods (e.g. the work of Pfahler and Morik32, Alexeeva et al.33, or Wang and Liu34) both from NL plain text or semi-structured text such as LateX or HTML. Despite the recent advancements in this research field, the quality of the existing tools remains quite inadequate with respect to the scope of our system.

Another limitation of our system, that heavily depends on the tools used, is the scaling behavior. Excessive computational complexity is a major challenge for automated theorem provers (ATPs): for certain types of logic (including the one that we use), proving a conjecture is undecidable. Deriving models from a logical theory using formal reasoning tools is even more difficult when using complex arithmetic and calculus operators. Moreover, the run-time variance of a theorem prover is very large: the system can at times solve some “large” problems while having difficulties with some “smaller” problems. Recent developments in the neuro-symbolic area use deep-learning techniques to enhance standard theorem provers (e.g., see Crouse et al.8). We are still at the early stages of this research and there is still a lot that can be done. We envision that the performance and capability (in terms of speed and expressivity) of theorem provers will improve with time. Symbolic regression tools, including the one based on solving mixed-integer nonlinear programs (MINLP) that we developed, often take an excessive amount of time to explore the space of possible symbolic expressions and find one that has low error and expression complexity, especially with noisy data. In practice, the worst-case solution time for MINLP solvers (including BARON) grows exponentially with input data encoding size (additional details in the Supplementary Information). However, MINLP solver performance and genetic programming based symbolic regression solvers are active areas of research.

Our proposed system could benefit from other improvements in individual components (especially in the functionality available). For example, Keymaera only supports differential equations in time and not in other variables and does not support higher order logic; BARON cannot handle differential equations.

Beyond improving individual components, our system can be improved by introducing techniques such as experimental design (not described in this work but envisioned in Fig. 3). A fundamental question in the holistic view of the discovery process is what data should be collected to give us maximum information regarding the underlying model. The goal of optimal experimental design (OED) is to find an optimal sequence of data acquisition steps such that the uncertainty associated with the inferred parameters, or some predicted quantity derived from them, is minimized with respect to a statistical or information theoretic criterion. In many realistic settings, experimentation may be restricted or costly, providing limited support for any given hypothesis as to the underlying functional form. It is therefore critical at times to incorporate an effective OED framework. In the context of model discovery, a large body of work addresses the question of experimental design for predetermined functional forms, and another body of research addresses the selection of a model (functional form) out of a set of candidates. A framework that can deal with both the functional form and the continuous set of parameters that define the model behavior is obviously desirable22; one that consistently accounts for logical derivability or knowledge-oriented considerations35 would be even better.

Supplementary information

Peer Review File (4.2MB, pdf)

Acknowledgements

We thank J. Ilja Siepmann for initially suggesting adsorption as a problem for symbolic regression. We thank James Chin-wen Chou for providing the atomic clock data. Funding: This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) (PA-18-02-02). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. T.R.J. was supported by the U.S. Department of Energy (DOE), Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences (DE-FG02-17ER16362), as well as startup funding from the University of Maryland, Baltimore County. T.R.J. also gratefully acknowledges the University of Minnesota Institute for Mathematics and its Applications (IMA).

Author contributions

C.C. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed and implemented the reasoning module, designed and discussed the experiments, performed the experiments for the reasoning module and for the comparison with the state of the art, analyzed and formatted the data, formalized the scientific theories, formatted code and data for the release, wrote and edited the manuscript, and designed the figures. S.D. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed the SR module architecture, designed and discussed the experiments, performed the experiments for the SR module, analyzed and formatted the data, and wrote and edited the manuscript. V.A. designed and implemented the SR module. T.R.J. identified target problems and experimental datasets, formalized the scientific theories, discussed the experiments, designed the figures, and wrote and edited the manuscript. J.G. prepared the data, executed the computational experiments for the SR module and for the comparison with the state of the art, formatted code and data for the release, and edited the manuscript. K.C. discussed and designed the overarching project, discussed the experiments, and edited and revised the manuscript. N.M. discussed and designed the overarching project, discussed the experiments, provided conceptual advice, and edited the manuscript. B.E.K. designed figure 1, discussed the reasoning measures, and edited the manuscript. L.H. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed the experiments, analyzed the results per validation of the framework, and wrote and edited the manuscript.

Peer review

Peer review information

Nature Communications thanks Joseph Scott, Hiroaki Kitano and the other, anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Data availability

The data used in this study are available in the AI-Descartes GitHub repository36:https://github.com/IBM/AI-Descartes.

Code availability

The code used for this work can be found, freely available, at the AI-Descartes GitHub repository36:https://github.com/IBM/AI-Descartes.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Cristina Cornelio, Email: c.cornelio@samsung.com.

Lior Horesh, Email: lhoresh@us.ibm.com.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-023-37236-y.

References

  • 1.Koza JR. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge: MIT Press; 1992. [Google Scholar]
  • 2.Koza JR. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge: MIT Press; 1994. [Google Scholar]
  • 3.Schmidt M, Lipson H. Distilling free-form natural laws from experimental data. Science. 2009;324:81–85. doi: 10.1126/science.1165893. [DOI] [PubMed] [Google Scholar]
  • 4.Martius, G. & Lampert, C. H. Extrapolation and learning equations. InProceedings of the 29th Conference on Neural Information Processing Systems (NIPS-16) (2016).
  • 5.Iten, R., Metger, T., Wilming, H., Rio, L. & Renner, R. Discovering physical concepts with neural networks.Physical Review Letters124, (2020). [DOI] [PubMed]
  • 6.Udrescu, S.-M. & Tegmark, M. AI Feynman: A physics-inspired method for symbolic regression.Science Advances6.16 (2020). [DOI] [PMC free article] [PubMed]
  • 7.Grigoryev D, Hirsch E, Pasechnik D. Complexity of semialgebraic proofs. Moscow Math. J. 2002;2:647–679. doi: 10.17323/1609-4514-2002-2-4-647-679. [DOI] [Google Scholar]
  • 8.Crouse, M. et al. A deep reinforcement learning based approach to learning transferable proof guidance strategies. InProceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (2021).
  • 9.Parrilo, P. A.Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph.D. thesis, Caltech, Pasadena (2000).
  • 10.Barak, B. & Steurer, D. Sum-of-squares proofs and the quest toward optimal algorithms.International Congress of Mathematicians (ICM), Seoul, South Korea, August 13-21, 2014.
  • 11.Fawzi, A., Malinowski, M., Fawzi, H. & Fawzi, O. Learning dynamic polynomial proofs. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)32, 41817–4190 (2019).
  • 12.Marra, G., Giannini, F., Diligenti, M. & Gori, M. Constraint-based visual generation. In Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing: 28th International Conference on Artificial Neural Networks, Proceedings, 565–577 (2019).
  • 13.Scott, J., Panju, M., & Ganesh, V. LGML: Logic Guided Machine Learning (Student Abstract). In Proceedings of the AAAI Conference on Artificial Intelligence,34, 13909–13910 (2020).
  • 14.Ashok D, Scott J, Wetzel SJ, Panju M, Ganesh V. Logic guided genetic algorithms (student abstract) Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:15753–15754. doi: 10.1609/aaai.v35i18.17873. [DOI] [Google Scholar]
  • 15.Kubalík, J., Derner, E. & Babuška, R. Symbolic regression driven by training data and prior knowledge. InProceedings of the 2020 Genetic and Evolutionary Computation Conference, 958–966 (2020).
  • 16.Augusto, D. A. & Barbosa, H. J. Symbolic regression via genetic programming. InProceedings 6th Brazilian Symp. Neural Networks, 173–178 (IEEE, 2000).
  • 17.Austel, V. et al. Globally optimal symbolic regression.NIPS Symposium on Interpretable Machine Learning (2017).
  • 18.Cozad, A.Data- and theory-driven techniques for surrogate-based optimization. Ph.D. thesis, Carnegie Mellon, Pittsburgh, PA (2014).
  • 19.Cozad A, Sahinidis NV. A global MINLP approach to symbolic regression. Math. Program. Ser. B. 2018;170:97–119. doi: 10.1007/s10107-018-1289-x. [DOI] [Google Scholar]
  • 20.Fulton, N., Mitsch, S., Quesel, J.-D., Völp, M. & Platzer, A. KeYmaera X: An axiomatic tactical theorem prover for hybrid systems. InProceedings of the International Conference on Automated Deduction, CADE-25 (2015).
  • 21.Wolfram Mathematica.https://www.wolfram.com. Version: 12.
  • 22.Clarkson, K. L. et al. Bayesian experimental design for symbolic discovery.Preprint athttps://arxiv.org/abs/2211.15860 (2022).
  • 23.NASA. Planets Factsheet.https://nssdc.gsfc.nasa.gov/planetary/factsheet/ (2017).
  • 24.NASA. Exoplanet Archive.https://exoplanetarchive.ipac.caltech.edu/ (2017).
  • 25.Novaković B. Orbits of five visual binary stars. Baltic Astronomy. 2007;16:435–442. [Google Scholar]
  • 26.Chou CW, Hume DB, Rosenband T, Wineland DJ. Optical clocks and relativity. Science. 2010;329:1630–1632. doi: 10.1126/science.1192720. [DOI] [PubMed] [Google Scholar]
  • 27.Behroozi F. A simple derivation of time dilation and length contraction in special relativity. Phys. Teach. 2014;52:410–412. doi: 10.1119/1.4895356. [DOI] [Google Scholar]
  • 28.Smith GS. A simple electromagnetic model for the light clock of special relativity. Eur. J. Phys. 2011;32:1585–1595. doi: 10.1088/0143-0807/32/6/012. [DOI] [Google Scholar]
  • 29.Langmuir I. The adsorption of gases on plane surfaces of glass, mica and platinum. J. Amer. Chem. Soc. 1918;40:1361–1403. doi: 10.1021/ja02242a004. [DOI] [Google Scholar]
  • 30.Sun MS, Shah DB, Xu HH, Talu O. Adsorption equilibria of C1 to C4 alkanes, CO2, and SF6 on silicalite. J. Phys. Chem. 1998;102:1466–1473. doi: 10.1021/jp9730196. [DOI] [Google Scholar]
  • 31.Douven, I. Abduction. In Zalta, E. N. (ed.)The Stanford Encyclopedia of Philosophy (Metaphysics Research Lab, Stanford University, 2021), Summer 2021 edn.
  • 32.Pfahler, L. & Morik, K. Semantic search in millions of equations. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD-20, 135–143 (Association for Computing Machinery, 2020).
  • 33.Alexeeva, M. et al. MathAlign: Linking formula identifiers to their contextual natural language descriptions. InProceedings of the 12th Language Resources and Evaluation Conference, 2204–2212 (European Language Resources Association, 2020).
  • 34.Wang Z, Liu J-CS. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. Int. J. Document Anal. Recognit. 2021;24:63–75. doi: 10.1007/s10032-020-00360-2. [DOI] [Google Scholar]
  • 35.Haber E, Horesh L, Tenorio L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems. 2008;24:055012. doi: 10.1088/0266-5611/24/5/055012. [DOI] [Google Scholar]
  • 36.Cornelio, C. et al.[AI-Descartes GitHub repository] Combining data and theory for derivable scientific discovery with AI-Descartes.https://github.com/IBM/AI-Descartes (2023). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Peer Review File (4.2MB, pdf)

Data Availability Statement

The data used in this study are available in the AI-Descartes GitHub repository36:https://github.com/IBM/AI-Descartes.

The code used for this work can be found, freely available, at the AI-Descartes GitHub repository36:https://github.com/IBM/AI-Descartes.


Articles from Nature Communications are provided here courtesy ofNature Publishing Group

ACTIONS

RESOURCES


[8]ページ先頭

©2009-2026 Movatter.jp