. 2023 Apr 12;14:1777. doi:10.1038/s41467-023-37236-y

Combining data and theory for derivable scientific discovery with AI-Descartes

Cristina Cornelio^1,^2,^✉,Sanjeeb Dash¹,Vernon Austel¹,Tyler R Josephson^3,⁴,Joao Goncalves¹,Kenneth L Clarkson¹,Nimrod Megiddo¹,Bachir El Khadir¹,Lior Horesh^1,^5,^✉

¹IBM Research—Mathematics and Theoretical Computer Science, New York, NY USA

²Samsung AI—Machine Learning and Data Intelligence, Cambridge, UK

³Department of Chemical, Biochemical, and Environmental Engineering, University of Maryland, Baltimore County, MD USA

⁴Department of Chemistry and Chemical Theory Center, University of Minnesota, Minneapolis, MN USA

⁵Columbia University, Computer Science, New York, NY USA

^✉

Corresponding author.

Received 2021 Oct 28; Accepted 2023 Mar 8; Collection date 2023.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visithttp://creativecommons.org/licenses/by/4.0/.

PMC Copyright notice

PMCID: PMC10097814 PMID:37045814

Abstract

Scientists aim to discover meaningful formulae that accurately describe experimental data. Mathematical models of natural phenomena can be manually created from domain knowledge and fitted to data, or, in contrast, created automatically from large datasets with machine-learning algorithms. The problem of incorporating prior knowledge expressed as constraints on the functional form of a learned model has been studied before, while finding models that are consistent with prior knowledge expressed via general logical axioms is an open problem. We develop a method to enable principled derivations of models of natural phenomena from axiomatic knowledge and experimental data by combining logical reasoning with symbolic regression. We demonstrate these concepts for Kepler’s third law of planetary motion, Einstein’s relativistic time-dilation law, and Langmuir’s theory of adsorption. We show we can discover governing laws from few data points when logical reasoning is used to distinguish between candidate formulae having similar error on the data.

Subject terms: Computer science, Information theory and computation

Automatic extraction of consistent governing laws from data is a challenging problem. The authors propose a method that takes as input experimental data and background theory and combines symbolic regression with logical reasoning to obtain scientifically meaningful symbolic formulas.

Introduction

Artificial neural networks (NN) and statistical regression are commonly used to automate the discovery of patterns and relations in data. NNs return “black-box” models, where the underlying functions are typically used for prediction only. In standard regression, the functional form is determined in advance, so model discovery amounts to parameter fitting. In symbolic regression (SR)^1,2, the functional form is not determined in advance, but is instead composed from operators in a given list (e.g., + , − , × , and ÷) and calculated from the data. SR models are typically more “interpretable” than NN models, and require less data. Thus, for discovering laws of nature in symbolic form from experimental data, SR may work better than NNs or fixed-form regression³; integration of NNs with SR has been a topic of recent research in neuro-symbolic AI^4–6. A major challenge in SR is to identify, out of many models that fit the data, those that are scientifically meaningful. Schmidt and Lipson³ identify meaningful functions as those that balance accuracy and complexity. However many such expressions exist for a given dataset, and not all are consistent with the known background theory.

Another approach would be to start from the known background theory, but there are no existing practical reasoning tools that generate theorems consistent with experimental data from a set of known axioms. Automated Theorem Provers (ATPs), the most widely-used reasoning tools, instead solve the task of proving a conjecture for a given logical theory. Computational complexity is a major challenge for ATPs; for certain types of logic, proving a conjecture is undecidable. Moreover, deriving models from a logical theory using formal reasoning tools is especially difficult when arithmetic and calculus operators are involved (e.g., see the work of Grigoryev et al.⁷ for the case of inequalities). Machine-learning techniques have been used to improve the performance of ATPs, for example, by using reinforcement learning to guide the search process⁸. This research area has received much attention recently^9–11.

Models that are derivable, and not merely empirically accurate, are appealing because they are arguably correct, predictive, and insightful. We attempt to obtain such models by combining a novel mathematical-optimization-based SR method with a reasoning system. This yields an end-to-end discovery system, which extracts formulas from data via SR, and then furnishes either a formal proof of derivability of the formula from a set of axioms, or a proof of inconsistency. We present novel measures that indicate how close a formula is to a derivable formula, when the model is provably non-derivable, and we calculate the values of these measures using our reasoning system. In earlier work combining machine learning with reasoning, Marra et al.¹² use a logic-based description to constrain the output of a GAN neural architecture for generating images. Scott et al.¹³ and Ashok et al.¹⁴ combine machine-learning tools and reasoning engines to search for functional forms that satisfy prespecified constraints. They augment the initial dataset with new points in order to improve the efficiency of learning methods and the accuracy of the final model. Kubalik et al.¹⁵ also exploit prior knowledge to create additional data points. However, these works only consider constraints on the functional form to be learned, and do not incorporate general background-theory axioms (logic constraints that describe the other laws and unmeasured variables that are involved in the phenomenon).

Results

Discovery as a formal mathematical problem

Our automated scientific discovery method aims to discover an unknownsymbolic model y = f *(x) (bold letters indicate vectors) wherex is the vector (x₁, …, x_n) of independent variables, andy is the dependent variable. The discovered modelf (an approximation off *) should fit a collection ofm data points, ((X¹, Y ¹), ⋯ , (X^m, Y^m)), be derivable from a background theory, have low complexity and bounded prediction error. More specifically, the inputs to our system are 4-tuples $⟨ B, C, D, M ⟩$ as follows.

Background Knowledge $B$ : a set of domain-specific axioms expressed as logic formulae. They involvex,y, and possibly more variables that are necessary to formulate the background theory. In this work we focus mainly on first-order-logic formulae with equality, inequality and basic arithmetic operators. We assume that the background theory $B$ iscomplete, that is, it contains all the axioms necessary to comprehensively explain the phenomena under consideration, andconsistent, that is, the axioms do not contradict one another. These two assumptions guarantee that there exists a unique derivable function $f_{B}$ that logically represents the variable of interesty. Note that although the derivable function is unique, there may exist different functional forms that are equivalent on the domain of interest. Considering the domain with two points {0, 1} for a variablex, the two functional formsf (x) = x andf (x) = x² both define the same function.
A Hypothesis Class $C$ : a set of admissible symbolic models defined by a grammar, a set of invariance constraints to avoid redundant expressions (e.g.,A + B is equivalent toB + A) and constraints on the functional form (e.g., monotonicity).
Data $D$ : a set ofm examples, each providing certain values forx₁, …, x_n, andy.
Modeler Preferences $M$ : a set of numerical parameters (e.g., error bounds on accuracy).

Generalized notion of distance

In general, there may not exist a function $f \in C$ that fits the data exactly and is derivable from $B$ . This could happen because the symbolic model generating the data might not belong to $C$ , the sensors used to collect the data might give noisy measurements, or the background knowledge might be inaccurate or incomplete. To quantify the compatibility of a symbolic model with data and background theory, we introduce the notion ofdistance between a modelf and $B$ . Roughly, it reflects the error between the predictions off and the predictions of a formula $f_{B}$ derivable from $B$ (thus, the distance equals zero whenf is derivable from $B$ ). Figure 1 provides a visualization of these two notions of distance for the problem of learning Kepler’s third law of planetary motion from solar-system data and background theory.

Fig. 1 — The numerical data, background theory, and a discovered model are depicted for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system. The data consists of measurements (m₁, m₂, d, p) of the mass of the sunm₁, the orbital periodp and massm₂ for each planet and its distanced from the sun. The background theory amounts to Newton’s laws of motion, i.e., the formulae for centrifugal force, gravitational force, and equilibrium conditions. The 4-tuples (m₁, m₂, d, p) are projected into (m₁ + m₂, d, p). The blue manifold represents solutions of $f_{B}$ , which is the function derivable from the background-theory axioms that represents the variable of interest. The gray manifold represents solutions of the discovered modelf. The double arrows indicate the distancesβ ( f ) andε( f ).

Integration of statistical and symbolic AI

Our system consists mainly of an SR module and a reasoning module. The SR module returns multiple candidate symbolic models (or formulae) expressingy as a function ofx₁, …, x_n and that fit the data. For each of these models, the system outputs the distanceε( f ) betweenf and $D$ and the distanceβ( f ) betweenf and $B$ . We will also be referring toε( f ) andβ( f ) as errors.

These functions are also tested to see if they satisfy the specified constraints on the functional form (in $C$ ) and the modeler-specified level of accuracy and complexity (in $M$ ). When the models are passed to the reasoning module (along with the background theory $B$ ), they are tested for derivability. If a model is found to be derivable from $B$ , it is returned as the chosen model for prediction; otherwise, if the reasoning module concludes that no candidate model is derivable, it is necessary to either collect additional data or add constraints. In this case, the reasoning module will return a quality assessment of the input set of candidate hypotheses based on the distanceβ, removing models that do not satisfy the modeler-specified bounds onβ. The distance (or error)β is computed between a function (or formula)f, derived from numerical data, and the derivable function $f_{B}$ which is implicitly defined by the set of axioms in $B$ and is logically represented by the variable of interesty. The distance between the function $f_{B}$ and any other formulaf depends only on the background theory and the formulaf and not on any particular functional form of $f_{B}$ . Moreover, the reasoning module can prove that a model is not derivable by returning counterexample points that satisfy $B$ but do not fit the model.

Interplay between data and theory in AI-Descartes

SR is typically solved with genetic programming (GP)^1–3,16, however methods based on mixed-integer nonlinear programming (MINLP) have recently been proposed^17–19. In this work, we develop a new MINLP-based SR solver (described in the Supplementary Information). The input consists of a subset of the operators { $+, -, \times, \div, \sqrt{}, \log, \exp$ }, an upper bound on expression complexity, and an upper bound on the number of constants used that do not equal 1. Given a dataset, the system formulates multiple MINLP instances to find an expression that minimizes the least-square error. Each instance is solved approximately, subject to a time limit. Both linear and nonlinear constraints can be imposed. In particular, dimensional consistency is imposed when physical dimensions of variables are available.

We use KeYmaera X²⁰ as a reasoning tool; it is an ATP for hybrid systems and combines different types of reasoning: deductive, real-algebraic, and computer-algebraic reasoning. We also use Mathematica²¹ for certain types of analysis of symbolic expressions. While a formula found by any grammar-based system (such as an SR system) is syntactically correct, it may contradict the axioms of the theory or not be derivable from them. In some cases, a formula may not be derivable as the theory may not have enough axioms; the formula may be provable under an extended axiom set or an alternative one (e.g., using a relativistic set of axioms rather than a “Newtonian” one).

An overview of our system seen as a discovery cycle is shown in Fig. 2. Our discovery cycle is inspired by Descartes who advanced the scientific method and emphasized the role that logical deduction, and not empirical evidence alone, plays in forming and validating scientific discoveries. Our present approach differs from implementations of the scientific method that obtain hypotheses from theory and then check them against data; instead we obtain hypotheses from data and assess them against theory. A more detailed schematic of the system is depicted in Fig. 3, where the colored components correspond to the system we present in this work, and the gray components refer to standard techniques for scientific discovery that we have not yet integrated into our current implementation.

Fig. 3 — Colored components correspond to our system, and gray components indicate standard techniques for scientific discovery (human-driven or artificial) that have not been integrated into the current system. The colors match the respective components of the discovery cycle of Fig. 2. The present system generates hypotheses from data using symbolic regression, which are posed as conjectures to an automated deductive reasoning system, which proves or disproves them based on background theory or provides reasoning-based quality measures.

Experimental validation

We tested the different capabilities of our system on three problems (more details in the Methods section). First, we considered the problem of deriving Kepler’s third law of planetary motion, providing reasoning-based measures to analyze the quality and generalizablity of the generated formulae. Extracting this law from experimental data is challenging, especially when the masses involved are of very different magnitudes. This is the case for the solar system, where the solar mass is much larger than the planetary masses. The reasoning module helps in choosing between different candidate formulae and identifying the one that generalizes well: using our data and theory integration we were able to re-discover Kepler’s third law. We then considered Einstein’s time-dilation formula. Although we did not recover this formula from data, we used the reasoning module to identify the formula that generalizes best. Moreover, analyzing the reasoning errors with two different sets of axioms (one with “Newtonian” assumptions and one relativistic), we were able to identify the theory that better explains the phenomenon. Finally, we considered Langmuir’s adsorption equation, whose background theory contains material-dependent coefficients. By relating these coefficients to the ones in the SR-generated models via existential quantification, we were able to logically prove one of the extracted formulae.

Discussion

We have demonstrated the value of combining logical reasoning with symbolic regression in obtaining meaningful symbolic models of physical phenomena, in the sense that they are consistent with background theory and generalize well in a domain that is significantly larger than the experimental data. The synthesis of regression and reasoning yields better models than can be obtained by SR or logical reasoning alone.

Improvements or replacements of individual system components and introduction of new modules such as abductive reasoning or experimental design²² (not described in this work for the sake of brevity) would extend the capabilities of the overall system. A deeper integration of reasoning and regression can help synthesize models that are both data driven and based on first principles, and lead to a revolution in the scientific discovery process. The discovery of models that are consistent with prior knowledge will accelerate scientific discovery, and enable going beyond existing discovery paradigms.

Methods

We next describe in detail the methodologies used to address the three problems studied to validate our method: Kepler’s third law of planetary motion, relativistic time dilation, and Langmuir’s adsorption equation.

Kepler’s third law of planetary motion

Kepler’s law relates the distanced between two bodies (e.g., the sun and a planet in the solar system) and their orbital periods. It can be expressed as

p = \sqrt{\frac{4 π^{2} d^{3}}{G (m_{1} + m_{2})}},

wherep is the period,G is the gravitational constant, andm₁ andm₂ are the two masses. It can be derived using the following axioms of the background theory $B$ , describing the center of mass (axiom K1), the distance between bodies (axiom K2), the gravitational force (axiom K3), the centrifugal force (axiom K4), the force balance (axiom K5), and the period (axiom K6):

\begin{matrix} K 1 . Center of mass & m_{1} d_{1} = m_{2} d_{2} \\ K 2 . Distance between bodies & d = d_{1} + d_{2} \\ K 3 . Gravitational force & F_{g} = \frac{G m_{1} m_{2}}{d^{2}} \\ K 4 . Centrifugal force & F_{c} = m_{2} d_{2} w^{2} \\ K 5 . Force balance & F_{g} = F_{c} \\ K 6 . Period definition & p = \frac{2 π}{w} \\ K 7 . Positivity constraints & m_{1} > 0, m_{2} > 0, p > 0, d_{1} > 0, d_{2} > 0 . \end{matrix}

We consider three real-world datasets: planets of the solar system (from the NASA Planetary Fact Sheet²³), the solar-system planets along with exoplanets from Trappist-1 and the GJ 667 system (from the NASA exoplanet archive²⁴), and binary stars²⁵. These datasets contain measurements of pairs of masses (a sun and a planet for the first two, and two suns for the third), the distance between them, and the orbital period of the planet around its sun in the first two datasets or the orbital period around the common center of mass in the third dataset. The data we use is given in the Supplementary Information. Note that the dataset does not contain measurements for a number of variables in the axiom system, such asd₁, d₂, F_g, etc.

The goal is to recover Kepler’s third law (Eq. (1)) from the data, that is, to obtainp as the above-stated function ofd,m₁ andm₂.

The SR module takes as input the set of operators { $+, -, \times, \div,$ √ } and outputs a set of candidate formulas. None of the formulae obtained via SR are derivable, though some are close approximations to derivable formulae. We evaluate the quality of these formulae by writing a logic program for calculating the errorβ of a formula with respect to a derivable formula. We use three measures, defined below, to assess the correctness of a data-driven formula from a reasoning viewpoint: thepointwise reasoning error, thegeneralization reasoning error, andvariable dependence.

Pointwise reasoning error

The key idea is to compute a distance between a formula generated from the numerical data and some derivable formula that is implicitly defined by the axiom set. The distance is measured by the $l_{2}$ or $l_{\infty}$ norm applied to the differences between the values of the numerically-derived formula and a derivable formula at the points in the dataset. This definition can be extended to other norms.

We compute the relative error of numerically derived formulaf (x) applied to them data pointsXⁱ (i = 1, …, m) with respect to $f_{B} (x)$ , derivable from the axioms via the following expressions:

β_{2}^{r} = \sqrt{\sum_{i = 1}^{m} {(\frac{f (X^{i}) - f_{B} (X^{i})}{f_{B} (X^{i})})}^{2}} and β_{\infty}^{r} = \max_{1 \leq i \leq m} \{\frac{∣ f (X^{i}) - f_{B} (X^{i}) ∣}{∣ f_{B} (X^{i}) ∣}\}

where $f_{B} (X^{i})$ denotes a derivable formula for the variable of interesty evaluated at the data pointXⁱ.

The KeYmaera formulation of these two measures for the first formula of Table 1 can be found in the Supplementary Information. Absolute-error variants of the first and second expressions in Eq. (3) are denoted by $β_{2}^{a}, β_{\infty}^{a}$ , respectively. The numerical (data) error measures $ε_{2}^{r}$ and $ε_{\infty}^{r}$ are defined by replacing $f_{B} (X^{i})$ byY_i in Eq. (3). Analogous to $β_{2}^{a}$ and $β_{\infty}^{a}$ , we also define absolute-numerical-error measures $ε_{2}^{a}$ and $ε_{\infty}^{a}$ .

Table 1.

Error values of candidate solutions for the Kepler dataset

1	2	3	4	5	6	7	8	9	10
Dataset	Candidate formula	numerical error		point. reas. err.		gen. reas. error	dependencies
	p =	$ε_{2}^{r}$	$ε_{\infty}^{r}$	$β_{2}^{r}$	$β_{\infty}^{r}$	$β_{\infty, S}^{r}$	m₁	m₂	d
Solar	$\sqrt{0.1319 d^{3}}$	0.0129	0.0064	0.0146	0.0052	0.0052	0	0	1
	$\sqrt{0.1316 (d^{3} + d)}$	1.9348	1.7498	1.9385	1.7533	1.7559	0	0	0
	(0.03765d³ + d²)/(2 + d)	0.3102	0.2766	0.3095	0.2758	0.2758	0	0	0
Exoplanet	$\sqrt{0.1319 d^{3} / m_{1}}$	0.0845	0.0819	0.0231	0.0052	0.0052	0	0	1
	$\sqrt{m_{1}^{2} m_{2}^{3} / d + 0.1319 d^{3} / m_{1}}$	0.1988	0.1636	0.1320	0.1097	>550	0	0	0
	$\sqrt{(1 - 0.7362 m_{1}) d^{3} / 2}$	1.2246	0.4697	1.2418	0.4686	0.4686	0	0	1
Binary stars	$1 / (d^{2} m_{1}^{2}) + 1 / (d m_{2}^{2}) - m_{1}^{3} m_{2}^{2} + \sqrt{0.4787 d^{3} / m_{2} + d^{2} m_{2}^{2}}$	0.0023	0.0015	0.0059	0.0050	Timeout	0	0	0
	$(\sqrt{d^{3}} + m_{1}^{3} m_{2} / \sqrt{d}) / \sqrt{m_{1} + m_{2}}$	0.0032	0.0031	0.0038	0.0031	Timeout	0	0	0
	$\sqrt{d^{3} / (0.9967 m_{1} + m_{2})}$	0.0058	0.0053	0.0014	0.0008	0.0020	1	1	1

Open in a new tab

Numerical error values, pointwise reasoning error values, and generalization error values are shown. We also give an analysis of the variable dependence of candidate solutions. For simplicity of notation, in the table we use the variablesd,m₁,m₂ andp, while referring to the scaled counterparts. We assume that all the errors are relative.

Table 1 reports in columns 5 and 6 the values of $β_{2}^{r}$ and $β_{\infty}^{r}$ , respectively. It also reports the relative numerical errors $ε_{2}^{r}$ and $ε_{\infty}^{r}$ in columns 3 and 4, measured by the $l_{2}$ and $l_{\infty}$ norms, respectively, for the candidate expressions given in column 2 when evaluated on the points in the dataset. We minimize the absolute $l_{2}$ error $ε_{2}^{a}$ (and not the relative error $ε_{2}^{r}$ ), when obtaining candidate expressions via symbolic regression.

The pointwise reasoning errorsβ₂ and $β_{\infty}$ are not very informative if SR yields a low-error candidate expression (measured with respect to the data), and the data itself satisfies the background theory up to a small error, which indeed is the case with the data we use; the reasoning errors and numerical errors are very similar.

Generalization reasoning error

Even when one can find a function that fits given data points well, it is challenging to obtain a function that generalizes well, that is, one which yields good results at points of the domain not equal to the data points. Let $β_{\infty, S}^{r}$ be calculated for a candidate formulaf (x) over a domainS that is not equal to the original set of data points as follows:

β_{\infty, S}^{r} = \max_{x \in S} \{\frac{∣ f (x) - f_{B} (x) ∣}{∣ f_{B} (x) ∣}\},

where we consider the relative error and, as before, the function $f_{B} (x)$ is not known, but is implicitly defined by the axioms in the background theory. We call this measure therelative generalization reasoning error. If we do not divide by $f_{B} (x)$ in the above expression, we get the correspondingabsolute version $β_{\infty, S}^{a}$ . For the Kepler dataset, we letS be the smallest multi-dimensional interval (or Cartesian product of intervals on the real line) containing all data points. In column 7 of Table 1, we show the relative generalization reasoning error $β_{\infty, S}^{r}$ on the Kepler datasets withS defined as above. If this error is roughly the same as $β_{\infty}^{r}$ the pointwise relative reasoning error for $l_{\infty}$ (e.g., for the solar system dataset) then the formula extracted from the numerical data is as accurate at points inS as it is at the original data points.

Variable dependence

In order to check if the functional dependence of a candidate formula on a specific variable is accurate, we compute the generalization error over a domain $S^{'}$ where the domain of this variable is extended by an order of magnitude beyond the smallest interval containing the values of the variable in the dataset. Thus we can check whether there exist special conditions under which the formula does not hold. We modify the endpoints of an interval by one order of magnitude, one variable at a time. If we notice an increase in the generalization reasoning error while modifying intervals for one variable, we deem the candidate formula as missing a dependency on that variable. A missing dependency might occur, for example, because the exponent for a variable is incorrect, or that variable is not considered at all when it should be. One can get further insight into the type of dependency by analyzing how the error varies (e.g., linearly or exponentially). Table 1 provides, in columns 8–10, results regarding the candidate formulae for Kepler’s third law. For each formula, the dependencies onm₁,m₂, andd are indicated by 1 or 0 (for correct or incorrect dependency). For example, the candidate formula $p = \sqrt{0.1319 d^{3}}$ for the solar system does not depend on either mass, and the dependency analysis suggests that the formula approximates well the phenomenon in the solar system, but not for larger masses.

The best formula for the binary-star dataset, $\sqrt{d^{3} / (0.9967 m_{1} + m_{2})}$ , has no missing dependency (all ones in columns 8–10), that is, it generalizes well; increasing the domain along any variable does not increase the generalized reasoning error.

Figure 4 provides a visualization of the two errors $ε_{2}^{r}$ and $β_{2}^{r}$ for the first three functions of Table 1 (solar-system dataset) and the ground truth $f^{*}$ .

Fig. 4 — The models produced by our SR system are represented by points (ε, β), whereε represents distance to data, andβ represents distance to background theory. Both distances are computed with an appropriate norm on the scaled data.

Relativistic time dilation

Einstein’s theory of relativity postulates that the speed of light is constant, and implies that two observers in relative motion to each other will experience time differently and observe different clock frequencies. The frequency $f$ for a clock moving at speedv is related to the frequency $f_{0}$ of a stationary clock by the formula

\frac{f - f_{0}}{f_{0}} = \sqrt{1 - \frac{v^{2}}{c^{2}}} - 1,

wherec is the speed of light. This formula was recently confirmed experimentally by Chou et al.²⁶ using high precision atomic clocks. We test our system on the experimental data reported by Chou et al.²⁶ which consists of measurements ofv and associated values of $(f - f_{0}) / f_{0}$ , reproduced in the Supplementary Information. We take the axioms for derivation of the time dilation formula from the work of Behroozi²⁷ and Smith²⁸. These are also listed in the Supplementary Information and involve variables that are not present in the experimental data.

In Table 2 we give some functions obtained by our SR module (using { $+, -, \times, \div,$ √ } as the set of input operators) along with the numerical errors of the associated functions and generalization reasoning errors. The sixth column gives the setS as an interval forv for which our reasoning module can verify that the absolute generalization reasoning error of the function in the first column is at most 1. The last column gives the interval forv for which we can verify a relative generalization reasoning error of at most 2%. Even though the last function has low relative error according to this metric, it can be ruled out as a reasonable candidate if one assumes the target function should be continuous (it has a singularity atv = 1). Thus, even though we cannot obtain the original function, we obtain another which generalizes well, as it yields excellent predictions for a very large range of velocities.

Table 2.

Candidate functions derived from time dilation data, and associated error values

Candidate	Numerical Error Absolute		Numerical Error Relative		S s.t. Absolute	S s.t. Relative
formula	Numerical Error Absolute		Numerical Error Relative		Gen. Reas. Error	Gen. Reas. Error
$f$ =	$ε_{2}^{a}$	$ε_{\infty}^{a}$	$ε_{2}^{r}$	$ε_{\infty}^{r}$	$β_{\infty, S}^{a} \leq 1$	$β_{\infty, S}^{r} \leq . 02$
−0.00563v²	0.3822	0.3067	1.0811	0.0018	37 ≤ v ≤ 115	37 ≤ v ≤ 10⁸
$\frac{v}{1 + 0.00689 v} - v$	0.3152	0.2097	1.0125	0.0069	37 ≤ v ≤ 49	37 ≤ v ≤ 38
$- 0.00537 \frac{v^{2} \sqrt{v + v^{2}}}{(v - 1)}$	0.3027	0.2299	1.2544	0.0021	37 ≤ v ≤ 98	37 ≤ v ≤ 109
$- 0.00545 \frac{v^{4}}{\sqrt{v^{2} + v^{- 2}} (v - 1)}$	0.3238	0.2531	1.1308	0.0010	37 ≤ v ≤ 126	37 ≤ v ≤ 10⁷

Open in a new tab

The values ofv are defined inm/s.

In this case, our system can also help rule out alternative axioms. Consider replacing the axiom that the speed of light is a constant valuec by a “Newtonian” assumption that light behaves like other mechanical objects: if emitted from an object with velocityv in a direction perpendicular to the direction of motion of the object, it has velocity $\sqrt{v^{2} + c^{2}}$ . Replacingc by $\sqrt{v^{2} + c^{2}}$ (in axiom R2 in the Supplementary Information to obtain R2’) produces a self-consistent axiom system (as confirmed by the theorem prover), albeit one leading to no time dilation. Our reasoning module concludes that none of the functions in Table 2 is compatible with this updated axiom system: the absolute generalization reasoning error is greater than 1 even on the dataset domain, as well as the pointwise reasoning error. Consequently, the data is used indirectly to discriminate between axiom systems relevant for the phenomenon under study; SR poses only accurate formulae as conjectures.

Langmuir’s adsorption equation

The Langmuir adsorption equation (Nobel Prize in Chemistry, 1932)²⁹ describes a chemical process in which gas molecules contact a surface, and relates the loadingq on the surface to the pressurep of the gas:

q = \frac{q_{\max} K_{a} p}{1 + K_{a} p} .

The constants $q_{\max}$ andK_a characterize the maximum loading and the adsorption strength, respectively. A similar model for a material with two types of adsorption sites yields:

q = \frac{q_{\max, 1} K_{a, 1} p}{1 + K_{a, 1} p} + \frac{q_{\max, 2} K_{a, 2} p}{1 + K_{a, 2} p},

with parameters for maximum loading and adsorption strength on each type of site. The parameters in Eqs. (6) and (7) fit experimental data using linear or nonlinear regression, and depend on the material, gas, and temperature.

We used data from Langmuir’s 1918 publication²⁹ for methane adsorption on mica at a temperature of 90 K, and also data from the work of Sun et al.³⁰ (Table 1) for isobutane adsorption on silicalite at a temperature of 277 K. In both cases, observed values ofq are given for specific values ofp; the goal is to expressq as a function ofp. We give the SR module the operators { + , − , × , ÷}, and obtain the best fitting functions with two and four constants. The code ran for 20 minutes on 45 cores, and seven of these functions are displayed for each dataset.

To encode the background theory, following Langmuir’s original theory²⁹, we elicited the following set $A$ of axioms:

\begin{matrix} L 1 . Site balance & S_{0} = S + S_{a} \\ L 2 . Adsorption rate model & r_{ads} = k_{ads} p S \\ L 3 . Desorption rate model & r_{des} = k_{des} S_{a} \\ L 4 . Equilibrium assumption & r_{ads} = r_{des} \\ L 5 . Mass balance on q & q = S_{a} . \end{matrix}

Here,S₀ is the total number of sites, of whichS are unoccupied andS_a are occupied (L1). The adsorption rater_ads is proportional to the pressurep and the number of unoccupied sites (L2). The desorption rater_des is proportional to the number of occupied sites (L3). At equilibrium,r_ads = r_des (L4), and the total amount adsorbed,q, is the number of occupied sites (L5) because the model assumes each site adsorbs at most one molecule. Langmuir solved these equations to obtain

q = \frac{S_{0} (k_{ads} / k_{des}) p}{1 + (k_{ads} / k_{des}) p},

which corresponds to Eq. (6), where $q_{\max} = S_{0}$ andK_a = k_ads/k_des. An axiomatic formulation for the multi-site Langmuir expression is described in the Supplementary Information. Additionally, constants and variables are constrained to be positive (e.g.,S₀ > 0,S > 0, andS_a > 0) or non-negative (e.g.,q ≥ 0).

The logic formulation to prove is:

(C \land A) \to f,

where $C$ is the conjunction of the non-negativity constraints, $A$ is a conjunction of the axioms, the union of $C$ and $A$ constitutes the background theory $B$ , andf is the formula we wish to prove.

SR can only generate numerical expressions involving the (dependent and independent) variables occurring in the input data, with certain values for constants; for example, the expressionf = p/(0.709p + 0.157). The expressions built from variables and constants from the background theory, such as Eq. (9), involve the constants (in their symbolic form) explicitly: for example,k_ads andk_des appear explicitly in Eq. (9) while SR only generates a numerical instance of the ratio of these constants. Thus, we cannot use Formula (10) directly to prove formulae generated from SR. Instead, we replace each numerical constant of the formula by a logic variablec_i : for example, the formulaf = p/(0.709p + 0.157) is replaced by $f^{'} = p / (c_{1} p + c_{2})$ , introducing two new variablesc₁ andc₂. We then quantify the new variables existentially, and define a new set of non-negativity constraints $C^{'}$ . In the example above we will have $C^{'} = c_{1} > 0 \land c_{2} > 0$ .

The final formulation to prove is:

\exists c_{1} \dots \exists c_{n} (C \land A) \to (f^{'} \land C^{'}) .

For example, $f^{'} = p / (c_{1} p + c_{2})$ is proved true if the reasoner can prove that there exist values ofc₁ andc₂ such that $f^{'}$ satisfies the background theory $A$ and the constraints $C$ . Herec₁ andc₂ can be functions of constantsk_ads,k_des,S₀, and/or real numbers, but not the variablesq andp.

We also consider background knowledge in the form of a list of desired properties of the relation betweenp andq, which helps trim the set of candidate formulae. Thus, we define a collection $K$ of constraints onf, whereq = f ( p), enforcing monotonicity or certain types of limiting behavior (see Supplementary Information). We use Mathematica²¹ to verify that a candidate function satisfies the constraints in $K$ .

In Table 3, column 1 gives the data source, and column 2 gives the “hyperparameters” used in our SR experiments: we allow either two or four constants in the derived expressions. Furthermore, as the first constraint C1 from $K$ can be modeled by simply adding the data pointp = q = 0, we also experiment with an “extra point”.

Table 3.

Results on two datasets for the Langmuir problem

1	2	3	4	5	6	7
Data	Condition	Candidate formula	Numerical Error		KeYmaera	$K$
		q =	$ε_{2}^{r}$	$ε_{\infty}^{r}$	provability	constr.
Langmuir²⁹ (Table IX)	2 const.	f₁: (p² + 2p − 1)/(0.00888p² + 0.118p)	0.0631	0.0486	Timeout	2/5
	2 const.	f₂: p/(0.00927p + 0.0759) *	0.1799	0.1258	Yes	5/5
	4 const.	f₃: (p² − 10.5p − 15)/(0.00892p² − 1.23)	0.0443	0.0295	Timeout	2/5
		f₄: (8.86p + 13.9)/(0.0787p + 1)	0.0658	0.0465	No	4/5
		f₅: p²/(0.00895p² + 0.0934p − 0.0860)	0.0759	0.0496	No	2/5
	4 const. extra-point	f₆: (p² + p)/(0.00890p² + 0.106p −0.0311)	0.0683	0.0470	Timeout	2/5
	4 const. extra-point	f₇: (112p² − p)/(p² + 10.4p − 9.66)	0.0771	0.0532	Timeout	3/5
Sun et al.³⁰ (Table 1)	2 const.	g₁: (p + 3)/(0.584p + 4.01)	0.1625	0.1007	No	4/5
	2 const.	g₂: p/(0.709p + 0.157)	0.9680	0.5120	Yes	5/5
	4 const.	g₃: (0.0298p² + 1)/(0.0185p² + 1.16) − 0.000905/p²	0.1053	0.0538	Timeout	2/5
	4 const.	g₄: 1/(p² + 1) + (2.53p − 1)/(1.54p + 2.77)	0.1300	0.0725	Timeout	3/5
	4 const. extra-point	g₅: (1.74p² + 7.61p)/(p² + 9.29p + 0.129)	0.1119	0.0996	Timeout	5/5
		g₆: (0.226p² + 0.762p − 7.62 ⋅ 10⁻⁴)/(0.131p² + p)	0.1540	0.0935	Timeout	2/5
		g₇: (4.78p² + 26.6p)/(2.71p² + 30.4p + 1)	0.1239	0.1364	Timeout	5/5

Open in a new tab

(* This expression is also generated when using four constants, and also when the extra point (0, 0) is added).

Column 3 displays a derived expression, while columns 4 and 5 give, respectively, the relative numerical errors $ε_{2}^{r}$ and $ε_{\infty}^{r}$ . If the expression can be derived from our background theory, then we indicate that in column 6. These results are visualized in Fig. 5. Column 7 indicates the number of constraints from $K$ that each expression satisfies, verified by Mathematica. Among the top two-constant expressions,f₁ fits the data better thanf₂, which is derivable from the background theory, whereasf₁ is not.

Fig. 5a refers to the methane adsorption on mica at a temperature of 90 K, while Fig. 5b refers to the isobutane adsorption on silicalite at a temperature of 277 K.f₂ andg₂ are equivalent to the single-site Langmuir equation;g₅ andg₇ are equivalent to the two-site Langmuir equation.

When we search for four-constant expressions²⁹, we get much smaller errors than Eq. (6) or even Eq. (7), but we do not obtain the two-site formula (Eq. (7)) as a candidate expression. For the dataset from Sun et al.³⁰,g₂ has a form equivalent to Langmuir’s one-site formula, andg₅ andg₇ have forms equivalent to Langmuir’s two-site formula, with appropriate values of $q_{\max, i}$ andK_a,i fori = 1, 2.

System limitations and future improvements

Our results on three problems and associated data are encouraging and provide the foundations of a new approach to automated scientific discovery. However our work is only a first, although crucial, step towards completing the missing links in automating the scientific method.

One limitation of the reasoning component is the assumption of correctness and completeness of the background theory. The incompleteness could be partially solved by the introduction of abductive reasoning³¹ (as depicted in Fig. 3). Abduction is a logic technique that aims to find explanations of an (or a set of) observation, given a logical theory. The explanation axioms are produced in a way that satisfy the following: (1) the explanation axioms are consistent with the original logical theory and (2) the observation can be deduced by the new enhanced theory (the original logical theory combined with the explanation axioms). In our context the logical theory corresponds to the set of background knowledge axioms that describe a scientific phenomenon, the observation is one of the formulas extracted from the numerical data and the explanations are the missing axioms in the incomplete background theory.

However the availability of background theory axioms in machine readable format for physics and other natural sciences is currently limited. Acquiring axioms could potentially be automated (or partially automated) using knowledge extraction techniques. Extraction from technical books or articles that describe a natural science phenomenon can be done by, for example, deep learning methods (e.g. the work of Pfahler and Morik³², Alexeeva et al.³³, or Wang and Liu³⁴) both from NL plain text or semi-structured text such as LateX or HTML. Despite the recent advancements in this research field, the quality of the existing tools remains quite inadequate with respect to the scope of our system.

Another limitation of our system, that heavily depends on the tools used, is the scaling behavior. Excessive computational complexity is a major challenge for automated theorem provers (ATPs): for certain types of logic (including the one that we use), proving a conjecture is undecidable. Deriving models from a logical theory using formal reasoning tools is even more difficult when using complex arithmetic and calculus operators. Moreover, the run-time variance of a theorem prover is very large: the system can at times solve some “large” problems while having difficulties with some “smaller” problems. Recent developments in the neuro-symbolic area use deep-learning techniques to enhance standard theorem provers (e.g., see Crouse et al.⁸). We are still at the early stages of this research and there is still a lot that can be done. We envision that the performance and capability (in terms of speed and expressivity) of theorem provers will improve with time. Symbolic regression tools, including the one based on solving mixed-integer nonlinear programs (MINLP) that we developed, often take an excessive amount of time to explore the space of possible symbolic expressions and find one that has low error and expression complexity, especially with noisy data. In practice, the worst-case solution time for MINLP solvers (including BARON) grows exponentially with input data encoding size (additional details in the Supplementary Information). However, MINLP solver performance and genetic programming based symbolic regression solvers are active areas of research.

Our proposed system could benefit from other improvements in individual components (especially in the functionality available). For example, Keymaera only supports differential equations in time and not in other variables and does not support higher order logic; BARON cannot handle differential equations.

Beyond improving individual components, our system can be improved by introducing techniques such as experimental design (not described in this work but envisioned in Fig. 3). A fundamental question in the holistic view of the discovery process is what data should be collected to give us maximum information regarding the underlying model. The goal of optimal experimental design (OED) is to find an optimal sequence of data acquisition steps such that the uncertainty associated with the inferred parameters, or some predicted quantity derived from them, is minimized with respect to a statistical or information theoretic criterion. In many realistic settings, experimentation may be restricted or costly, providing limited support for any given hypothesis as to the underlying functional form. It is therefore critical at times to incorporate an effective OED framework. In the context of model discovery, a large body of work addresses the question of experimental design for predetermined functional forms, and another body of research addresses the selection of a model (functional form) out of a set of candidates. A framework that can deal with both the functional form and the continuous set of parameters that define the model behavior is obviously desirable²²; one that consistently accounts for logical derivability or knowledge-oriented considerations³⁵ would be even better.

Supplementary information

Supplementary Information^{(430.4KB, pdf)}

Peer Review File^{(4.2MB, pdf)}

Acknowledgements

We thank J. Ilja Siepmann for initially suggesting adsorption as a problem for symbolic regression. We thank James Chin-wen Chou for providing the atomic clock data. Funding: This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) (PA-18-02-02). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. T.R.J. was supported by the U.S. Department of Energy (DOE), Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences (DE-FG02-17ER16362), as well as startup funding from the University of Maryland, Baltimore County. T.R.J. also gratefully acknowledges the University of Minnesota Institute for Mathematics and its Applications (IMA).

Author contributions

C.C. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed and implemented the reasoning module, designed and discussed the experiments, performed the experiments for the reasoning module and for the comparison with the state of the art, analyzed and formatted the data, formalized the scientific theories, formatted code and data for the release, wrote and edited the manuscript, and designed the figures. S.D. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed the SR module architecture, designed and discussed the experiments, performed the experiments for the SR module, analyzed and formatted the data, and wrote and edited the manuscript. V.A. designed and implemented the SR module. T.R.J. identified target problems and experimental datasets, formalized the scientific theories, discussed the experiments, designed the figures, and wrote and edited the manuscript. J.G. prepared the data, executed the computational experiments for the SR module and for the comparison with the state of the art, formatted code and data for the release, and edited the manuscript. K.C. discussed and designed the overarching project, discussed the experiments, and edited and revised the manuscript. N.M. discussed and designed the overarching project, discussed the experiments, provided conceptual advice, and edited the manuscript. B.E.K. designed figure 1, discussed the reasoning measures, and edited the manuscript. L.H. conceptualized the overarching derivable symbolic discovery architecture, designed the project, designed the experiments, analyzed the results per validation of the framework, and wrote and edited the manuscript.

Peer review

Peer review information

Nature Communications thanks Joseph Scott, Hiroaki Kitano and the other, anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Data availability

The data used in this study are available in the AI-Descartes GitHub repository³⁶:https://github.com/IBM/AI-Descartes.

Code availability

The code used for this work can be found, freely available, at the AI-Descartes GitHub repository³⁶:https://github.com/IBM/AI-Descartes.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Cristina Cornelio, Email: c.cornelio@samsung.com.

Lior Horesh, Email: lhoresh@us.ibm.com.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-023-37236-y.

References

1.Koza JR. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge: MIT Press; 1992. [Google Scholar]
2.Koza JR. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge: MIT Press; 1994. [Google Scholar]
3.Schmidt M, Lipson H. Distilling free-form natural laws from experimental data. Science. 2009;324:81–85. doi: 10.1126/science.1165893. [DOI] [PubMed] [Google Scholar]
4.Martius, G. & Lampert, C. H. Extrapolation and learning equations. InProceedings of the 29th Conference on Neural Information Processing Systems (NIPS-16) (2016).
5.Iten, R., Metger, T., Wilming, H., Rio, L. & Renner, R. Discovering physical concepts with neural networks.Physical Review Letters124, (2020). [DOI] [PubMed]
6.Udrescu, S.-M. & Tegmark, M. AI Feynman: A physics-inspired method for symbolic regression.Science Advances6.16 (2020). [DOI] [PMC free article] [PubMed]
7.Grigoryev D, Hirsch E, Pasechnik D. Complexity of semialgebraic proofs. Moscow Math. J. 2002;2:647–679. doi: 10.17323/1609-4514-2002-2-4-647-679. [DOI] [Google Scholar]
8.Crouse, M. et al. A deep reinforcement learning based approach to learning transferable proof guidance strategies. InProceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (2021).
9.Parrilo, P. A.Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph.D. thesis, Caltech, Pasadena (2000).
10.Barak, B. & Steurer, D. Sum-of-squares proofs and the quest toward optimal algorithms.International Congress of Mathematicians (ICM), Seoul, South Korea, August 13-21, 2014.
11.Fawzi, A., Malinowski, M., Fawzi, H. & Fawzi, O. Learning dynamic polynomial proofs. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)32, 41817–4190 (2019).
12.Marra, G., Giannini, F., Diligenti, M. & Gori, M. Constraint-based visual generation. In Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing: 28th International Conference on Artificial Neural Networks, Proceedings, 565–577 (2019).
13.Scott, J., Panju, M., & Ganesh, V. LGML: Logic Guided Machine Learning (Student Abstract). In Proceedings of the AAAI Conference on Artificial Intelligence,34, 13909–13910 (2020).
14.Ashok D, Scott J, Wetzel SJ, Panju M, Ganesh V. Logic guided genetic algorithms (student abstract) Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:15753–15754. doi: 10.1609/aaai.v35i18.17873. [DOI] [Google Scholar]
15.Kubalík, J., Derner, E. & Babuška, R. Symbolic regression driven by training data and prior knowledge. InProceedings of the 2020 Genetic and Evolutionary Computation Conference, 958–966 (2020).
16.Augusto, D. A. & Barbosa, H. J. Symbolic regression via genetic programming. InProceedings 6th Brazilian Symp. Neural Networks, 173–178 (IEEE, 2000).
17.Austel, V. et al. Globally optimal symbolic regression.NIPS Symposium on Interpretable Machine Learning (2017).
18.Cozad, A.Data- and theory-driven techniques for surrogate-based optimization. Ph.D. thesis, Carnegie Mellon, Pittsburgh, PA (2014).
19.Cozad A, Sahinidis NV. A global MINLP approach to symbolic regression. Math. Program. Ser. B. 2018;170:97–119. doi: 10.1007/s10107-018-1289-x. [DOI] [Google Scholar]
20.Fulton, N., Mitsch, S., Quesel, J.-D., Völp, M. & Platzer, A. KeYmaera X: An axiomatic tactical theorem prover for hybrid systems. InProceedings of the International Conference on Automated Deduction, CADE-25 (2015).
21.Wolfram Mathematica.https://www.wolfram.com. Version: 12.
22.Clarkson, K. L. et al. Bayesian experimental design for symbolic discovery.Preprint athttps://arxiv.org/abs/2211.15860 (2022).
23.NASA. Planets Factsheet.https://nssdc.gsfc.nasa.gov/planetary/factsheet/ (2017).
24.NASA. Exoplanet Archive.https://exoplanetarchive.ipac.caltech.edu/ (2017).
25.Novaković B. Orbits of five visual binary stars. Baltic Astronomy. 2007;16:435–442. [Google Scholar]
26.Chou CW, Hume DB, Rosenband T, Wineland DJ. Optical clocks and relativity. Science. 2010;329:1630–1632. doi: 10.1126/science.1192720. [DOI] [PubMed] [Google Scholar]
27.Behroozi F. A simple derivation of time dilation and length contraction in special relativity. Phys. Teach. 2014;52:410–412. doi: 10.1119/1.4895356. [DOI] [Google Scholar]
28.Smith GS. A simple electromagnetic model for the light clock of special relativity. Eur. J. Phys. 2011;32:1585–1595. doi: 10.1088/0143-0807/32/6/012. [DOI] [Google Scholar]
29.Langmuir I. The adsorption of gases on plane surfaces of glass, mica and platinum. J. Amer. Chem. Soc. 1918;40:1361–1403. doi: 10.1021/ja02242a004. [DOI] [Google Scholar]
30.Sun MS, Shah DB, Xu HH, Talu O. Adsorption equilibria of C1 to C4 alkanes, CO2, and SF6 on silicalite. J. Phys. Chem. 1998;102:1466–1473. doi: 10.1021/jp9730196. [DOI] [Google Scholar]
31.Douven, I. Abduction. In Zalta, E. N. (ed.)The Stanford Encyclopedia of Philosophy (Metaphysics Research Lab, Stanford University, 2021), Summer 2021 edn.
32.Pfahler, L. & Morik, K. Semantic search in millions of equations. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD-20, 135–143 (Association for Computing Machinery, 2020).
33.Alexeeva, M. et al. MathAlign: Linking formula identifiers to their contextual natural language descriptions. InProceedings of the 12th Language Resources and Evaluation Conference, 2204–2212 (European Language Resources Association, 2020).
34.Wang Z, Liu J-CS. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. Int. J. Document Anal. Recognit. 2021;24:63–75. doi: 10.1007/s10032-020-00360-2. [DOI] [Google Scholar]
35.Haber E, Horesh L, Tenorio L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems. 2008;24:055012. doi: 10.1088/0266-5611/24/5/055012. [DOI] [Google Scholar]
36.Cornelio, C. et al.[AI-Descartes GitHub repository] Combining data and theory for derivable scientific discovery with AI-Descartes.https://github.com/IBM/AI-Descartes (2023). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(430.4KB, pdf)}

Peer Review File^{(4.2MB, pdf)}

Data Availability Statement

The data used in this study are available in the AI-Descartes GitHub repository³⁶:https://github.com/IBM/AI-Descartes.

The code used for this work can be found, freely available, at the AI-Descartes GitHub repository³⁶:https://github.com/IBM/AI-Descartes.

Movatterモバイル変換

PERMALINK

Combining data and theory for derivable scientific discovery with AI-Descartes

Cristina Cornelio

Sanjeeb Dash

Vernon Austel

Tyler R Josephson

Joao Goncalves

Kenneth L Clarkson

Nimrod Megiddo

Bachir El Khadir

Lior Horesh

Abstract

Introduction

Results

Discovery as a formal mathematical problem

Generalized notion of distance

Fig. 1. Visualization of relevant sets and their distances.

Integration of statistical and symbolic AI

Interplay between data and theory in AI-Descartes

Fig. 2. An interpretation of the scientific method as implemented by our system.

Fig. 3. System overview.

Experimental validation

Discussion

Methods

Kepler’s third law of planetary motion

Pointwise reasoning error

Table 1.

Generalization reasoning error

Variable dependence

Fig. 4. Depiction of symbolic models for Kepler’s third law of planetary motion giving the orbital period of a planet in the solar system.

Relativistic time dilation

Table 2.

Langmuir’s adsorption equation

Table 3.

Fig. 5. Symbolic regression solutions to two adsorption datasets.

System limitations and future improvements

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases