SIAM News Blog
Insights and Commentary

Can Scientific Discovery Be Automated?

On the Prospects of Equation Discovery and Symbolic Regression 

<strong>Figure 1.</strong> A section of “Man, Controller of the Universe” by Diego Rivera. Powered by the scientific conquest of the celestial and micro-biological scales, a worker steers the fate of humanity with an enormous machine. Figure courtesy of <a href="https://en.wikipedia.org/wiki/File:Mural._El_hombre_en_el_cruce_de_caminos._Diego_rivera_(1934).JPG" target="_blank"> AAP86/Wikimedia Commons</a> and shared via the <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en" target="_blank"> Creative Commons Attribution-ShareAlike 3.0 Unported license</a>.
Figure 1. A section of “Man, Controller of the Universe” by Diego Rivera. Powered by the scientific conquest of the celestial and micro-biological scales, a worker steers the fate of humanity with an enormous machine. Figure courtesy of AAP86/Wikimedia Commons and shared via the Creative Commons Attribution-ShareAlike 3.0 Unported license.

In his 1933 lecture “On the Method of Theoretical Physics,” Albert Einstein offered insight into empiricism and “pure thought” in apprehending nature through physical laws. He began his discussion with a warning: “if you wish to learn from the theoretical physicist anything about the methods which he uses, I would give you the following piece of advice: don’t listen to his words, examine his achievements” [4]. He emphasized that experience “cannot possibly be the source” of fundamental concepts. Einstein did not rely on existing measurements of the velocities of Brownian particles in devising his statistical theory of this phenomenon, and he used elaborate thought experiments involving chasing beams of light to develop the special theory of relativity [16]. Far from being derived from the data, general relativity received its first experimental confirmation with Eddington's 1919 measurement of light deflecting around the sun, three years after the theory's introduction. As Einstein’s epochal scientific creativity shows, scientific discovery is a nebulous affair. On this topic, Karl Popper insists that devising a theory “seems to me neither to call for logical analysis nor to be susceptible of it,” later expressing that how scientists generate a theory “is irrelevant to the logical analysis of scientific knowledge” [11]. 

In recent years, a new voice has emerged in discussions of scientific discovery. Out of the burgeoning field of scientific machine learning, researchers that specialize in “equation discovery” and “symbolic regression” suggest an ambitious new scientific epistemology: in the age of big data, the painstaking process of manually fitting algebraic and differential equations to measurements—a process which has supposedly characterized the tedious science of yesterday—can be offloaded to machines. This is not Chris Anderson’s “End of Theory” [1], which, in its reliance on black-box models based on deep neural networks, spells the end of scientific understanding as we know it. Rather, it presents an engine of “interpretable” scientific discovery, which charts a course between Andersonian excesses and the darkness of science’s past. For example, mathematician and symbolic regression researcher Miles Cranmer claims that in our era of artificial intelligence (AI), the “next great scientific theory is hiding inside a neural network,” meaning that he expects symbolic regression strategies to discover new laws of nature [3]. Similarly, in their extension of a foundational equation discovery framework to partial differential equations (PDEs), Samuel H. Rudy and his collaborators argue that their method will address the “many complex systems that have eluded quantitative analytic descriptions or even characterization of a suitable choice of variables (e.g., neuroscience, power grid, epidemiology, finance, ecology, etc.)” [13]. Now, having developed ingenious techniques to find the parsimonious ordinary differential equations (ODEs) and PDEs to describe diverse physical systems, is an age of scientific abundance upon us? After all, these methods have been used to rediscover fundamental laws of nature [10]. This is the kind of optimism one encounters in the literature, though the achievement of new laws of nature has yet to surface. What can we realistically expect to discover with these methods? 

For our purposes, the details of how these methods work are not important; put simply, their ability to revolutionize scientific discovery relies on the claim that the crux of discovery is finding parsimonious equations to describe existing data. This explains the literature’s frequent reference to Johannes Kepler, who fashioned his three laws of planetary motion from scrutinizing astronomical data, as a prototype of discovery. If this claim is accurate, then equation discovery and symbolic regression truly do stand poised to revolutionize our understanding of nature, full stop. If not, we require a more nuanced understanding of the kinds of problems for which equation discovery and symbolic regression are well-equipped.  

In the words of Thomas Kuhn in The Structure of Scientific Revolutions, we distinguish between two modes of scientific discovery: “normal scientific” and “revolutionary.” Kuhn explains that a paradigm consists of a scientific community's shared framework and “[commitment] to the same rules and standards for scientific practice,” based on the symbolic laws, values, and “exemplar” problems. Then, Kuhn describes normal science as a process of “puzzle solving,” whereby a scientist works within the set of rules furnished by the paradigm to find an answer to a well-defined question. As such, we take normal scientific discovery to be the discovery of a law within an established research tradition, or paradigm. Revolutionary science, however, is quite a different matter; we take revolutionary discoveries to be ones which define research paradigms. For example, the use of symbolic regression for the discovery of coarse-grained equations to describe cloud cover in climate models [7] is a normal scientific discovery, but the Navier-Stokes equations are revolutionary. The discovery that material radiates heat at a given rate is normal scientific, but obtaining governing differential equations by applying energy conservation to material bodies is revolutionary. Revolutionary discoveries provide the rules by which subsequent research plays. In his study of computer simulations, philosopher of science Eric Winsberg argues that theory, which here we equate with revolutionary discovery, is the fundamental starting point of any predictive simulation, with additional modeling assumptions required to apply the theory to a system [17]. Thus, in computational science, most of the research is “theory articulation,” where revolutionary science is not even the goal. 

While we have no doubt that equation discovery and symbolic regression techniques will continue to succeed on problems of normal scientific discovery, the discussion of Newton’s dynamics and discovering “great scientific theories” in the literature suggest a lack of clarity regarding the differences between the normal scientific and revolutionary modes of discovery. Here, we argue that equation discovery and symbolic regression cannot be used for revolutionary science.  

The Past

The history of the scientific method is often taken to begin with Francis Bacon’s The New Organon. Skeptical of the Aristotelian faith that “axioms” of nature could be obtained by reason alone, Bacon proposes an inductive approach to the acquisition of scientific laws based on careful observation of natural phenomena with the help of measurement instruments [2]. He argues that the relevant features of a phenomenon of interest can be enumerated, and that its nature can then be ascertained using specific techniques of inquiry. The basic movements of Bacon’s approach are consonant with those of the equation discovery and symbolic regression literature: first, the data is obtained, and second, the laws are extracted. While the details of law extraction differ, the two approaches agree that the origins of scientific knowledge are found in data obtained from the senses and/or measurement apparatus. 

<strong>Figure 2.</strong> A section of Paul Gauguin’s “Where Do We Come From? What Are We? Where Are We Going?” depicting three stages of life, with a central Eve-like figure picking fruit from a tree. Figure courtesy of <a href="https://commons.wikimedia.org/wiki/File:Gauguin_-_Where_Do_We_Come_From%3F_What_Are_We%3F_Where_Are_We_Going%3F_(1897-98).jpg" target="_blank"> Surajr7/Wikimedia Commons</a>. Public domain image.
Figure 2. A section of Paul Gauguin’s “Where Do We Come From? What Are We? Where Are We Going?” depicting three stages of life, with a central Eve-like figure picking fruit from a tree. Figure courtesy of Surajr7/Wikimedia Commons. Public domain image.

This simple-minded picture of induction has been extensively criticized. The first problem is often known as the “theory-ladenness” of observation. As philosopher Carl Hempel puts it, “empirical facts or findings can be qualified as logically relevant or irrelevant only in reference to a given hypothesis, but not in reference to a given problem” [8]. Furthermore, if an attempt to generate scientific knowledge required the collection of all relevant facts, such an investigation would need to “await the end of the world.” Hempel calls this the “narrow inductivist” philosophy. Kuhn echoed similar concerns, stating “in the absence of a paradigm or some candidate for paradigm, all of the facts that could pertain to the development of a given science are likely to seem equally relevant” [9]. Thus, according to these philosophers, the assumption that data can be collected in a theory-neutral way before discovery is problematic. A more radical critique of this framing of science is given by Paul Feyerabend, who argues that the way science progresses is to go against the data. In his view, Copernicus needed to contradict the experience of the senses in proposing a mobile earth, as it was plain to see that stones falling went straight to the surface of the earth, providing “an irrefutable argument for the earth being motionless” [5].  

The narrow inductivist portrait of revolutionary scientific discovery is thus threatened from two sides. First, the necessary data cannot be collected without strong prior commitments as to what is relevant. While this poses no trouble for normal scientific discovery—where the relevant variables and structure of the governing equations are already known—it presents challenges for revolutionary science. The determination of relevance is not strictly antecedent to a scientific theory; rather, it is a fundamental part of the theory. Bacon himself attempts to showcase his inductive method to devise a theory of heat, but he does not distinguish between the concepts of heat and temperature. In compiling the data he deems relevant to his theory, he includes everything from the flammability of materials to the burning sensation produced by liquor, and the “sparkle” of seawater at night when “struck forcefully by oars” [2]. It took an additional 200 years for Joseph Fourier to conceptualize the phenomenon of heat transfer with sufficient clarity for the relevant data to be collected. In his Analytical Theory of Heat, Fourier writes: “if it be imagined that each molecule carries a separate thermometer, which indicates its temperature at every instant, the state of the solid will from time to time be represented by the variable system of all these thermometric heights” [6]. Evidently, the image of tiny thermometers spread continuously throughout a body is not self-evident. Neither is the fact that the sensation of burning is something distinct from thermodynamic temperature. Thus, in the context of revolutionary scientific discovery, prior knowledge is scarce, fragmentary, and contested. Framing a problem in the right way is a tremendously important step toward a solution, and it is not at all clear how data can assist this framing process. As the inventor Charles Kettering famously said, “a problem well stated is half solved.”

The second threat is that prior data may actually be an obstacle to developing the new theory. The immediacy of sense experience had to be set aside in order to entertain the Copernican hypothesis of a mobile Earth. Feyerabend shows that Galileo first adopts the Copernican hypothesis despite its contradiction with experience, and then subsequently demonstrates how this contradiction can be removed. This counter-inductive step recurs throughout the history of science. A more recent example is that of peridynamics, which models fractures in solids that make use of the continuum equivalent of “action at a distance,” giving rise to non-local integro-differential equations—a rarity in fundamental physical models [15]. Just as obtaining the right data requires insight into the problem at hand, working with the wrong data, or even the wrong interpretation of the right data, can obscure the underlying truth.

As we have seen, the history of science offers counterexamples to the narrow inductivist model of discovery on which the success of equation discovery and symbolic regression rely. To the best of our knowledge, the problem of theory ladenness of observation has not been addressed head-on in the literature. However, the failure of revolutionary science to conform to this model in the past does not guarantee that this will continue. Perhaps our understanding of nature has progressed to such a degree that large quantities of trustworthy, relevant data can be obtained and interpreted. What are the prospects of a data-driven scientific epistemology in an age of abundant data and strong paradigms already established for investigating the natural world?  

The Future

Our hopes for revolutionary science with equation discovery and symbolic regression are not high. Even ignoring the possibility of counter-inductive reasoning, and despite all our scientific and technological sophistication, we still do not know how to frame problems at the frontiers of knowledge. Ongoing attempts to interpret large language models (LLMs) help clarify this point. For example, if one were to play 20 questions with an LLM, a game where the model is asked to think of an object and answer questions about it, the model would not think of an object at the start but suddenly choose an object at the end of the dialogue consistent with the answers to the questions [14]. Can the model really understand the game if it doesn’t think of an object? This raises important questions: is the definition of understanding something we can agree upon, and can it be measured? The philosophical flavor of these questions is suggestive of pre-paradigmatic debate over fundamentals. What would it even mean to collect data on understanding in machines? What form would a law take in this context? Until a law is discovered—assuming it exists at all—we cannot be sure the problem was appropriately framed, and it is not clear how equation discovery and symbolic regression can help figure this out.  

Closing Remarks

A scientific education naturalizes the entities that appear in mathematical models of nature. The concepts of acceleration, mass, force, energy, temperature, and probability all appear self-evident to the practicing scientist, as if they manifest directly to the senses. We tend to treat these objects as the building blocks of theory, but not as themselves beholden to theory. Yet, the history of science shows the story to be more complex. As Kuhn says, until the Aristotelian paradigm of dynamics was overthrown, “there were no pendulums, but only swinging stones, for the scientist to see. Pendulums were brought into existence by something very like a paradigm-induced gestalt switch” [9].  

The primary obstacle to using equation discovery and symbolic regression for revolutionary science is the fact that the collection and interpretation of data is theory-dependent; a law of nature and the entities that comprise it co-evolve. Unlike normal discovery, in the messy business of revolutionary scientific discovery, the data does not wait passively for the scientist to pull a law out of it. Generating and interpreting data is a creative, intuitive act on par in epistemic significance with fitting a parsimonious equation. How fundamental discoveries are made, we cannot say, but it does not appear to be straightforwardly algorithmic. This may call for a tempering of one’s hopes regarding a data-driven revolution in science, at least the one suggested by equation discovery and symbolic regression. But, on the bright side, we scientists will have plenty of work left to do. 

References

[1] Anderson, C. (2008, June 23). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. WIRED. Retrieved from https://www.wired.com/2008/06/pb-theory/
[2] Bacon, F. (2000). Francis Bacon: The new organon (Jardine, L., & Silverthorne, M., Eds.). Cambridge. MA: Cambridge University Press. 
[3] Cranmer, M. (2024, April 3). The next great scientific theory is hiding inside a neural network. [Presentation]. Simons Foundation Presidential Lectures. Retrieved from https://www.simonsfoundation.org/event/the-next-great-scientific-theory-is-hiding-inside-a-neural-network/
[4] Einstein, A. (1934). On the method of theoretical physics. Philosophy of Science, 1(2), 163–169.
[5] Feyerabend, P. (2010). Against method: Outline of an anarchistic theory of knowledge. New York, NY: Verso. 
[6] Fourier, J. (1878). The analytical theory of heat. Cambridge, U.K.: Cambridge University Press. 
[7] Grundner, A., Beucler, T., Gentine, P., & Eyring, V. (2024). Data-driven equation discovery of a cloud cover parameterization. J. Adv. Model. Earth Syst., 16(3). 
[8] Hempel, C. (1966). Philosophy of natural science. Upper Saddle River, NJ: Prentice Hall.
[9] Kuhn, T.S. (2012). The structure of scientific revolutions (50th Anniversary ed). Chicago, IL: University of Chicago Press.
[10] Lemos, P., Jeffrey, N., Cranmer, M., Ho, S., & Battaglia, P. (2022). Rediscovering orbital mechanics with machine learning. Preprint, arXiv:2202.02306.
[11] Popper, K. The logic of scientific discovery. (1959). New York, NY: Basic Books.
[12] Rosenfeld, L. (1969). Newton’s views on aether and gravitation. Arch. Hist. Exact Sci., 6(1), 29-37.
[13] Rudy, S.H., Brunton, S.L., Proctor, J.L, & Kutz, J.N. (2016). Data-driven discovery of partial differential equations. Preprint, arXiv:1609.06401.
[14] Shanahan, M., McDonell, K., Reynolds, L. (2023). Role-play with large language models. Preprint, arXiv:2305.16367.
[15] Silling, S. & Lehoucq, R.B. (2010). Peridynamic theory of solid mechanics. Adv. Appl. Mech., 44, 73–168. 
[16] Smith, G.E. & Seth, R. (2020). The historical background: Brownian motion as of 1905. In R. Seth and G.E. Smith (Eds.), Brownian motion and molecular reality. Oxford, U.K.: Oxford University Press.
[17] Winsberg, E. (2010). Science in the age of computer simulation. Chicago, IL: University of Chicago Press.

About the Author

Conor Rowan

Ph.D. student, University of Colorado Boulder

Conor Rowan is a Ph.D. student in aerospace engineering at the University of Colorado Boulder. He is interested in continuum mechanics, scientific machine learning, and the philosophy of science.