Opportunities for machine learning in scientific discovery

Ricardo Vinuesa FLOW, Engineering Mechanics, KTH Royal Institute of Technology, Stockholm, Sweden Jean Rabault IT Department, Norwegian Meteorological Institute, 0313 Oslo, Norway Hossein Azizpour Robotics, Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden Swedish e-Science Research Centre, (SeRC), Stockholm, Sweden Stefan Bauer TUM School of Computation, Information and Technology, Technical University Munich, Munich, Germany Helmholtz AI, Helmholtz Center Munich, Munich, Germany Bingni W. Brunton Department of Biology, University of Washington, Seattle, WA 98195, USA Arne Elofsson Dept. of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, 171 21 Solna Swedish e-Science Research Centre, (SeRC), Stockholm, Sweden Elias Jarlebring Dept. Mathematics, KTH Royal Institute of Technology, 100 44 Stockholm, Sweden Swedish e-Science Research Centre, (SeRC), Stockholm, Sweden Hedvig Kjellström Robotics, Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden Swedish e-Science Research Centre, (SeRC), Stockholm, Sweden Stefano Markidis Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden Swedish e-Science Research Centre, (SeRC), Stockholm, Sweden David Marlevi Dept. Molecular Medicine and Surgery, Karolinska Institutet, 171 77 Stockholm, Sweden Inst. for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Paola Cinnella Institut Jean le Rond D’Alembert, Sorbonne Université, France Steven L. Brunton Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA
Abstract

Technological advancements have substantially increased computational power and data availability, enabling the application of powerful machine-learning (ML) techniques across various fields. However, our ability to leverage ML methods for scientific discovery, i.e. to obtain fundamental and formalized knowledge about natural processes, is still in its infancy. In this review, we explore how the scientific community can increasingly leverage ML techniques to achieve scientific discoveries. We observe that the applicability and opportunity of ML depends strongly on the nature of the problem domain, and whether we have full (e.g., turbulence), partial (e.g., computational biochemistry), or no (e.g., neuroscience) a-priori knowledge about the governing equations and physical properties of the system. Although challenges remain, principled use of ML is opening up new avenues for fundamental scientific discoveries. Throughout these diverse fields, there is a theme that ML is enabling researchers to embrace complexity in observational data that was previously intractable to classic analysis and numerical investigations.

Keywords: machine learning (ML); deep learning (DL); artificial intelligence (AI); scientific discovery; physics; life sciences; computer science

Introduction

Machine learning (ML) has shown great potential to transform a broad range of domains [1, 2, 3, 4, 5], and it is increasingly being applied to problems in science and engineering. ML has been widely used for predictive tasks in these areas, and despite an initial promising phase where ML methods have outperformed well-established techniques [6, 7, 8, 9, 10], such predictive applications are starting to exhibit diminishing returns. There are, by contrast, increasing opportunities in the academic usage of ML for scientific discovery, i.e. answering challenging scientific questions while leveraging existing fundamental knowledge. Such focus on scientific discovery can move the frontiers of science forward when progress in more traditional methods has slowed. Furthermore, the development of novel and more powerful ML methods can help to tackle some open subjects in the context of predictions from scarce, noisy, or incomplete data, out-of-sample generalization, extreme-event predictions and predictions under uncertainty.

Science is fundamentally interested in identifying structure and explaining the world, the many systems that constitute it, and the laws that govern it. The development of the scientific method as we know it has taken time and been progressive, from early forms found in antiquity (Aristotle, first forms of causality [11], etc.) to rigorous methodologies based on objective, quantitative, and mathematical evidence [12]. This approach has encountered great success, although the “unreasonable effectiveness of mathematics in natural sciences” [13] increasingly appears to encounter difficulty making further progress with the “complexity” observed in more challenging problems and real-world data. Complexity is hard to define but often arises from non-linearity, high dimensionality, and multiscale dynamics in space and time, leading to a system comprising a very large number of parts and mechanisms that cannot easily be simplified or approximated. Traditional tools inherited from the rigorous mathematical tradition of the 1700s to late 1900s are challenged by these problems: still today, we can observe and, to some degree, simulate and reproduce in numerical models complex problems such as turbulence, processes in the brain, biological systems, etc. But arguably, we do not fully understand these phenomena at a deeper level.

Machine learning comprises a growing set of algorithms, enabled by high-performance computing and increasingly vast data, that show incredible promise for handling complexity [14, 15]. Interestingly, artificial neural networks are themselves “complex” artificial systems that can now perform “complex” tasks for which no traditional algorithms are known. Even though their governing laws are based on simple operations, they are fully observable, and they run deterministically on microprocessors, we cannot in general explain the decisions and outputs generated by neural networks. Regardless, neural networks are enabling novel discoveries with profound impact, such as the first new class of antibiotics in decades [16]. This gives rise to several challenges, such as the need for explainable artificial intelligence (XAI) [17, 18].

Symbolic approaches, such as gene-expression programming, sparse regression, and sparse Bayesian learning  [19, 20, 21, 22, 23], have led to successful approaches over the years (see e.g. the work on “machine scientists” from the 1980s [24]). Such approaches are limited by exponential complexity with the size of the search space, motivating recent attempts to combine symbolic and deep-learning-based strategies [25]; however, they can still be used to enable advances in science, including the discovery of novel materials [26]. This observation also raises several fundamental, quasi-philosophical questions: is a complex system complex enough to understand its complexity, or can it only understand lesser complexities? This is also related to the question of how much AI can discover that is not already contained in the training data [27]. These considerations naturally raise the fundamental question of what opportunities (and challenges) are offered by the growing impact of data-driven methods and machine learning in science: will we, as some authors have suggested, observe the “unreasonable effectiveness of data” [28], and of deep learning [29]? And as a consequence, what does this mean in practice for future scientific discoveries? How and to what degrees can we extract fundamental understanding, not just computational recipes, from modern machine learning, and by using which approaches? In this sense, and for this article, we do not consider optimization or automation tasks as “scientific discovery”. Still, being able to effectively solve governing equations (for instance, partial differential equations, PDEs) through ML would be considered as “scientific discovery”, as long as this enables addressing fundamental questions that cannot be answered with other tools.

There have been some recent studies on the possible impact of ML on science. For instance, Wang et al. [30] focus on the potential of self-supervised learning and generative models for generally achieving improvements in scientific experiments and simulations. Zenil et al. [31] are more focused on how AI can help formulate scientific questions and answer them, emphasizing the potential of large language models (LLMs) in this context. The latter are raising a number of open questions in terms of potential benefits and threats of their application in science [32]. Fajardo-Fontiveros et al. [33] discuss the interesting question of when it is possible to learn models from data and what is the maximum level of noise acceptable for learning the correct model.

In the present work we adopt a different approach, analyzing separately the potential of AI to enable/facilitate scientific discovery in three types of problems: (i) problems where the governing phenomenological equations are entirely known. This corresponds to cases where it would be directly possible, given enough computational power, to simulate and reproduce the system entirely. (ii) Cases where we have some partial knowledge regarding the governing equations and/or some physical properties that hold. (iii) Scenarios where nothing is known about the governing equations or the physical properties of the problem under study. We illustrate these categories by considering examples from two concrete application areas: Physical Sciences and Life Sciences. A summary of the various categories and applications is presented in Figure 1, where we discuss the examples of turbulent flows, dark matter, drug discovery and brain research, sorted from more to less knowledge of the governing equations and/or their underlying properties. While the proposed categorization is a convenient schematic view, the lines between full knowledge and partial or no knowledge become increasingly blurred as system complexity grows. Examples given in the following sections illustrate how different uses of ML may support scientific discovery by hel** to tackle complexity.

Complete information is available

There are several applications within Physical Sciences where the underlying governing equations are known perfectly but the high-level global dynamics are still not well understood. Even if, technically, we know the governing quantum equations of chemistry underlying the dynamics of bio-molecules, and subsequently of full organisms, complexity makes biology a partial-knowledge problem, because only very limited parts of complex biological systems can be actually measured or simulated, while extreme complexity makes it impossible to simulate the full human brain. Similarly, turbulent flows of continuous media (i.e. flows characterized by very small values of the Knudsen number Kn=lm.f.p./L𝐾𝑛subscript𝑙m.f.p.𝐿Kn=l_{\text{m.f.p.}}/Litalic_K italic_n = italic_l start_POSTSUBSCRIPT m.f.p. end_POSTSUBSCRIPT / italic_L, where lm.f.p.subscript𝑙m.f.p.l_{\text{m.f.p.}}italic_l start_POSTSUBSCRIPT m.f.p. end_POSTSUBSCRIPT is the mean free path of molecules constituting the fluid and L𝐿Litalic_L is a macroscopic length scale) are well described by the Navier–Stokes equations, which in turn can be derived from the Boltzmann equations describing the kinetics of gases at the microscopic level [34]. Nevertheless, the tiny and highly chaotic small scales in turbulent flows can no longer be measured or simulated as the advection forces become increasingly dominant over diffusion, i.e. as the Reynolds number increases, making turbulence a major unsolved problem [35]. On the other hand, the mean-flow properties or the largest, coherent scales can still be simulated and observed, thus providing a partial knowledge of the system, but we do not have a consistent theory to prove fundamental flow properties based on these, nor describe the detailed interactions among turbulent structures and phenomena they imply.

In this context, supervised ML holds promises for scientific discoveries, as summarized in Figure 2. Indeed, it is possible to generate large amounts of fully resolved DNS data, yielding large training datasets of high quality. Such data can be used, for example, to learn what the most important structures are through explainable deep learning [36], sensitivity analysis [37] or information theory [38]. In such a context, a broad range of ML techniques, such as symbolic regression, reduced-order modeling and autoencoders, can be leveraged to discover novel structures and relationships in flow features and to uncover previously unknown physical properties of turbulence. Similarly, turbulence closure modeling has seen significant advances using supervised-ML techniques [39, 40]. In particular, novel closures have been developed for Reynolds-averaged Navier–Stokes (RANS) [41, 42, 43] and large-eddy simulation (LES) [44] turbulence models. Importantly, these closures are analytic expressions that are interpretable and generalizable, built on sparse symbolic-regression techniques [22]. The broader family of symbolic-regression methods [45, 41, 46] promises to further aid in the development of these improved closure models. Furthermore, accelerated computational-fluid-dynamics codes are also being developed using machine learning, for example to discover improved stencils and numerical schemes for computations [47, 48]. In the context of astrophysical sciences, a supervised classifier denominated SPOCK has allowed to predict from short-term simulations the long-term stability of compact multi-planet systems that required integration of the laws of gravitation over billions of orbital periods [49]. Not only SPOCK was able to predict the long-term stability of small systems similar to those used for its training, but it was able to generalize to large multi-planet systems.

An additional challenge associated with complex physical and biological systems is to perform optimal control, which is a key aspect of many scientific questions as well as industrial applications [50, 51]. In this context there are also many promising applications of ML: specifically, deep reinforcement learning (DRL) is now leading to similar discoveries as AlphaGo [52], but in Physics. It is common to use simulation data, based on the known governing equations of the system, to define the environment with which the DRL agent interacts to develop the best-performing policy. In the case of experimental environments, numerical data can also be used to enhance the measurements. The use of DRL has spread across different physics applications in recent years. Several examples can be found in quantum physics, where DRL was able to find the ground state and describe the unitary time evolution of complex interacting quantum systems [53]. Reinforcement learning has also been employed in astronomy for the adaptive control of astronomy systems [54]. In the field of turbulence research, DRL has allowed to discover novel and previously unknown flow-control strategies which outperform previous state of the art [55]. Similar advances have been done in, e.g., tokamak-instability control [14, 56]. This allows the discovery of novel and more effective ways to tune instability control, which is of great importance in practice. But this may also allow to understand better complexity arising in turbulent systems. By performing an a-posteriori analysis of the novel control laws using traditional scientific methods, one can analyze and explain how such control laws work, providing novel insights into the underlying physics. Moreover, novel physical regimes that are not spontaneously observed, for example, because they lie behind an energy barrier or they are intrinsically unstable, can be discovered through DRL. For instance, application of RL in thermodynamics [57] has allowed to learn previously unknown thermodynamic cycles.

Intractable simulations and optimization of complex systems with known governing equations can be dramatically accelerated using ML. For example, autoencoders can be leveraged to discover effective latent-space representations of complex systems [58]. Such latent spaces can then be used to design effective time integrators and faster simulations [59], or to perform optimization at a lower computational cost [60]. While speeding up simulations is not, per se, a scientific discovery, the improved understanding that comes from faster simulations can be used to perform systematic studies and lead to breakthroughs. For example, in high-energy physics, particle discovery relies on the ability to accurately compare observed detector response data with expectations based on physical models [61]. While the processes of subatomic particle interactions with matter are known, the analytical calculation of the detector response is analytically intractable, and Monte-Carlo methods must be used to simulate the propagation of particles in detectors for comparison with the data [62, 63]. As a consequence, recent advances in high-fidelity fast generative models, such as generative adversarial networks (GANs) [64] or variational autoencoders (VAEs) [65], offer a promising alternative for simulation, gaining orders of magnitude in simulation speed over existing techniques, provided that these methodologies can be developed to achieve the required accuracy, which is a subject of ongoing research [66]. A striking example is given by the recent introduction of LLMs for weather and climate modeling, which is not only revolutionizing weather forecasting [67], but has also the potential to accelerate scientific discoveries in climate change. Studies on Paleoclimates can also be accelerated by leveraging AI to replace or complement the cumbersome coupled resolution of several complex climate processes (e.g. convection, clouds, atmospheric chemistry), and by enabling access to the finer resolutions required to understand regional to local changes in climate [15].

In applied Mathematics and scientific computing, approaches based on DRL similar to AlphaGo [52] can also lead to discoveries. Fawzi et al. [8] discovered novel algorithmic optimizations to accelerate matrix operations. This is important for both practical applications (faster programs, in particular for numerical simulations) and scientific advancement, opening for new algorithmic optimization possibilities that were previously unknown. Large language models (LLMs) are now also being leveraged to generate possible algorithms to approach solutions for classical problems in combinatorics, e.g. the cap set problem [5]. Similar to the matrix-acceleration task, since the validation of suggested solutions can be performed with a classical algorithm, this allows the discovery of strategies and algorithms that are both previously unknown and that can be formally validated and proved for correctness. In both cases, AI is proving useful as a heuristic method to provide good candidate solutions for a problem where the verification of a solution candidate can be done cheaply, e.g. deterministic linear [8] or polynomial [5] time. Note however that deciding how to find, or discover, an adequate candidate is challenging.

Partial information is available

In contrast to the previous section, the following discussion illustrates scientific-discovery techniques enabled by ML where only partial information is available about the governing equations and/or the mechanisms of the phenomenon under study. This occurs in physical or biological systems for which descriptions of the macroscopic behavior are possible according to first principles, however, the microscopic structure is too complex to be completely described or observed. This is the case for instance of complex materials (i.e., composite or textured) or certain types of fluids (e.g., granular flows, multiphase flows, flows with particles or sediments, flows through canopies, etc.). Conversely, some other systems may exhibit elementary microscopic behaviors leading however to extremely complex and/or chaotic macroscopic behaviors that cannot be easily described. In Biology this can be the case of the spread of infectious diseases, whereby the underlying mechanisms can be described by e.g. cellular automata [68], while the collective behavior can be highly complex and hardly predictable [69]. Another example is given by active matter, generally of biological origin, which exhibits a form of “turbulent” spectrum despite being described at the microscopic level by linear equations. This leads to highly chaotic behaviors known as “active turbulence” [70]. In all these cases, only the macroscopic or microscopic governing equations are known, while the rest is unknown or hard to describe.

In such cases, inductive biases [71, 72], i.e. the known information about the system behavior, can be incorporated into the architecture of data-driven methods to exploit established physical or biological knowledge. Such biases may consist of frame invariance, equivariance, or symmetry-based constraints, thereby reducing the sample complexity of the discovery process. Physical constraints, e.g. flow incompressibility, thermodynamic properties, etc., can also be incorparated, leading to the discovery of more compact and interpretable models with better generalization properties (universality). For example, Moya et al. [73] propose a framework for training neural networks where the learning is constrained by the known thermodynamic properties of the system in such a way that the model is general and can be applied to several physical systems (this particular study focuses on fluid mechanics). This enables exploring phenomena through digital twins, where only some data are available and makes it possible to combine data-driven aspects with simulations while preserving generalization properties [74]. Another example of taking advantage of known physical properties of the system is embedding symmetries in autoencoders, intending to develop reduced-order models (ROMs) in physical systems invariant to input transformations [75]. This process makes it possible to train ROMs more effectively without having to learn the symmetries and it enables examining the physics of the system in the latent space more effectively. A general framework to impose and discover symmetries in physical systems for ML applications was proposed by Otto et al. [76]. Such approaches, combining data-driven methods and physics, are essential to achieve novel techniques for scientific discovery. Recent work has shown the potential of novel deep-learning methods for geometric reasoning [9], a possibility that can be critical when dealing with symmetries and other physical properties in systems containing intrinsic structures.

Machine learning can also be used to obtain physical knowledge where partial knowledge is available through the discovery of constitutive laws. Many materials are characterized by complex rheologies that are difficult or impossible to describe using standard modeling approaches. However, costly high-fidelity or ab initio simulations can be produced under various loading configurations and used to indirectly infer the constitutive equations or rheological behaviour. For instance, De Lorenzis and coworkers have recently proposed the EUCLID hybrid finite-element/neural-network framework for learning constitutive equations in hyperelastic solids [77], and the SpaRTA framework initially introduced in Ref. [41] for data-driven discovery of turbulence models has been adapted to the discovery of constitutive equations for elastic solids [78]. Sparse identification has also been used to discover constitutive equations of crystal structures, learning from ab initio calculations or interatomic potentials [79]. Finally, a deep-learning method incorporating strong inductive biases, such as objectivity, consistency, and stability, was developed by As’ad et al. [80] to learn constitutive laws for complex nonlinear materials. In all these examples, which are schematically represented in Figure 3, the macroscopic behaviour of the material follows the conservation laws of Mechanics, and only the constitutive equations are unknown. Mahmoudabadbozchelou et al. [81] introduced the notion of “digital rheometer twin”, where ML methodologies are leveraged to learn the hidden rheology of complex fluids through a limited number of experiments.

Multiscale modeling and stochastic simulations are other areas where learning from simulated data can lead to a real discovery. In multiscale simulations, an appropriate model is available at a small scale (e.g., the fundamental laws of molecular dynamics), and the goal is to learn a model at another scale (e.g., a continuum-scale partial differential equation) from the data generated at the first scale. In stochastic simulations, the governing equations contain uncertain parameters or are driven by randomly fluctuating forcing terms representing subgrid variability and processes (e.g., Langevin equations). Solutions to such problems are typically characterized using probability density functions (PDFs). The goal is to learn the deterministic dynamics of either the PDF of a system state or its statistical moments. While PDF equations can be exact under certain conditions, their derivation requires closure approximations based on field-specific knowledge and can introduce uncontrollable errors. Machine learning can then infer closure terms from the databases, e.g. based on sparse regression [82]. The D-CIPHER method has recently been shown that it can discover many ordinary and partial differential equations [83]. Similarly, when studying biomolecules, their interactions, and generally their functions that serve the mechanisms of a cell, Newton’s laws of motion are commonly used to model molecular dynamics at the atomistic level. However, in these cases, the exact energy functions are not known as these are generally optimized in smaller systems using a combination of techniques [84]. Therefore, the exact energy function governing these simulations always contains some noise and includes necessary compromises to achieve computational efficiency. Furthermore, even if the precise energy functions were known, such simulations are prohibitively expensive [85], and we do not yet have complete predictive rules for larger-scale interactions at the size of biological macromolecules. Here, ML methods provide alternative paths to analyse the energies and dynamics of these systems, with both molecular-dynamics simulations and experiments contributing to generate databases for learning large-scale interactions [86]. They can also serve as a regulariser for ML models, thus enabling the use of a more coarse-grained, and therefore faster, representation [87, 88, 89].

In the context of Life Sciences, structural biology is a very important example where only partial knowledge of the phenomena is available. Yet, ML is leading to significant new scientific discoveries. Fast identification of free-energy minima has been developed for small molecules [90], proteins [7], ligand-binding [91], and mixtures of these [10]. More concretely, one notable recent progress is AlphaFold [7], which employs a complex deep architecture with several embedded inductive biases to fold proteins from their one-dimensional amino-acids sequence into their three-dimensional (3D) native form. Although AlphaFold does not provide any insights into how proteins fold, it has been proven to be very valuable for the structural-biology community. Among AlphaFold’s architectural biases, the most notable are: (i) multiple sequence alignment (MSA), a source of co-evolutionary signal of protein folding. In this context, a novel transformer architecture, the EvoFormer, is used, and (ii) structural equivariance is guaranteed by using a 3D equivariant transformer trained to predict a novel loss function, the FAPE loss.

There are several additional examples of ML leading to scientific discovery in systems where partial information is available, for instance through generative models. Such models constitute a powerful tool which has crossed over into popular culture in recent years, especially due to their capability of generating artistic pictures or videos. Recently, generative AI has been used to learn physical models from large datasets by incorporating prior knowledge expressed as constraints on the functional form of the learned model or from axiomatic knowledge and experimental data by combining logical reasoning with symbolic regression [92]. The discovered models enable generalizing known phenomena to new configurations, for example, new geometries or operating conditions. This can help to shed light on the physical phenomena in new scenarios beyond those in which data is available. When it comes to quantum systems, deep reinforcement learning (DRL) has enabled the discovery of novel approaches to put a quantum system in a given state, which provides novel insights into the underlying physics [93, 94]. Furthermore, ML is currently allowing to reduce the noise in quantum-computing systems, while quantum computing is allowing in turn to improve ML performance with reduced energy consumption [95]. Another example is climate, where ML is hel** to develop climate models enabling the characterization of novel physics. This includes establishing new large-eddy simulation (LES) models for climate ensuring stable behavior for long-term forecast [96]. Note that in LES only the larger scales are resolved, whereas the smaller ones need to be represented by a model, which, in this case, is developed through ML. There are also several studies which reflect how ML can help to improve classical weather-prediction systems [97, 98].

AlphaFold [7] has also spurred many innovations in the ML field, e.g. single-sequence methods such as ESMfold [99]. These single-sequence-based methods are based on foundation models trained to recover a masked region of a protein sequence [100], a technique also used in AlphaFold to improve its performance. They have also been used for protein design [101], but have lately been replaced by generative models that take both sequence and structure into account [102]. This highlights that a method not trained directly to uncover scientific insights can be utilised to provide such insights in another scientific discipline. Another such example is the use of diffusion models [103, 104] to generate protein backbone structures suitable for protein design that are routinely tested using AlphaFold. Lately, diffusion and flow-matching models can also be used to altogether bypass the need for simulations when generating a molecular ensemble [105].

No information is available

In various scientific fields, certain phenomena exist whose origins and descriptions remain largely unknown, and we lack known governing equations or foundational physical models to accurately capture their critical aspects. For example, the field of neuroscience has no governing equations from first principles, as there are no known conservation laws, symmetries, or other physics that may be used to derive generalizable differential equations. Even if we have access to equations to describe the behavior of individual molecules or cells, the complexity of the brain makes it unfeasible to simulate at every scale; furthermore, it is not tractable to take enough measurements to initialize such a simulation. At the same time, rapid progress is being made that advances our ability to acquire neural data. For instance, large-scale neural recording and imaging techniques now routinely produce datasets of unprecedented spatial and temporal resolution [106, 107], and advances in connectomics offer detailed views of cell types and neural circuit architecture [108, 109]. Taken together, the state of what is possible in data-driven modeling promises a new era of empirical models that leverage machine learning for discovery in our understanding of neuroscience and behavior.

Full numerical simulations of the brain and behavior of an animal are impossible to construct and infeasible to initialize, but data-driven models that recapitulate key input-output relationships and generate predictions can be crucial tools in future discovery. In scientific fields focused on complex phenomena for which we lack known governing equations, a common approach to gaining understanding is through perturbation experiments. For instance, in systems neuroscience, experimentally activating a specific population of neurons at a particular phase of a visual discrimination task can directly bias the animal’s perception. Such causal experiments allow us to gain insights into this part of the visual perception pathway. Thus, data-driven models are particularly useful when they are constructed in a space that matches experimentally measurable quantities and are capable of generating novel, testable hypotheses. What would happen if we combinatorially activated populations of neurons, and at several different phases of the task? Because data-driven models are not based on known physical equations, such predictions can diverge quite drastically from reality, so we emphasize the importance of a rapid iteration between generating hypotheses and performing experiments, which in turn generate data that continually refine models.

Another way in which ML contributes to discovery is through providing an approach to synthesize diverse experimental data that measure parts of the same system, but collected separately and at different resolutions. For instance, the MICrONS dataset [110] includes both functional-imaging data and structural reconstructions of the same cortical neurons. Although we know that morphological and physiological features of these neurons are related, each dataset is incomplete, and the relationships between them are difficult to describe. In such cases, a data-driven model is a valuable way to formalize these relationships.

Where no a-priori knowledge of governing equations is available, ML methods can serve as an approach to learn dynamics directly from observations. There is a long history of learning differential equations to model neural dynamics, exemplified by the famed Hodgkin–Huxley equations written to describe the electrical action potential based on changes in the conductance of ion channels [111]. There are several categories of modern ML approaches to learn dynamical models. For instance, the SINDy (sparse identification of non-linear dynamics) family of techniques [22] identifies the underlying dynamics of a system from the data and trajectories it generates by using sparse-promoting regression of the transitions matrix, or related sparsity-generating techniques. Related approaches have been used for the same purposes using either genetic algorithms [112] or reinforcement learning [113]. Where we have no governing equations but are aware of some underlying structure, incorporating such constraints to the learning process can yield faster and more robust discovery of equations describing the dynamics. Once a set of equations that govern a system are identified, they can be applied to discover novel mechanisms and to efficiently perform engineering tasks on complex systems.

Another approach to equation discovery is the neural ordinary differential equation model (NODE) [114]. Unlike traditional neural networks that map inputs to outputs through a series of layers with fixed parameters, NODEs formulate the transformation from input to output as the solution to an ODE and learn the dynamics of this transformation over continuous time. While NODE is a powerful and general method, it cannot provide direct scientific insights into the learned functional relationships from the dataset. To address this problem, there has been a recent interest in hybrid approaches that rely on transformer backbones [115] or on shallow neural-network architectures [116] inspired by SINDY [22] for learning differential equations from data. Note that interpretable models may lead to new scientific insights. For instance, such interpretable ML models can be trained on patient data with specific diagnoses or prognoses and then subsequently inspected to discover novel biomarkers for the detection of diseases or outcome-prediction of treatments [117, 118].

From a statistical perspective, a key element that enables the learning of the underlying mechanisms is whether, in addition to observational data (e.g. electronic-health records), interventional data are available. As an example of interventional data, perturbational CRISPR-Ko in single-cell biology [119] already includes genome-wide perturbational experiments, featuring millions of perturbations [120]. With increasingly automated setups becoming available [121, 122], a key research question is how to experimentally design and select the next intervention to facilitate identifying the underlying mechanisms and governing equations. While decades of research have focused on active learning for system identification in linear cases [123], more research is needed to consider learning an underlying equation from data in such a setting when the parametrized model is a NODE and in generally unknown contexts [124, 125, 126].

With no prior knowledge of the functional form of the equations, one often does not observe the variables of interest directly. In these cases, representation learning and latent-space identification are essential for understanding and predicting complex dynamical systems. A common assumption underlying the idea of representation learning is that while the observations might exhibit complex behaviour, the underlying dynamics might be expressible in a simple form in some abstract space. Thus, many techniques for representation learning from high-dimensional data rely on variational autoencoders and latent space embedding by a generative model [127, 128]. By identifying the correct latent spaces, one often hopes to achieve: i) dimensionality reduction, ii) simplifying system characterization, and iii) facilitating interpretation by identifying some form of causal structure [129].

Causality and causal formulations of dynamical systems [130] have been of significant interest in the past years, especially in Earth sciences [131, 132, 133] and molecular Biology [134, 135]. Using data from Biology, Lippe et al. [136] identified causal ODEs using invariance from heterogeneous experiments as a learning signal, while other approaches focus on understanding latent causal factors of a dynamical system by learning disentangled representations from time series [137, 138]. These approaches aim to not only model the latent space but also the temporal dependencies within a sequence [139, 140]. Learning these structured latent spaces is of crucial importance since it provides an effective coordinate system in which the dynamics have a simple representation, which is a key requirement for generalization and interpretability [141]. In Figure 4 we provide a schematic representation of the identification of an underlying causal structure from observations where the variables of interest are not directly observed.

Orthogonal to the approaches of using AI to learn and identify dynamical systems is the current research trend of investigating different ODEs and solvers for improving the general class of diffusion models [142]. The model class of diffusion, denoising autoencoders, or flow-matching models [143] has the key advantage that it is amenable to theoretical investigations while providing state-of-the-art results in various domains. This is an example where not only systems identification can profit from machine learning but where general ML approaches profit from the insights and cross pollination with control theory and existing knowledge of dynamical systems [144, 145]. While these models have found widespread success not only in image generation [146] but in many scientific applications from protein modeling [147] to materials science [148], we currently have very limited causal or scientific insights from these large pre-trained models [149], which may provide ample opportunities for further research.

In acquiring data from complex systems with unknown governing equations, our measurements are often indirect and incomplete, so that they require extensive processing before they are suitable for modeling and learning using the approaches described above. In such cases, applications of ML have critically facilitated data collection and have potential to further advance scientific discovery, by expanding the realm of the observable. For instance, where neural recordings are incomplete or corrupted, machine learning approaches to imputation and generative modeling can produce more complete timeseries data that improve downstream applications [150, 151, 152]. In large-scale image analysis, several notable examples highlight how once laborious manual annotation has been successfully automated by advanced computer vision, turning large quantities of unstructured images or movies into structured scientific data. In microscopy, automated segmentation and profiling have made tractable the analysis of heterogenous cell population as well as their development in time [153, 154]. Automated segmentation and image analysis also enabled the assembly of high value datasets to the community, including brain connectomes at cellular and synaptic resolution [155, 108, 156, 109, 110] . In animal behavior and ethology, tracking the body movements of one or many individuals transform video data into analyzable kinematics and poses, which can then be related to the underlying neural computational and to their impact on social interactions and behavior [157, 158, 159, 160, 161]. Taken together, it is hard to overstate the ongoing impact of machine learning as a critical tool that, when used in conjunction with other approaches, catalyzes scientific discovery.

Conclusions and outlook

Machine-learning methods have already enabled several scientific and technological advancements that were still considered out of reach just a few years back, from effectively solving image classification and segmentation to winning the game of Go. In this article, we have illustrated how contemporary developments in data-driven methods now provide unique opportunities enabling fundamental scientific discovery, focusing on state-of-the-art methods that provide means for pushing the frontiers of science beyond what has been achievable so far. To provide concrete examples of this potential, we have categorised available approaches according to the level of classical scientific knowledge available in the problems to address. These range from settings where governing equations are well known and partially understood to settings where virtually nothing is formally known about the underlying governing equations and principles. As highlighted in the summarizing Table 1, the potential of ML for performing new scientific discoveries spans various methodological approaches. These can enable scientific discovery across a wide spectrum, such as revealing underlying governing principles for physical systems, enabling parametric evaluation of highly complex systems, inducing unknown behaviour in systems for which we only have partial deterministic understanding, and causally inferring principles in systems of complex multiscale nature. The potential of ML for performing scientific discoveries covers a wide range of application areas, from theoretical to applied settings, across Physics, Mathematics, Chemistry and Life Sciences. We argue that continued focus on such discovering tasks represents one of the more fruitful and beneficial usage of machine learning in modern-day scientific practice, enabling scientific leaps in time frames previously unimaginable.

Pushing the scientific frontiers using data-driven methods, however, comes with its own set of challenges. Whilst a clear benefit of ML approaches is associated with their ability to learn and efficiently represent complex systems, this more often than not requires a significant amount of training data to successfully capture the underlying phenomena. In some instances data availability might not be a primary hurdle, however, in instances of scientific discovery data scarcity is frequently an obstacle, for example when causally describing infrequent astrophysical events, inferring descriptions of yet-to-be-tested (or yet-to-be-discovered) pharmaceutical agents, or achieving a mechanistic understanding of pathophysiological phenomena of rare disease. Interestingly, complementary ML approaches can either facilitate the creation of suitable training data (e.g. speeding up computational simulations), or avoid the need for extensive datasets (e.g. by self-supervised out-of-distribution generalization, dimension folding, etc.). As a consequence, even ML techniques that do not lead to scientific discovery per se might facilitate discovery in instances where available training data are scarce. The validation of ML-driven scientific discovery is another challenge, where the underlying governing principles might not exist to confirm output descriptions. However, validation of ML-driven discoveries should not be viewed any differently than validation of deterministically driven discoveries, where hypothesis-driven investigation, observational confirmation and comparative benchmarking can all help to confirm inferred behaviour or descriptions.

Lastly, when employing ML for scientific discovery, a challenge lies in overcoming the black-box nature of standard applied deep-learning frameworks to gain formal knowledge. A black-box setup utilizing a trained network is by construction not directly open for human insight [18]: in particular, internalized neural-network data are hidden in the network weights and not easily interpreted, making it challenging to gain formal knowledge and understanding. In the context of scientific discovery, this is a major challenge for the formalization of knowledge, as well as for public dissemination and acceptance. Both explainable [36] and interpretable [162] machine learning offer an alternative in which human oversight is preserved, however, the development and applicability of these methods is still challenging in many applications. Suppose explainable and/or interpretable ML can be deployed. In that case, the discovered new knowledge or proposed scientific advances can be grounded and validated in existing notions and know-hows, securing scientifically sound knowledge progression.

Scientific communities across a wide range of disciplines are now, following the rapid development of ML techniques, benefiting from incorporating these methods into their research methodology and toolbox, opening new opportunities for scientific discoveries. While this comes with new challenges, we believe that the recent development shows that we can expect further evolution of ML techniques adapted to various specific needs across disciplines. These are likely to provide opportunities to mitigate the limitations outlined in recent works, enabling new discoveries to take place.

References

  • [1] Deng, J. et al. Imagenet: A large-scale hierarchical image database. \JournalTitle2009 IEEE conference on computer vision and pattern recognition 248–255 (2009).
  • [2] Mnih, V. et al. Human-level control through deep reinforcement learning. \JournalTitleNature 518, 529–533 (2015).
  • [3] Vinyals, O. et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. \JournalTitleNature 575, 350–354 (2019).
  • [4] Vinuesa, R. et al. The role of artificial intelligence in achieving the Sustainable Development Goals. \JournalTitleNature Communications 11, 233 (2020).
  • [5] Romera-Paredes, B. et al. Mathematical discoveries from program search with large language models. \JournalTitleNature 625, 468–475 (2024).
  • [6] Guastoni, L. et al. Convolutional-network models to predict wall-bounded turbulence from wall quantities. \JournalTitleJournal of Fluid Mechanics 928, A27 (2021).
  • [7] Jumper, J. et al. Highly accurate protein structure prediction with alphafold. \JournalTitleNature 596, 583–589 (2021).
  • [8] Fawzi et al, A. Discovering faster matrix multiplication algorithms with reinforcement learning. \JournalTitleNature 610, 47–53 (2022).
  • [9] Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. \JournalTitleNature 625, 476–482 (2024).
  • [10] Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold all-atom. \JournalTitleScience eadl2528, DOI: 10.1126/science.adl2528 (2024).
  • [11] Camps-Valls, G. et al. Discovering causal relations and equations from data. \JournalTitlePhysics Reports 1044, 1–68 (2023).
  • [12] Castillo, M. The scientific method: a need for something better? \JournalTitleAmerican Journal of Neuroradiology 34, 1669–1671 (2013).
  • [13] Wigner, E. P. The unreasonable effectiveness of mathematics in the natural sciences. In Mathematics and science, 291–306 (World Scientific, 1990).
  • [14] Degrave et al, J. Magnetic control of tokamak plasmas through deep reinforcement learning. \JournalTitleNature 602, 414–419 (2022).
  • [15] Wong, C. How AI is improving climate forecasts. \JournalTitleNature (2024).
  • [16] Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. \JournalTitleNature 626, 177–185, DOI: 10.1038/s41586-023-06887-8 (2024).
  • [17] Hoffman, R. R., Mueller, S. T., Klein, G. & Litman, J. Metrics for explainable AI: Challenges and prospects. \JournalTitlearXiv preprint arXiv:1812.04608 (2018).
  • [18] Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. \JournalTitleNature Machine Intelligence 1, 206–215 (2019).
  • [19] Ferreira, C. Gene expression programming: mathematical modeling by an artificial intelligence, vol. 21 (Springer, 2006).
  • [20] Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. \JournalTitleScience 324, 81–85 (2009).
  • [21] McConaghy, T. Ffx: Fast, scalable, deterministic symbolic regression technology. \JournalTitleGenetic Programming Theory and Practice IX 235–260 (2011).
  • [22] Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. \JournalTitleProceedings of the National Academy of Sciences 113, 3932–3937 (2016).
  • [23] Fuentes, R. et al. Equation discovery for nonlinear dynamical systems: A bayesian viewpoint. \JournalTitleMechanical Systems and Signal Processing 154, 107528 (2021).
  • [24] Langley, P., Bradshaw, G. L. & Simon, H. A. Bacon. 5: The discovery of conservation laws. In IJCAI, vol. 81, 121–126 (1981).
  • [25] Guimerà, R. et al. A Bayesian machine scientist to aid in the solution of challenging scientific problems. \JournalTitleScience Advances 6, eaav6971 (2020).
  • [26] Merchant, A. et al. Scaling deep learning for materials discovery. \JournalTitleNature 624, 80–85, DOI: 10.1038/s41586-023-06735-9 (2023).
  • [27] Leslie, D. Does the sun rise for ChatGPT? Scientific discovery in the age of generative AI. \JournalTitleAI and Ethics 1–6 (2023).
  • [28] Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. \JournalTitleIEEE intelligent systems 24, 8–12 (2009).
  • [29] Sejnowski, T. J. The unreasonable effectiveness of deep learning in artificial intelligence. \JournalTitleProceedings of the National Academy of Sciences 117, 30033–30038 (2020).
  • [30] Wang, H. et al. Scientific discovery in the age of artificial intelligence. \JournalTitleNature 620, 47–60 (2023).
  • [31] Zenil, H. et al. The future of fundamental science led by generative closed-loop artificial intelligence. \JournalTitlePreprint arXiv:2307.07522 (2023).
  • [32] Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. \JournalTitleNature Reviews Physics 5, 277–280 (2023).
  • [33] Fajardo-Fontiveros, O. et al. Fundamental limits to learning closed-form mathematical models from data. \JournalTitleNature Communications 14, 1043 (2023).
  • [34] Chapman, S. & Cowling, T. G. The mathematical theory of non-uniform gases: an account of the kinetic theory of viscosity, thermal conduction and diffusion in gases (Cambridge university press, 1990).
  • [35] Fefferman, C. L. Existence and smoothness of the Navier-Stokes equation. \JournalTitleThe millennium prize problems 57, 67 (2000).
  • [36] Cremades, A. et al. Identifying regions of importance in wall-bounded turbulence through explainable deep learning. \JournalTitleNature Communications, To Appear. Preprint arXiv:2302.01250 (2023).
  • [37] Encinar, M. P. & Jiménez, J. Identifying causally significant features in three-dimensional isotropic turbulence. \JournalTitleJournal of Fluid Mechanics 965, A20 (2023).
  • [38] Lozano-Durán, A. & Arranz, G. Information-theoretic formulation of dynamical systems: Causality, modeling, and control. \JournalTitlePhysical Review Research 4, 023195 (2022).
  • [39] Duraisamy, K., Iaccarino, G. & Xiao, H. Turbulence modeling in the age of data. \JournalTitleAnnual Review of Fluid Mechanics 51, 357–377 (2019).
  • [40] Brunton, S. L., Noack, B. R. & Koumoutsakos, P. Machine learning for fluid mechanics. \JournalTitleAnnual Review of Fluid Mechanics 52, 477–508 (2020).
  • [41] Schmelzer, M., Dwight, R. P. & Cinnella, P. Discovery of algebraic Reynolds-stress models using sparse symbolic regression. \JournalTitleFlow, Turbulence and Combustion 104, 579–603 (2020).
  • [42] Beetham, S. & Capecelatro, J. Formulating turbulence closures using sparse regression with embedded form invariance. \JournalTitlePhysical Review Fluids 5, 084611 (2020).
  • [43] Beetham, S., Fox, R. O. & Capecelatro, J. Sparse identification of multiphase turbulence closures for coupled fluid–particle flows. \JournalTitleJournal of Fluid Mechanics 914 (2021).
  • [44] Zanna, L. & Bolton, T. Data-driven equation discovery of ocean mesoscale closures. \JournalTitleGeophysical Research Letters 47, e2020GL088376 (2020).
  • [45] Bongard, J. & Lipson, H. Automated reverse engineering of nonlinear dynamical systems. \JournalTitleProceedings of the National Academy of Sciences 104, 9943–9948 (2007).
  • [46] Cranmer, M. Interpretable machine learning for science with PySR and SymbolicRegression. jl. \JournalTitlearXiv preprint arXiv:2305.01582 (2023).
  • [47] Bar-Sinai, Y., Hoyer, S., Hickey, J. & Brenner, M. P. Learning data-driven discretizations for partial differential equations. \JournalTitleProceedings of the National Academy of Sciences 116, 15344–15349 (2019).
  • [48] Kochkov, D. et al. Machine learning accelerated computational fluid dynamics. \JournalTitleProc. Natl Acad. Sci. USA 118, e2101784118 (2021).
  • [49] Tamayo, D. et al. Predicting the long-term stability of compact multiplanet systems. \JournalTitleProceedings of the National Academy of Sciences 117, 18194–18205 (2020).
  • [50] Sipp, D., Marquet, O., Meliga, P. & Barbagallo, A. Dynamics and Control of Global Instabilities in Open-Flows: A Linearized Approach. \JournalTitleApplied Mechanics Reviews 63, 030801, DOI: 10.1115/1.4001478 (2010). https://asmedigitalcollection.asme.org/appliedmechanicsreviews/article-pdf/63/3/030801/5442879/030801_1.pdf.
  • [51] Al-Housseiny, T. T., Tsai, P. A. & Stone, H. A. Control of interfacial instabilities using flow geometry. \JournalTitleNature Physics 8, 747–750 (2012).
  • [52] Silver et al, D. Mastering the game of Go with deep neural networks and tree search. \JournalTitleNature 529, 484–489 (2016).
  • [53] Carleo, G. & Troyer, M. Solving the quantum many-body problem with artificial neural networks. \JournalTitleScience 355, 602–606 (2017).
  • [54] Nousiainen, J., Rajani, C., Kasper, M. & Helin, T. Adaptive optics control using model-based reinforcement learning. \JournalTitleOptics Express 29, 15327–15344 (2021).
  • [55] Guastoni, L., Rabault, J., Schlatter, P., Azizpour, H. & Vinuesa, R. Deep reinforcement learning for turbulent drag reduction in channel flows. \JournalTitleThe European Physical Journal E 46, 27 (2023).
  • [56] Seo, J. et al. Feedforward beta control in the kstar tokamak by deep reinforcement learning. \JournalTitleNuclear Fusion 61, 106010 (2021).
  • [57] Beeler, C. et al. Optimizing thermodynamic trajectories using evolutionary and gradient-based reinforcement learning. \JournalTitlePhysical Review E 104, 064128 (2021).
  • [58] Solera-Rico, A. et al. β𝛽\betaitalic_β-Variational autoencoders and transformers for reduced-order modelling of fluid flows. \JournalTitleNat. Commun. 15, 1361 (2014).
  • [59] Wiewel, S., Becher, M. & Thuerey, N. Latent space physics: Towards learning the temporal evolution of fluid flow. In Computer graphics forum, vol. 38, 71–82 (Wiley Online Library, 2019).
  • [60] Park, S. et al. Optimization of physical quantities in the autoencoder latent space. \JournalTitleScientific Reports 12, 9003 (2022).
  • [61] A detailed map of Higgs boson interactions by the ATLAS experiment ten years after the discovery. \JournalTitleNature 607, 52–59 (2022).
  • [62] Sjöstrand, T., Mrenna, S. & Skands, P. Pythia 6.4 physics and manual. \JournalTitleJournal of High Energy Physics 2006, 026 (2006).
  • [63] Collaboration, A. et al. Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. \JournalTitlePhysics Letters B 716, p1–29 (2012).
  • [64] Goodfellow, J. et al. Generative adversarial networks. \JournalTitlePreprint arXiv:1406.2661 (2014).
  • [65] Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. \JournalTitlePreprint arXiv:1312.6114 (2014).
  • [66] Albertsson, K. et al. Machine learning in high energy physics community white paper. In Journal of Physics: Conference Series, vol. 1085, 022008 (IOP Publishing, 2018).
  • [67] Pathak, J. et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. \JournalTitlearXiv preprint arXiv:2202.11214 (2022).
  • [68] Neumann, J. v. Theory of self-reproducing automata. \JournalTitleEdited by Arthur W. Burks (1966).
  • [69] Souza, L. F., Rocha Filho, T. M. & Moret, M. A. Relating SARS-CoV-2 variants using cellular automata imaging. \JournalTitleScientific Reports 12, 10297 (2022).
  • [70] Alert, R., Casademunt, J. & Joanny, J.-F. Active turbulence. \JournalTitleAnnual Review of Condensed Matter Physics 13, 143–170 (2022).
  • [71] Cranmer, M. et al. Discovering symbolic models from deep learning with inductive biases. \JournalTitle34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada (2020).
  • [72] Liu, Z., Chen, Y., Du, Y. & Tegmark, M. Physics-augmented learning: A new paradigm beyond physics-informed learning. \JournalTitlearXiv preprint arXiv:2109.13901 (2021).
  • [73] Moya, B., Badías, A., González, D., Chinesta, F. & Cueto, E. A thermodynamics-informed active learning approach to perception and reasoning about fluids. \JournalTitleComputational Mechanics 72, 577–591 (2023).
  • [74] Kapteyn, M. G., Pretorius, J. V. R. & Willcox, K. E. A probabilistic graphical model foundation for enabling predictive digital twins at scale. \JournalTitleNature Computational Science 1, 337–347 (2021).
  • [75] Kneer, S., Sayadi, T., Sipp, D., Schmid, P. & Rigas, G. Symmetry-aware autoencoders: s-PCA and s-NLPCA. \JournalTitlePreprint arXiv:2111.02893v3 (2022).
  • [76] Otto, S. E., Zolman, N., Kutz, J. N. & Brunton, S. L. A unified framework to enforce, discover, and promote symmetry in machine learning. \JournalTitlePreprint arXiv:2311.00212 (2023).
  • [77] Flaschel, M., Kumar, S. & De Lorenzis, L. Automated discovery of generalized standard material models with euclid. \JournalTitleComputer Methods in Applied Mechanics and Engineering 405, 115867 (2023).
  • [78] Wang, M., Chen, C. & Liu, W. Establish algebraic data-driven constitutive models for elastic solids with a tensorial sparse symbolic regression method and a hybrid feature selection technique. \JournalTitleJournal of the Mechanics and Physics of Solids 159, 104742 (2022).
  • [79] Im, S., Kim, H., Kim, W., Chung, H. & Cho, M. Discovering constitutive equations of crystal structures by sparse identification. \JournalTitleInternational Journal of Mechanical Sciences 236, 107756 (2022).
  • [80] As’ad, F., Avery, P. & Farhat, C. A mechanics-informed artificial neural network approach in data-driven constitutive modeling. \JournalTitleInternational Journal for Numerical Methods in Engineering 123, 2738–2759 (2022).
  • [81] Mahmoudabadbozchelou, M., Kamani, K. M., Rogers, S. A. & Jamali, S. Digital rheometer twins: Learning the hidden rheology of complex fluids through rheology-informed graph neural networks. \JournalTitleProceedings of the National Academy of Sciences 119, e2202234119 (2022).
  • [82] Bakarji, J. & Tartakovsky, D. M. Data-driven discovery of coarse-grained equations. \JournalTitleJournal of Computational Physics 434, 110219 (2021).
  • [83] Kacprzyk, K., Qian, Z. & van der Schaar, M. D-cipher: Discovery of closed-form partial differential equations. \JournalTitlePreprint arXiv:2206.10586 (2023).
  • [84] Adcock, S. A. & McCammon, J. A. Molecular dynamics: survey of methods for simulating the activity of proteins. \JournalTitleChemical Reviews 106, 1589–1615 (2006).
  • [85] Freddolino, P. L., Harrison, C. B., Liu, Y. & Schulten, K. Challenges in protein-folding simulations. \JournalTitleNature Physics 6, 751–758 (2010).
  • [86] Glielmo, A. et al. Unsupervised learning methods for molecular simulation data. \JournalTitleChemical Reviews 121, 9722–9758 (2021).
  • [87] Noe, F., Tkatchenko, A., Muller, K.-R. & Clementi, C. Machine learning for molecular simulation. \JournalTitleAnnual Review of Physical Chemistry 71, 361–390 (2020).
  • [88] Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. \JournalTitleScience 365, eaaw1147 (2019).
  • [89] Noé, F., Tkatchenko, A., Müller, K.-R. & Clementi, C. Machine learning for molecular simulation. \JournalTitleAnnual Review of Physical Chemistry 71, 361–390 (2020).
  • [90] Nagai, R., Akashi, R. & Sugino, O. Completing density functional theory by machine learning hidden messages from molecules. \JournalTitleNPJ Comput Mater 6, 43 (2020).
  • [91] Corso, G., Stärk, H., **g, B., Barzilay, R. & Jaakkola, T. DiffDock: Diffusion steps, twists, and turns for molecular docking. \JournalTitlePreprint arXiv:2210.01776 (2023).
  • [92] Cornelio, C. et al. Combining data and theory for derivable scientific discovery with ai-descartes. \JournalTitleNature Communications 14, 1777 (2023).
  • [93] Ma, H., Dong, D., Ding, S. X. & Chen, C. Curriculum-based deep reinforcement learning for quantum control. \JournalTitleIEEE Transactions on Neural Networks and Learning Systems (2022).
  • [94] Melnikov, A. A. et al. Active learning machine learns to create new quantum experiments. \JournalTitleProceedings of the National Academy of Sciences 115, 1221–1226 (2018).
  • [95] Melnikov, A., Kordzanganeh, M., Alodjants, A. & Lee, R.-K. Quantum machine learning: from physics to software engineering. \JournalTitleAdvances in Physics: X 8, 2165452 (2023).
  • [96] Frezat, H., Sommer, J., Fablet, R., Balarac, G. & Lguensat, R. A posteriori learning for quasi-geostrophic turbulence parametrization. \JournalTitleJournal of Advances in Modeling Earth Systems 14, e2022MS003124 (2022).
  • [97] Molina, M. J. et al. A review of recent and emerging machine learning applications for climate variability and weather phenomena. \JournalTitleArtificial Intelligence for the Earth Systems 2, 220086 (2023).
  • [98] de Burgh-Day, C. O. & Leeuwenburg, T. Machine learning for numerical weather and climate modelling: a review. \JournalTitleGeoscientific Model Development 16, 6433–6477 (2023).
  • [99] Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. \JournalTitleScience 379, 1123–1130 (2023).
  • [100] Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. \JournalTitleIEEE Trans Pattern Anal Mach Intell 44, 7112–7127, DOI: 10.1109/TPAMI.2021.3095381 (2022).
  • [101] Madani, A. et al. Large language models generate functional protein sequences across diverse families. \JournalTitleNat Biotechnol 41, 1099–1106, DOI: 10.1038/s41587-022-01618-2 (2023).
  • [102] Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. \JournalTitleScience 378, 49–56, DOI: 10.1126/science.add2187 (2022).
  • [103] Watson, J. et al. De novo design of protein structure and function with RFdiffusion. \JournalTitleNature 620, 1089–1100, DOI: 10.1038/s41586-023-06415-8 (2023).
  • [104] Ingraham, J. et al. Illuminating protein space with a programmable generative model. \JournalTitleNature 623, 1070–1078, DOI: 10.1038/s41586-023-06728-8 (2023).
  • [105] **g, B., Berger, B. & Jaakkola, T. Alphafold meets flow matching for generating protein ensembles. \JournalTitlePreprint arXiv:2402.04845 (2024). 2402.04845.
  • [106] Steinmetz, N. A., Zatka-Haas, P., Carandini, M. & Harris, K. D. Distributed coding of choice, action and engagement across the mouse brain. \JournalTitleNature 576, 266–273, DOI: 10.1038/s41586-019-1787-x (2019). Number: 7786 Publisher: Nature Publishing Group.
  • [107] Zhou, Y. et al. Distributed functions of prefrontal and parietal cortices during sequential categorical decisions. \JournalTitleeLife 10, e58782, DOI: 10.7554/eLife.58782 (2021). Publisher: eLife Sciences Publications, Ltd.
  • [108] Dorkenwald, S. et al. Flywire: online community for whole-brain connectomics. \JournalTitleNature methods 19, 119–128 (2022).
  • [109] Yao, S. et al. A whole-brain monosynaptic input connectome to neuron classes in mouse visual cortex. \JournalTitleNature neuroscience 26, 350–364 (2023).
  • [110] Consortium, M. et al. Functional connectomics spanning multiple areas of mouse visual cortex. \JournalTitleBioRxiv 2021–07 (2021).
  • [111] Nelson, M. & Rinzel, J. The Hodgkin–Huxley model. \JournalTitleThe Book of Genesis (1995).
  • [112] Chen, Y., Luo, Y., Liu, Q., Xu, H. & Zhang, D. Symbolic genetic algorithm for discovering open-form partial differential equations (sga-pde). \JournalTitlePhysical Review Research 4, 023174 (2022).
  • [113] Du, M., Chen, Y. & Zhang, D. Discover: Deep identification of symbolic open-form pdes via enhanced reinforcement-learning. \JournalTitlePreprint arXiv:2210.02181 (2022).
  • [114] Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. \JournalTitleAdvances in neural information processing systems 31 (2018).
  • [115] Becker, S., Klein, M., Neitz, A., Parascandolo, G. & Kilbertus, N. Predicting ordinary differential equations with transformers. In International Conference on Machine Learning, 1978–2002 (PMLR, 2023).
  • [116] Sahoo, S., Lampert, C. & Martius, G. Learning equations for extrapolation and control. In International Conference on Machine Learning, 4442–4450 (Pmlr, 2018).
  • [117] Qiu, S. et al. Development and validation of an interpretable deep learning framework for alzheimer’s disease classification. \JournalTitleBrain 143, 1920–1933 (2020).
  • [118] **, T., Nguyen, N. D., Talos, F. & Wang, D. Ecmarker: interpretable machine learning model identifies gene expression biomarkers predicting clinical outcomes and reveals molecular mechanisms of human disease in early stages. \JournalTitleBioinformatics 37, 1115–1124 (2021).
  • [119] Ji, Y., Lotfollahi, M., Wolf, F. A. & Theis, F. J. Machine learning for perturbational single-cell omics. \JournalTitleCell Systems 12, 522–537 (2021).
  • [120] Gasperini, M. et al. A genome-wide framework for map** gene regulation via cellular genetic screens. \JournalTitleCell 176, 377–390 (2019).
  • [121] Häse, F., Roch, L. M. & Aspuru-Guzik, A. Next-generation experimentation with self-driving laboratories. \JournalTitleTrends in Chemistry 1, 282–291 (2019).
  • [122] MacLeod, B. P. et al. A self-driving laboratory advances the pareto front for material properties. \JournalTitleNat. Commun. 13, 995 (2022).
  • [123] Wagenmaker, A. & Jamieson, K. Active learning for identification of linear dynamical systems. In Conference on Learning Theory, 3487–3582 (PMLR, 2020).
  • [124] Pauwels, E., Lajaunie, C. & Vert, J.-P. A Bayesian active learning strategy for sequential experimental design in systems biology. \JournalTitleBMC Systems Biology 8, 1–11 (2014).
  • [125] Du, J., Futoma, J. & Doshi-Velez, F. Model-based reinforcement learning for semi-Markov decision processes with neural ODEs. \JournalTitleAdvances in Neural Information Processing Systems 33, 19805–19816 (2020).
  • [126] Wu, K., O’Leary-Roseberry, T., Chen, P. & Ghattas, O. Large-scale Bayesian optimal experimental design with derivative-informed projected neural network. \JournalTitleJournal of Scientific Computing 95, 30 (2023).
  • [127] Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 35, 1798–1828 (2013).
  • [128] Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. \JournalTitleNature methods 15, 805–815 (2018).
  • [129] Schölkopf, B. et al. Toward causal representation learning. \JournalTitleProceedings of the IEEE 109, 612–634 (2021).
  • [130] Peters, J., Bauer, S. & Pfister, N. Causal models for dynamical systems. In Probabilistic and Causal Inference: The Works of Judea Pearl, 671–690 (2022).
  • [131] Runge, J. et al. Inferring causation from time series in earth system sciences. \JournalTitleNat. Commun. 10, 2553 (2019).
  • [132] Nowack, P., Runge, J., Eyring, V. & Haigh, J. D. Causal networks for climate model evaluation and constrained projections. \JournalTitleNature communications 11, 1415 (2020).
  • [133] Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K. & Grover, A. Climax: A foundation model for weather and climate. \JournalTitlePreprint arXiv:2301.10343 (2023).
  • [134] Tejada-Lapuerta, A. et al. Causal machine learning for single-cell genomics. \JournalTitlePreprint arXiv:2310.14935 (2023).
  • [135] Lobentanzer, S., Rodriguez-Mier, P., Bauer, S. & Saez-Rodriguez, J. Molecular causality in the advent of foundation models. \JournalTitlePreprint arXiv:2401.09558 (2024).
  • [136] Pfister, N., Bauer, S. & Peters, J. Learning stable and predictive structures in kinetic systems. \JournalTitleProceedings of the National Academy of Sciences 116, 25405–25411 (2019).
  • [137] Lippe, P. et al. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, 13557–13603 (PMLR, 2022).
  • [138] Song, X. et al. Temporally disentangled representation learning under unknown nonstationarity. \JournalTitleAdvances in Neural Information Processing Systems 36 (2024).
  • [139] Yildiz, C., Heinonen, M. & Lahdesmaki, H. Ode2vae: Deep generative second order odes with bayesian neural networks. \JournalTitleAdvances in Neural Information Processing Systems 32 (2019).
  • [140] Girin, L. et al. Dynamical variational autoencoders: A comprehensive review. \JournalTitlePreprint arXiv:2008.12595 (2020).
  • [141] Champion, K., Lusch, B., Kutz, J. N. & Brunton, S. L. Data-driven discovery of coordinates and governing equations. \JournalTitleProceedings of the National Academy of Sciences 116, 22445–22451 (2019).
  • [142] Lu, C. et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. \JournalTitleAdvances in Neural Information Processing Systems 35, 5775–5787 (2022).
  • [143] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. \JournalTitlePreprint arXiv:2210.02747 (2022).
  • [144] Berner, J., Richter, L. & Ullrich, K. An optimal control perspective on diffusion-based generative modeling. \JournalTitlePreprint arXiv:2211.01364 (2022).
  • [145] Karras, T. et al. Analyzing and improving the training dynamics of diffusion models. \JournalTitlePreprint arXiv:2312.02696 (2023).
  • [146] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695 (2022).
  • [147] Watson, J. L. et al. De novo design of protein structure and function with rfdiffusion. \JournalTitleNature 620, 1089–1100 (2023).
  • [148] Zeni, C. et al. Mattergen: a generative model for inorganic materials design. \JournalTitlePreprint arXiv:2312.03687 (2023).
  • [149] Nichani, E., Damian, A. & Lee, J. D. How transformers learn causal structure with gradient descent. \JournalTitlePreprint arXiv:2402.14735 (2024).
  • [150] Peterson, S. M. et al. Ajile12: Long-term naturalistic human intracranial neural recordings and pose. \JournalTitleScientific data 9, 184 (2022).
  • [151] Talukder, S., Sun, J. J., Leonard, M., Brunton, B. W. & Yue, Y. Deep neural imputation: A framework for recovering incomplete brain recordings. \JournalTitlearXiv preprint arXiv:2206.08094 (2022).
  • [152] Vetter, J., Macke, J. H. & Gao, R. Generating realistic neurophysiological time series with denoising diffusion probabilistic models. \JournalTitlebioRxiv 2023–08 (2023).
  • [153] Kirillov, A. et al. Segment anything (2023). 2304.02643.
  • [154] Greenwald, N. F. et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. \JournalTitleNature biotechnology 40, 555–565 (2022).
  • [155] Scheffer, L. K. et al. A connectome and analysis of the adult drosophila central brain. \JournalTitleelife 9, e57443 (2020).
  • [156] Takemura, S.-Y. et al. A connectome of the male drosophila ventral nerve cord. \JournalTitlebioRxiv 2023–06 (2023).
  • [157] Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. \JournalTitleNature Neuroscience 21, 1281–1289, DOI: 10.1038/s41593-018-0209-y (2018). Number: 9 Publisher: Nature Publishing Group.
  • [158] Pereira, T. D. et al. SLEAP: A deep learning system for multi-animal pose tracking. \JournalTitleNature Methods 19, 486–495, DOI: 10.1038/s41592-022-01426-1 (2022). Number: 4 Publisher: Nature Publishing Group.
  • [159] Karashchuk, P. et al. Anipose: A toolkit for robust markerless 3D pose estimation. \JournalTitleCell Reports 36, 109730, DOI: 10.1016/j.celrep.2021.109730 (2021).
  • [160] Schweihoff, J. F. et al. DeepLabStream enables closed-loop behavioral experiments using deep learning-based markerless, real-time posture detection. \JournalTitleCommunications Biology 4, 1–11, DOI: 10.1038/s42003-021-01654-9 (2021). Number: 1 Publisher: Nature Publishing Group.
  • [161] Dunn, T. W. et al. Geometric deep learning enables 3D kinematic profiling across species and environments. \JournalTitleNature Methods 18, 564–573, DOI: 10.1038/s41592-021-01106-6 (2021). Number: 5 Publisher: Nature Publishing Group.
  • [162] Vinuesa, R. & Sirmacek, B. Interpretable deep-learning models to help achieve the Sustainable Development Goals. \JournalTitleNature Machine Intelligence 3, 926 (2021).
  • [163] Vinuesa, R. et al. Turbulent boundary layers around wing sections up to Rec=1,000,000𝑅subscript𝑒𝑐1000000Re_{c}=1,000,000italic_R italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1 , 000 , 000. \JournalTitleInternational Journal of Heat and Fluid Flow 72, 86–99 (2018).
  • [164] Institute, S. T. S. Dark matter even darker than once thought. \JournalTitleScience Release – ESA (2015).
  • [165] Fleming, N. Computer-calculated compounds. \JournalTitleNature 557, S55–S57 (2015).
  • [166] Eivazi, H., Le Clainche, S., Hoyas, S. & Vinuesa, R. Towards extraction of orthogonal and parsimonious non-linear modes from turbulent flows. \JournalTitleExpert Systems with Applications 202, 117038 (2022).
  • [167] Suárez, P. et al. Active flow control for three-dimensional cylinders through deep reinforcement learning. \JournalTitlearXiv preprint arXiv:2309.02462 (2023).

Acknowledgements

The following researchers are acknowledged for helpful discussions during the preparation of this article: Frida Bender, Annica Ekman, Inga Koszalka, Romit Maulik, Henrik Nielsen, Gunilla Svensson, Björn Wallner. RV and HA acknowledge SeRC and Digital Futures for funding the workshop that initiated this work. RV acknowledges financial support from ERC grant no. ‘2021-CoG-101043998, DEEPCONTROL’. DM acknowledges financial support from ERC grant no. 2022-StG-101075494, MultiPRESS. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible. AE was funded by the Vetenskapsrådet Grant No. 2021-03979 and the Knut and Alice Wallenberg Foundation and by SeRC. SLB acknowledges funding support from the US National Science Foundation AI Institute in Dynamic Systems (grant number 2112085) and from The Boeing Company.

Author contributions

RV and HA initiated the idea for this article following a workshop celebrated in November 2022 at KTH. All the authors contributed equally to the rest of this work.

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Refer to caption
Figure 1: Schematic representation of the various applications of ML for scientific discovery, depending on the amount of knowledge available in each category. A number of examples are provided, including turbulent flows [163], dark matter [164], drug discovery [165] and brain research. Other relevant figures were adapted from Refs. [166, 71].
Refer to caption
Figure 2: Schematic representation of ML directions to enable scientific discoveries when complete information about the governing equations is available. In such a case, both supervised, unsupervised, and reinforcement-learning methodologies can be leveraged. Supervised and unsupervised methodologies are made possible by the ability to generate large datasets of synthetic data simulated from the governing equations. This allows to deploy a variety of ML techniques that can discover complex hidden relations, nonlinear coordinate systems, hidden dynamics or solve problems that are otherwise intractable. Reinforcement learning can also be used by coupling it to the physics simulator, which has already proven successful at discovering previously unknown control strategies and regimes of complex systems, or to generate high-quality heuristic guesses that can be tested in the case of problems where solution verification is easy, but the suggestion of good candidate solutions is hard. Some panels were adapted from Refs [167, 7].
Refer to caption
Figure 3: Example of machine learning applied to a case where partial knowledge is available about the underlying system, illustrating a model (for instance a flow with complex rheology or a flow through a porous medium) which depends on a set of known inputs 𝐱𝐱\mathbf{x}bold_x (e.g. geometry, boundary conditions, etc.) as well as on a set of hidden (unobservable) variables describing, e.g., the fluid constitutive behavior. The latter may involve small-scale phenomena that can be difficult or impossible to describe. In such conditions, experimental or numerical data for observable quantities (e.g. velocity fields or stresses) can be used to infer the unknown field by training a machine learning model (here represented as a neural network, although other ML approaches are possible). The model is subjected to a set of available physical constraints (e.g. positivity, symmetries or invariances). The whole process allows, on one hand, to train a data-driven closure model for the hidden variables α𝛼\alphaitalic_α and, on the other hand, to gain a-posteriori physical knowledge on the fluid constitutive properties.
Refer to caption
Figure 4: Schematic representation of a model (for instance the observed symptoms of an unknown or complex disease within a population, or observed opinion dynamics within a social network) where the behavior as observed in data depends on an unknown dynamic or causal structure. The observed behavior or dynamics might occur on several different spatial and temporal scales, and the observed data might reflect more or less aspects of the underlying system. In such conditions, representation-learning methods can be employed to distill out an explanation of the observed data, in the form of a system of ODEs, or as a causal-graph representation.
[Uncaptioned image]
Table 1: Summarizing overview of the opportunities for machine learning in scientific discovery. Based on the differentiation presented in our work, the level of prior, deterministic, knowledge (left) can be used to differentiate methods (second right) and applications (right) across which scientific advancements can be made by means of dedicated AI systems. This also allows for various modes of discovery (second left), ranging from cases where machine learning enables discovery by allowing for efficient computational usage, parametrics sweeps, etc.; to cases where machine learning is used to causally infer underlying mechanistic behaviours in complex multidisciplinary systems.