-
The FLAMINGO Project: A comparison of galaxy cluster samples selected on mass, X-ray luminosity, Compton-Y parameter, or galaxy richness
Authors:
Roi Kugel,
Joop Schaye,
Matthieu Schaller,
Ian G. McCarthy,
Joey Braspenning,
John C. Helly,
Victor J. Forouhar Moreno,
Robert J. McGibbon
Abstract:
Galaxy clusters provide an avenue to expand our knowledge of cosmology and galaxy evolution. Because it is difficult to accurately measure the total mass of a large number of individual clusters, cluster samples are typically selected using an observable proxy for mass. Selection effects are therefore a key problem in understanding galaxy cluster statistics. We make use of the $(2.8~\rm{Gpc})^3$ F…
▽ More
Galaxy clusters provide an avenue to expand our knowledge of cosmology and galaxy evolution. Because it is difficult to accurately measure the total mass of a large number of individual clusters, cluster samples are typically selected using an observable proxy for mass. Selection effects are therefore a key problem in understanding galaxy cluster statistics. We make use of the $(2.8~\rm{Gpc})^3$ FLAMINGO hydrodynamical simulation to investigate how selection based on X-ray luminosity, thermal Sunyaev-Zeldovich effect or galaxy richness influences the halo mass distribution. We define our selection cuts based on the median value of the observable at a fixed mass and compare the resulting samples to a mass-selected sample. We find that all samples are skewed towards lower mass haloes. For X-ray luminosity and richness cuts below a critical value, scatter dominates over the trend with mass and the median mass becomes biased increasingly low with respect to a mass-selected sample. At $z\leq0.5$, observable cuts corresponding to median halo masses between $M_\text{500c}=10^{14}$ and $10^{15}~\rm{M_{\odot}}$ give nearly unbiased median masses for all selection methods, but X-ray selection results in biased medians for higher masses. For cuts corresponding to median masses $<10^{14}$ at $z\leq0.5$ and for all masses at $z\geq1$, only Compton-Y selection yields nearly unbiased median masses. Importantly, even when the median mass is unbiased, the scatter is not because for each selection the sample is skewed towards lower masses than a mass-selected sample. Each selection leads to a different bias in secondary quantities like cool-core fraction, temperature and gas fraction.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Multi-Epoch Machine Learning 2: Identifying physical drivers of galaxy properties in simulations
Authors:
Robert McGibbon,
Sadegh Khochfar
Abstract:
Using a novel machine learning method, we investigate the buildup of galaxy properties in different simulations, and in various environments within a single simulation. The aim of this work is to show the power of this approach at identifying the physical drivers of galaxy properties within simulations. We compare how the stellar mass is dependent on the value of other galaxy and halo properties a…
▽ More
Using a novel machine learning method, we investigate the buildup of galaxy properties in different simulations, and in various environments within a single simulation. The aim of this work is to show the power of this approach at identifying the physical drivers of galaxy properties within simulations. We compare how the stellar mass is dependent on the value of other galaxy and halo properties at different points in time by examining the feature importance values of a machine learning model. By training the model on IllustrisTNG we show that stars are produced at earlier times in higher density regions of the universe than they are in low density regions. We also apply the technique to the Illustris, EAGLE, and CAMELS simulations. We find that stellar mass is built up in a similar way in EAGLE and IllustrisTNG, but significantly differently in the original Illustris, suggesting that subgrid model physics is more important than the choice of hydrodynamics method. These differences are driven by the efficiency of supernova feedback. Applying principal component analysis to the CAMELS simulations allows us to identify a component associated with the importance of a halo's gravitational potential and another component representing the time at which galaxies form. We discover that the speed of galactic winds is a more critical subgrid parameter than the total energy per unit star formation. Finally we find that the Simba black hole feedback model has a larger effect on galaxy formation than the IllustrisTNG black hole feedback model.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Multi-Epoch Machine Learning 1: Unravelling Nature vs Nurture for Galaxy Formation
Authors:
Robert McGibbon,
Sadegh Khochfar
Abstract:
We present a novel machine learning method for predicting the baryonic properties of dark matter only subhalos from N-body simulations. Our model is built using the extremely randomized tree (ERT) algorithm and takes subhalo properties over a wide range of redshifts as its input features. We train our model using the IllustrisTNG simulations to predict blackhole mass, gas mass, magnitudes, star fo…
▽ More
We present a novel machine learning method for predicting the baryonic properties of dark matter only subhalos from N-body simulations. Our model is built using the extremely randomized tree (ERT) algorithm and takes subhalo properties over a wide range of redshifts as its input features. We train our model using the IllustrisTNG simulations to predict blackhole mass, gas mass, magnitudes, star formation rate, stellar mass, and metallicity. We compare the results of our method with a baseline model from previous works, and against a model that only considers the mass history of the subhalo. We find that our new model significantly outperforms both of the other models. We then investigate the predictive power of each input by looking at feature importance scores from the ERT algorithm. We produce feature importance plots for each baryonic property, and find that they differ significantly. We identify low redshifts as being most important for predicting star formation rate and gas mass, with high redshifts being most important for predicting stellar mass and metallicity, and consider what this implies for nature vs nurture. We find that the physical properties of galaxies investigated in this study are all driven by nurture and not nature. The only property showing a somewhat stronger impact of nature is the present-day star formation rate of galaxies. Finally we verify that the feature importance plots are discovering physical patterns, and that the trends shown are not an artefact of the ERT algorithm.
△ Less
Submitted 4 May, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
QUOTAS: A new research platform for the data-driven investigation of black holes
Authors:
Priyamvada Natarajan,
Kwok Sun Tang,
Robert McGibbon,
Sadegh Khochfar,
Brian Nord,
Steinn Sigurdsson,
Joe Tricot,
Nico Cappelluti,
Daniel George,
Jack Hidary
Abstract:
We present QUOTAS, a novel research platform for the data-driven investigation of super-massive black hole (SMBH) populations. While SMBH data sets -- observations and simulations -- have grown rapidly in complexity and abundance, our computational environments and analysis tools have not matured commensurately to exhaust opportunities for discovery. Motivated to explore BH host galaxy and the par…
▽ More
We present QUOTAS, a novel research platform for the data-driven investigation of super-massive black hole (SMBH) populations. While SMBH data sets -- observations and simulations -- have grown rapidly in complexity and abundance, our computational environments and analysis tools have not matured commensurately to exhaust opportunities for discovery. Motivated to explore BH host galaxy and the parent dark matter halo connection, in this pilot version of QUOTAS, we assemble and co-locate the high-redshift, luminous quasar population at $z \geq 3$ alongside simulated data of the same epochs. Leveraging machine learning algorithms (ML) we expand simulation volumes that successfully replicate halo populations beyond the training set. Training ML on the Illustris-TNG300 simulation that includes baryonic physics, we populate the larger LEGACY Expanse dark matter-only box with quasars. Our first science results comparing observational and ML simulated quasars at $z \sim 3$, reveal that while the recovered Black Hole Mass Functions and clustering are in good agreement, simulated SMBHs fail to accrete, shine and grow at high enough rates to match observed quasars. We conclude that sub-grid models of mass accretion and SMBH feedback implemented in Illustris-TNG300 do not reproduce their observed mass growth. QUOTAS, demonstrates the power of ML, both for analyzing large complex datasets, and offering a unique opportunity to interrogate our theoretical model assumptions. We deploy ML again to derive and devise an optimal survey strategy for bringing the undetected lower luminosity quasar population into view. QUOTAS, and all related materials are publicly available at the Google Kaggle platform.
△ Less
Submitted 14 April, 2023; v1 submitted 25 March, 2021;
originally announced March 2021.
-
Efficient hyperparameter optimization by way of PAC-Bayes bound minimization
Authors:
John J. Cherian,
Andrew G. Taube,
Robert T. McGibbon,
Panagiotis Angelikopoulos,
Guy Blanc,
Michael Snarski,
Daniel D. Richman,
John L. Klepeis,
David E. Shaw
Abstract:
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor p…
▽ More
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor performance on unseen data. We argue that this overfitting results from using the standard hyperparameter optimization objective function. Here we present an alternative objective that is equivalent to a Probably Approximately Correct-Bayes (PAC-Bayes) bound on the expected out-of-sample error. We then devise an efficient gradient-based algorithm to minimize this objective; the proposed method has asymptotic space and time complexity equal to or better than other gradient-based hyperparameter optimization methods. We show that this new method significantly reduces out-of-sample error when applied to hyperparameter optimization problems known to be prone to overfitting.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
Theano: A Python framework for fast computation of mathematical expressions
Authors:
The Theano Development Team,
Rami Al-Rfou,
Guillaume Alain,
Amjad Almahairi,
Christof Angermueller,
Dzmitry Bahdanau,
Nicolas Ballas,
Frédéric Bastien,
Justin Bayer,
Anatoly Belikov,
Alexander Belopolsky,
Yoshua Bengio,
Arnaud Bergeron,
James Bergstra,
Valentin Bisson,
Josh Bleecher Snyder,
Nicolas Bouchard,
Nicolas Boulanger-Lewandowski,
Xavier Bouthillier,
Alexandre de Brébisson,
Olivier Breuleux,
Pierre-Luc Carrier,
Kyunghyun Cho,
Jan Chorowski,
Paul Christiano
, et al. (88 additional authors not shown)
Abstract:
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu…
▽ More
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models.
The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.
-
Identification of simple reaction coordinates from complex dynamics
Authors:
Robert T. McGibbon,
Brooke E. Husic,
Vijay S. Pande
Abstract:
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator asso…
▽ More
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator associated with the ensemble dynamics. We present a new sparse estimator for these eigenfunctions which can search through a large candidate pool of structural order parameters and build simple, interpretable approximations that employ only a small number of these order parameters. Example applications with a small molecule's rotational dynamics and simulations of protein conformational change and folding show that this approach can filter through statistical noise to identify simple reaction coordinates from complex dynamics.
△ Less
Submitted 6 January, 2017; v1 submitted 28 February, 2016;
originally announced February 2016.
-
Efficient maximum likelihood parameterization of continuous-time Markov processes
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence inte…
▽ More
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence intervals in all model parameters, and can easily enforce important physical constraints on the models such as detailed balance. We demonstrate and discuss the advantages of these models over existing discrete-time Markov models for the analysis of molecular dynamics simulations.
△ Less
Submitted 30 June, 2015; v1 submitted 7 April, 2015;
originally announced April 2015.
-
Perspective: Markov Models for Long-Timescale Biomolecular Dynamics
Authors:
Christian R. Schwantes,
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been…
▽ More
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been taken for granted, it deserves further attention as large-scale simulations become increasingly routine. In this perspective, we discuss the application of Markov models to the analysis of large-scale biomolecular simulations. We draw attention to recent improvements in the construction of these models as well as several important open issues. In addition, we highlight recent theoretical advances that pave the way for a new generation of models of molecular kinetics.
△ Less
Submitted 22 August, 2014;
originally announced August 2014.
-
Variational cross-validation of slow dynamical modes in molecular kinetics
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these meth…
▽ More
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. Here, we consider cross-validation with a new objective function for estimators of these slow dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures the ability of a rank-$m$ projection operator to capture the slow subspace of the system. It is shown that a variational theorem bounds the GMRQ from above by the sum of the first $m$ eigenvalues of the system's propagator, but that this bound can be violated when the requisite matrix elements are estimated subject to statistical uncertainty. This overfitting can be detected and avoided through cross-validation. These result make it possible to construct Markov state models for protein dynamics in a way that appropriately captures the tradeoff between systematic and statistical errors.
△ Less
Submitted 27 March, 2015; v1 submitted 30 July, 2014;
originally announced July 2014.
-
Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models
Authors:
Robert T. McGibbon,
Bharath Ramsundar,
Mohammad M. Sultan,
Gert Kiss,
Vijay S. Pande
Abstract:
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing a…
▽ More
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing accessible interpretations, critical for both cellular biology and rational drug design. We present an EM algorithm for learning and introduce a model selection criteria based on the physical notion of convergence in relaxation timescales. We contrast our model with standard methods in biophysics and demonstrate improved robustness. We implement our algorithm on GPUs and apply the method to two large protein simulation datasets generated respectively on the NCSA Bluewaters supercomputer and the Folding@Home distributed computing network. Our analysis identifies the conformational dynamics of the ubiquitin protein critical to cellular signaling, and elucidates the stepwise activation mechanism of the c-Src kinase protein.
△ Less
Submitted 6 May, 2014;
originally announced May 2014.