-
Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles
Authors:
Fan Yang,
Pierre Le Bodic,
Michael Kamp,
Mario Boley
Abstract:
Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the incl…
▽ More
Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximate the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Roadmap on Data-Centric Materials Science
Authors:
Stefan Bauer,
Peter Benner,
Tristan Bereau,
Volker Blum,
Mario Boley,
Christian Carbogno,
C. Richard A. Catlow,
Gerhard Dehm,
Sebastian Eibl,
Ralph Ernstorfer,
Ádám Fekete,
Lucas Foppa,
Peter Fratzl,
Christoph Freysoldt,
Baptiste Gault,
Luca M. Ghiringhelli,
Sajal K. Giri,
Anton Gladyshev,
Pawan Goyal,
Jason Hattrick-Simpers,
Lara Kabalan,
Petr Karpov,
Mohammad S. Khorrami,
Christoph Koch,
Sebastian Kokott
, et al. (36 additional authors not shown)
Abstract:
Science is and always has been based on data, but the terms "data-centric" and the "4th paradigm of" materials research indicate a radical change in how information is retrieved, handled and research is performed. It signifies a transformative shift towards managing vast data collections, digital repositories, and innovative data analytics methods. The integration of Artificial Intelligence (AI) a…
▽ More
Science is and always has been based on data, but the terms "data-centric" and the "4th paradigm of" materials research indicate a radical change in how information is retrieved, handled and research is performed. It signifies a transformative shift towards managing vast data collections, digital repositories, and innovative data analytics methods. The integration of Artificial Intelligence (AI) and its subset Machine Learning (ML), has become pivotal in addressing all these challenges. This Roadmap on Data-Centric Materials Science explores fundamental concepts and methodologies, illustrating diverse applications in electronic-structure theory, soft matter theory, microstructure research, and experimental techniques like photoemission, atom probe tomography, and electron microscopy. While the roadmap delves into specific areas within the broad interdisciplinary field of materials science, the provided examples elucidate key concepts applicable to a wider range of topics. The discussed instances offer insights into addressing the multifaceted challenges encountered in contemporary materials research.
△ Less
Submitted 1 May, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Scaling K2 VII: Evidence for a high occurrence rate of hot sub-Neptunes at intermediate ages
Authors:
Jessie L. Christiansen,
Jon K. Zink,
Kevin K. Hardegree-Ullman,
Rachel B. Fernandes,
Philip F. Hopkins,
Luisa M. Rebull,
Kiersten M. Boley,
Galen J. Bergsten,
Sakhee Bhure
Abstract:
The NASA K2 mission obtained high precision time-series photometry for four young clusters, including the near-twin 600-800 Myr-old Praesepe and Hyades clusters. Hot sub-Neptunes are highly prone to mass-loss mechanisms, given their proximity to the the host star and the weakly bound gaseous envelopes, and analyzing this population at young ages can provide strong constraints on planetary evolutio…
▽ More
The NASA K2 mission obtained high precision time-series photometry for four young clusters, including the near-twin 600-800 Myr-old Praesepe and Hyades clusters. Hot sub-Neptunes are highly prone to mass-loss mechanisms, given their proximity to the the host star and the weakly bound gaseous envelopes, and analyzing this population at young ages can provide strong constraints on planetary evolution models. Using our automated transit detection pipeline, we recover 15 planet candidates across the two clusters, including 10 previously confirmed planets. We find a hot sub-Neptune occurrence rate of 79-107% for GKM stars in the Praesepe cluster. This is 2.5-3.5 sigma higher than the occurrence rate of 16.54+1.00-0.98% for the same planets orbiting the ~3-9 Gyr-old GKM field stars observed by K2, even after accounting for the slightly super-solar metallicity ([Fe/H]~0.2 dex) of the Praesepe cluster. We examine the effect of adding ~100 targets from the Hyades cluster, and extending the planet parameter space under examination, and find similarly high occurrence rates in both cases. The high occurrence rate of young, hot sub-Neptunes could indicate either that these planets are undergoing atmospheric evolution as they age, or that planetary systems that formed when the Galaxy was much younger are substantially different than from today. Under the assumption of the atmospheric mass-loss scenario, a significantly higher occurrence rate of these planets at the intermediate ages of Praesepe and Hyades appears more consistent with the core-powered mass loss scenario sculpting the hot sub-Neptune population, compared to the photoevaporation scenario.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
From Prediction to Action: Critical Role of Performance Estimation for Machine-Learning-Driven Materials Discovery
Authors:
Mario Boley,
Felix Luong,
Simon Teshuva,
Daniel F Schmidt,
Lucas Foppa,
Matthias Scheffler
Abstract:
Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in develo** proper…
▽ More
Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in develo** property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to naïve reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction.
△ Less
Submitted 6 December, 2023; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization
Authors:
Shu Yu Tew,
Mario Boley,
Daniel F. Schmidt
Abstract:
We present a novel method for tuning the regularization hyper-parameter, $λ$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad…
▽ More
We present a novel method for tuning the regularization hyper-parameter, $λ$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $λ$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $λ$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $λ$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $λ$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).
△ Less
Submitted 2 November, 2023; v1 submitted 28 October, 2023;
originally announced October 2023.
-
Fizzy Super-Earths: Impacts of Magma Composition on the Bulk Density and Structure of Lava Worlds
Authors:
Kiersten M. Boley,
Wendy R. Panero,
Cayman T. Unterborn,
Joseph G. Schulze,
Romy Rodrıguez Martınez,
Ji Wang
Abstract:
Lava worlds are a potential emerging population of Super-Earths that are on close-in orbits around their host stars with likely partially molten mantles. To date, few studies address the impact of magma on the observed properties of a planet. At ambient conditions magma is less dense than solid rock; however, it is also more compressible with increasing pressure. Therefore, it is unclear how large…
▽ More
Lava worlds are a potential emerging population of Super-Earths that are on close-in orbits around their host stars with likely partially molten mantles. To date, few studies address the impact of magma on the observed properties of a planet. At ambient conditions magma is less dense than solid rock; however, it is also more compressible with increasing pressure. Therefore, it is unclear how large-scale magma oceans affect planet observables, such as bulk density. We update ExoPlex, a thermodynamically self-consistent planet interior software, to include anhydrous, hydrous (2.2 wt \% H_2O), and carbonated magmas (5.2 wt\% CO_2). We find that Earth-like planets with magma oceans larger than \sim 1.5 R_{\oplus} and \sim 3.2 M_{\oplus} are modestly denser than an equivalent mass solid planet. From our model, three classes of mantle structures emerge for magma ocean planets: (1) mantle magma ocean, (2) surface magma ocean, and (3) one consisting of a surface magma ocean, solid rock layer, and a basal magma ocean. The class of planets in which a basal magma ocean is present may sequester dissolved volatiles on billion-year timescales, in which a 4 M_{\oplus} mass planet can trap more than 130 times the mass of water than in Earth's present-day oceans and 1000 times the carbon in the Earth's surface and crust.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
A Comparison of the Composition of Planets in Single- and Multi-Planet Systems Orbiting M dwarfs
Authors:
Romy Rodríguez Martínez,
David V. Martin,
B. Scott Gaudi,
Joseph G. Schulze,
Anusha Pai Asnodkar,
Kiersten M. Boley,
Sarah Ballard
Abstract:
We investigate and compare the composition of M-dwarf planets in systems with only one known planet (``singles") to those residing in multi-planet systems (``multis") and the fundamental properties of their host stars. We restrict our analysis to planets with directly measured masses and radii, which comprise a total of 70 planets: 30 singles and 40 multis in 19 systems. We compare the bulk densit…
▽ More
We investigate and compare the composition of M-dwarf planets in systems with only one known planet (``singles") to those residing in multi-planet systems (``multis") and the fundamental properties of their host stars. We restrict our analysis to planets with directly measured masses and radii, which comprise a total of 70 planets: 30 singles and 40 multis in 19 systems. We compare the bulk densities for the full sample, which includes planets ranging in size from $0.52 R_{\oplus}$ to $12.8R_\oplus$, and find that single planets have significantly lower densities on average than multis, which we cannot attribute to selection biases. We compare the bulk densities normalized by an Earth model for planets with $R_{p} < 6R_{\oplus}$, and find that multis are also denser with 99\% confidence. We calculate and compare the core/water mass fractions (CMF/WMF) of low-mass planets ($M_p <10 M_{\oplus}$), and find that the likely rocky multis (with $R_p <1.6 R_{\oplus}$) have lower CMFs than singles. We also compare the [Fe/H] metallicity and rotation period of all single versus multi-planet host stars with such measurements in the literature and find that multi-planet hosts are significantly more metal-poor than those hosting a single planet. Moreover, we find that host star metallicity decreases with increasing planet multiplicity. In contrast, we find only a modest difference in the rotation period. The significant differences in planetary composition and metallicity of the host stars point to different physical processes governing the formation of single- and multi-planet systems in M dwarfs.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Scaling K2. VI. Reduced Small Planet Occurrence in High Galactic Amplitude Stars
Authors:
Jon K. Zink,
Kevin K. Hardegree-Ullman,
Jessie L. Christiansen,
Erik A. Petigura,
Kiersten M. Boley,
Sakhee Bhure,
Malena Rice,
Samuel W. Yee,
Howard Isaacson,
Rachel B. Fernandes,
Andrew W. Howard,
Sarah Blunt,
Jack Lubin,
Ashley Chontos,
Daria Pidhorodetska,
Mason G. MacDougall
Abstract:
In this study, we performed a homogeneous analysis of the planets around FGK dwarf stars observed by the Kepler and K2 missions, providing spectroscopic parameters for 310 K2 targets -- including 239 Scaling K2 hosts -- observed with Keck/HIRES. For orbital periods less than 40 days, we found that the distribution of planets as a function of orbital period, stellar effective temperature, and metal…
▽ More
In this study, we performed a homogeneous analysis of the planets around FGK dwarf stars observed by the Kepler and K2 missions, providing spectroscopic parameters for 310 K2 targets -- including 239 Scaling K2 hosts -- observed with Keck/HIRES. For orbital periods less than 40 days, we found that the distribution of planets as a function of orbital period, stellar effective temperature, and metallicity was consistent between K2 and Kepler, reflecting consistent planet formation efficiency across numerous ~1 kpc sight-lines in the local Milky Way. Additionally, we detected a 3X excess of sub-Saturns relative to warm Jupiters beyond 10 days, suggesting a closer association between sub-Saturn and sub-Neptune formation than between sub-Saturn and Jovian formation. Performing a joint analysis of Kepler and K2 demographics, we observed diminishing super-Earth, sub-Neptune, and sub-Saturn populations at higher stellar effective temperatures, implying an inverse relationship between formation and disk mass. In contrast, no apparent host-star spectral-type dependence was identified for our population of Jupiters, which indicates gas-giant formation saturates within the FGK mass regimes. We present support for stellar metallicity trends reported by previous Kepler analyses. Using GAIA DR3 proper motion and RV measurements, we discovered a galactic location trend: stars that make large vertical excursions from the plane of the Milky Way host fewer super-Earths and sub-Neptunes. While oscillation amplitude is associated with metallicity, metallicity alone cannot explain the observed trend, demonstrating that galactic influences are imprinted on the planet population. Overall, our results provide new insights into the distribution of planets around FGK dwarf stars and the factors that influence their formation and evolution.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
A Reanalysis of the Composition of K2-106b: an Ultra-short Period Super-Mercury Candidate
Authors:
Romy Rodríguez Martínez,
B. Scott Gaudi,
Joseph G. Schulze,
Lorena Acuña,
Jared Kolecki,
Jennifer A. Johnson,
Anusha Pai Asnodkar,
Kiersten M. Boley,
Magali Deleuil,
Olivier Mousis,
Wendy R. Panero,
Ji Wang
Abstract:
We present a reanalysis of the K2-106 transiting planetary system, with a focus on the composition of K2-106b, an ultra-short period, super-Mercury candidate. We globally model existing photometric and radial velocity data and derive a planetary mass and radius for K2-106b of $M_{p} = 8.53\pm1.02~M_{\oplus}$ and $R_{p} = 1.71^{+0.069}_{-0.057}~R_{\oplus}$, which leads to a density of…
▽ More
We present a reanalysis of the K2-106 transiting planetary system, with a focus on the composition of K2-106b, an ultra-short period, super-Mercury candidate. We globally model existing photometric and radial velocity data and derive a planetary mass and radius for K2-106b of $M_{p} = 8.53\pm1.02~M_{\oplus}$ and $R_{p} = 1.71^{+0.069}_{-0.057}~R_{\oplus}$, which leads to a density of $ρ_{p} = 9.4^{+1.6}_{-1.5}$ $\rm g~cm^{-3}$, a significantly lower value than previously reported in the literature. We use planet interior models that assume a two-layer planet comprised of a liquid, pure Fe core and iron-free, $\rm MgSiO_{3}$ mantle, and we determine the range of core mass fractions that are consistent with the observed mass and radius. We use existing high-resolution spectra of the host star to derive Fe/Mg/Si abundances ([Fe/H]$=-0.03 \pm 0.01$, [Mg/H]$= 0.04 \pm 0.02$, [Si/H]$=0.03 \pm 0.06$) to infer the composition of K2-106b. We find that although K2-106b has a high density and core mass fraction ($44^{+12}_{-15}\%$) compared to the Earth ($33\%$), its composition is consistent with what is expected assuming that it reflects the relative refractory abundances of its host star. K2-106b is therefore unlikely to be a super-Mercury, as has been suggested in previous literature.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
Spectroscopy of TOI-1259B -- an unpolluted white dwarf companion to an inflated warm Saturn
Authors:
Evan Fitzmaurice,
David V. Martin,
Romy Rodriguez Martinez,
Patrick Vallely,
Alexander P. Stephan,
Kiersten M. Boley,
Rick Pogge,
Kareem El-Badry,
Vedad Kunovac,
Amaury H. M. J. Triaud
Abstract:
TOI-1259 consists of a transiting exoplanet orbiting a main sequence star, with a bound outer white dwarf companion. Less than a dozen systems with this architecture are known. We conduct follow-up spectroscopy on the white dwarf TOI-1259B using the Large Binocular Telescope (LBT) to better characterise it. We observe only strong hydrogen lines, making TOI-1259B a DA white dwarf. We see no evidenc…
▽ More
TOI-1259 consists of a transiting exoplanet orbiting a main sequence star, with a bound outer white dwarf companion. Less than a dozen systems with this architecture are known. We conduct follow-up spectroscopy on the white dwarf TOI-1259B using the Large Binocular Telescope (LBT) to better characterise it. We observe only strong hydrogen lines, making TOI-1259B a DA white dwarf. We see no evidence of heavy element pollution, which would have been evidence of planetary material around the white dwarf. Such pollution is seen in ~ 25 - 50% of white dwarfs, but it is unknown if this rate is higher or lower in TOI-1259-like systems that contain a known planet. Our spectroscopy permits an improved white dwarf age measurement of 4.05 (+1.00 -0.42) Gyrs, which matches gyrochronology of the main sequence star. This is the first of an expanded sample of similar binaries that will allow us to calibrate these dating methods and provide a new perspective on planets in binaries.
△ Less
Submitted 12 September, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Searching For Transiting Planets Around Halo Stars. II. Constraining the Occurrence Rate of Hot Jupiters
Authors:
Kiersten M. Boley,
Ji Wang,
Joel C. Zinn,
Karen A. Collins,
Kevin I. Collins,
Tianjun Gan,
Ting S. Li
Abstract:
Jovian planet formation has been shown to be strongly correlated with host star metallicity, which is thought to be a proxy for disk solids. Observationally, previous works have indicated that jovian planets preferentially form around stars with solar and super solar metallicities. Given these findings, it is challenging to form planets within metal-poor environments, particularly for hot Jupiters…
▽ More
Jovian planet formation has been shown to be strongly correlated with host star metallicity, which is thought to be a proxy for disk solids. Observationally, previous works have indicated that jovian planets preferentially form around stars with solar and super solar metallicities. Given these findings, it is challenging to form planets within metal-poor environments, particularly for hot Jupiters that are thought to form via metallicity-dependent core accretion. Although previous studies have conducted planet searches for hot Jupiters around metal-poor stars, they have been limited due to small sample sizes, which are a result of a lack of high-quality data making hot Jupiter occurrence within the metal-poor regime difficult to constrain until now. We use a large sample of halo stars observed by TESS to constrain the upper limit of hot Jupiter occurrence within the metal-poor regime (-2.0 $\leq$ [Fe/H] $\leq$ -0.6). Placing the most stringent upper limit on hot Jupiter occurrence, we find the mean 1-$σ$ upper limit to be 0.18 $\%$ for radii 0.8 -2 R$_{\rm{Jupiter}}$ and periods $0.5- 10$ days. This result is consistent with previous predictions indicating that there exists a certain metallicity below which no planets can form.
△ Less
Submitted 28 June, 2021; v1 submitted 24 June, 2021;
originally announced June 2021.
-
Learning Rules for Materials Properties and Functions
Authors:
Mario Boley,
Matthias Scheffler
Abstract:
In materials science and engineering, one is typically searching for materials that exhibit exceptional performance for a certain function, and the number of these materials is extremely small. Thus, statistically speaking, we are interested in the identification of *rare phenomena*, and the scientific discovery typically resembles the proverbial hunt for the needle in a haystack.
In materials science and engineering, one is typically searching for materials that exhibit exceptional performance for a certain function, and the number of these materials is extremely small. Thus, statistically speaking, we are interested in the identification of *rare phenomena*, and the scientific discovery typically resembles the proverbial hunt for the needle in a haystack.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.
-
Better Short than Greedy: Interpretable Models through Optimal Rule Boosting
Authors:
Mario Boley,
Simon Teshuva,
Pierre Le Bodic,
Geoffrey I Webb
Abstract:
Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with…
▽ More
Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with a few rules. Here, we present a novel approach aiming to fit rule ensembles of maximal predictive power for a given ensemble size (and thus model comprehensibility). In particular, we present an efficient branch-and-bound algorithm that optimally solves the per-rule objective function of the popular second-order gradient boosting framework. Our main insight is that the boosting objective can be tightly bounded in linear time of the number of covered data points. Along with an additional novel pruning technique related to rule redundancy, this leads to a computationally feasible approach for boosting optimal rules that, as we demonstrate on a wide range of common benchmark problems, consistently outperforms the predictive performance of boosting greedy rules.
△ Less
Submitted 20 January, 2021;
originally announced January 2021.
-
Discovering Reliable Causal Rules
Authors:
Kailash Budhathoki,
Mario Boley,
Jilles Vreeken
Abstract:
We study the problem of deriving policies, or rules, that when enacted on a complex system, cause a desired outcome. Absent the ability to perform controlled experiments, such rules have to be inferred from past observations of the system's behaviour. This is a challenging problem for two reasons: First, observational effects are often unrepresentative of the underlying causal effect because they…
▽ More
We study the problem of deriving policies, or rules, that when enacted on a complex system, cause a desired outcome. Absent the ability to perform controlled experiments, such rules have to be inferred from past observations of the system's behaviour. This is a challenging problem for two reasons: First, observational effects are often unrepresentative of the underlying causal effect because they are skewed by the presence of confounding factors. Second, naive empirical estimations of a rule's effect have a high variance, and, hence, their maximisation can lead to random results.
To address these issues, first we measure the causal effect of a rule from observational data---adjusting for the effect of potential confounders. Importantly, we provide a graphical criteria under which causal rule discovery is possible. Moreover, to discover reliable causal rules from a sample, we propose a conservative and consistent estimator of the causal effect, and derive an efficient and exact algorithm that maximises the estimator. On synthetic data, the proposed estimator converges faster to the ground truth than the naive estimator and recovers relevant causal rules even at small sample sizes. Extensive experiments on a variety of real-world datasets show that the proposed algorithm is efficient and discovers meaningful rules.
△ Less
Submitted 8 September, 2020; v1 submitted 6 September, 2020;
originally announced September 2020.
-
Relative Flatness and Generalization
Authors:
Henning Petzka,
Michael Kamp,
Linara Adilova,
Cristian Sminchisescu,
Mario Boley
Abstract:
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While it has been empirically observed that flatness measures consistently correlate strongly with generalization, it is still an open theoretical problem why and under which circumstances flatness is connected to generalization, in particular in light…
▽ More
Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While it has been empirically observed that flatness measures consistently correlate strongly with generalization, it is still an open theoretical problem why and under which circumstances flatness is connected to generalization, in particular in light of reparameterizations that change certain flatness measures but leave generalization unchanged. We investigate the connection between flatness and generalization by relating it to the interpolation from representative data, deriving notions of representativeness, and feature robustness. The notions allow us to rigorously connect flatness and generalization and to identify conditions under which the connection holds. Moreover, they give rise to a novel, but natural relative flatness measure that correlates strongly with generalization, simplifies to ridge regression for ordinary least squares, and solves the reparameterization issue.
△ Less
Submitted 4 November, 2021; v1 submitted 3 January, 2020;
originally announced January 2020.
-
Communication-Efficient Distributed Online Learning with Kernels
Authors:
Michael Kamp,
Sebastian Bothe,
Mario Boley,
Michael Mock
Abstract:
We propose an efficient distributed online learning protocol for low-latency real-time services. It extends a previously presented protocol to kernelized online learners that represent their models by a support vector expansion. While such learners often achieve higher predictive performance than their linear counterparts, communicating the support vector expansions becomes inefficient for large n…
▽ More
We propose an efficient distributed online learning protocol for low-latency real-time services. It extends a previously presented protocol to kernelized online learners that represent their models by a support vector expansion. While such learners often achieve higher predictive performance than their linear counterparts, communicating the support vector expansions becomes inefficient for large numbers of support vectors. The proposed extension allows for a larger class of online learning algorithms---including those alleviating the problem above through model compression. In addition, we characterize the quality of the proposed protocol by introducing a novel criterion that requires the communication to be bounded by the loss suffered.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Adaptive Communication Bounds for Distributed Online Learning
Authors:
Michael Kamp,
Mario Boley,
Michael Mock,
Daniel Keren,
Assaf Schuster,
Izchak Sharfman
Abstract:
We consider distributed online learning protocols that control the exchange of information between local learners in a round-based learning scenario. The learning performance of such a protocol is intuitively optimal if approximately the same loss is incurred as in a hypothetical serial setting. If a protocol accomplishes this, it is inherently impossible to achieve a strong communication bound at…
▽ More
We consider distributed online learning protocols that control the exchange of information between local learners in a round-based learning scenario. The learning performance of such a protocol is intuitively optimal if approximately the same loss is incurred as in a hypothetical serial setting. If a protocol accomplishes this, it is inherently impossible to achieve a strong communication bound at the same time. In the worst case, every input is essential for the learning performance, even for the serial setting, and thus needs to be exchanged between the local learners. However, it is reasonable to demand a bound that scales well with the hardness of the serialized prediction problem, as measured by the loss received by a serial online learning algorithm. We provide formal criteria based on this intuition and show that they hold for a simplified version of a previously published protocol.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Discovering Reliable Correlations in Categorical Data
Authors:
Panagiotis Mandros,
Mario Boley,
Jilles Vreeken
Abstract:
In many scientific tasks we are interested in discovering whether there exist any correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on distribution of the data or the type of correlation, and, how to efficiently discover the top-most reliably c…
▽ More
In many scientific tasks we are interested in discovering whether there exist any correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on distribution of the data or the type of correlation, and, how to efficiently discover the top-most reliably correlated attribute sets from data. In this paper we answer these questions for discovery tasks in categorical data.
In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, by which we obtain a reliable, naturally interpretable, non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through two case studies we confirm that our discovery framework identifies interesting and meaningful correlations.
△ Less
Submitted 30 August, 2019;
originally announced August 2019.
-
Effective Parallelisation for Machine Learning
Authors:
Michael Kamp,
Mario Boley,
Olana Missura,
Thomas Gärtner
Abstract:
We present a novel parallelisation scheme that simplifies the adaptation of learning algorithms to growing amounts of data as well as growing needs for accurate and confident predictions in critical applications. In contrast to other parallelisation techniques, it can be applied to a broad class of learning algorithms without further mathematical derivations and without writing dedicated code, whi…
▽ More
We present a novel parallelisation scheme that simplifies the adaptation of learning algorithms to growing amounts of data as well as growing needs for accurate and confident predictions in critical applications. In contrast to other parallelisation techniques, it can be applied to a broad class of learning algorithms without further mathematical derivations and without writing dedicated code, while at the same time maintaining theoretical performance guarantees. Moreover, our parallelisation scheme is able to reduce the runtime of many learning algorithms to polylogarithmic time on quasi-polynomially many processing units. This is a significant step towards a general answer to an open question on the efficient parallelisation of machine learning algorithms in the sense of Nick's Class (NC). The cost of this parallelisation is in the form of a larger sample complexity. Our empirical study confirms the potential of our parallelisation scheme with fixed numbers of processors and instances in realistic application scenarios.
△ Less
Submitted 8 October, 2018;
originally announced October 2018.
-
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
Authors:
Panagiotis Mandros,
Mario Boley,
Jilles Vreeken
Abstract:
The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve t…
▽ More
The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search.
△ Less
Submitted 14 September, 2018;
originally announced September 2018.
-
Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups
Authors:
Janis Kalofolias,
Mario Boley,
Jilles Vreeken
Abstract:
Subgroup discovery is a local pattern mining technique to find interpretable descriptions of sub-populations that stand out on a given target variable. That is, these sub-populations are exceptional with regard to the global distribution. In this paper we argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global d…
▽ More
Subgroup discovery is a local pattern mining technique to find interpretable descriptions of sub-populations that stand out on a given target variable. That is, these sub-populations are exceptional with regard to the global distribution. In this paper we argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global distribution with regard to a control variable. That is, when the distribution of this control variable is the same, or almost the same, as over the whole data.
We formalise this objective function and give an efficient algorithm to compute its tight optimistic estimator for the case of a numeric target and a binary control variable. This enables us to use the branch-and-bound framework to efficiently discover the top-$k$ subgroups that are both exceptional as well as representative. Experimental evaluation on a wide range of datasets shows that with this algorithm we discover meaningful representative patterns and are up to orders of magnitude faster in terms of node evaluations as well as time.
△ Less
Submitted 22 September, 2017;
originally announced September 2017.
-
Discovering Reliable Approximate Functional Dependencies
Authors:
Panagiotis Mandros,
Mario Boley,
Jilles Vreeken
Abstract:
Given a database and a target attribute of interest, how can we tell whether there exists a functional, or approximately functional dependence of the target on any set of other attributes in the data? How can we reliably, without bias to sample size or dimensionality, measure the strength of such a dependence? And, how can we efficiently discover the optimal or $α$-approximate top-$k$ dependencies…
▽ More
Given a database and a target attribute of interest, how can we tell whether there exists a functional, or approximately functional dependence of the target on any set of other attributes in the data? How can we reliably, without bias to sample size or dimensionality, measure the strength of such a dependence? And, how can we efficiently discover the optimal or $α$-approximate top-$k$ dependencies? These are exactly the questions we answer in this paper.
As we want to be agnostic on the form of the dependence, we adopt an information-theoretic approach, and construct a reliable, bias correcting score that can be efficiently computed. Moreover, we give an effective optimistic estimator of this score, by which for the first time we can mine the approximate functional dependencies from data with guarantees of optimality. Empirical evaluation shows that the derived score achieves a good bias for variance trade-off, can be used within an efficient discovery algorithm, and indeed discovers meaningful dependencies. Most important, it remains reliable in the face of data sparsity.
△ Less
Submitted 18 June, 2017; v1 submitted 25 May, 2017;
originally announced May 2017.
-
Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery
Authors:
Mario Boley,
Bryan R. Goldsmith,
Luca M. Ghiringhelli,
Jilles Vreeken
Abstract:
Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new cl…
▽ More
Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective functions: we show how tight estimators can be computed efficiently for all functions that are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the average absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.
△ Less
Submitted 23 April, 2017; v1 submitted 26 January, 2017;
originally announced January 2017.
-
Uncovering structure-property relationships of materials by subgroup discovery
Authors:
B. R. Goldsmith,
M. Boley,
J. Vreeken,
M. Scheffler,
L. M. Ghiringhelli
Abstract:
Subgroup discovery (SGD) is presented here as a data-mining approach to help find interpretable local patterns, correlations, and descriptors of a target property in materials-science data. Specifically, we will be concerned with data generated by density-functional theory calculations. At first, we demonstrate that SGD can identify physically meaningful models that classify the crystal structures…
▽ More
Subgroup discovery (SGD) is presented here as a data-mining approach to help find interpretable local patterns, correlations, and descriptors of a target property in materials-science data. Specifically, we will be concerned with data generated by density-functional theory calculations. At first, we demonstrate that SGD can identify physically meaningful models that classify the crystal structures of 82 octet binary semiconductors as either rocksalt or zincblende. SGD identifies an interpretable two-dimensional model derived from only the atomic radii of valence s and p orbitals that properly classifies the crystal structures for 79 of the 82 octet binary semiconductors. The SGD framework is subsequently applied to 24 400 configurations of neutral gas-phase gold clusters with 5 to 14 atoms to discern general patterns between geometrical and physicochemical properties. For example, SGD helps find that van der Waals interactions within gold clusters are linearly correlated with their radius of gyration and are weaker for planar clusters than for nonplanar clusters. Also, a descriptor that predicts a local linear correlation between the chemical hardness and the cluster isomer stability is found for the even-sized gold clusters.
△ Less
Submitted 13 December, 2016;
originally announced December 2016.
-
Probabilistic Structured Predictors
Authors:
Shankar Vembu,
Thomas Gartner,
Mario Boley
Abstract:
We consider MAP estimators for structured prediction with exponential family models. In particular, we concentrate on the case that efficient algorithms for uniform sampling from the output space exist. We show that under this assumption (i) exact computation of the partition function remains a hard problem, and (ii) the partition function and the gradient of the log partition function can be appr…
▽ More
We consider MAP estimators for structured prediction with exponential family models. In particular, we concentrate on the case that efficient algorithms for uniform sampling from the output space exist. We show that under this assumption (i) exact computation of the partition function remains a hard problem, and (ii) the partition function and the gradient of the log partition function can be approximated efficiently. Our main result is an approximation scheme for the partition function based on Markov Chain Monte Carlo theory. We also show that the efficient uniform sampling assumption holds in several application settings that are of importance in machine learning.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.