-
Identifying Climate Targets in National Laws and Policies using Machine Learning
Authors:
Matyas Juhasz,
Tina Marchand,
Roshan Melwani,
Kalyan Dutia,
Sarah Goodenough,
Harrison Pim,
Henry Franks
Abstract:
Quantified policy targets are a fundamental element of climate policy, typically characterised by domain-specific and technical language. Current methods for curating comprehensive views of global climate policy targets entail significant manual effort. At present there are few scalable methods for extracting climate targets from national laws or policies, which limits policymakers' and researcher…
▽ More
Quantified policy targets are a fundamental element of climate policy, typically characterised by domain-specific and technical language. Current methods for curating comprehensive views of global climate policy targets entail significant manual effort. At present there are few scalable methods for extracting climate targets from national laws or policies, which limits policymakers' and researchers' ability to (1) assess private and public sector alignment with global goals and (2) inform policy decisions. In this paper we present an approach for extracting mentions of climate targets from national laws and policies. We create an expert-annotated dataset identifying three categories of target ('Net Zero', 'Reduction' and 'Other' (e.g. renewable energy targets)) and train a classifier to reliably identify them in text. We investigate bias and equity impacts related to our model and identify specific years and country names as problematic features. Finally, we investigate the characteristics of the dataset produced by running this classifier on the Climate Policy Radar (CPR) dataset of global national climate laws and policies and UNFCCC submissions, highlighting the potential of automated and scalable data collection for existing climate policy databases and supporting further research. Our work represents a significant upgrade in the accessibility of these key climate policy elements for policymakers and researchers. We publish our model at https://huggingface.co/ClimatePolicyRadar/national-climate-targets and related dataset at https://huggingface.co/datasets/ClimatePolicyRadar/national-climate-targets.
△ Less
Submitted 4 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
SRATTA : Sample Re-ATTribution Attack of Secure Aggregation in Federated Learning
Authors:
Tanguy Marchand,
Régis Loeb,
Ulysse Marteau-Ferey,
Jean Ogier du Terrail,
Arthur Pignet
Abstract:
We consider a cross-silo federated learning (FL) setting where a machine learning model with a fully connected first layer is trained between different clients and a central server using FedAvg, and where the aggregation step can be performed with secure aggregation (SA). We present SRATTA an attack relying only on aggregated models which, under realistic assumptions, (i) recovers data samples fro…
▽ More
We consider a cross-silo federated learning (FL) setting where a machine learning model with a fully connected first layer is trained between different clients and a central server using FedAvg, and where the aggregation step can be performed with secure aggregation (SA). We present SRATTA an attack relying only on aggregated models which, under realistic assumptions, (i) recovers data samples from the different clients, and (ii) groups data samples coming from the same client together. While sample recovery has already been explored in an FL setting, the ability to group samples per client, despite the use of SA, is novel. This poses a significant unforeseen security threat to FL and effectively breaks SA. We show that SRATTA is both theoretically grounded and can be used in practice on realistic models and datasets. We also propose counter-measures, and claim that clients should play an active role to guarantee their privacy during training.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings
Authors:
Jean Ogier du Terrail,
Samy-Safwan Ayed,
Edwige Cyffers,
Felix Grimberg,
Chaoyang He,
Regis Loeb,
Paul Mangold,
Tanguy Marchand,
Othmane Marfoq,
Erum Mushtaq,
Boris Muzellec,
Constantin Philippenko,
Santiago Silva,
Maria Teleńczuk,
Shadi Albarqouni,
Salman Avestimehr,
Aurélien Bellet,
Aymeric Dieuleveut,
Martin Jaggi,
Sai Praneeth Karimireddy,
Marco Lorenzi,
Giovanni Neglia,
Marc Tommasi,
Mathieu Andreux
Abstract:
Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works hav…
▽ More
Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works have proposed representative datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist, thereby slowing algorithmic research in this critical application. In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare datasets with natural splits, covering multiple tasks, modalities, and data volumes, each accompanied with baseline training code. As an illustration, we additionally benchmark standard FL algorithms on all datasets. Our flexible and modular suite allows researchers to easily download datasets, reproduce results and re-use the different components for their research. FLamby is available at~\url{www.github.com/owkin/flamby}.
△ Less
Submitted 5 May, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
SecureFedYJ: a safe feature Gaussianization protocol for Federated Learning
Authors:
Tanguy Marchand,
Boris Muzellec,
Constance Beguier,
Jean Ogier du Terrail,
Mathieu Andreux
Abstract:
The Yeo-Johnson (YJ) transformation is a standard parametrized per-feature unidimensional transformation often used to Gaussianize features in machine learning. In this paper, we investigate the problem of applying the YJ transformation in a cross-silo Federated Learning setting under privacy constraints. For the first time, we prove that the YJ negative log-likelihood is in fact convex, which all…
▽ More
The Yeo-Johnson (YJ) transformation is a standard parametrized per-feature unidimensional transformation often used to Gaussianize features in machine learning. In this paper, we investigate the problem of applying the YJ transformation in a cross-silo Federated Learning setting under privacy constraints. For the first time, we prove that the YJ negative log-likelihood is in fact convex, which allows us to optimize it with exponential search. We numerically show that the resulting algorithm is more stable than the state-of-the-art approach based on the Brent minimization method. Building on this simple algorithm and Secure Multiparty Computation routines, we propose SecureFedYJ, a federated algorithm that performs a pooled-equivalent YJ transformation without leaking more information than the final fitted parameters do. Quantitative experiments on real data demonstrate that, in addition to being secure, our approach reliably normalizes features across silos as well as if data were pooled, making it a viable approach for safe federated feature Gaussianization.
△ Less
Submitted 13 October, 2022; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Wavelet Conditional Renormalization Group
Authors:
Tanguy Marchand,
Misaki Ozawa,
Giulio Biroli,
Stéphane Mallat
Abstract:
We develop a multiscale approach to estimate high-dimensional probability distributions from a dataset of physical fields or configurations observed in experiments or simulations. In this way we can estimate energy functions (or Hamiltonians) and efficiently generate new samples of many-body systems in various domains, from statistical physics to cosmology. Our method -- the Wavelet Conditional Re…
▽ More
We develop a multiscale approach to estimate high-dimensional probability distributions from a dataset of physical fields or configurations observed in experiments or simulations. In this way we can estimate energy functions (or Hamiltonians) and efficiently generate new samples of many-body systems in various domains, from statistical physics to cosmology. Our method -- the Wavelet Conditional Renormalization Group (WC-RG) -- proceeds scale by scale, estimating models for the conditional probabilities of "fast degrees of freedom" conditioned by coarse-grained fields. These probability distributions are modeled by energy functions associated with scale interactions, and are represented in an orthogonal wavelet basis. WC-RG decomposes the microscopic energy function as a sum of interaction energies at all scales and can efficiently generate new samples by going from coarse to fine scales. Near phase transitions, it avoids the "critical slowing down" of direct estimation and sampling algorithms. This is explained theoretically by combining results from RG and wavelet theories, and verified numerically for the Gaussian and $\varphi^4$ field theories. We show that multiscale WC-RG energy-based models are more general than local potential models and can capture the physics of complex many-body interacting systems at all length scales. This is demonstrated for weak-gravitational-lensing fields reflecting dark matter distributions in cosmology, which include long-range interactions with long-tail probability distributions. WC-RG has a large number of potential applications in non-equilibrium systems, where the underlying distribution is not known {\it a priori}. Finally, we discuss the connection between WC-RG and deep network architectures.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Self supervised learning improves dMMR/MSI detection from histology slides across multiple cancers
Authors:
Charlie Saillard,
Olivier Dehaene,
Tanguy Marchand,
Olivier Moindrot,
Aurélie Kamoun,
Benoit Schmauch,
Simon Jegou
Abstract:
Microsatellite instability (MSI) is a tumor phenotype whose diagnosis largely impacts patient care in colorectal cancers (CRC), and is associated with response to immunotherapy in all solid tumors. Deep learning models detecting MSI tumors directly from H&E stained slides have shown promise in improving diagnosis of MSI patients. Prior deep learning models for MSI detection have relied on neural n…
▽ More
Microsatellite instability (MSI) is a tumor phenotype whose diagnosis largely impacts patient care in colorectal cancers (CRC), and is associated with response to immunotherapy in all solid tumors. Deep learning models detecting MSI tumors directly from H&E stained slides have shown promise in improving diagnosis of MSI patients. Prior deep learning models for MSI detection have relied on neural networks pretrained on ImageNet dataset, which does not contain any medical image. In this study, we leverage recent advances in self-supervised learning by training neural networks on histology images from the TCGA dataset using MoCo V2. We show that these networks consistently outperform their counterparts pretrained using ImageNet and obtain state-of-the-art results for MSI detection with AUCs of 0.92 and 0.83 for CRC and gastric tumors, respectively. These models generalize well on an external CRC cohort (0.97 AUC on PAIP) and improve transfer from one organ to another. Finally we show that predictive image regions exhibit meaningful histological patterns, and that the use of MoCo features highlighted more relevant patterns according to an expert pathologist.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.