-
DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design
Authors:
Clare Lyle,
Arash Mehrjou,
Pascal Notin,
Andrew Jesson,
Stefan Bauer,
Yarin Gal,
Patrick Schwab
Abstract:
The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interv…
▽ More
The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
The CausalBench challenge: A machine learning contest for gene network inference from single-cell perturbation data
Authors:
Mathieu Chevalley,
Jacob Sackett-Sanders,
Yusuf Roohani,
Pascal Notin,
Artemy Bakulin,
Dariusz Brzezinski,
Kaiwen Deng,
Yuanfang Guan,
Justin Hong,
Michael Ibrahim,
Wojciech Kotlowski,
Marcin Kowiel,
Panagiotis Misiakos,
Achille Nazaret,
Markus Püschel,
Chris Wendler,
Arash Mehrjou,
Patrick Schwab
Abstract:
In drug discovery, map** interactions between genes within cellular systems is a crucial early step. This helps formulate hypotheses regarding molecular mechanisms that could potentially be targeted by future medicines. The CausalBench Challenge was an initiative to invite the machine learning community to advance the state of the art in constructing gene-gene interaction networks. These network…
▽ More
In drug discovery, map** interactions between genes within cellular systems is a crucial early step. This helps formulate hypotheses regarding molecular mechanisms that could potentially be targeted by future medicines. The CausalBench Challenge was an initiative to invite the machine learning community to advance the state of the art in constructing gene-gene interaction networks. These networks, derived from large-scale, real-world datasets of single cells under various perturbations, are crucial for understanding the causal mechanisms underlying disease biology. Using the framework provided by the CausalBench benchmark, participants were tasked with enhancing the capacity of the state of the art methods to leverage large-scale genetic perturbation data. This report provides an analysis and summary of the methods submitted during the challenge to give a partial image of the state of the art at the time of the challenge. The winning solutions significantly improved performance compared to previous baselines, establishing a new state of the art for this critical task in biology and medicine.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
RITA: a Study on Scaling Up Generative Protein Sequence Models
Authors:
Daniel Hesslow,
Niccoló Zanichelli,
Pascal Notin,
Iacopo Poli,
Debora Marks
Abstract:
In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive tr…
▽ More
In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.
△ Less
Submitted 14 July, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.