Search | arXiv e-print repository

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

Authors: Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

Abstract: As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging f… ▽ More As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources. △ Less

Submitted 16 January, 2023; originally announced January 2023.

Comments: 29 pages, 6 figures

arXiv:2211.07274 [pdf, other]

doi 10.1103/PhysRevE.107.L042602

The effect of curvature on the diffusion of colloidal bananas

Authors: Justin-Aurel Ulbrich, Carla Fernandez-Rico, Brian Rost, Jacopo Vialetto, Lucio Isa, Jeffrey S. Urbach, Roel P. A. Dullens

Abstract: Anisotropic colloidal particles exhibit complex dynamics which play a crucial role in their functionality, transport and phase behaviour. In this work, we investigate the two-dimensional diffusion of smoothly curved colloidal rods -- also known as colloidal bananas -- as a function of their opening angle, α. We measure the translational and rotational diffusion coefficients of the particles with o… ▽ More Anisotropic colloidal particles exhibit complex dynamics which play a crucial role in their functionality, transport and phase behaviour. In this work, we investigate the two-dimensional diffusion of smoothly curved colloidal rods -- also known as colloidal bananas -- as a function of their opening angle, α. We measure the translational and rotational diffusion coefficients of the particles with opening angles ranging from 0° (straight rods) to nearly 360°(closed rings). In particular, we find that the anisotropic diffusion of the particles varies non-monotonically with their opening angle and that the axis of fastest diffusion switches from the long to the short axis of the particles when α>180°. We also find that the rotational diffusion coefficient of nearly closed rings is approximately an order of magnitude higher than that of straight rods of the same length. Finally, we show that the experimental results are consistent with Slender Body Theory, indicating that the dynamical behavior of the particles arises primarily from their local drag anisotropy. These results highlight the impact of curvature on the Brownian Motion of elongated colloidal particles, which must be taken into account when seeking to understand the behaviour of curved colloidal particles. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: 5 pages (including references) and 4 figures

arXiv:2204.12400 [pdf, other]

Robust measurements of $n$-point correlation functions of driven-dissipative quantum systems on a digital quantum computer

Authors: Lorenzo Del Re, Brian Rost, Michael Foss-Feig, A. F. Kemper, J. K. Freericks

Abstract: We propose and demonstrate a unified hierarchical method to measure $n$-point correlation functions that can be applied to driven, dissipative, or otherwise open or non-equilibrium quantum systems. In this method, the time evolution of the system is repeatedly interrupted by interacting an ancilla qubit with the system through a controlled operation, and measuring the ancilla immediately afterward… ▽ More We propose and demonstrate a unified hierarchical method to measure $n$-point correlation functions that can be applied to driven, dissipative, or otherwise open or non-equilibrium quantum systems. In this method, the time evolution of the system is repeatedly interrupted by interacting an ancilla qubit with the system through a controlled operation, and measuring the ancilla immediately afterwards. We discuss the robustness of this method as compared to other ancilla-based interferometric techniques (such as the Hadamard test), and highlight its advantages for near-term quantum simulations of open quantum systems. We implement the method on a quantum computer in order to measure single-particle Green's functions of a driven-dissipative fermionic system. This work shows that dynamical correlation functions for driven-dissipative systems can be robustly measured with near-term quantum computers. △ Less

Submitted 28 February, 2024; v1 submitted 26 April, 2022; originally announced April 2022.

arXiv:2108.01183 [pdf, other]

Long-Time Error-Mitigating Simulation of Open Quantum Systems on Near Term Quantum Computers

Authors: Brian Rost, Lorenzo Del Re, Nathan Earnest, Alexander F. Kemper, Barbara Jones, James K. Freericks

Abstract: We study an open quantum system simulation on quantum hardware, which demonstrates robustness to hardware errors even with deep circuits containing up to two thousand entangling gates. We simulate two systems of electrons coupled to an infinite thermal bath: 1) a system of dissipative free electrons in a driving electric field; and 2) the thermalization of two interacting electrons in a single orb… ▽ More We study an open quantum system simulation on quantum hardware, which demonstrates robustness to hardware errors even with deep circuits containing up to two thousand entangling gates. We simulate two systems of electrons coupled to an infinite thermal bath: 1) a system of dissipative free electrons in a driving electric field; and 2) the thermalization of two interacting electrons in a single orbital in a magnetic field -- the Hubbard atom. These problems are solved using IBM quantum computers, showing no signs of decreasing fidelity at long times. Our results demonstrate that algorithms for simulating open quantum systems are able to far outperform similarly complex non-dissipative algorithms on noisy hardware. Our two examples show promise that the driven-dissipative quantum many-body problem can eventually be solved on quantum computers. △ Less

Submitted 5 June, 2024; v1 submitted 2 August, 2021; originally announced August 2021.

arXiv:2104.02443 [pdf]

CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing

Authors: Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, Burkhard Rost

Abstract: Currently, a growing number of mature natural language processing applications make people's life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with tra… ▽ More Currently, a growing number of mature natural language processing applications make people's life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans. https://github.com/agemagician/CodeTrans △ Less

Submitted 12 May, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: 28 pages, 6 tables and 1 figure

arXiv:2007.06225 [pdf]

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Authors: Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost

Abstract: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids.… ▽ More Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans. △ Less

Submitted 4 May, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: 17 pages, 9 figures, 4 tables

arXiv:2001.00794 [pdf, other]

Simulation of Thermal Relaxation in Spin Chemistry Systems on a Quantum Computer Using Inherent Qubit Decoherence

Authors: Brian Rost, Barbara Jones, Mariya Vyushkova, Aaila Ali, Charlotte Cullip, Alexander Vyushkov, Jarek Nabrzyski

Abstract: Current and near term quantum computers (i.e. NISQ devices) are limited in their computational power in part due to qubit decoherence. Here we seek to take advantage of qubit decoherence as a resource in simulating the behavior of real world quantum systems, which are always subject to decoherence, with no additional computational overhead. As a first step toward this goal we simulate the thermal… ▽ More Current and near term quantum computers (i.e. NISQ devices) are limited in their computational power in part due to qubit decoherence. Here we seek to take advantage of qubit decoherence as a resource in simulating the behavior of real world quantum systems, which are always subject to decoherence, with no additional computational overhead. As a first step toward this goal we simulate the thermal relaxation of quantum beats in radical ion pairs (RPs) on a quantum computer as a proof of concept of the method. We present three methods for implementing the thermal relaxation, one which explicitly applies the relaxation Kraus operators, one which combines results from two separate circuits in a classical post-processing step, and one which relies on leveraging the inherent decoherence of the qubits themselves. We use our methods to simulate two real world systems and find excellent agreement between our results, experimental data, and the theoretical prediction. △ Less

Submitted 6 November, 2020; v1 submitted 3 January, 2020; originally announced January 2020.

Comments: 14 pages, 13 figures

arXiv:1912.08310 [pdf, other]

Driven-dissipative quantum mechanics on a lattice: Simulating a fermionic reservoir on a quantum computer

Authors: Lorenzo Del Re, Brian Rost, A. F. Kemper, J. K. Freericks

Abstract: The driven-dissipative many-body problem remains one of the most challenging unsolved problems in quantum mechanics. The advent of quantum computers may provide a unique platform for efficiently simulating such driven-dissipative systems. But there are many choices for how one can engineer the reservoir. One can simply employ ancilla qubits to act as a reservoir and then digitally simulate them vi… ▽ More The driven-dissipative many-body problem remains one of the most challenging unsolved problems in quantum mechanics. The advent of quantum computers may provide a unique platform for efficiently simulating such driven-dissipative systems. But there are many choices for how one can engineer the reservoir. One can simply employ ancilla qubits to act as a reservoir and then digitally simulate them via algorithmic cooling. A more attractive approach, which allows one to simulate an infinite reservoir, is to integrate out the bath degrees of freedom and describe the driven-dissipative system via a master equation, that can also be simulated on a quantum computer. In this work, we consider the particular case of non-interacting electrons on a lattice driven by an electric field and coupled to a fermionic thermostat. Then, we provide two different quantum circuits: the first one reconstructs the full dynamics of the system using Trotter steps, while the second one dissipatively prepares the final non-equilibrium steady state in a single step. We run both circuits on the IBM quantum experience. For circuit (i), we achieved up to 5 Trotter steps. When partial resets become available on quantum computers, we expect that the maximum simulation time can be significantly increased. The methods developed here suggest generalizations that can be applied to simulating interacting driven-dissipative systems. △ Less

Submitted 17 August, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

arXiv:1905.01564 [pdf, other]

doi 10.1103/PhysRevE.102.023103

Effective aspect ratio of helices in shear flow

Authors: Brian W. Rost, Justin T. Stimatze, David A. Egolf, Jeffrey S. Urbach

Abstract: We report the results of simulations of rigid colloidal helices suspended in a shear flow, using dissipative particle dynamics for a coarse-grained representation of the suspending fluid, as well as deterministic trajectories of non-Brownian helices calculated from the resistance tensor derived under the slender-body approximation. The shear flow produces nonuniform rotation of the helices, simila… ▽ More We report the results of simulations of rigid colloidal helices suspended in a shear flow, using dissipative particle dynamics for a coarse-grained representation of the suspending fluid, as well as deterministic trajectories of non-Brownian helices calculated from the resistance tensor derived under the slender-body approximation. The shear flow produces nonuniform rotation of the helices, similarly to other high aspect ratio particles, such that more elongated helices spend more time aligned with the fluid velocity. We introduce a geometric effective aspect ratio calculated directly from the helix geometry and a dynamical effective aspect ratio derived from the trajectories of the particles and find that the two effective aspect ratios are approximately equal over the entire parameter range tested. We also describe observed transient deflections of the helical axis into the vorticity direction that can occur when the helix is rotating through the gradient direction and that depend on the rotation of the helix about its axis. △ Less

Submitted 11 August, 2020; v1 submitted 4 May, 2019; originally announced May 2019.

Comments: 9 pages, 8 figures. Supplements include interactive Mathematica notebook and static PDF version

Journal ref: Phys. Rev. E 102, 023103 (2020)

arXiv:1605.04614 [pdf, other]

DeepLearningKit - an GPU Optimized Deep Learning Framework for Apple's iOS, OS X and tvOS developed in Metal and Swift

Authors: Amund Tveit, Torbjørn Morland, Thomas Brox Røst

Abstract: In this paper we present DeepLearningKit - an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the big screen, or OS X desktop applic… ▽ More In this paper we present DeepLearningKit - an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the big screen, or OS X desktop applications. The goal is to support using deep learning models trained with popular frameworks such as Caffe, Torch, TensorFlow, Theano, Pylearn, Deeplearning4J and Mocha. Given the massive GPU resources and time required to train Deep Learning models we suggest an App Store like model to distribute and download pretrained and reusable Deep Learning models. △ Less

Submitted 15 May, 2016; originally announced May 2016.

Comments: 9 pages, 12 figures, open source documentation and code at deeplearningkit.org and github.com/deeplearningkit

arXiv:1601.00891 [pdf, other]

doi 10.1186/s13059-016-1037-6

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Authors: Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca , et al. (122 additional authors not shown)

Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a… ▽ More Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction. △ Less

Submitted 2 January, 2016; originally announced January 2016.

Comments: Submitted to Genome Biology

Showing 1–11 of 11 results for author: Rost, B