Search | arXiv e-print repository

OpenProteinSet: Training data for structural biology at scale

Authors: Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Lukas Jarosch, Daniel Berenberg, Ian Fisk, Andrew M. Watkins, Stephen Ra, Richard Bonneau, Mohammed AlQuraishi

Abstract: Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally… ▽ More Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research. △ Less

Submitted 10 August, 2023; originally announced August 2023.

arXiv:2209.08868 [pdf, other]

Snowmass 2021 Computational Frontier CompF4 Topical Group Report: Storage and Processing Resource Access

Authors: W. Bhimji, D. Carder, E. Dart, J. Duarte, I. Fisk, R. Gardner, C. Guok, B. Jayatilaka, T. Lehman, M. Lin, C. Maltzahn, S. McKee, M. S. Neubauer, O. Rind, O. Shadura, N. V. Tran, P. van Gemmeren, G. Watts, B. A. Weaver, F. Würthwein

Abstract: Computing plays a significant role in all areas of high energy physics. The Snowmass 2021 CompF4 topical group's scope is facilities R&D, where we consider "facilities" as the computing hardware and software infrastructure inside the data centers plus the networking between data centers, irrespective of who owns them, and what policies are applied for using them. In other words, it includes commer… ▽ More Computing plays a significant role in all areas of high energy physics. The Snowmass 2021 CompF4 topical group's scope is facilities R&D, where we consider "facilities" as the computing hardware and software infrastructure inside the data centers plus the networking between data centers, irrespective of who owns them, and what policies are applied for using them. In other words, it includes commercial clouds, federally funded High Performance Computing (HPC) systems for all of science, and systems funded explicitly for a given experimental or theoretical program. This topical group report summarizes the findings and recommendations for the storage, processing, networking and associated software service infrastructures for future high energy physics research, based on the discussions organized through the Snowmass 2021 community study. △ Less

Submitted 29 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: Snowmass 2021 Computational Frontier CompF4 topical group report. v2: Expanded introduction. Updated author list. 52 pages, 6 figures

arXiv:1901.07143 [pdf, other]

doi 10.1051/epjconf/201921406030

Using Big Data Technologies for HEP Analysis

Authors: Matteo Cremonesi, Claudio Bellini, Bianny Bian, Luca Canali, Vasileios Dimakopoulos, Peter Elmer, Ian Fisk, Maria Girone, Oliver Gutsche, Siew-Yan Hoh, Bo Jayatilaka, Viktor Khristenko, Andrea Luiselli, Andrew Melo, Evangelos Evangelos, Dominick Olivito, Jacopo Pazzini, Jim Pivarski, Alexey Svyatkovskiy, Marco Zanetti

Abstract: The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches… ▽ More The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by develo** a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow. △ Less

Submitted 21 January, 2019; originally announced January 2019.

arXiv:1711.00375 [pdf, other]

CMS Analysis and Data Reduction with Apache Spark

Authors: Oliver Gutsche, Luca Canali, Illia Cremer, Matteo Cremonesi, Peter Elmer, Ian Fisk, Maria Girone, Bo Jayatilaka, Jim Kowalkowski, Viktor Khristenko, Evangelos Motesnitsalis, Jim Pivarski, Saba Sehrish, Kacper Surdy, Alexey Svyatkovskiy

Abstract: Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the a… ▽ More Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time hel** in reducing the cost of ownership for the end-users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end-analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, comparing Spark's feasibility, usability and performance to the ROOT-based analysis. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: Proceedings for 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017). arXiv admin note: text overlap with arXiv:1703.04171

arXiv:1710.00100 [pdf, other]

doi 10.1007/s41781-017-0001-9

HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation

Authors: Burt Holzman, Lothar A. T. Bauerdick, Brian Bockelman, Dave Dykstra, Ian Fisk, Stuart Fuess, Gabriele Garzoglio, Maria Girone, Oliver Gutsche, Dirk Hufnagel, Hyunwoo Kim, Robert Kennedy, Nicolo Magini, David Mason, Panagiotis Spentzouris, Anthony Tiradani, Steve Timm, Eric W. Vaandering

Abstract: Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly de… ▽ More Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly deployed for a variety of computing tasks. There is a growing nterest among the cloud providers to demonstrate the capability to perform large-scale scientific computing. In this paper, we discuss results from the CMS experiment using the Fermilab HEPCloud facility, which utilized both local Fermilab resources and virtual machines in the Amazon Web Services Elastic Compute Cloud. We discuss the planning, technical challenges, and lessons learned involved in performing physics workflows on a large-scale set of virtualized resources. In addition, we will discuss the economics and operational efficiencies when executing workflows both in the cloud and on dedicated resources. △ Less

Submitted 29 September, 2017; originally announced October 2017.

Comments: 15 pages, 9 figures

Journal ref: Comput Softw Big Sci (2017) 1:1

arXiv:1012.1980 [pdf]

First Experiences with LHC Grid Computing and Distributed Analysis

Authors: Ian Fisk

Abstract: In this presentation the experiences of the LHC experiments using grid computing were presented with a focus on experience with distributed analysis. After many years of development, preparation, exercises, and validation the LHC (Large Hadron Collider) experiments are in operations. The computing infrastructure has been heavily utilized in the first 6 months of data collection. The general experi… ▽ More In this presentation the experiences of the LHC experiments using grid computing were presented with a focus on experience with distributed analysis. After many years of development, preparation, exercises, and validation the LHC (Large Hadron Collider) experiments are in operations. The computing infrastructure has been heavily utilized in the first 6 months of data collection. The general experience of exploiting the grid infrastructure for organized processing and preparation is described, as well as the successes employing the infrastructure for distributed analysis. At the end the expected evolution and future plans are outlined. △ Less

Submitted 9 December, 2010; originally announced December 2010.

Comments: 6 pages, 2 figures, Proceedings of the Hadron Collider Physics Symposium 2010 (HCP2010), Toronto Canada

arXiv:cs/0305066 [pdf, ps, other]

The CMS Integration Grid Testbed

Authors: Gregory E. Graham, M. Anzar Afaq, Shafqat Aziz, L. A. T. Bauerdick, Michael Ernst, Joseph Kaiser, Natalia Ratnikova, Hans Wenzel, Yujun Wu, Erik Aslakson, Julian Bunn, Saima Iqbal, Iosif Legrand, Harvey Newman, Suresh Singh, Conrad Steenberg, James Branson, Ian Fisk, James Letts, Adam Arbree, Paul Avery, Dimitri Bourilkov, Richard Cavanaugh, Jorge Rodriguez, Suchindra Kategari , et al. (5 additional authors not shown)

Abstract: The CMS Integration Grid Testbed (IGT) comprises USCMS Tier-1 and Tier-2 hardware at the following sites: the California Institute of Technology, Fermi National Accelerator Laboratory, the University of California at San Diego, and the University of Florida at Gainesville. The IGT runs jobs using the Globus Toolkit with a DAGMan and Condor-G front end. The virtual organization (VO) is managed us… ▽ More The CMS Integration Grid Testbed (IGT) comprises USCMS Tier-1 and Tier-2 hardware at the following sites: the California Institute of Technology, Fermi National Accelerator Laboratory, the University of California at San Diego, and the University of Florida at Gainesville. The IGT runs jobs using the Globus Toolkit with a DAGMan and Condor-G front end. The virtual organization (VO) is managed using VO management scripts from the European Data Grid (EDG). Gridwide monitoring is accomplished using local tools such as Ganglia interfaced into the Globus Metadata Directory Service (MDS) and the agent based Mona Lisa. Domain specific software is packaged and installed using the Distrib ution After Release (DAR) tool of CMS, while middleware under the auspices of the Virtual Data Toolkit (VDT) is distributed using Pacman. During a continuo us two month span in Fall of 2002, over 1 million official CMS GEANT based Monte Carlo events were generated and returned to CERN for analysis while being demonstrated at SC2002. In this paper, we describe the process that led to one of the world's first continuously available, functioning grids. △ Less

Submitted 10 June, 2003; v1 submitted 30 May, 2003; originally announced May 2003.

Comments: CHEP 2003 MOCT010

ACM Class: A.0; C.2.4

Journal ref: eConfC0303241:MOCT010B,2003

Showing 1–7 of 7 results for author: Fisk, I