-
SpecHD: Hyperdimensional Computing Framework for FPGA-based Mass Spectrometry Clustering
Authors:
Sumukh **e,
Weihong Xu,
Jaeyoung Kang,
Tianqi Zhang,
Neima Moshiri,
Wout Bittremieux,
Tajana Rosing
Abstract:
Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the d…
▽ More
Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the domain of spectral clustering. To tackle these inefficiencies, we introduce SpecHD, a hyperdimensional computing (HDC) framework supplemented by an FPGA-accelerated architecture with integrated near-storage preprocessing. Utilizing streamlined binary operations in an HDC environment, SpecHD capitalizes on the low-latency and parallel capabilities of FPGAs. This approach markedly improves clustering speed and efficiency, serving as a catalyst for real-time, high-throughput data analysis in future healthcare applications. Our evaluations demonstrate that SpecHD not only maintains but often surpasses existing clustering quality metrics while drastically cutting computational time. Specifically, it can cluster a large-scale human proteome dataset-comprising 25 million MS/MS spectra and 131 GB of MS data-in just 5 minutes. With energy efficiency exceeding 31x and a speedup factor that spans a range of 6x to 54x over existing state of-the-art solutions, SpecHD emerges as a promising solution for the rapid analysis of mass spectrometry data with great implications for personalized healthcare.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations
Authors:
Derek Jones,
Jonathan E. Allen,
Xiaohua Zhang,
Behnam Khaleghi,
Jaeyoung Kang,
Weihong Xu,
Niema Moshiri,
Tajana S. Rosing
Abstract:
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between t…
▽ More
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug to its protein target. A major drawback of the approaches is that they require exceptional computing capabilities to consider for even relatively small collections of molecules.
Hyperdimensional Computing (HDC) is a recently proposed learning paradigm that is able to leverage low-precision binary vector arithmetic to build efficient representations of the data that can be obtained without the need for gradient-based optimization approaches that are required in many conventional machine learning and deep learning approaches. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated for a range of application areas. We consider existing HDC approaches for molecular property classification and introduce two novel encoding algorithms that leverage the extended connectivity fingerprint (ECFP) algorithm.
We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods and achieve an acceleration of nearly 9 orders of magnitude as compared to inference with molecular docking. We demonstrate multiple approaches for the encoding of molecular data for HDC and examine their relative performance on a range of challenging molecular property prediction and drug-protein binding classification tasks. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
RAPIDx: High-performance ReRAM Processing in-Memory Accelerator for Sequence Alignment
Authors:
Weihong Xu,
Saransh Gupta,
Niema Moshiri,
Tajana Rosing
Abstract:
Genome sequence alignment is the core of many biological applications. The advancement of sequencing technologies produces a tremendous amount of data, making sequence alignment a critical bottleneck in bioinformatics analysis. The existing hardware accelerators for alignment suffer from limited on-chip memory, costly data movement, and poorly optimized alignment algorithms. They cannot afford to…
▽ More
Genome sequence alignment is the core of many biological applications. The advancement of sequencing technologies produces a tremendous amount of data, making sequence alignment a critical bottleneck in bioinformatics analysis. The existing hardware accelerators for alignment suffer from limited on-chip memory, costly data movement, and poorly optimized alignment algorithms. They cannot afford to concurrently process the massive amount of data generated by sequencing machines. In this paper, we propose a ReRAM-based accelerator, RAPIDx, using processing in-memory (PIM) for sequence alignment. RAPIDx achieves superior efficiency and performance via software-hardware co-design. First, we propose an adaptive banded parallelism alignment algorithm suitable for PIM architecture. Compared to the original dynamic programming-based alignment, the proposed algorithm significantly reduces the required complexity, data bit width, and memory footprint at the cost of negligible accuracy degradation. Then we propose the efficient PIM architecture that implements the proposed algorithm. The data flow in RAPIDx achieves four-level parallelism and we design an in-situ alignment computation flow in ReRAM, delivering $5.5$-$9.7\times$ efficiency and throughput improvements compared to our previous PIM design, RAPID. The proposed RAPIDx is reconfigurable to serve as a co-processor integrated into existing genome analysis pipeline to boost sequence alignment or edit distance calculation. On short-read alignment, RAPIDx delivers $131.1\times$ and $46.8\times$ throughput improvements over state-of-the-art CPU and GPU libraries, respectively. As compared to ASIC accelerators for long-read alignment, the performance of RAPIDx is $1.8$-$2.9\times$ higher.
△ Less
Submitted 24 January, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
NiemaGraphGen: A memory-efficient global-scale contact network simulation toolkit
Authors:
Niema Moshiri
Abstract:
Epidemic simulations require the ability to sample contact networks from various random graph models. Existing methods can simulate city-scale or even country-scale contact networks, but they are unable to feasibly simulate global-scale contact networks due to high memory consumption. NiemaGraphGen (NGG) is a memory-efficient graph generation tool that enables the simulation of global-scale contac…
▽ More
Epidemic simulations require the ability to sample contact networks from various random graph models. Existing methods can simulate city-scale or even country-scale contact networks, but they are unable to feasibly simulate global-scale contact networks due to high memory consumption. NiemaGraphGen (NGG) is a memory-efficient graph generation tool that enables the simulation of global-scale contact networks. NGG avoids storing the entire graph in memory and is instead intended to be used in a data streaming pipeline, resulting in memory consumption that is orders of magnitude smaller than existing tools. NGG provides a massively-scalable solution for simulating social contact networks, enabling global-scale epidemic simulation studies.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
Ten Simple Rules for Attending Your First Conference
Authors:
Elizabeth Leininger,
Kelly Shaw,
Niema Moshiri,
Kelly Neiles,
Getiria Onsongo,
Anna Ritz
Abstract:
Conferences are a mainstay of most scientific disciplines, where scientists of all career stages come together to share cutting-edge ideas and approaches. If you do research, chances are you will attend one or more of these meetings in your career. Conferences are a microcosm of their discipline, and while conferences offer different perspectives in different disciplines, they all offer experience…
▽ More
Conferences are a mainstay of most scientific disciplines, where scientists of all career stages come together to share cutting-edge ideas and approaches. If you do research, chances are you will attend one or more of these meetings in your career. Conferences are a microcosm of their discipline, and while conferences offer different perspectives in different disciplines, they all offer experiences that range from a casual chat waiting in line for coffee to watching someone present their groundbreaking, hot-off-the-press research. The authors of this piece have attended our fair share of conferences and have collectively mentored hundreds of students in understanding the "unwritten rules" and pro-tips of conference attendance. As you head to your first scientific conference, these rules will help you navigate the conference environment and make the most of your experience.
We have also developed a web portal which contains far more information about these rules, tables of professional societies and conferences in different disciplines, and other resources that may come in handy for first-time conference attendees and their mentors. We encourage any reader to use, adapt, and contribute to these materials.
△ Less
Submitted 19 April, 2021; v1 submitted 21 January, 2021;
originally announced January 2021.
-
The dual-Barabási-Albert model
Authors:
Niema Moshiri
Abstract:
The ability to sample random networks that can accurately represent real social contact networks is essential to the study of viral epidemics. The Barabási-Albert model and its extensions attempt to capture reality by generating networks with power-law degree distributions, but properties of the resulting distributions (e.g. minimum, average, and maximum degree) are often unrealistic of the social…
▽ More
The ability to sample random networks that can accurately represent real social contact networks is essential to the study of viral epidemics. The Barabási-Albert model and its extensions attempt to capture reality by generating networks with power-law degree distributions, but properties of the resulting distributions (e.g. minimum, average, and maximum degree) are often unrealistic of the social contacts the models attempt to capture. I propose a novel extension of the Barabási-Albert model, which I call the "dual-Barabási-Albert" (DBA) model, that attempts to better capture these properties of real networks of social contact.
△ Less
Submitted 24 October, 2018;
originally announced October 2018.
-
Ten Simple Rules for Reproducible Research in Jupyter Notebooks
Authors:
Adam Rule,
Amanda Birmingham,
Cristal Zuniga,
Ilkay Altintas,
Shih-Cheng Huang,
Rob Knight,
Niema Moshiri,
Mai H. Nguyen,
Sara Brin Rosenthal,
Fernando Pérez,
Peter W. Rose
Abstract:
Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or s…
▽ More
Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or simulations.
In this paper, we address several questions about reproducibility. For example, what are the technical and non-technical barriers to reproducible computational studies? What opportunities and challenges do computational notebooks offer to overcome some of these barriers? What tools are available and how can they be used effectively?
We have developed a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems, such as Jupyter Notebooks, which have become a tool of choice for many applications. Notebooks combine detailed workflows with narrative text and visualization of results. Combined with software repositories and open source licensing, notebooks are powerful tools for transparent, collaborative, reproducible, and reusable data analyses.
△ Less
Submitted 13 October, 2018;
originally announced October 2018.