Search | arXiv e-print repository

Mamba State-Space Models Can Be Strong Downstream Learners

Authors: John T. Halloran, Manbir Gulati, Paul F. Roysdon

Abstract: Mamba state-space models (SSMs) have recently outperformed state-of-the-art (SOTA) Transformer large language models (LLMs) in various tasks and been widely adapted. However, Mamba's downstream learning capabilities remain either unexplored$\unicode{x2013}$e.g., mixed-precision (MPFT) and parameter-efficient fine-tuning (PEFT)--or under-evaluated$\unicode{x2013}$e.g., in-context learning (ICL). Fo… ▽ More Mamba state-space models (SSMs) have recently outperformed state-of-the-art (SOTA) Transformer large language models (LLMs) in various tasks and been widely adapted. However, Mamba's downstream learning capabilities remain either unexplored$\unicode{x2013}$e.g., mixed-precision (MPFT) and parameter-efficient fine-tuning (PEFT)--or under-evaluated$\unicode{x2013}$e.g., in-context learning (ICL). For the latter, recent works reported Mamba's ICL rivals SOTA Transformer LLMs using non-standard benchmarks. In contrast, we show that on standard benchmarks, pretrained Mamba models achieve only 38% of the ICL performance improvements (over zero-shot) of comparable Transformers. Enabling MPFT and PEFT in Mamba architectures is challenging due to recurrent dynamics and highly customized CUDA kernels, respectively. However, we prove that Mamba's recurrent dynamics are robust to small input changes using dynamical systems theory. Empirically, we show that performance changes in Mamba's inference and fine-tuning due to mixed-precision align with Transformer LLMs. Furthermore, we show that targeting key memory buffers in Mamba's customized CUDA kernels for low-rank adaptation regularizes SSM parameters, thus achieving parameter efficiency while retaining speedups. We show that combining MPFT and PEFT enables up to 2.15 times more tokens-per-second and 65.5% reduced per-token-memory compared to full Mamba fine-tuning, while achieving up to 81.5% of the ICL performance improvements (over zero-shot) of comparably fine-tuned Transformers. △ Less

Submitted 31 May, 2024; originally announced June 2024.

Comments: 16 pages, 4 figures, 3 tables

arXiv:2201.01240 [pdf, other]

Feedback and Engagement on an Introductory Programming Module

Authors: Beate Grawemeyer, John Halloran, Matthew England, David Croft

Abstract: We ran a study on engagement and achievement for a first year undergraduate programming module which used an online learning environment containing tasks which generate automated feedback. Students could also access human feedback from traditional labs. We gathered quantitative data on engagement and achievement which allowed us to split the cohort into 6 groups. We then ran interviews with studen… ▽ More We ran a study on engagement and achievement for a first year undergraduate programming module which used an online learning environment containing tasks which generate automated feedback. Students could also access human feedback from traditional labs. We gathered quantitative data on engagement and achievement which allowed us to split the cohort into 6 groups. We then ran interviews with students after the end of the module to produce qualitative data on perceptions of what feedback is, how useful it is, the uses made of it, and how it bears on engagement. A general finding was that human and automated feedback are different but complementary. However there are different feedback needs by group. Our findings imply: (1) that a blended human-automated feedback approach improves engagement; and (2) that this approach needs to be differentiated according to type of student. We give implications for the design of feedback for programming modules. △ Less

Submitted 4 January, 2022; originally announced January 2022.

Comments: To appear in Proc. CEP 2022

ACM Class: K.3.2

arXiv:2008.03433 [pdf, other]

GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification

Authors: John T. Halloran, David M. Rocke

Abstract: One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. While TRON has recently been shown to enjoy substantial speedups on shared-memory multi-core systems, exploiting graphical processing units (GPUs) to speed up the method is significantl… ▽ More One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. While TRON has recently been shown to enjoy substantial speedups on shared-memory multi-core systems, exploiting graphical processing units (GPUs) to speed up the method is significantly more difficult, owing to the highly complex and heavily sequential nature of the algorithm. In this work, we show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced. For sparse feature sets, we show that using GPUs to train logistic regression classifiers in LIBLINEAR is up to an order-of-magnitude faster than solely using multithreading. For dense feature sets--which impose far more stringent memory constraints--we show that GPUs substantially reduce the lengthy SVM learning times required for state-of-the-art proteomics analysis, leading to dramatic improvements over recently proposed speedups. Furthermore, we show how GPU speedups may be mixed with multithreading to enable such speedups when the dataset is too large for GPU memory requirements; on a massive dense proteomics dataset of nearly a quarter-billion data instances, these mixed-architecture speedups reduce SVM analysis time from over half a week to less than a single day while using limited GPU memory. △ Less

Submitted 14 October, 2020; v1 submitted 7 August, 2020; originally announced August 2020.

Comments: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

arXiv:1909.02136 [pdf, other]

Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra

Authors: John T. Halloran, David M. Rocke

Abstract: The most widely used technology to identify the proteins present in a complex biological sample is tandem mass spectrometry, which quickly produces a large collection of spectra representative of the peptides (i.e., protein subsequences) present in the original sample. In this work, we greatly expand the parameter learning capabilities of a dynamic Bayesian network (DBN) peptide-scoring algorithm,… ▽ More The most widely used technology to identify the proteins present in a complex biological sample is tandem mass spectrometry, which quickly produces a large collection of spectra representative of the peptides (i.e., protein subsequences) present in the original sample. In this work, we greatly expand the parameter learning capabilities of a dynamic Bayesian network (DBN) peptide-scoring algorithm, Didea, by deriving emission distributions for which its conditional log-likelihood scoring function remains concave. We show that this class of emission distributions, called Convex Virtual Emissions (CVEs), naturally generalizes the log-sum-exp function while rendering both maximum likelihood estimation and conditional maximum likelihood estimation concave for a wide range of Bayesian networks. Utilizing CVEs in Didea allows efficient learning of a large number of parameters while ensuring global convergence, in stark contrast to Didea's previous parameter learning framework (which could only learn a single parameter using a costly grid search) and other trainable models (which only ensure convergence to local optima). The newly trained scoring function substantially outperforms the state-of-the-art in both scoring function accuracy and downstream Fisher kernel analysis. Furthermore, we significantly improve Didea's runtime performance through successive optimizations to its message passing schedule and derive explicit connections between Didea's new concave score and related MS/MS scoring functions. △ Less

Submitted 4 September, 2019; originally announced September 2019.

Comments: 16 pages. A partitioned version of this appeared in NeurIPS 2018

arXiv:1909.02093 [pdf, other]

Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra

Authors: John T. Halloran, David M. Rocke

Abstract: Tandem mass spectrometry (MS/MS) is a high-throughput technology used toidentify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generati… ▽ More Tandem mass spectrometry (MS/MS) is a high-throughput technology used toidentify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generative models to improve the identification of such spectra. In particular, we show that the gradient of a recently proposed dynamic Bayesian network (DBN) may be naturally employed by a kernel-based discriminative classifier. The resulting Fisher kernel substantially improves upon recent attempts to combine generative and discriminative models for post-processing analysis, outperforming all other methods on the evaluated datasets. We extend the improved accuracy offered by the Fisher kernel framework to other search algorithms by introducing Theseus, a DBN representing a large number of widely used MS/MS scoring functions. Furthermore, with gradient ascent and max-product inference at hand, we use Theseus to learn model parameters without any supervision. △ Less

Submitted 4 September, 2019; originally announced September 2019.

Comments: 13 pages. A partitioned version of this appeared in NIPS 2017

arXiv:1807.06574 [pdf, ps, other]

Jensen: An Easily-Extensible C++ Toolkit for Production-Level Machine Learning and Convex Optimization

Authors: Rishabh Iyer, John T. Halloran, Kai Wei

Abstract: This paper introduces Jensen, an easily extensible and scalable toolkit for production-level machine learning and convex optimization. Jensen implements a framework of convex (or loss) functions, convex optimization algorithms (including Gradient Descent, L-BFGS, Stochastic Gradient Descent, Conjugate Gradient, etc.), and a family of machine learning classifiers and regressors (Logistic Regression… ▽ More This paper introduces Jensen, an easily extensible and scalable toolkit for production-level machine learning and convex optimization. Jensen implements a framework of convex (or loss) functions, convex optimization algorithms (including Gradient Descent, L-BFGS, Stochastic Gradient Descent, Conjugate Gradient, etc.), and a family of machine learning classifiers and regressors (Logistic Regression, SVMs, Least Square Regression, etc.). This framework makes it possible to deploy and train models with a few lines of code, and also extend and build upon this by integrating new loss functions and optimization algorithms. △ Less

Submitted 17 July, 2018; originally announced July 2018.

arXiv:1210.4904 [pdf]

Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra

Authors: Ajit P. Singh, John Halloran, Jeff A. Bilmes, Katrin Kirchoff, William S. Noble

Abstract: Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum id… ▽ More Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly outperforms the de-facto standard tools for this task: SEQUEST and Mascot. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-775-785

Showing 1–7 of 7 results for author: Halloran, J