Search | arXiv e-print repository

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Authors: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li

Abstract: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts… ▽ More We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.13464 [pdf, other]

doi 10.1145/3613904.3641982

Investigating Why Clinicians Deviate from Standards of Care: Liberating Patients from Mechanical Ventilation in the ICU

Authors: Nur Yildirim, Susanna Zlotnikov, Aradhana Venkat, Gursimran Chawla, Jennifer Kim, Leigh A. Bukowski, Jeremy M. Kahn, James McCann, John Zimmerman

Abstract: Clinical practice guidelines, care pathways, and protocols are designed to support evidence-based practices for clinicians; however, their adoption remains a challenge. We set out to investigate why clinicians deviate from the ``Wake Up and Breathe'' protocol, an evidence-based guideline for liberating patients from mechanical ventilation in the intensive care unit (ICU). We conducted over 40 hour… ▽ More Clinical practice guidelines, care pathways, and protocols are designed to support evidence-based practices for clinicians; however, their adoption remains a challenge. We set out to investigate why clinicians deviate from the ``Wake Up and Breathe'' protocol, an evidence-based guideline for liberating patients from mechanical ventilation in the intensive care unit (ICU). We conducted over 40 hours of direct observations of live clinical workflows, 17 interviews with frontline care providers, and 4 co-design workshops at three different medical intensive care units. Our findings indicate that unlike prior literature suggests, disagreement with the protocol is not a substantial barrier to adoption. Instead, the uncertainty surrounding the application of the protocol for individual patients leads clinicians to deprioritize adoption in favor of tasks where they have high certainty. Reflecting on these insights, we identify opportunities for technical systems to help clinicians in effectively executing the protocol and discuss future directions for HCI research to support the integration of protocols into clinical practice in complex, team-based healthcare settings. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: to appear at CHI 2024

arXiv:2402.13437 [pdf, other]

doi 10.1145/3613904.3641896

Sketching AI Concepts with Capabilities and Examples: AI Innovation in the Intensive Care Unit

Authors: Nur Yildirim, Susanna Zlotnikov, Deniz Sayar, Jeremy M. Kahn, Leigh A. Bukowski, Sher Shah Amin, Kathryn A. Riman, Billie S. Davis, John S. Minturn, Andrew J. King, Dan Ricketts, Lu Tang, Venkatesh Sivaraman, Adam Perer, Sarah M. Preum, James McCann, John Zimmerman

Abstract: Advances in artificial intelligence (AI) have enabled unprecedented capabilities, yet innovation teams struggle when envisioning AI concepts. Data science teams think of innovations users do not want, while domain experts think of innovations that cannot be built. A lack of effective ideation seems to be a breakdown point. How might multidisciplinary teams identify buildable and desirable use case… ▽ More Advances in artificial intelligence (AI) have enabled unprecedented capabilities, yet innovation teams struggle when envisioning AI concepts. Data science teams think of innovations users do not want, while domain experts think of innovations that cannot be built. A lack of effective ideation seems to be a breakdown point. How might multidisciplinary teams identify buildable and desirable use cases? This paper presents a first hand account of ideating AI concepts to improve critical care medicine. As a team of data scientists, clinicians, and HCI researchers, we conducted a series of design workshops to explore more effective approaches to AI concept ideation and problem formulation. We detail our process, the challenges we encountered, and practices and artifacts that proved effective. We discuss the research implications for improved collaboration and stakeholder engagement, and discuss the role HCI might play in reducing the high failure rate experienced in AI innovation. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: to appear at CHI 2024

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2310.17864 [pdf, other]

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.01352 [pdf, other]

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Authors: Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Scott Yih

Abstract: Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction… ▽ More Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average. △ Less

Submitted 6 May, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: v4: ICLR 2024 camera-ready version

arXiv:2303.08046 [pdf, other]

Ultra-High-Resolution Detector Simulation with Intra-Event Aware GAN and Self-Supervised Relational Reasoning

Authors: Baran Hashemi, Nikolai Hartmann, Sahand Sharifzadeh, James Kahn, Thomas Kuhr

Abstract: Simulating high-resolution detector responses is a storage-costly and computationally intensive process that has long been challenging in particle physics. Despite the ability of deep generative models to make this process more cost-efficient, ultra-high-resolution detector simulation still proves to be difficult as it contains correlated and fine-grained mutual information within an event. To o… ▽ More Simulating high-resolution detector responses is a storage-costly and computationally intensive process that has long been challenging in particle physics. Despite the ability of deep generative models to make this process more cost-efficient, ultra-high-resolution detector simulation still proves to be difficult as it contains correlated and fine-grained mutual information within an event. To overcome these limitations, we propose Intra-Event Aware GAN (IEA-GAN), a novel fusion of Self-Supervised Learning and Generative Adversarial Networks. IEA-GAN presents a Relational Reasoning Module that approximates the concept of an ''event'' in detector simulation, allowing for the generation of correlated layer-dependent contextualized images for high-resolution detector responses with a proper relational inductive bias. IEA-GAN also introduces a new intra-event aware loss and a Uniformity loss, resulting in significant enhancements to image fidelity and diversity. We demonstrate IEA-GAN's application in generating sensor-dependent images for the high-granularity Pixel Vertex Detector (PXD), with more than 7.5M information channels and a non-trivial geometry, at the Belle II Experiment. Applications of this work include controllable simulation-based inference and event generation, high-granularity detector simulation such as at the HL-LHC (High Luminosity LHC), and fine-grained density estimation and sampling. To the best of our knowledge, IEA-GAN is the first algorithm for faithful ultra-high-resolution detector simulation with event-based reasoning. △ Less

Submitted 7 March, 2023; originally announced March 2023.

arXiv:2302.06117 [pdf, other]

The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment

Authors: Jared Fernandez, Jacob Kahn, Clara Na, Yonatan Bisk, Emma Strubell

Abstract: Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies ca… ▽ More Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies can be largely attributed to bottlenecks introduced by deep learning frameworks. We denote this phenomenon as the \textit{framework tax}, and observe that the disparity is growing as hardware speed increases over time. In this work, we examine this phenomenon through a series of case studies analyzing the effects of model design decisions, framework paradigms, and hardware platforms on total model latency. Code is available at https://github.com/JaredFern/Framework-Tax. △ Less

Submitted 22 December, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: EMNLP 2023

arXiv:2302.00096 [pdf, other]

Ignore, Trust, or Negotiate: Understanding Clinician Acceptance of AI-Based Treatment Recommendations in Health Care

Authors: Venkatesh Sivaraman, Leigh A. Bukowski, Joel Levin, Jeremy M. Kahn, Adam Perer

Abstract: Artificial intelligence (AI) in healthcare has the potential to improve patient outcomes, but clinician acceptance remains a critical barrier. We developed a novel decision support interface that provides interpretable treatment recommendations for sepsis, a life-threatening condition in which decisional uncertainty is common, treatment practices vary widely, and poor outcomes can occur even with… ▽ More Artificial intelligence (AI) in healthcare has the potential to improve patient outcomes, but clinician acceptance remains a critical barrier. We developed a novel decision support interface that provides interpretable treatment recommendations for sepsis, a life-threatening condition in which decisional uncertainty is common, treatment practices vary widely, and poor outcomes can occur even with optimal decisions. This system formed the basis of a mixed-methods study in which 24 intensive care clinicians made AI-assisted decisions on real patient cases. We found that explanations generally increased confidence in the AI, but concordance with specific recommendations varied beyond the binary acceptance or rejection described in prior work. Although clinicians sometimes ignored or trusted the AI, they also often prioritized aspects of the recommendations to follow, reject, or delay in a process we term "negotiation." These results reveal novel barriers to adoption of treatment-focused AI tools and suggest ways to better support differing clinician perspectives. △ Less

Submitted 31 January, 2023; originally announced February 2023.

Comments: CHI 2023

arXiv:2210.12924 [pdf, other]

OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks

Authors: Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, James Hegarty

Abstract: The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as spilling and recomputation, which increase training time, or reduced precision and model pruning, which can affect model accuracy. We present OLLA, an algorithm… ▽ More The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as spilling and recomputation, which increase training time, or reduced precision and model pruning, which can affect model accuracy. We present OLLA, an algorithm that optimizes the lifetime and memory location of the tensors used to train neural networks. Our method reduces the memory usage of existing neural networks, without needing any modification to the models or their training procedures. We formulate the problem as a joint integer linear program (ILP). We present several techniques to simplify the encoding of the problem, and enable our approach to scale to the size of state-of-the-art neural networks using an off-the-shelf ILP solver. We experimentally demonstrate that OLLA only takes minutes if not seconds to allow the training of neural networks using one-third less memory on average. △ Less

Submitted 2 November, 2022; v1 submitted 23 October, 2022; originally announced October 2022.

arXiv:2208.14924 [pdf, other]

doi 10.1088/2632-2153/ac8de0

Learning Tree Structures from Leaves For Particle Decay Reconstruction

Authors: James Kahn, Ilias Tsaklidis, Oskar Taubert, Lea Reuter, Giulio Dujany, Tobias Boeckh, Arthur Thaller, Pablo Goldenzweig, Florian Bernlochner, Achim Streit, Markus Götz

Abstract: In this work, we present a neural approach to reconstructing rooted tree graphs describing hierarchical interactions, using a novel representation we term the Lowest Common Ancestor Generations (LCAG) matrix. This compact formulation is equivalent to the adjacency matrix, but enables learning a tree's structure from its leaves alone without the prior assumptions required if using the adjacency mat… ▽ More In this work, we present a neural approach to reconstructing rooted tree graphs describing hierarchical interactions, using a novel representation we term the Lowest Common Ancestor Generations (LCAG) matrix. This compact formulation is equivalent to the adjacency matrix, but enables learning a tree's structure from its leaves alone without the prior assumptions required if using the adjacency matrix directly. Employing the LCAG therefore enables the first end-to-end trainable solution which learns the hierarchical structure of varying tree sizes directly, using only the terminal tree leaves to do so. In the case of high-energy particle physics, a particle decay forms a hierarchical tree structure of which only the final products can be observed experimentally, and the large combinatorial space of possible trees makes an analytic solution intractable. We demonstrate the use of the LCAG as a target in the task of predicting simulated particle physics decay structures using both a Transformer encoder and a Neural Relational Inference encoder Graph Neural Network. With this approach, we are able to correctly predict the LCAG purely from leaf features for a maximum tree-depth of $8$ in $92.5\%$ of cases for trees up to $6$ leaves (including) and $59.7\%$ for trees up to $10$ in our simulated dataset. △ Less

Submitted 1 September, 2022; v1 submitted 31 August, 2022; originally announced August 2022.

Comments: 14 pages, 6 figures, accepted in Machine Learning: Science and Technology

arXiv:2203.11027 [pdf, other]

Reasoning over Public and Private Data in Retrieval-Based Systems

Authors: Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, Christopher Ré

Abstract: Users and organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private data is important to personalize open-domain applications such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve relevant information to a user question from a background corpus before producing… ▽ More Users and organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private data is important to personalize open-domain applications such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve relevant information to a user question from a background corpus before producing an answer. While today's retrieval systems assume the corpus is fully accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We first define the PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes. We then argue that an adequate benchmark is missing to study PAIR since existing textual benchmarks require retrieving from a single data distribution. However, public and private data intuitively reflect different distributions, motivating us to create ConcurrentQA, the first textual QA benchmark to require concurrent retrieval over multiple data-distributions. Finally, we show that existing systems face large privacy vs. performance tradeoffs when applied to our proposed retrieval setting and investigate how to mitigate these tradeoffs. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2201.12465 [pdf, other]

Flashlight: Enabling Innovation in Tools for Machine Learning

Authors: Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, Benoit Steiner, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Abstract: As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the… ▽ More As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototy** new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. Flashlight is available at https://github.com/flashlight/flashlight . △ Less

Submitted 22 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: Presented at ICML 2022

arXiv:2106.13706 [pdf, other]

Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance

Authors: Alex Hagen, Shane Jackson, James Kahn, Jan Strube, Isabel Haide, Karl Pazdernik, Connor Hainje

Abstract: Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result th… ▽ More Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:2104.05588 [pdf, other]

doi 10.1186/s40537-021-00556-1

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

Authors: Daniel Coquelin, Charlotte Debus, Markus Götz, Fabrice von der Lehr, James Kahn, Martin Siggel, Achim Streit

Abstract: With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) to utilize large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients… ▽ More With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) to utilize large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations. This synchronization is the central algorithmic bottleneck. To combat this, we introduce the Distributed Asynchronous and Selective Optimization (DASO) method which leverages multi-GPU compute node architectures to accelerate network training. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to other existing data parallel training methods. △ Less

Submitted 15 April, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Journal ref: J Big Data 9, 14 (2022)

arXiv:2104.01027 [pdf, other]

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq. △ Less

Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

arXiv:2102.12523 [pdf, other]

doi 10.1145/3442381.3450093

Online Mobile App Usage as an Indicator of Sleep Behavior and Job Performance

Authors: Chunjong Park, Morelle Arian, Xin Liu, Leon Sasson, Jeffrey Kahn, Shwetak Patel, Alex Mariakakis, Tim Althoff

Abstract: Sleep is critical to human function, mediating factors like memory, mood, energy, and alertness; therefore, it is commonly conjectured that a good night's sleep is important for job performance. However, both real-world sleep behavior and job performance are hard to measure at scale. In this work, we show that people's everyday interactions with online mobile apps can reveal insights into their jo… ▽ More Sleep is critical to human function, mediating factors like memory, mood, energy, and alertness; therefore, it is commonly conjectured that a good night's sleep is important for job performance. However, both real-world sleep behavior and job performance are hard to measure at scale. In this work, we show that people's everyday interactions with online mobile apps can reveal insights into their job performance in real-world contexts. We present an observational study in which we objectively tracked the sleep behavior and job performance of salespeople (N = 15) and athletes (N = 19) for 18 months, using a mattress sensor and online mobile app. We first demonstrate that cumulative sleep measures are correlated with job performance metrics, showing that an hour of daily sleep loss for a week was associated with a 9.0% and 9.5% reduction in performance of salespeople and athletes, respectively. We then examine the utility of online app interaction time as a passively collectible and scalable performance indicator. We show that app interaction time is correlated with the performance of the athletes, but not the salespeople. To support that our app-based performance indicator captures meaningful variation in psychomotor function and is robust against potential confounds, we conducted a second study to evaluate the relationship between sleep behavior and app interaction time in a cohort of 274 participants. Using a generalized additive model to control for per-participant random effects, we demonstrate that participants who lost one hour of daily sleep for a week exhibited 5.0% slower app interaction times. We also find that app interaction time exhibits meaningful chronobiologically consistent correlations with sleep history, time awake, and circadian rhythms. Our findings reveal an opportunity for online app developers to generate new insights regarding cognition and productivity. △ Less

Submitted 24 February, 2021; originally announced February 2021.

arXiv:2102.02852 [pdf, other]

Eliciting judgements about dependent quantities of interest: The SHELF extension and copula methods illustrated using an asthma case study

Authors: Björn Holzhauer, Lisa V. Hampson, John Paul Gosling, Björn Bornkamp, Joseph Kahn, Markus R. Lange, Wen-Lin Luo, Caterina Brindicci, David Lawrence, Steffen Ballerstedt, Anthony O'Hagan

Abstract: Pharmaceutical companies regularly need to make decisions about drug development programs based on the limited knowledge from early stage clinical trials. In this situation, eliciting the judgements of experts is an attractive approach for synthesising evidence on the unknown quantities of interest. When calculating the probability of success for a drug development program, multiple quantities of… ▽ More Pharmaceutical companies regularly need to make decisions about drug development programs based on the limited knowledge from early stage clinical trials. In this situation, eliciting the judgements of experts is an attractive approach for synthesising evidence on the unknown quantities of interest. When calculating the probability of success for a drug development program, multiple quantities of interest - such as the effect of a drug on different endpoints - should not be treated as unrelated. We discuss two approaches for establishing a multivariate distribution for several related quantities within the SHeffield ELicitation Framework (SHELF). The first approach elicits experts' judgements about a quantity of interest conditional on knowledge about another one. For the second approach, we first elicit marginal distributions for each quantity of interest. Then, for each pair of quantities, we elicit the concordance probability that both lie on the same side of their respective elicited medians. This allows us to specify a copula to obtain the joint distribution of the quantities of interest. We show how these approaches were used in an elicitation workshop that was performed to assess the probability of success of the registrational program of an asthma drug. The judgements of the experts, which were obtained prior to completion of the pivotal studies, were well aligned with the final trial results. △ Less

Submitted 15 February, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: 29 pages, 7 figures

MSC Class: 62P10; 62P30; 62C99

arXiv:2010.11745 [pdf, ps, other]

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Authors: Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

Abstract: Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset… ▽ More Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data. Finally, we show that training a single acoustic model on the most widely-used datasets - combined - reaches competitive performance on both research and real-world benchmarks. △ Less

Submitted 2 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

MSC Class: 68T07; 68T10 ACM Class: I.2.6; I.5.4

arXiv:2010.11524 [pdf, other]

SlimIPL: Language-Model-Free Iterative Pseudo-Labeling

Authors: Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

Abstract: Recent results in end-to-end automatic speech recognition have demonstrated the efficacy of pseudo-labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to fu… ▽ More Recent results in end-to-end automatic speech recognition have demonstrated the efficacy of pseudo-labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further improve performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens), that is, without a language model. We call this approach Language-Model-Free IPL (slimIPL) and give a resultant training setup for low-resource settings with CTC-based models. slimIPL features a dynamic cache for pseudo-labels which reduces sensitivity to changes in relabeling hyperparameters and results in improves training stability. slimIPL is also highly-efficient and requires 3.5-4x fewer computational resources to converge than other state-of-the-art semi/self-supervised approaches. With only 10 hours of labeled audio, slimIPL is competitive with self-supervised approaches, and is state-of-the-art with 100 hours of labeled audio without the use of a language model both at test time and during pseudo-label generation. △ Less

Submitted 29 August, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.01003 [pdf, other]

Differentiable Weighted Finite-State Transducers

Authors: Awni Hannun, Vineel Pratap, Jacob Kahn, Wei-Ning Hsu

Abstract: We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time. Through the separation of graphs from operations on graphs, this framework enables the exploration of new structured loss functions which in turn eases the encoding of prior knowledge into learning algorithms. We show how the framework can com… ▽ More We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time. Through the separation of graphs from operations on graphs, this framework enables the exploration of new structured loss functions which in turn eases the encoding of prior knowledge into learning algorithms. We show how the framework can combine pruning and back-off in transition models with various sequence-level loss functions. We also show how to learn over the latent decomposition of phrases into word pieces. Finally, to demonstrate that WFSTs can be used in the interior of a deep neural network, we propose a convolutional WFST layer which maps lower-level representations to higher-level representations and can be used as a drop-in replacement for a traditional convolution. We validate these algorithms with experiments in handwriting recognition and speech recognition. △ Less

Submitted 2 October, 2020; originally announced October 2020.

arXiv:2005.09267 [pdf, other]

Iterative Pseudo-Labeling for Speech Recognition

Authors: Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

Abstract: Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.… ▽ More Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR △ Less

Submitted 26 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: INTERSPEECH 2020

arXiv:2004.07175 [pdf, other]

Sampling Rates for $\ell^1$-Synthesis

Authors: Maximilian März, Claire Boyer, Jonas Kahn, Pierre Weiss

Abstract: This work investigates the problem of signal recovery from undersampled noisy sub-Gaussian measurements under the assumption of a synthesis-based sparsity model. Solving the $\ell^1$-synthesis basis pursuit allows for a simultaneous estimation of a coefficient representation as well as the sought-for signal. However, due to linear dependencies within redundant dictionary atoms it might be impossib… ▽ More This work investigates the problem of signal recovery from undersampled noisy sub-Gaussian measurements under the assumption of a synthesis-based sparsity model. Solving the $\ell^1$-synthesis basis pursuit allows for a simultaneous estimation of a coefficient representation as well as the sought-for signal. However, due to linear dependencies within redundant dictionary atoms it might be impossible to identify a specific representation vector, although the actual signal is still successfully recovered. The present manuscript studies both estimation problems from a non-uniform, signal-dependent perspective. By utilizing recent results on the convex geometry of linear inverse problems, the sampling rates describing the phase transitions of each formulation are identified. In both cases, they are given by the conic Gaussian mean width of an $\ell^1$-descent cone that is linearly transformed by the dictionary. In general, this expression does not allow a simple calculation by following the polarity-based approach commonly found in the literature. Hence, two upper bounds involving the sparsity of coefficient representations are provided: The first one is based on a local condition number and the second one on a geometric analysis that makes use of the thinness of high-dimensional polyhedral cones with not too many generators. It is furthermore revealed that both recovery problems can differ dramatically with respect to robustness to measurement noise -- a fact that seems to have gone unnoticed in most of the related literature. All insights are carefully undermined by numerical simulations. △ Less

Submitted 15 April, 2020; originally announced April 2020.

arXiv:2001.09727 [pdf, other]

Scaling Up Online Speech Recognition Using ConvNets

Authors: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Abstract: We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency a… ▽ More We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements. △ Less

Submitted 27 January, 2020; originally announced January 2020.

arXiv:1912.07875 [pdf, ps, other]

doi 10.1109/ICASSP40776.2020.9052942

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art. △ Less

Submitted 17 December, 2019; originally announced December 2019.

arXiv:1911.08460 [pdf, ps, other]

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Authors: Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

Abstract: We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance… ▽ More We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models. △ Less

Submitted 14 July, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

Comments: Published at the workshop on Self-supervision in Audio and Speech (SAS) at the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria

arXiv:1910.13433 [pdf, other]

Thresholds versus fractional expectation-thresholds

Authors: Keith Frankston, Jeff Kahn, Bhargav Narayanan, **young Park

Abstract: Proving a conjecture of Talagrand, a fractional version of the 'expectation-threshold' conjecture of Kalai and the second author, we show for any increasing family $F$ on a finite set $X$ that $p_c (F) =O( q_f (F) \log \ell(F))$, where $p_c(F)$ and $q_f(F)$ are the threshold and 'fractional expectation-threshold' of $F$, and $\ell(F)$ is the largest size of a minimal member of $F$. This easily imp… ▽ More Proving a conjecture of Talagrand, a fractional version of the 'expectation-threshold' conjecture of Kalai and the second author, we show for any increasing family $F$ on a finite set $X$ that $p_c (F) =O( q_f (F) \log \ell(F))$, where $p_c(F)$ and $q_f(F)$ are the threshold and 'fractional expectation-threshold' of $F$, and $\ell(F)$ is the largest size of a minimal member of $F$. This easily implies several heretofore difficult results and conjectures in probabilistic combinatorics, including thresholds for perfect hypergraph matchings (Johansson--Kahn--Vu), bounded-degree spanning trees (Montgomery), and bounded-degree spanning graphs (new). We also resolve (and vastly extend) the 'axial' version of the random multi-dimensional assignment problem (earlier considered by Martin--Mézard--Rivoire and Frieze--Sorkin). Our approach builds on a recent breakthrough of Alweiss, Lovett, Wu and Zhang on the Erdős--Rado 'Sunflower Conjecture'. △ Less

Submitted 10 December, 2019; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: 16 pages, submitted, now includes some discussion of applications

arXiv:1909.09116 [pdf, ps, other]

doi 10.1109/ICASSP40776.2020.9054295

Self-Training for End-to-End Speech Recognition

Authors: Jacob Kahn, Ann Lee, Awni Hannun

Abstract: We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach t… ▽ More We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve. △ Less

Submitted 23 February, 2020; v1 submitted 19 September, 2019; originally announced September 2019.

Comments: To be published in the 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020

arXiv:1812.07625 [pdf, other]

doi 10.1109/ICASSP.2019.8683535

wav2letter++: The Fastest Open-source Speech Recognition System

Authors: Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, Ronan Collobert

Abstract: This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster th… ▽ More This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++'s training times scale linearly to 64 GPUs, the highest we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks. △ Less

Submitted 18 December, 2018; originally announced December 2018.

arXiv:1809.11162 [pdf, other]

Fast state tomography with optimal error bounds

Authors: Madalin Guta, Jonas Kahn, Richard Kueng, Joel A. Tropp

Abstract: Projected least squares (PLS) is an intuitive and numerically cheap technique for quantum state tomography. The method first computes the least-squares estimator (or a linear inversion estimator) and then projects the initial estimate onto the space of states. The main result of this paper equips this point estimator with a rigorous, non-asymptotic confidence region expressed in terms of the trace… ▽ More Projected least squares (PLS) is an intuitive and numerically cheap technique for quantum state tomography. The method first computes the least-squares estimator (or a linear inversion estimator) and then projects the initial estimate onto the space of states. The main result of this paper equips this point estimator with a rigorous, non-asymptotic confidence region expressed in terms of the trace distance. The analysis holds for a variety of measurements, including 2-designs and Pauli measurements. The sample complexity of the estimator is comparable to the strongest convergence guarantees available in the literature and -- in the case of measuring the uniform POVM -- saturates fundamental lower bounds.The results are derived by reinterpreting the least-squares estimator as a sum of random matrices and applying a matrix-valued concentration inequality. The theory is supported by numerical simulations for mutually unbiased bases, Pauli observables, and Pauli basis measurements. △ Less

Submitted 28 September, 2018; originally announced September 2018.

Comments: 5+10 pages, 2+1 figures

MSC Class: Primary: 81P50. Secondary: 15B52

arXiv:1804.08477 [pdf, other]

ASR Performance Prediction on Unseen Broadcast Programs using Convolutional Neural Networks

Authors: Zied Elloumi, Laurent Besacier, Olivier Galibert, Juliette Kahn, Benjamin Lecouteux

Abstract: In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly foc… ▽ More In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly focus on the combination of both textual (ASR transcription) and signal inputs. While the joint use of textual and signal features did not work for the regression baseline, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably predicts the WER distribution on a collection of speech recordings. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: IEEE ICASSP 2018

arXiv:1707.05797 [pdf]

doi 10.1109/JLT.2018.2811755

Low-complexity implementation of convex optimization-based phase retrieval

Authors: Sercan O. Arik, Joseph M. Kahn

Abstract: Phase retrieval has important applications in optical imaging, communications and sensing. Lifting the dimensionality of the problem allows phase retrieval to be approximated as a convex optimization problem in a higher-dimensional space. Convex optimization-based phase retrieval has been shown to yield high accuracy, yet its low-complexity implementation has not been explored. In this paper, we s… ▽ More Phase retrieval has important applications in optical imaging, communications and sensing. Lifting the dimensionality of the problem allows phase retrieval to be approximated as a convex optimization problem in a higher-dimensional space. Convex optimization-based phase retrieval has been shown to yield high accuracy, yet its low-complexity implementation has not been explored. In this paper, we study three fundamental approaches for its low-complexity implementation: the projected gradient method, the Nesterov accelerated gradient method, and the alternating direction method of multipliers (ADMM) method. We derive the corresponding estimation algorithms and evaluate their complexities. We compare their performance in the application area of direct-detection mode-division multiplexing. We demonstrate that they yield negligible estimation penalties (less than 0.2 dB for transmitter processing and less than 0.6 dB for receiver equalization) while yielding low computational cost, as their implementation complexities all scale quadratically in the number of unknown parameters. Among the three methods, ADMM achieves convergence after the smallest number of iterations. △ Less

Submitted 19 March, 2018; v1 submitted 18 July, 2017; originally announced July 2017.

arXiv:1609.04608 [pdf, other]

doi 10.1109/TPAMI.2018.2815524

Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals

Authors: Andrés Hoyos-Idrobo, Gaël Varoquaux, Jonas Kahn, Bertrand Thirion

Abstract: In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clusteri… ▽ More In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is that good clustering comes with large algorithmic costs. We address it by contributing a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional variance-minimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publicly-available data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme. △ Less

Submitted 19 March, 2018; v1 submitted 15 September, 2016; originally announced September 2016.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, In press

arXiv:1511.04898 [pdf, other]

Fast clustering for scalable statistical analysis on structured images

Authors: Bertrand Thirion, Andrés Hoyos-Idrobo, Jonas Kahn, Gael Varoquaux

Abstract: The use of brain images as markers for diseases or behavioral differences is challenged by the small effects size and the ensuing lack of power, an issue that has incited researchers to rely more systematically on large cohorts. Coupled with resolution increases, this leads to very large datasets. A striking example in the case of brain imaging is that of the Human Connectome Project: 20 Terabytes… ▽ More The use of brain images as markers for diseases or behavioral differences is challenged by the small effects size and the ensuing lack of power, an issue that has incited researchers to rely more systematically on large cohorts. Coupled with resolution increases, this leads to very large datasets. A striking example in the case of brain imaging is that of the Human Connectome Project: 20 Terabytes of data and growing. The resulting data deluge poses severe challenges regarding the tractability of some processing steps (discriminant analysis, multivariate models) due to the memory demands posed by these data. In this work, we revisit dimension reduction approaches, such as random projections, with the aim of replacing costly function evaluations by cheaper ones while decreasing the memory requirements. Specifically, we investigate the use of alternate schemes, based on fast clustering, that are well suited for signals exhibiting a strong spatial structure, such as anatomical and functional brain images. Our contribution is twofold: i) we propose a linear-time clustering scheme that bypasses the percolation issues inherent in these algorithms and thus provides compressions nearly as good as traditional quadratic-complexity variance-minimizing clustering schemes, ii) we show that cluster-based compression can have the virtuous effect of removing high-frequency noise, actually improving subsequent estimations steps. As a consequence, the proposed approach yields very accurate models on several large-scale problems yet with impressive gains in computational efficiency, making it possible to analyze large datasets. △ Less

Submitted 16 November, 2015; originally announced November 2015.

Comments: ICML Workshop on Statistics, Machine Learning and Neuroscience (Stamlins 2015), Jul 2015, Lille, France

arXiv:1308.2794 [pdf, ps, other]

Functions without influential coalitions

Authors: Jeff Kahn, Gil Kalai

Abstract: We give counterexamples to a conjecture of Benny Chor and another of the second author, both from the late 80s, by exhibiting functions for which the influences of large coalitions are unexpectedly small relative to the expectations of the functions. We give counterexamples to a conjecture of Benny Chor and another of the second author, both from the late 80s, by exhibiting functions for which the influences of large coalitions are unexpectedly small relative to the expectations of the functions. △ Less

Submitted 13 August, 2013; originally announced August 2013.

Comments: 13 pages

arXiv:1301.1752 [pdf, ps, other]

A bipartite graph with non-unimodal independent set sequence

Authors: Arnab Bhattacharyya, Jeff Kahn

Abstract: We show that the independent set sequence of a bipartite graph need not be unimodal. We show that the independent set sequence of a bipartite graph need not be unimodal. △ Less

Submitted 8 January, 2013; originally announced January 2013.

arXiv:1207.4144 [pdf]

A Generative Bayesian Model for Aggregating Experts' Probabilities

Authors: Joseph Kahn

Abstract: In order to improve forecasts, a decisionmaker often combines probabilities given by various sources, such as human experts and machine learning classifiers. When few training data are available, aggregation can be improved by incorporating prior knowledge about the event being forecasted and about salient properties of the experts. To this end, we develop a generative Bayesian aggregation model f… ▽ More In order to improve forecasts, a decisionmaker often combines probabilities given by various sources, such as human experts and machine learning classifiers. When few training data are available, aggregation can be improved by incorporating prior knowledge about the event being forecasted and about salient properties of the experts. To this end, we develop a generative Bayesian aggregation model for probabilistic classi cation. The model includes an event-specific prior, measures of individual experts' bias, calibration, accuracy, and a measure of dependence betweeen experts. Rather than require absolute measures, we show that aggregation may be expressed in terms of relative accuracy between experts. The model results in a weighted logarithmic opinion pool (LogOps) that satis es consistency criteria such as the external Bayesian property. We derive analytic solutions for independent and for exchangeable experts. Empirical tests demonstrate the model's use, comparing its accuracy with other aggregation methods. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-301-308

arXiv:1206.1016 [pdf, ps, other]

Mantel's Theorem for random graphs

Authors: Bobby DeMarco, Jeff Kahn

Abstract: For a graph $G$, denote by $t(G)$ (resp. $b(G)$) the maximum size of a triangle-free (resp. bipartite) subgraph of $G$. Of course $t(G) \geq b(G)$ for any $G$, and a classic result of Mantel from 1907 (the first case of Turán's Theorem) says that equality holds for complete graphs. A natural question, first considered by Babai, Simonovits and Spencer about 20 years ago is, when (i.e. for what… ▽ More For a graph $G$, denote by $t(G)$ (resp. $b(G)$) the maximum size of a triangle-free (resp. bipartite) subgraph of $G$. Of course $t(G) \geq b(G)$ for any $G$, and a classic result of Mantel from 1907 (the first case of Turán's Theorem) says that equality holds for complete graphs. A natural question, first considered by Babai, Simonovits and Spencer about 20 years ago is, when (i.e. for what $p=p(n)$) is the "Erdős-Rényi" random graph $G=G(n,p)$ likely to satisfy $t(G) = b(G)$? We show that this is true if $p>C n^{-1/2} \log^{1/2}n $ for a suitable constant $C$, which is best possible up to the value of $C$. △ Less

Submitted 5 June, 2012; originally announced June 2012.

Comments: 15 pages

MSC Class: 05D40; 05C35; 05C80

Showing 1–38 of 38 results for author: Kahn, J