Search | arXiv e-print repository

Audiobox: Unified Audio Generation with Natural Language Prompts

Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/ △ Less

Submitted 25 December, 2023; originally announced December 2023.

arXiv:2308.11596 [pdf, other]

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

ACM Class: I.2.7

arXiv:2207.04672 [pdf]

No Language Left Behind: Scaling Human-Centered Machine Translation

Authors: NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran , et al. (14 additional authors not shown)

Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality res… ▽ More Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while kee** ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb. △ Less

Submitted 25 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

Comments: 190 pages

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1902.04003 [pdf, other]

Stabilized MorteX method for mesh tying along embedded interfaces

Authors: Basava Raju Akula, Julien Vignollet, Vladislav A. Yastrebov

Abstract: We present a unified framework to tie overlap** meshes in solid mechanics applications. This framework is a combination of the X-FEM method and the mortar method, which uses Lagrange multipliers to fulfill the tying constraints. As known, mixed formulations are prone to mesh locking which manifests itself by the emergence of spurious oscillations in the vicinity of the tying interface. To overco… ▽ More We present a unified framework to tie overlap** meshes in solid mechanics applications. This framework is a combination of the X-FEM method and the mortar method, which uses Lagrange multipliers to fulfill the tying constraints. As known, mixed formulations are prone to mesh locking which manifests itself by the emergence of spurious oscillations in the vicinity of the tying interface. To overcome this inherent difficulty, we suggest a new coarse-grained interpolation of Lagrange multipliers. This technique consists in selective assignment of Lagrange multipliers on nodes of the mortar side and in non-local interpolation of the associated traction field. The optimal choice of the coarse-graining spacing is guided solely by the mesh-density contrast between the mesh of the mortar side and the number of blending elements of the host mesh. The method is tested on two patch tests (compression and bending) for different interpolations and element types as well as for different material and mesh contrasts. The optimal mesh convergence and removal of spurious oscillations is also demonstrated on the Eshelby inclusion problem for high contrasts of inclusion/matrix materials. Few additional examples confirm the performance of the elaborated framework. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 32 pages, 36 figures, 64 references

arXiv:1902.04000 [pdf, other]

MorteX method for contact along real and embedded surfaces: coupling X-FEM with the Mortar method

Authors: Basava Raju Akula, Julien Vignollet, Vladislav A. Yastrebov

Abstract: A method to treat frictional contact problems along embedded surfaces in the finite element framework is developed. Arbitrarily shaped embedded surfaces, cutting through finite element meshes, are handled by the X-FEM. The frictional contact problem is solved using the monolithic augmented Lagrangian method within the mortar framework which was adapted for handling embedded surfaces. We report tha… ▽ More A method to treat frictional contact problems along embedded surfaces in the finite element framework is developed. Arbitrarily shaped embedded surfaces, cutting through finite element meshes, are handled by the X-FEM. The frictional contact problem is solved using the monolithic augmented Lagrangian method within the mortar framework which was adapted for handling embedded surfaces. We report that the resulting mixed formulation is prone to mesh locking in case of high elastic and mesh density contrasts across the contact interface. The mesh locking manifests itself in spurious stress oscillations in the vicinity of the contact interface. We demonstrate that in the classical patch test, these oscillations can be removed simply by using triangular blending elements. In a more general case, the triangulation is shown inefficient, therefore stabilization of the problem is achieved by adopting a recently proposed coarse-graining interpolation of Lagrange multipliers. Moreover, we demonstrate that the coarse-graining is also beneficial for the classical mortar method to avoid spurious oscillations for contact interfaces with high elastic contrast. The performance of this novel method, called MorteX, is demonstrated on several examples which show as accurate treatment of frictional contact along embedded surfaces as the classical mortar method along boundary fitted surfaces. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 30 pages, 28 figures, 58 references

arXiv:0911.1672 [pdf]

Biological Computing Fundamentals and Futures

Authors: Balaji Akula, James Cusick

Abstract: The fields of computing and biology have begun to cross paths in new ways. In this paper a review of the current research in biological computing is presented. Fundamental concepts are introduced and these foundational elements are explored to discuss the possibilities of a new computing paradigm. We assume the reader to possess a basic knowledge of Biology and Computer Science The fields of computing and biology have begun to cross paths in new ways. In this paper a review of the current research in biological computing is presented. Fundamental concepts are introduced and these foundational elements are explored to discuss the possibilities of a new computing paradigm. We assume the reader to possess a basic knowledge of Biology and Computer Science △ Less

Submitted 9 November, 2009; originally announced November 2009.

Comments: Introduction to Biological computing, 7 pages

Showing 1–6 of 6 results for author: Akula, B