Search | arXiv e-print repository

Audiobox: Unified Audio Generation with Natural Language Prompts

Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/ △ Less

Submitted 25 December, 2023; originally announced December 2023.

arXiv:2209.12856 [pdf, other]

Digital Twin in Safety-Critical Robotics Applications: Opportunities and Challenges

Authors: Sabur Baidya, Sumit K. Das, Mohammad Helal Uddin, Chase Kosek, Chris Summers

Abstract: Digital Twin technology is being envisioned to be an integral part of the industrial evolution in modern generation. With the rapid advancement in the Internet-of-Things (IoT) technology and increasing trend of automation, integration between the virtual and the physical world is now realizable to produce practical digital twins. However, the existing definitions of digital twin is incomplete and… ▽ More Digital Twin technology is being envisioned to be an integral part of the industrial evolution in modern generation. With the rapid advancement in the Internet-of-Things (IoT) technology and increasing trend of automation, integration between the virtual and the physical world is now realizable to produce practical digital twins. However, the existing definitions of digital twin is incomplete and sometimes ambiguous. Herein, we conduct historical review and analyze the modern generic view of digital twin to create its new extended definition. We also review and discuss the existing work in digital twin in safety-critical robotics applications. Especially, the usage of digital twin in industrial applications necessitates autonomous and remote operations due to environmental challenges. However, the uncertainties in the environment may need close monitoring and quick adaptation of the robots which need to be safety-proof and cost effective. We demonstrate a case study on develo** a framework for safety-critical robotic arm applications and present the system performance to show its advantages, and discuss the challenges and scopes ahead. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2204.13250 [pdf, other]

Watts: Infrastructure for Open-Ended Learning

Authors: Aaron Dharna, Charlie Summers, Rohin Dasari, Julian Togelius, Amy K. Hoover

Abstract: This paper proposes a framework called Watts for implementing, comparing, and recombining open-ended learning (OEL) algorithms. Motivated by modularity and algorithmic flexibility, Watts atomizes the components of OEL systems to promote the study of and direct comparisons between approaches. Examining implementations of three OEL algorithms, the paper introduces the modules of the framework. The h… ▽ More This paper proposes a framework called Watts for implementing, comparing, and recombining open-ended learning (OEL) algorithms. Motivated by modularity and algorithmic flexibility, Watts atomizes the components of OEL systems to promote the study of and direct comparisons between approaches. Examining implementations of three OEL algorithms, the paper introduces the modules of the framework. The hope is for Watts to enable benchmarking and to explore new types of OEL algorithms. The repo is available at \url{https://github.com/aadharna/watts} △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: ICLR Workshop on Agent Learning in Open-Endedness (ALOE 2022)

arXiv:2103.04514 [pdf, other]

Nondeterminism and Instability in Neural Network Optimization

Authors: Cecilia Summers, Michael J. Dinneen

Abstract: Nondeterminism in neural network optimization produces uncertainty in performance, making small improvements difficult to discern from run-to-run variability. While uncertainty can be reduced by training multiple model copies, doing so is time-consuming, costly, and harms reproducibility. In this work, we establish an experimental protocol for understanding the effect of optimization nondeterminis… ▽ More Nondeterminism in neural network optimization produces uncertainty in performance, making small improvements difficult to discern from run-to-run variability. While uncertainty can be reduced by training multiple model copies, doing so is time-consuming, costly, and harms reproducibility. In this work, we establish an experimental protocol for understanding the effect of optimization nondeterminism on model diversity, allowing us to isolate the effects of a variety of sources of nondeterminism. Surprisingly, we find that all sources of nondeterminism have similar effects on measures of model diversity. To explain this intriguing fact, we identify the instability of model training, taken as an end-to-end procedure, as the key determinant. We show that even one-bit changes in initial parameters result in models converging to vastly different values. Last, we propose two approaches for reducing the effects of instability on run-to-run variability. △ Less

Submitted 10 July, 2021; v1 submitted 7 March, 2021; originally announced March 2021.

Comments: ICML 2021

arXiv:2001.07343 [pdf, other]

Lyceum: An efficient and scalable ecosystem for robot learning

Authors: Colin Summers, Kendall Lowrey, Aravind Rajeswaran, Siddhartha Srinivasa, Emanuel Todorov

Abstract: We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depen… ▽ More We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depending on the complexity of the environment, Lyceum is 5-30x faster compared to other popular abstractions like OpenAI's Gym and DeepMind's dm-control. This substantially reduces training time for various reinforcement learning algorithms; and is also fast enough to support real-time model predictive control through MuJoCo. The code, tutorials, and demonstration videos can be found at: www.lyceum.ml. △ Less

Submitted 21 January, 2020; originally announced January 2020.

arXiv:1908.08031 [pdf, other]

MuSHR: A Low-Cost, Open-Source Robotic Racecar for Education and Research

Authors: Siddhartha S. Srinivasa, Patrick Lancaster, Johan Michalove, Matt Schmittle, Colin Summers, Matthew Rockett, Rosario Scalise, Joshua R. Smith, Sanjiban Choudhury, Christoforos Mavrogiannis, Fereshteh Sadeghi

Abstract: We present MuSHR, the Multi-agent System for non-Holonomic Racing. MuSHR is a low-cost, open-source robotic racecar platform for education and research, developed by the Personal Robotics Lab in the Paul G. Allen School of Computer Science & Engineering at the University of Washington. MuSHR aspires to contribute towards democratizing the field of robotics as a low-cost platform that can be built… ▽ More We present MuSHR, the Multi-agent System for non-Holonomic Racing. MuSHR is a low-cost, open-source robotic racecar platform for education and research, developed by the Personal Robotics Lab in the Paul G. Allen School of Computer Science & Engineering at the University of Washington. MuSHR aspires to contribute towards democratizing the field of robotics as a low-cost platform that can be built and deployed by following detailed, open documentation and do-it-yourself tutorials. A set of demos and lab assignments developed for the Mobile Robots course at the University of Washington provide guided, hands-on experience with the platform, and milestones for further development. MuSHR is a valuable asset for academic research labs, robotics instructors, and robotics enthusiasts. △ Less

Submitted 24 December, 2023; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: Added Rosario Scalise to the author list

arXiv:1906.03749 [pdf, other]

Improved Adversarial Robustness via Logit Regularization Methods

Authors: Cecilia Summers, Michael J. Dinneen

Abstract: While great progress has been made at making neural networks effective across a wide range of visual tasks, most models are surprisingly vulnerable. This frailness takes the form of small, carefully chosen perturbations of their input, known as adversarial examples, which represent a security threat for learned vision models in the wild -- a threat which should be responsibly defended against in s… ▽ More While great progress has been made at making neural networks effective across a wide range of visual tasks, most models are surprisingly vulnerable. This frailness takes the form of small, carefully chosen perturbations of their input, known as adversarial examples, which represent a security threat for learned vision models in the wild -- a threat which should be responsibly defended against in safety-critical applications of computer vision. In this paper, we advocate for and experimentally investigate the use of a family of logit regularization techniques as an adversarial defense, which can be used in conjunction with other methods for creating adversarial robustness at little to no marginal cost. We also demonstrate that much of the effectiveness of one recent adversarial defense mechanism can in fact be attributed to logit regularization, and show how to improve its defense against both white-box and black-box attacks, in the process creating a stronger black-box attack against PGD-based models. We validate our methods on three datasets and include results on both gradient-free attacks and strong gradient-based iterative attacks with as many as 1,000 steps. △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1906.03548 [pdf, other]

Four Things Everyone Should Know to Improve Batch Normalization

Authors: Cecilia Summers, Michael J. Dinneen

Abstract: A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to th… ▽ More A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters gamma and beta; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet. △ Less

Submitted 14 February, 2020; v1 submitted 8 June, 2019; originally announced June 2019.

Comments: ICLR 2020, 18 pages

arXiv:1805.11272 [pdf, other]

Improved Mixed-Example Data Augmentation

Authors: Cecilia Summers, Michael J. Dinneen

Abstract: In order to reduce overfitting, neural networks are typically trained with data augmentation, the practice of artificially generating additional training data via label-preserving transformations of existing training examples. While these types of transformations make intuitive sense, recent work has demonstrated that even non-label-preserving data augmentation can be surprisingly effective, exami… ▽ More In order to reduce overfitting, neural networks are typically trained with data augmentation, the practice of artificially generating additional training data via label-preserving transformations of existing training examples. While these types of transformations make intuitive sense, recent work has demonstrated that even non-label-preserving data augmentation can be surprisingly effective, examining this type of data augmentation through linear combinations of pairs of examples. Despite their effectiveness, little is known about why such methods work. In this work, we aim to explore a new, more generalized form of this type of data augmentation in order to determine whether such linearity is necessary. By considering this broader scope of "mixed-example data augmentation", we find a much larger space of practical augmentation techniques, including methods that improve upon previous state-of-the-art. This generalization has benefits beyond the promise of improved performance, revealing a number of types of mixed-example data augmentation that are radically different from those considered in prior work, which provides evidence that current theories for the effectiveness of such methods are incomplete and suggests that any such theory must explain a much broader phenomenon. Code is available at https://github.com/ceciliaresearch/MixedExample. △ Less

Submitted 19 January, 2019; v1 submitted 29 May, 2018; originally announced May 2018.

Comments: 9 pages

arXiv:1505.00519 [pdf, other]

Large Scale Discovery of Seasonal Music From User Data

Authors: Cameron Summers, Phillip Popp

Abstract: The consumption history of online media content such as music and video offers a rich source of data from which to mine information. Trends in this data are of particular interest because they reflect user preferences as well as associated cultural contexts that can be exploited in systems such as recommendation or search. This paper classifies songs as seasonal using a large, real-world dataset o… ▽ More The consumption history of online media content such as music and video offers a rich source of data from which to mine information. Trends in this data are of particular interest because they reflect user preferences as well as associated cultural contexts that can be exploited in systems such as recommendation or search. This paper classifies songs as seasonal using a large, real-world dataset of user listening data. Results show strong performance of classification of Christmas music with Gaussian Mixture Models. △ Less

Submitted 3 May, 2015; originally announced May 2015.

Comments: 4 pages, 1 figure

Showing 1–10 of 10 results for author: Summers, C