-
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
Authors:
Chung-Ming Chien,
Andros Tjandra,
Apoorv Vyas,
Matt Le,
Bowen Shi,
Wei-Ning Hsu
Abstract:
As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one…
▽ More
As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Audiobox: Unified Audio Generation with Natural Language Prompts
Authors:
Apoorv Vyas,
Bowen Shi,
Matthew Le,
Andros Tjandra,
Yi-Chiao Wu,
Baishan Guo,
Jiemin Zhang,
Xinyue Zhang,
Robert Adkins,
William Ngan,
Jeff Wang,
Ivan Cruz,
Bapi Akula,
Akinniyi Akinyemi,
Brian Ellis,
Rashel Moritz,
Yael Yungster,
Alice Rakotoarison,
Liang Tan,
Chris Summers,
Carleigh Wood,
Joshua Lane,
Mary Williamson,
Wei-Ning Hsu
Abstract:
Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever…
▽ More
Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
CaptainCook4D: A dataset for understanding errors in procedural activities
Authors:
Rohith Peddi,
Shivvrat Arya,
Bharath Challa,
Likhitha Pallapothula,
Akshay Vyas,
Jikai Wang,
Qifan Zhang,
Vasundhara Komaragiri,
Eric Ragan,
Nicholas Ruozzi,
Yu Xiang,
Vibhav Gogate
Abstract:
Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understan…
▽ More
Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Generative Pre-training for Speech with Flow Matching
Authors:
Alexander H. Liu,
Matt Le,
Apoorv Vyas,
Bowen Shi,
Andros Tjandra,
Wei-Ning Hsu
Abstract:
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there…
▽ More
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
△ Less
Submitted 25 March, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Authors:
Matthew Le,
Apoorv Vyas,
Bowen Shi,
Brian Karrer,
Leda Sari,
Rashel Moritz,
Mary Williamson,
Vimal Manohar,
Yossi Adi,
Jay Mahadeokar,
Wei-Ning Hsu
Abstract:
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative…
▽ More
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.
△ Less
Submitted 19 October, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Scaling Speech Technology to 1,000+ Languages
Authors:
Vineel Pratap,
Andros Tjandra,
Bowen Shi,
Paden Tomasello,
Arun Babu,
Sayani Kundu,
Ali Elkahky,
Zhaoheng Ni,
Apoorv Vyas,
Maryam Fazel-Zarandi,
Alexei Baevski,
Yossi Adi,
Xiaohui Zhang,
Wei-Ning Hsu,
Alexis Conneau,
Michael Auli
Abstract:
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on…
▽ More
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Beyond first-order methods for non-convex non-concave min-max optimization
Authors:
Abhijeet Vyas,
Brian Bullins
Abstract:
We propose a study of structured non-convex non-concave min-max problems which goes beyond standard first-order approaches. Inspired by the tight understanding established in recent works [Adil et al., 2022, Lin and Jordan, 2022b], we develop a suite of higher-order methods which show the improvements attainable beyond the monotone and Minty condition settings. Specifically, we provide a new under…
▽ More
We propose a study of structured non-convex non-concave min-max problems which goes beyond standard first-order approaches. Inspired by the tight understanding established in recent works [Adil et al., 2022, Lin and Jordan, 2022b], we develop a suite of higher-order methods which show the improvements attainable beyond the monotone and Minty condition settings. Specifically, we provide a new understanding of the use of discrete-time $p^{th}$-order methods for operator norm minimization in the min-max setting, establishing an $O(1/ε^\frac{2}{p})$ rate to achieve $ε$-approximate stationarity, under the weakened Minty variational inequality condition of Diakonikolas et al. [2021]. We further present a continuous-time analysis alongside rates which match those for the discrete-time setting, and our empirical results highlight the practical benefits of our approach over first-order methods.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Neural Operator: Is data all you need to model the world? An insight into the impact of Physics Informed Machine Learning
Authors:
Hrishikesh Viswanath,
Md Ashiqur Rahman,
Abhijeet Vyas,
Andrey Shor,
Beatriz Medeiros,
Stephanie Hernandez,
Suhas Eswarappa Prameela,
Aniket Bera
Abstract:
Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are some limitations. Co…
▽ More
Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are some limitations. Conventional approaches such as Finite Element Methods (FEMs) and Finite Differential Methods (FDMs) require considerable time and are computationally expensive. In contrast, data driven machine learning-based methods such as neural networks provide a faster, fairly accurate alternative, and have certain advantages such as discretization invariance and resolution invariance. This article aims to provide a comprehensive insight into how data-driven approaches can complement conventional techniques to solve engineering and physics problems, while also noting some of the major pitfalls of machine learning-based approaches. Furthermore, we highlight, a novel and fast machine learning-based approach (~1000x) to learning the solution operator of a PDE operator learning. We will note how these new computational approaches can bring immense advantages in tackling many problems in fundamental and applied physics.
△ Less
Submitted 18 September, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Towards the Age of Intelligent Vehicular Networks for Connected and Autonomous Vehicles in 6G
Authors:
Van-Linh Nguyen,
Ren-Hung Hwang,
Po-Ching Lin,
Abhishek Vyas,
Van-Tao Nguyen
Abstract:
Twenty-two years after the advent of the first-generation vehicular network, i.e., dedicated short-range communications (DSRC) standard/IEEE 802.11p, the vehicular technology market has become very competitive with a new player, Cellular Vehicle-to-Everything (C-V2X). Currently, C-V2X technology likely dominates the race because of the big advantages of comprehensive coverage and high throughput/r…
▽ More
Twenty-two years after the advent of the first-generation vehicular network, i.e., dedicated short-range communications (DSRC) standard/IEEE 802.11p, the vehicular technology market has become very competitive with a new player, Cellular Vehicle-to-Everything (C-V2X). Currently, C-V2X technology likely dominates the race because of the big advantages of comprehensive coverage and high throughput/reliability. Meanwhile, DSRC-based technologies are struggling to survive and rebound with many hopes betting on the success of the second-generation standard, IEEE P802.11bd. While the standards battle to attract automotive makers and dominate the commercial market landing, the research community has started thinking about the shape of the next-generation vehicular networks. This article details the state-of-the-art progress of vehicular networks, particularly the cellular V2X-related technologies in specific use cases, compared to the features of the current generation. Through the typical examples, we also highlight why 5G is inadequate to provide the best connectivity for vehicular applications, and then 6G technologies can fill up the vacancy.
△ Less
Submitted 3 September, 2022;
originally announced September 2022.
-
"All of them claim to be the best": Multi-perspective study of VPN users and VPN providers
Authors:
Reethika Ramesh,
Anjali Vyas,
Roya Ensafi
Abstract:
As more users adopt VPNs for a variety of reasons, it is important to develop empirical knowledge of their needs and mental models of what a VPN offers. Moreover, studying VPN users alone is not enough because, by using a VPN, a user essentially transfers trust, say from their network provider, onto the VPN provider. To that end, we are the first to study the VPN ecosystem from both the users' and…
▽ More
As more users adopt VPNs for a variety of reasons, it is important to develop empirical knowledge of their needs and mental models of what a VPN offers. Moreover, studying VPN users alone is not enough because, by using a VPN, a user essentially transfers trust, say from their network provider, onto the VPN provider. To that end, we are the first to study the VPN ecosystem from both the users' and the providers' perspectives. In this paper, we conduct a quantitative survey of 1,252 VPN users in the U.S. and qualitative interviews of nine providers to answer several research questions regarding the motivations, needs, threat model, and mental model of users, and the key challenges and insights from VPN providers. We create novel insights by augmenting our multi-perspective results, and highlight cases where the user and provider perspectives are misaligned. Alarmingly, we find that users rely on and trust VPN review sites, but VPN providers shed light on how these sites are mostly motivated by money. Worryingly, we find that users have flawed mental models about the protection VPNs provide, and about data collected by VPNs. We present actionable recommendations for technologists and security and privacy advocates by identifying potential areas on which to focus efforts and improve the VPN ecosystem.
△ Less
Submitted 28 September, 2022; v1 submitted 6 August, 2022;
originally announced August 2022.
-
Competitive Gradient Optimization
Authors:
Abhijeet Vyas,
Kamyar Azizzadenesheli
Abstract:
We study the problem of convergence to a stationary point in zero-sum games. We propose competitive gradient optimization (CGO ), a gradient-based method that incorporates the interactions between the two players in zero-sum games for optimization updates. We provide continuous-time analysis of CGO and its convergence properties while showing that in the continuous limit, CGO predecessors degenera…
▽ More
We study the problem of convergence to a stationary point in zero-sum games. We propose competitive gradient optimization (CGO ), a gradient-based method that incorporates the interactions between the two players in zero-sum games for optimization updates. We provide continuous-time analysis of CGO and its convergence properties while showing that in the continuous limit, CGO predecessors degenerate to their gradient descent ascent (GDA) variants. We provide a rate of convergence to stationary points and further propose a generalized class of $α$-coherent function for which we provide convergence analysis. We show that for strictly $α$-coherent functions, our algorithm convergences to a saddle point. Moreover, we propose optimistic CGO (OCGO), an optimistic variant, for which we show convergence rate to saddle points in $α$-coherent class of functions.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
On-demand compute reduction with stochastic wav2vec 2.0
Authors:
Apoorv Vyas,
Wei-Ning Hsu,
Michael Auli,
Alexei Baevski
Abstract:
Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further intr…
▽ More
Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further introduce query and key-value pooling mechanisms that can be applied to each transformer layer for further compression. Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting. We further show that we can fine-tune the same stochastically pre-trained model to a specific configuration to recover the WER difference resulting in significant computational savings on pre-training models from scratch.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model
Authors:
Apoorv Vyas,
Srikanth Madikeri,
Hervé Bourlard
Abstract:
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on thr…
▽ More
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including out-of-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show that for supervised adaptation of the wav2vec 2.0 model, both E2E-LFMMI and CTC achieve similar results; significantly outperforming the baselines trained only with supervised data. Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the following relative WER improvements over the supervised baseline trained with E2E-LFMMI. We get relative improvements of 40% and 44% on the clean-set and 64% and 58% on the test set of Librispeech (100h) respectively. On Switchboard (300h) we obtain relative improvements of 33% and 35% respectively. Finally, for Babel languages, we obtain relative improvements of 26% and 23% on Swahili (38h) and 18% and 17% on Tagalog (84h) respectively.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models
Authors:
Apoorv Vyas,
Srikanth Madikeri,
Hervé Bourlard
Abstract:
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the c…
▽ More
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.3% on Swahili (38h) and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.
△ Less
Submitted 6 April, 2021; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Dynamic Structure Learning through Graph Neural Network for Forecasting Soil Moisture in Precision Agriculture
Authors:
Anoushka Vyas,
Sambaran Bandyopadhyay
Abstract:
Soil moisture is an important component of precision agriculture as it directly impacts the growth and quality of vegetation. Forecasting soil moisture is essential to schedule the irrigation and optimize the use of water. Physics based soil moisture models need rich features and heavy computation which is not scalable. In recent literature, conventional machine learning models have been applied f…
▽ More
Soil moisture is an important component of precision agriculture as it directly impacts the growth and quality of vegetation. Forecasting soil moisture is essential to schedule the irrigation and optimize the use of water. Physics based soil moisture models need rich features and heavy computation which is not scalable. In recent literature, conventional machine learning models have been applied for this problem. These models are fast and simple, but they often fail to capture the spatio-temporal correlation that soil moisture exhibits over a region. In this work, we propose a novel graph neural network based solution that learns temporal graph structures and forecast soil moisture in an end-to-end framework. Our solution is able to handle the problem of missing ground truth soil moisture which is common in practice. We show the merit of our algorithm on real-world soil moisture data.
△ Less
Submitted 16 May, 2022; v1 submitted 7 December, 2020;
originally announced December 2020.
-
Foundations of the Socio-physical Model of Activities (SOMA) for Autonomous Robotic Agents
Authors:
Daniel Beßler,
Robert Porzel,
Mihai Pomarlan,
Abhijit Vyas,
Sebastian Höffner,
Michael Beetz,
Rainer Malaka,
John Bateman
Abstract:
In this paper, we present foundations of the Socio-physical Model of Activities (SOMA). SOMA represents both the physical as well as the social context of everyday activities. Such tasks seem to be trivial for humans, however, they pose severe problems for artificial agents. For starters, a natural language command requesting something will leave many pieces of information necessary for performing…
▽ More
In this paper, we present foundations of the Socio-physical Model of Activities (SOMA). SOMA represents both the physical as well as the social context of everyday activities. Such tasks seem to be trivial for humans, however, they pose severe problems for artificial agents. For starters, a natural language command requesting something will leave many pieces of information necessary for performing the task unspecified. Humans can solve such problems fast as we reduce the search space by recourse to prior knowledge such as a connected collection of plans that describe how certain goals can be achieved at various levels of abstraction. Rather than enumerating fine-grained physical contexts SOMA sets out to include socially constructed knowledge about the functions of actions to achieve a variety of goals or the roles objects can play in a given situation. As the human cognition system is capable of generalizing experiences into abstract knowledge pieces applicable to novel situations, we argue that both physical and social context need be modeled to tackle these challenges in a general manner. This is represented by the link between the physical and social context in SOMA where relationships are established between occurrences and generalizations of them, which has been demonstrated in several use cases that validate SOMA.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models
Authors:
Srikanth Madikeri,
Sibo Tong,
Juan Zuluaga-Gomez,
Apoorv Vyas,
Petr Motlicek,
Hervé Bourlard
Abstract:
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other capabilities of Kaldi have also been ported to Py…
▽ More
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other capabilities of Kaldi have also been ported to PyTorch. This includes the parallel training ability when multi-GPU environments are unavailable and decode with graphs created in Kaldi. The package is available on Github at https://github.com/idiap/pkwrap.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
FPGA Implementation of Simplified Spiking Neural Network
Authors:
Shikhar Gupta,
Arpan Vyas,
Gaurav Trivedi
Abstract:
Spiking Neural Networks (SNN) are third-generation Artificial Neural Networks (ANN) which are close to the biological neural system. In recent years SNN has become popular in the area of robotics and embedded applications, therefore, it has become imperative to explore its real-time and energy-efficient implementations. SNNs are more powerful than their predecessors because they encode temporal in…
▽ More
Spiking Neural Networks (SNN) are third-generation Artificial Neural Networks (ANN) which are close to the biological neural system. In recent years SNN has become popular in the area of robotics and embedded applications, therefore, it has become imperative to explore its real-time and energy-efficient implementations. SNNs are more powerful than their predecessors because they encode temporal information and use biologically plausible plasticity rules. In this paper, a simpler and computationally efficient SNN model using FPGA architecture is described. The proposed model is validated on a Xilinx Virtex 6 FPGA and analyzes a fully connected network which consists of 800 neurons and 12,544 synapses in real-time.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
Fast Transformers with Clustered Attention
Authors:
Apoorv Vyas,
Angelos Katharopoulos,
François Fleuret
Abstract:
Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, grou…
▽ More
Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.
△ Less
Submitted 29 September, 2020; v1 submitted 9 July, 2020;
originally announced July 2020.
-
DART: Open-Domain Structured Data Record to Text Generation
Authors:
Linyong Nan,
Dragomir Radev,
Rui Zhang,
Amrit Rau,
Abhinand Sivaprasad,
Chiachun Hsieh,
Xiangru Tang,
Aadit Vyas,
Neha Verma,
Pranav Krishna,
Yangxiaokang Liu,
Nadia Irwanto,
Jessica Pan,
Faiaz Rahman,
Ahmad Zaidi,
Mutethia Mutuma,
Yasin Tarabar,
Ankit Gupta,
Tao Yu,
Yi Chern Tan,
Xi Victoria Lin,
Caiming Xiong,
Richard Socher,
Nazneen Fatema Rajani
Abstract:
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploi…
▽ More
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks by utilizing techniques such as: tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
△ Less
Submitted 12 April, 2021; v1 submitted 6 July, 2020;
originally announced July 2020.
-
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Authors:
Angelos Katharopoulos,
Apoorv Vyas,
Nikolaos Pappas,
François Fleuret
Abstract:
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from…
▽ More
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
△ Less
Submitted 31 August, 2020; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Gandhipedia: A one-stop AI-enabled portal for browsing Gandhian literature, life-events and his social network
Authors:
Sayantan Adak,
Atharva Vyas,
Animesh Mukherjee,
Heer Ambavi,
Pritam Kadasi,
Mayank Singh,
Shivam Patel
Abstract:
We introduce an AI-enabled portal that presents an excellent visualization of Mahatma Gandhi's life events by constructing temporal and spatial social networks from the Gandhian literature. Applying an ensemble of methods drawn from NLTK, Polyglot and Spacy we extract the key persons and places that find mentions in Gandhi's written works. We visualize these entities and connections between them b…
▽ More
We introduce an AI-enabled portal that presents an excellent visualization of Mahatma Gandhi's life events by constructing temporal and spatial social networks from the Gandhian literature. Applying an ensemble of methods drawn from NLTK, Polyglot and Spacy we extract the key persons and places that find mentions in Gandhi's written works. We visualize these entities and connections between them based on co-mentions within the same time frame as networks in an interactive web portal. The nodes in the network, when clicked, fire search queries about the entity and all the information about the entity presented in the corresponding book from which the network is constructed, are retrieved and presented back on the portal. Overall, this system can be used as a digital and user-friendly resource to study Gandhian literature.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
ESPRIT: Explaining Solutions to Physical Reasoning Tasks
Authors:
Nazneen Fatema Rajani,
Rui Zhang,
Yi Chern Tan,
Stephan Zheng,
Jeremy Weiss,
Aadit Vyas,
Abhijit Gupta,
Caiming XIong,
Richard Socher,
Dragomir Radev
Abstract:
Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose ESPRIT, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment…
▽ More
Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose ESPRIT, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment and then generating natural language descriptions of those events using a data-to-text approach. Our framework learns to generate explanations of how the physical simulation will causally evolve so that an agent or a human can easily reason about a solution using those interpretable descriptions. Human evaluations indicate that ESPRIT produces crucial fine-grained details and has high coverage of physical concepts compared to even human annotations. Dataset, code and documentation are available at https://github.com/salesforce/esprit.
△ Less
Submitted 13 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Encoding Knowledge Graph Entity Aliases in Attentive Neural Network for Wikidata Entity Linking
Authors:
Isaiah Onando Mulang,
Kuldeep Singh,
Akhilesh Vyas,
Saeedeh Shekarpour,
Maria Esther Vidal,
Jens Lehmann,
Soren Auer
Abstract:
The collaborative knowledge graphs such as Wikidata excessively rely on the crowd to author the information. Since the crowd is not bound to a standard protocol for assigning entity titles, the knowledge graph is populated by non-standard, noisy, long or even sometimes awkward titles. The issue of long, implicit, and nonstandard entity representations is a challenge in Entity Linking (EL) approach…
▽ More
The collaborative knowledge graphs such as Wikidata excessively rely on the crowd to author the information. Since the crowd is not bound to a standard protocol for assigning entity titles, the knowledge graph is populated by non-standard, noisy, long or even sometimes awkward titles. The issue of long, implicit, and nonstandard entity representations is a challenge in Entity Linking (EL) approaches for gaining high precision and recall. Underlying KG, in general, is the source of target entities for EL approaches, however, it often contains other relevant information, such as aliases of entities (e.g., Obama and Barack Hussein Obama are aliases for the entity Barack Obama). EL models usually ignore such readily available entity attributes. In this paper, we examine the role of knowledge graph context on an attentive neural network approach for entity linking on Wikidata. Our approach contributes by exploiting the sufficient context from a KG as a source of background knowledge, which is then fed into the neural network. This approach demonstrates merit to address challenges associated with entity titles (multi-word, long, implicit, case-sensitive). Our experimental study shows approx 8% improvements over the baseline approach, and significantly outperform an end to end approach for Wikidata entity linking.
△ Less
Submitted 26 September, 2020; v1 submitted 12 December, 2019;
originally announced December 2019.
-
Security, Privacy and Safety Risk Assessment for Virtual Reality Learning Environment Applications
Authors:
Aniket Gulhane,
Akhil Vyas,
Reshmi Mitra,
Roland Oruche,
Gabriela Hoefer,
Samaikya Valluripally,
Prasad Calyam,
Khaza Anuarul Hoque
Abstract:
Social Virtual Reality based Learning Environments (VRLEs) such as vSocial render instructional content in a three-dimensional immersive computer experience for training youth with learning impediments. There are limited prior works that explored attack vulnerability in VR technology, and hence there is a need for systematic frameworks to quantify risks corresponding to security, privacy, and safe…
▽ More
Social Virtual Reality based Learning Environments (VRLEs) such as vSocial render instructional content in a three-dimensional immersive computer experience for training youth with learning impediments. There are limited prior works that explored attack vulnerability in VR technology, and hence there is a need for systematic frameworks to quantify risks corresponding to security, privacy, and safety (SPS) threats. The SPS threats can adversely impact the educational user experience and hinder delivery of VRLE content. In this paper, we propose a novel risk assessment framework that utilizes attack trees to calculate a risk score for varied VRLE threats with rate and duration of threats as inputs. We compare the impact of a well-constructed attack tree with an adhoc attack tree to study the trade-offs between overheads in managing attack trees, and the cost of risk mitigation when vulnerabilities are identified. We use a vSocial VRLE testbed in a case study to showcase the effectiveness of our framework and demonstrate how a suitable attack tree formalism can result in a more safer, privacy-preserving and secure VRLE system.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers
Authors:
Apoorv Vyas,
Nataraj Jammalamadaka,
Xia Zhu,
Dipankar Das,
Bharat Kaul,
Theodore L. Willke
Abstract:
As deep learning methods form a critical part in commercially important applications such as autonomous driving and medical diagnostics, it is important to reliably detect out-of-distribution (OOD) inputs while employing these algorithms. In this work, we propose an OOD detection algorithm which comprises of an ensemble of classifiers. We train each classifier in a self-supervised manner by leavin…
▽ More
As deep learning methods form a critical part in commercially important applications such as autonomous driving and medical diagnostics, it is important to reliably detect out-of-distribution (OOD) inputs while employing these algorithms. In this work, we propose an OOD detection algorithm which comprises of an ensemble of classifiers. We train each classifier in a self-supervised manner by leaving out a random subset of training data as OOD data and the rest as in-distribution (ID) data. We propose a novel margin-based loss over the softmax output which seeks to maintain at least a margin $m$ between the average entropy of the OOD and in-distribution samples. In conjunction with the standard cross-entropy loss, we minimize the novel loss to train an ensemble of classifiers. We also propose a novel method to combine the outputs of the ensemble of classifiers to obtain OOD detection score and class prediction. Overall, our method convincingly outperforms Hendrycks et al.[7] and the current state-of-the-art ODIN[13] on several OOD detection benchmarks.
△ Less
Submitted 4 September, 2018;
originally announced September 2018.
-
Face Recognition Techniques: A Survey
Authors:
Raunak Dave,
Ankit Vyas,
Nikita P Desai
Abstract:
Nowadays research has expanded to extracting auxiliary information from various biometric techniques like fingerprints, face, iris, palm and voice . This information contains some major features like gender, age, beard, mustache, scars, height, hair, skin color, glasses, weight, facial marks and tattoos. All this information contributes strongly to identification of human. The major challenges tha…
▽ More
Nowadays research has expanded to extracting auxiliary information from various biometric techniques like fingerprints, face, iris, palm and voice . This information contains some major features like gender, age, beard, mustache, scars, height, hair, skin color, glasses, weight, facial marks and tattoos. All this information contributes strongly to identification of human. The major challenges that come across face recognition are to find age & gender of the person. This paper contributes a survey of various face recognition techniques for finding the age and gender. The existing techniques are discussed based on their performances. This paper also provides future directions for further research.
△ Less
Submitted 30 January, 2021; v1 submitted 20 March, 2018;
originally announced March 2018.
-
Automated Early Leaderboard Generation From Comparative Tables
Authors:
Mayank Singh,
Rajdeep Sarkar,
Atharva Vyas,
Pawan Goyal,
Animesh Mukherjee,
Soumen Chakrabarti
Abstract:
A leaderboard is a tabular presentation of performance scores of the best competing techniques that address a specific scientific problem. Manually maintained leaderboards take time to emerge, which induces a latency in performance discovery and meaningful comparison. This can delay dissemination of best practices to non-experts and practitioners. Regarding papers as proxies for techniques, we pre…
▽ More
A leaderboard is a tabular presentation of performance scores of the best competing techniques that address a specific scientific problem. Manually maintained leaderboards take time to emerge, which induces a latency in performance discovery and meaningful comparison. This can delay dissemination of best practices to non-experts and practitioners. Regarding papers as proxies for techniques, we present a new system to automatically discover and maintain leaderboards in the form of partial orders between papers, based on performance reported therein. In principle, a leaderboard depends on the task, data set, other experimental settings, and the choice of performance metrics. Often there are also tradeoffs between different metrics. Thus, leaderboard discovery is not just a matter of accurately extracting performance numbers and comparing them. In fact, the levels of noise and uncertainty around performance comparisons are so large that reliable traditional extraction is infeasible. We mitigate these challenges by using relatively cleaner, structured parts of the papers, e.g., performance tables. We propose a novel performance improvement graph with papers as nodes, where edges encode noisy performance comparison information extracted from tables. Every individual performance edge is extracted from a table with citations to other papers. These extractions resemble (noisy) outcomes of 'matches' in an incomplete tournament. We propose several approaches to rank papers from these noisy 'match' outcomes. We show that our ranking scheme can reproduce various manually curated leaderboards very well. Using widely-used lists of state-of-the-art papers in 27 areas of Computer Science, we demonstrate that our system produces very reliable rankings.
△ Less
Submitted 19 February, 2019; v1 submitted 13 February, 2018;
originally announced February 2018.
-
Optimization of Ensemble Supervised Learning Algorithms for Increased Sensitivity, Specificity, and AUC of Population-Based Colorectal Cancer Screenings
Authors:
Anirudh Kamath,
Aditya Singh,
Raj Ramnani,
Ayush Vyas,
Jay Shenoy
Abstract:
Over 150,000 new people in the United States are diagnosed with colorectal cancer each year. Nearly a third die from it (American Cancer Society). The only approved noninvasive diagnosis tools currently involve fecal blood count tests (FOBTs) or stool DNA tests. Fecal blood count tests take only five minutes and are available over the counter for as low as \…
▽ More
Over 150,000 new people in the United States are diagnosed with colorectal cancer each year. Nearly a third die from it (American Cancer Society). The only approved noninvasive diagnosis tools currently involve fecal blood count tests (FOBTs) or stool DNA tests. Fecal blood count tests take only five minutes and are available over the counter for as low as \$15. They are highly specific, yet not nearly as sensitive, yielding a high percentage (25%) of false negatives (Colon Cancer Alliance). Moreover, FOBT results are far too generalized, meaning that a positive result could mean much more than just colorectal cancer, and could just as easily mean hemorrhoids, anal fissure, proctitis, Crohn's disease, diverticulosis, ulcerative colitis, rectal ulcer, rectal prolapse, ischemic colitis, angiodysplasia, rectal trauma, proctitis from radiation therapy, and others. Stool DNA tests, the modern benchmark for CRC screening, have a much higher sensitivity and specificity, but also cost \$600, take two weeks to process, and are not for high-risk individuals or people with a history of polyps. To yield a cheap and effective CRC screening alternative, a unique ensemble-based classification algorithm is put in place that considers the FIT result, BMI, smoking history, and diabetic status of patients. This method is tested under ten-fold cross validation to have a .95 AUC, 92% specificity, 89% sensitivity, .88 F1, and 90% precision. Once clinically validated, this test promises to be cheaper, faster, and potentially more accurate when compared to a stool DNA test.
△ Less
Submitted 14 August, 2017; v1 submitted 13 August, 2017;
originally announced August 2017.
-
Trajectory and Policy Aware Sender Anonymity in Location Based Services
Authors:
Alin Deutsch,
Richard Hull,
Avinash Vyas,
Kevin Keliang Zhao
Abstract:
We consider Location-based Service (LBS) settings, where a LBS provider logs the requests sent by mobile device users over a period of time and later wants to publish/share these logs. Log sharing can be extremely valuable for advertising, data mining research and network management, but it poses a serious threat to the privacy of LBS users. Sender anonymity solutions prevent a malicious attacker…
▽ More
We consider Location-based Service (LBS) settings, where a LBS provider logs the requests sent by mobile device users over a period of time and later wants to publish/share these logs. Log sharing can be extremely valuable for advertising, data mining research and network management, but it poses a serious threat to the privacy of LBS users. Sender anonymity solutions prevent a malicious attacker from inferring the interests of LBS users by associating them with their service requests after gaining access to the anonymized logs. With the fast-increasing adoption of smartphones and the concern that historic user trajectories are becoming more accessible, it becomes necessary for any sender anonymity solution to protect against attackers that are trajectory-aware (i.e. have access to historic user trajectories) as well as policy-aware (i.e they know the log anonymization policy). We call such attackers TP-aware.
This paper introduces a first privacy guarantee against TP-aware attackers, called TP-aware sender k-anonymity. It turns out that there are many possible TP-aware anonymizations for the same LBS log, each with a different utility to the consumer of the anonymized log. The problem of finding the optimal TP-aware anonymization is investigated. We show that trajectory-awareness renders the problem computationally harder than the trajectory-unaware variants found in the literature (NP-complete in the size of the log, versus PTIME). We describe a PTIME l-approximation algorithm for trajectories of length l and empirically show that it scales to large LBS logs (up to 2 million users).
△ Less
Submitted 29 February, 2012;
originally announced February 2012.