Search | arXiv e-print repository

An Evaluation Benchmark for Autoformalization in Lean4

Authors: Aryan Gulati, Devanshu Ladsaria, Shubhra Mishra, Jasdeep Sidhu, Brando Miranda

Abstract: Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5,… ▽ More Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: To appear at ICLR 2024 as part of the Tiny Papers track

arXiv:2404.05227 [pdf, ps, other]

A Note on the Common Haar State Model

Authors: Prabhanjan Ananth, Aditya Gulati, Yao-Ting Lin

Abstract: Common random string model is a popular model in classical cryptography with many constructions proposed in this model. We study a quantum analogue of this model called the common Haar state model, which was also studied in an independent work by Chen, Coladangelo and Sattath (arXiv 2024). In this model, every party in the cryptographic system receives many copies of one or more i.i.d Haar states.… ▽ More Common random string model is a popular model in classical cryptography with many constructions proposed in this model. We study a quantum analogue of this model called the common Haar state model, which was also studied in an independent work by Chen, Coladangelo and Sattath (arXiv 2024). In this model, every party in the cryptographic system receives many copies of one or more i.i.d Haar states. Our main result is the construction of a statistically secure PRSG with: (a) the output length of the PRSG is strictly larger than the key size, (b) the security holds even if the adversary receives $O\left(\fracλ{(\log(λ))^{1.01}} \right)$ copies of the pseudorandom state. We show the optimality of our construction by showing a matching lower bound. Our construction is simple and its analysis uses elementary techniques. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 17 pages, 1 figure. arXiv admin note: text overlap with arXiv:2311.18566 by other authors

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.18032 [pdf, other]

Human Shape and Clothing Estimation

Authors: Aayush Gupta, Aditya Gulati, Himanshu, Lakshya LNU

Abstract: Human shape and clothing estimation has gained significant prominence in various domains, including online shop**, fashion retail, augmented reality (AR), virtual reality (VR), and gaming. The visual representation of human shape and clothing has become a focal point for computer vision researchers in recent years. This paper presents a comprehensive survey of the major works in the field, focus… ▽ More Human shape and clothing estimation has gained significant prominence in various domains, including online shop**, fashion retail, augmented reality (AR), virtual reality (VR), and gaming. The visual representation of human shape and clothing has become a focal point for computer vision researchers in recent years. This paper presents a comprehensive survey of the major works in the field, focusing on four key aspects: human shape estimation, fashion generation, landmark detection, and attribute recognition. For each of these tasks, the survey paper examines recent advancements, discusses their strengths and limitations, and qualitative differences in approaches and outcomes. By exploring the latest developments in human shape and clothing estimation, this survey aims to provide a comprehensive understanding of the field and inspire future research in this rapidly evolving domain. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2311.02901 [pdf, other]

Pseudorandom Isometries

Authors: Prabhanjan Ananth, Aditya Gulati, Fatih Kaleoglu, Yao-Ting Lin

Abstract: We introduce a new notion called ${\cal Q}$-secure pseudorandom isometries (PRI). A pseudorandom isometry is an efficient quantum circuit that maps an $n$-qubit state to an $(n+m)$-qubit state in an isometric manner. In terms of security, we require that the output of a $q$-fold PRI on $ρ$, for $ ρ\in {\cal Q}$, for any polynomial $q$, should be computationally indistinguishable from the output of… ▽ More We introduce a new notion called ${\cal Q}$-secure pseudorandom isometries (PRI). A pseudorandom isometry is an efficient quantum circuit that maps an $n$-qubit state to an $(n+m)$-qubit state in an isometric manner. In terms of security, we require that the output of a $q$-fold PRI on $ρ$, for $ ρ\in {\cal Q}$, for any polynomial $q$, should be computationally indistinguishable from the output of a $q$-fold Haar isometry on $ρ$. By fine-tuning ${\cal Q}$, we recover many existing notions of pseudorandomness. We present a construction of PRIs and assuming post-quantum one-way functions, we prove the security of ${\cal Q}$-secure pseudorandom isometries (PRI) for different interesting settings of ${\cal Q}$. We also demonstrate many cryptographic applications of PRIs, including, length extension theorems for quantum pseudorandomness notions, message authentication schemes for quantum states, multi-copy secure public and private encryption schemes, and succinct quantum commitments. △ Less

Submitted 10 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2304.06277 [pdf]

Optimizing Multi-Domain Performance with Active Learning-based Improvement Strategies

Authors: Anand Gokul Mahalingam, Aayush Shah, Akshay Gulati, Royston Mascarenhas, Rakshitha Panduranga

Abstract: Improving performance in multiple domains is a challenging task, and often requires significant amounts of data to train and test models. Active learning techniques provide a promising solution by enabling models to select the most informative samples for labeling, thus reducing the amount of labeled data required to achieve high performance. In this paper, we present an active learning-based fram… ▽ More Improving performance in multiple domains is a challenging task, and often requires significant amounts of data to train and test models. Active learning techniques provide a promising solution by enabling models to select the most informative samples for labeling, thus reducing the amount of labeled data required to achieve high performance. In this paper, we present an active learning-based framework for improving performance across multiple domains. Our approach consists of two stages: first, we use an initial set of labeled data to train a base model, and then we iteratively select the most informative samples for labeling to refine the model. We evaluate our approach on several multi-domain datasets, including image classification, sentiment analysis, and object recognition. Our experiments demonstrate that our approach consistently outperforms baseline methods and achieves state-of-the-art performance on several datasets. We also show that our method is highly efficient, requiring significantly fewer labeled samples than other active learning-based methods. Overall, our approach provides a practical and effective solution for improving performance across multiple domains using active learning techniques. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 13 pages, 20 figures, draft work previously published as a medium story

arXiv:2304.00171 [pdf, other]

Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

Authors: Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu

Abstract: Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to impr… ▽ More Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to improve the execution speed, including replacing lower conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Our optimized conformer can be readily incorporated into a cascaded-encoder setting, allowing a second-pass decoder to operate on its output and improve the accuracy whenever more resources are available. Altogether, we find that these optimizations can reduce latency by a factor of 6.8x, and come at a reasonable trade-off in quality. With the cascaded second-pass, we show that the recognition accuracy is completely recoverable. Thus, our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline. △ Less

Submitted 31 March, 2023; originally announced April 2023.

arXiv:2211.01444 [pdf, other]

Pseudorandom (Function-Like) Quantum State Generators: New Definitions and Applications

Authors: Prabhanjan Ananth, Aditya Gulati, Luowen Qian, Henry Yuen

Abstract: Pseudorandom quantum states (PRS) are efficiently constructible states that are computationally indistinguishable from being Haar-random, and have recently found cryptographic applications. We explore new definitions, new properties and applications of pseudorandom states, and present the following contributions: 1. New Definitions: We study variants of pseudorandom function-like state (PRFS) ge… ▽ More Pseudorandom quantum states (PRS) are efficiently constructible states that are computationally indistinguishable from being Haar-random, and have recently found cryptographic applications. We explore new definitions, new properties and applications of pseudorandom states, and present the following contributions: 1. New Definitions: We study variants of pseudorandom function-like state (PRFS) generators, introduced by Ananth, Qian, and Yuen (CRYPTO'22), where the pseudorandomness property holds even when the generator can be queried adaptively or in superposition. We show feasibility of these variants assuming the existence of post-quantum one-way functions. 2. Classical Communication: We show that PRS generators with logarithmic output length imply commitment and encryption schemes with classical communication. Previous constructions of such schemes from PRS generators required quantum communication. 3. Simplified Proof: We give a simpler proof of the Brakerski--Shmueli (TCC'19) result that polynomially-many copies of uniform superposition states with random binary phases are indistinguishable from Haar-random states. 4. Necessity of Computational Assumptions: We also show that a secure PRS with output length logarithmic, or larger, in the key length necessarily requires computational assumptions. △ Less

Submitted 9 June, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

arXiv:2210.01122 [pdf, other]

BIASeD: Bringing Irrationality into Automated System Design

Authors: Aditya Gulati, Miguel Angel Lozano, Bruno Lepri, Nuria Oliver

Abstract: Human perception, memory and decision-making are impacted by tens of cognitive biases and heuristics that influence our actions and decisions. Despite the pervasiveness of such biases, they are generally not leveraged by today's Artificial Intelligence (AI) systems that model human behavior and interact with humans. In this theoretical paper, we claim that the future of human-machine collaboration… ▽ More Human perception, memory and decision-making are impacted by tens of cognitive biases and heuristics that influence our actions and decisions. Despite the pervasiveness of such biases, they are generally not leveraged by today's Artificial Intelligence (AI) systems that model human behavior and interact with humans. In this theoretical paper, we claim that the future of human-machine collaboration will entail the development of AI systems that model, understand and possibly replicate human cognitive biases. We propose the need for a research agenda on the interplay between human cognitive biases and Artificial Intelligence. We categorize existing cognitive biases from the perspective of AI systems, identify three broad areas of interest and outline research directions for the design of AI systems that have a better understanding of our own biases. △ Less

Submitted 1 December, 2023; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: 14 pages, 1 figure; Accepted for presentation at the AAAI Fall Symposium 2022 on Thinking Fast and Slow and Other Cognitive Theories in AI. Corrected typos; v3: Updated figure 1, added table 4

arXiv:2207.02107 [pdf, other]

EasyABM: a lightweight and easy to use heterogeneous agent-based modelling tool written in Julia

Authors: Renu Solanki, Monisha Khanna, Shailly Anand, Anita Gulati, Prateek Kumar, Munendra Kumar, Dushyant Kumar

Abstract: Agent based modelling is a computational approach that aims to understand the behaviour of complex systems through simplified interactions of programmable objects in computer memory called agents. Agent based models (ABMs) are predominantly used in fields of biology, ecology, social sciences and economics where the systems of interest often consist of several interacting entities. In this work, we… ▽ More Agent based modelling is a computational approach that aims to understand the behaviour of complex systems through simplified interactions of programmable objects in computer memory called agents. Agent based models (ABMs) are predominantly used in fields of biology, ecology, social sciences and economics where the systems of interest often consist of several interacting entities. In this work, we present a Julia package EasyABM.jl for simplifying the process of studying agent based models. EasyABM.jl provides an intuitive and easy to understand functional approach for building and analysing agent based models. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: 18 pages, 7 figures

arXiv:2110.10329 [pdf, other]

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Authors: Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H. Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang

Abstract: Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-trai… ▽ More Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and text-normalization, we observe evidence of capacity limitations and interference between the two modalities, leading to degraded performance compared to an equivalent text-only model, while still being competitive with BERT. Through extensive empirical analysis we also demonstrate the importance of the choice of objective function for speech pre-training, and the beneficial effect of adding additional supervised signals on the quality of the learned representations. △ Less

Submitted 19 October, 2021; originally announced October 2021.

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2104.14830 [pdf, other]

Scaling End-to-End Models for Large-Scale Multilingual ASR

Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.6K to 53.5K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model already outperforms monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth works better than width and large encoders do better than large decoders; (4) with continuous training, they can be adapted to new languages and domains. △ Less

Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

Comments: ASRU 2021

arXiv:2101.06914 [pdf, other]

Capitol (Pat)riots: A comparative study of Twitter and Parler

Authors: Hitkul, Avinash Prabhu, Dipanwita Guhathakurta, Jivitesh jain, Mallika Subramanian, Manvith Reddy, Shradha Sehgal, Tanvi Karandikar, Amogh Gulati, Udit Arora, Rajiv Ratn Shah, Ponnurangam Kumaraguru

Abstract: On 6 January 2021, a mob of right-wing conservatives stormed the USA Capitol Hill interrupting the session of congress certifying 2020 Presidential election results. Immediately after the start of the event, posts related to the riots started to trend on social media. A social media platform which stood out was a free speech endorsing social media platform Parler; it is being claimed as the platfo… ▽ More On 6 January 2021, a mob of right-wing conservatives stormed the USA Capitol Hill interrupting the session of congress certifying 2020 Presidential election results. Immediately after the start of the event, posts related to the riots started to trend on social media. A social media platform which stood out was a free speech endorsing social media platform Parler; it is being claimed as the platform on which the riots were planned and talked about. Our report presents a contrast between the trending content on Parler and Twitter around the time of riots. We collected data from both platforms based on the trending hashtags and draw comparisons based on what are the topics being talked about, who are the people active on the platforms and how organic is the content generated on the two platforms. While the content trending on Twitter had strong resentments towards the event and called for action against rioters and inciters, Parler content had a strong conservative narrative echoing the ideas of voter fraud similar to the attacking mob. We also find a disproportionately high manipulation of traffic on Parler when compared to Twitter. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2011.10978 [pdf, ps, other]

On algorithms to find p-ordering

Authors: Aditya Gulati, Sayak Chakrabarti, Rajat Mittal

Abstract: The concept of p-ordering for a prime p was introduced by Manjul Bhargava (in his PhD thesis) to develop a generalized factorial function over an arbitrary subset of integers. This notion of p-ordering provides a representation of polynomials modulo prime powers, and has been used to prove properties of roots sets modulo prime powers. We focus on the complexity of finding a p-ordering given a prim… ▽ More The concept of p-ordering for a prime p was introduced by Manjul Bhargava (in his PhD thesis) to develop a generalized factorial function over an arbitrary subset of integers. This notion of p-ordering provides a representation of polynomials modulo prime powers, and has been used to prove properties of roots sets modulo prime powers. We focus on the complexity of finding a p-ordering given a prime p, an exponent k and a subset of integers modulo p^k. Our first algorithm gives a p-ordering for set of size n in time O(nk\log p), where set is considered modulo p^k. The subsets modulo p^k can be represented succinctly using the notion of representative roots (Panayi, PhD Thesis, 1995; Dwivedi et.al, ISSAC, 2019); a natural question would be, can we find a p-ordering more efficiently given this succinct representation. Our second algorithm achieves precisely that, we give a p-ordering in time O(d^2k\log p + nk \log p + nd), where d is the size of the succinct representation and n is the required length of the p-ordering. Another contribution that we make is to compute the structure of roots sets for prime powers p^k, when k is small. The number of root sets have been given in the previous work (Dearden and Metzger, Eur. J. Comb., 1997; Maulick, J. Comb. Theory, Ser. A, 2001), we explicitly describe all the root sets for p^2, p^3 and p^4. △ Less

Submitted 22 November, 2020; originally announced November 2020.

Comments: 26 pages

arXiv:2011.10798 [pdf, other]

A Better and Faster End-to-End Model for Streaming ASR

Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR. △ Less

Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: Accepted in ICASSP 2021

arXiv:2010.16244 [pdf, other]

doi 10.1109/ICRA48506.2021.9561562

Interleaving Fast and Slow Decision Making

Authors: Aditya Gulati, Sarthak Soni, Shrisha Rao

Abstract: The "Thinking, Fast and Slow" paradigm of Kahneman proposes that we use two different styles of thinking -- a fast and intuitive System 1 for certain tasks, along with a slower but more analytical System 2 for others. While the idea of using this two-system style of thinking is gaining popularity in AI and robotics, our work considers how to interleave the two styles of decision-making, i.e., how… ▽ More The "Thinking, Fast and Slow" paradigm of Kahneman proposes that we use two different styles of thinking -- a fast and intuitive System 1 for certain tasks, along with a slower but more analytical System 2 for others. While the idea of using this two-system style of thinking is gaining popularity in AI and robotics, our work considers how to interleave the two styles of decision-making, i.e., how System 1 and System 2 should be used together. For this, we propose a novel and general framework which includes a new System 0 to oversee Systems 1 and 2. At every point when a decision needs to be made, System 0 evaluates the situation and quickly hands over the decision-making process to either System 1 or System 2. We evaluate such a framework on a modified version of the classic Pac-Man game, with an already-trained RL algorithm for System 1, a Monte-Carlo tree search for System 2, and several different possible strategies for System 0. As expected, arbitrary switches between Systems 1 and 2 do not work, but certain strategies do well. With System 0, an agent is able to perform better than one that uses only System 1 or System 2. △ Less

Submitted 26 March, 2021; v1 submitted 30 October, 2020; originally announced October 2020.

Comments: 7 pages, 11 figures; typos corrected, references added

arXiv:2010.11148 [pdf, other]

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Authors: Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction i… ▽ More Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the sequence-level optimization of transducer models for streaming ASR by applying it on various end-to-end streaming ASR networks including RNN-Transducer, Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve 150-300 ms latency reduction with significantly better accuracy over previous techniques on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech. △ Less

Submitted 3 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: Accepted in ICASSP 2021

arXiv:2010.06030 [pdf, other]

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Wu, Ruoming Pang

Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech reco… ▽ More Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency. △ Less

Submitted 27 January, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: Accepted in ICLR 2021

arXiv:2009.05566 [pdf, other]

Accelerating 2PC-based ML with Limited Trusted Hardware

Authors: Muqsit Nawaz, Aditya Gulati, Kunlong Liu, Vishwajeet Agrawal, Prabhanjan Ananth, Trinabh Gupta

Abstract: This paper describes the design, implementation, and evaluation of Otak, a system that allows two non-colluding cloud providers to run machine learning (ML) inference without knowing the inputs to inference. Prior work for this problem mostly relies on advanced cryptography such as two-party secure computation (2PC) protocols that provide rigorous guarantees but suffer from high resource overhead.… ▽ More This paper describes the design, implementation, and evaluation of Otak, a system that allows two non-colluding cloud providers to run machine learning (ML) inference without knowing the inputs to inference. Prior work for this problem mostly relies on advanced cryptography such as two-party secure computation (2PC) protocols that provide rigorous guarantees but suffer from high resource overhead. Otak improves efficiency via a new 2PC protocol that (i) tailors recent primitives such as function and homomorphic secret sharing to ML inference, and (ii) uses trusted hardware in a limited capacity to bootstrap the protocol. At the same time, Otak reduces trust assumptions on trusted hardware by running a small code inside the hardware, restricting its use to a preprocessing step, and distributing trust over heterogeneous trusted hardware platforms from different vendors. An implementation and evaluation of Otak demonstrates that its CPU and network overhead converted to a dollar amount is 5.4$-$385$\times$ lower than state-of-the-art 2PC-based works. Besides, Otak's trusted computing base (code inside trusted hardware) is only 1,300 lines of code, which is 14.6$-$29.2$\times$ lower than the code-size in prior trusted hardware-based works. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: 19 pages

arXiv:2005.10627 [pdf, other]

Dynamic Sparsity Neural Networks for Automatic Speech Recognition

Authors: Zhaofeng Wu, Ding Zhao, Qiao Liang, Jiahui Yu, Anmol Gulati, Ruoming Pang

Abstract: In automatic speech recognition (ASR), model pruning is a widely adopted technique that reduces model size and latency to deploy neural network models on edge devices with resource constraints. However, multiple models with different sparsity levels usually need to be separately trained and deployed to heterogeneous target hardware with different resource specifications and for applications that h… ▽ More In automatic speech recognition (ASR), model pruning is a widely adopted technique that reduces model size and latency to deploy neural network models on edge devices with resource constraints. However, multiple models with different sparsity levels usually need to be separately trained and deployed to heterogeneous target hardware with different resource specifications and for applications that have various latency requirements. In this paper, we present Dynamic Sparsity Neural Networks (DSNN) that, once trained, can instantly switch to any predefined sparsity configuration at run-time. We demonstrate the effectiveness and flexibility of DSNN using experiments on internal production datasets with Google Voice Search data, and show that the performance of a DSNN model is on par with that of individually trained single sparsity networks. Our trained DSNN model, therefore, can greatly ease the training process and simplify deployment in diverse scenarios with resource constraints. △ Less

Submitted 8 February, 2021; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: ICASSP 2021. (c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2005.08100 [pdf, other]

Conformer: Convolution-augmented Transformer for Speech Recognition

Authors: Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

Abstract: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution ne… ▽ More Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:2005.05513 [pdf, other]

Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic

Authors: Baani Leen Kaur Jolly, Palash Aggrawal, Amogh Gulati, Amarjit Singh Sethi, Ponnurangam Kumaraguru, Tavpritesh Sethi

Abstract: COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people's health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the… ▽ More COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people's health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the COVID-19 crisis, Twitter alone has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its direct messaging usage, since March 6th 2020. In this study, we analyze the psychometric impact and coupling of the COVID-19 infodemic with the official bulletins related to COVID-19 at the national and state level in India. We look at these two sources with a psycho-linguistic lens of emotions and quantified the extent and coupling between the two. We modified path, a deep skip-gram based open-sourced lexicon builder for effective capture of health-related emotions. We were then able to capture the time-evolution of health-related emotions in social media and official bulletins. An analysis of lead-lag relationships between the time series of extracted emotions from official bulletins and social media using Granger's causality showed that state bulletins were leading the social media for some emotions such as Medical Emergency. Further insights that are potentially relevant for the policymaker and the communicators actively engaged in mitigating misinformation are also discussed. Our paper also introduces CoronaIndiaDataset2, the first social media based COVID-19 dataset at national and state levels from India with over 5.6 million national and 2.6 million state-level tweets. Finally, we present our findings as COVibes, an interactive web application capturing psychometric insights captured upon the CoronaIndiaDataset, both at a national and state level. △ Less

Submitted 13 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

arXiv:2005.03191 [pdf, other]

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Authors: Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into… ▽ More Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset. △ Less

Submitted 15 May, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

Showing 1–25 of 25 results for author: Gulati, A