Search | arXiv e-print repository

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2311.07911 [pdf, other]

Instruction-Following Evaluation for Large Language Models

Authors: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou

Abstract: One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval… ▽ More One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval △ Less

Submitted 14 November, 2023; originally announced November 2023.

MSC Class: 68T50 (Primary) 68T99 (Secondary) ACM Class: I.2.7

arXiv:2305.10403 [pdf, other]

PaLM 2 Technical Report

Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report. △ Less

Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.04947 [pdf, other]

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

Authors: Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi Zhou, Nan Du, Vincent Y. Zhao, Yuexin Wu, Bo Li, Yu Zhang, Ming-Wei Chang

Abstract: We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-w… ▽ More We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency. △ Less

Submitted 26 November, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: NeurIPS camera ready version

arXiv:2303.09752 [pdf, other]

CoLT5: Faster Long-Range Transformers with Conditional Computation

Authors: Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, Sumit Sanghai

Abstract: Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this in… ▽ More Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length. △ Less

Submitted 23 October, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: Accepted at EMNLP 2023

arXiv:2211.01267 [pdf, other]

Multi-Vector Retrieval as Sparse Alignment

Authors: Yujie Qian, **hyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, Vincent Y. Zhao

Abstract: Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary… ▽ More Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2210.11416 [pdf, other]

Scaling Instruction-Finetuned Language Models

Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yan** Huang , et al. (10 additional authors not shown)

Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d… ▽ More Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. △ Less

Submitted 6 December, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: Public checkpoints: https://huggingface.co/docs/transformers/model_doc/flan-t5

arXiv:2210.03841 [pdf, other]

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Authors: Siddhartha Brahma, Polina Zablotskaia, David Mimno

Abstract: Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measur… ▽ More Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Shorter version accepted to SNN2021 workshop

arXiv:2011.14459 [pdf, other]

Improved Semantic Role Labeling using Parameterized Neighborhood Memory Adaptation

Authors: Ishan **dal, Ranit Aharonov, Siddhartha Brahma, Huaiyu Zhu, Yunyao Li

Abstract: Deep neural models achieve some of the best results for semantic role labeling. Inspired by instance-based learning that utilizes nearest neighbors to handle low-frequency context-specific training samples, we investigate the use of memory adaptation techniques in deep neural models. We propose a parameterized neighborhood memory adaptive (PNMA) method that uses a parameterized representation of t… ▽ More Deep neural models achieve some of the best results for semantic role labeling. Inspired by instance-based learning that utilizes nearest neighbors to handle low-frequency context-specific training samples, we investigate the use of memory adaptation techniques in deep neural models. We propose a parameterized neighborhood memory adaptive (PNMA) method that uses a parameterized representation of the nearest neighbors of tokens in a memory of activations and makes predictions based on the most similar samples in the training data. We empirically show that PNMA consistently improves the SRL performance of the base model irrespective of types of word embeddings. Coupled with contextualized word embeddings derived from BERT, PNMA improves over existing models for both span and dependency semantic parsing datasets, especially on out-of-domain text, reaching F1 scores of 80.2, and 84.97 on CoNLL2005, and CoNLL2009 datasets, respectively. △ Less

Submitted 29 November, 2020; originally announced November 2020.

arXiv:2011.04732 [pdf, other]

CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling

Authors: Ishan **dal, Yunyao Li, Siddhartha Brahma, Huaiyu Zhu

Abstract: Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be ef… ▽ More Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be effective with polyglot training by representing the distant vocabularies in a shared representation space. Meanwhile, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages (e.g. adjuncts have more or less similar semantic meaning across languages). To leverage such similarity in annotation space across languages, we propose a method called Cross-Lingual Argument Regularizer (CLAR). CLAR identifies such linguistic annotation similarity across languages and exploits this information to map the target language arguments using a transformation of the space on which source language arguments lie. By doing so, our experimental results show that CLAR consistently improves SRL performance on multiple languages over monolingual and polyglot baselines for low resource languages. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: EMNLP 2020, ACL Findings

arXiv:2009.08560 [pdf, other]

doi 10.18653/v1/2020.emnlp-main.91

Small but Mighty: New Benchmarks for Split and Rephrase

Authors: Li Zhang, Huaiyu Zhu, Siddhartha Brahma, Yunyao Li

Abstract: Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even… ▽ More Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models. △ Less

Submitted 12 December, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: In EMNLP 2020

Journal ref: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020) 1198-1205

arXiv:1808.08424 [pdf, ps, other]

Efficiently Processing Workflow Provenance Queries on SPARK

Authors: Rajmohan C, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, Sameep Mehta

Abstract: In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of… ▽ More In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches. △ Less

Submitted 25 October, 2018; v1 submitted 25 August, 2018; originally announced August 2018.

arXiv:1808.05908 [pdf, ps, other]

Improved Language Modeling by Decoding the Past

Authors: Siddhartha Brahma

Abstract: Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the… ▽ More Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax. We also show gains by using PDR in combination with a mixture-of-softmaxes, achieving a word level perplexity of 53.8 and 60.5 on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling. These results constitute a new state-of-the-art in their respective settings. △ Less

Submitted 23 January, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

arXiv:1808.04343 [pdf, other]

REGMAPR - Text Matching Made Easy

Authors: Siddhartha Brahma

Abstract: Text matching is a fundamental problem in natural language processing. Neural models using bidirectional LSTMs for sentence encoding and inter-sentence attention mechanisms perform remarkably well on several benchmark datasets. We propose REGMAPR - a simple and general architecture for text matching that does not use inter-sentence attention. Starting from a Siamese architecture, we augment the em… ▽ More Text matching is a fundamental problem in natural language processing. Neural models using bidirectional LSTMs for sentence encoding and inter-sentence attention mechanisms perform remarkably well on several benchmark datasets. We propose REGMAPR - a simple and general architecture for text matching that does not use inter-sentence attention. Starting from a Siamese architecture, we augment the embeddings of the words with two features based on exact and para- phrase match between words in the two sentences. We train the model using three types of regularization on datasets for textual entailment, paraphrase detection and semantic related- ness. REGMAPR performs comparably or better than more complex neural models or models using a large number of handcrafted features. REGMAPR achieves state-of-the-art results for paraphrase detection on the SICK dataset and for textual entailment on the SNLI dataset among models that do not use inter-sentence attention. △ Less

Submitted 10 September, 2018; v1 submitted 13 August, 2018; originally announced August 2018.

arXiv:1808.04217 [pdf, ps, other]

Unsupervised Learning of Sentence Representations Using Sequence Consistency

Authors: Siddhartha Brahma

Abstract: Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose ConsSent, a simple yet surprisingly powerful unsupervised method to learn such representations by enforcing consistency constraints on sequences of tokens. We consider two classes of such constraints -- sequences that form a sentence and between two sequences that form a se… ▽ More Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose ConsSent, a simple yet surprisingly powerful unsupervised method to learn such representations by enforcing consistency constraints on sequences of tokens. We consider two classes of such constraints -- sequences that form a sentence and between two sequences that form a sentence when merged. We learn sentence encoders by training them to distinguish between consistent and inconsistent examples, the latter being generated by randomly perturbing consistent examples in six different ways. Extensive evaluation on several transfer learning and linguistic probing tasks shows improved performance over strong unsupervised and supervised baselines, substantially surpassing them in several cases. Our best results are achieved by training sentence encoders in a multitask setting and by an ensemble of encoders trained on the individual tasks. △ Less

Submitted 23 January, 2019; v1 submitted 10 August, 2018; originally announced August 2018.

arXiv:1805.07340 [pdf, other]

Improved Sentence Modeling using Suffix Bidirectional LSTM

Authors: Siddhartha Brahma

Abstract: Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. However, BiLSTMs are known to suffer from sequential bias - the contextual representation of a… ▽ More Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. However, BiLSTMs are known to suffer from sequential bias - the contextual representation of a token is heavily influenced by tokens close to it in a sentence. We propose a general and effective improvement to the BiLSTM model which encodes each suffix and prefix of a sequence of tokens in both forward and reverse directions. We call our model Suffix Bidirectional LSTM or SuBiLSTM. This introduces an alternate bias that favors long range dependencies. We apply SuBiLSTMs to several tasks that require sentence modeling. We demonstrate that using SuBiLSTM instead of a BiLSTM in existing models leads to improvements in performance in learning general sentence representations, text classification, textual entailment and paraphrase detection. Using SuBiLSTM we achieve new state-of-the-art results for fine-grained sentiment classification and question classification. △ Less

Submitted 10 September, 2018; v1 submitted 18 May, 2018; originally announced May 2018.

arXiv:1802.07374 [pdf, ps, other]

On the scaling of polynomial features for representation matching

Authors: Siddhartha Brahma

Abstract: In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models. In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models. △ Less

Submitted 20 February, 2018; originally announced February 2018.

Comments: 4 pages, Submitted to ICLR 2018 workshop

arXiv:1802.07370 [pdf, other]

SufiSent - Universal Sentence Representations Using Suffix Encodings

Authors: Siddhartha Brahma

Abstract: Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on… ▽ More Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks. △ Less

Submitted 20 February, 2018; originally announced February 2018.

Comments: 4 pages, Submitted to ICLR 2018 workshop

arXiv:1612.02062 [pdf, other]

Consistency in the face of change: an adaptive approach to physical layer cooperation

Authors: Ayan Sengupta, Yahya H. Ezzeldin, Siddhartha Brahma, Christina Fragouli, Suhas Diggavi

Abstract: Most existing works on physical-layer (PHY) cooperation (beyond routing) focus on how to best use a given, static relay network--while wireless networks are anything but static. In this paper, we pose a different set of questions: given that we have multiple devices within range, which relay(s) do we use for PHY cooperation, to maintain a consistent target performance? How can we efficiently adapt… ▽ More Most existing works on physical-layer (PHY) cooperation (beyond routing) focus on how to best use a given, static relay network--while wireless networks are anything but static. In this paper, we pose a different set of questions: given that we have multiple devices within range, which relay(s) do we use for PHY cooperation, to maintain a consistent target performance? How can we efficiently adapt, as network conditions change? And how important is it, in terms of performance, to adapt? Although adapting to the best path when routing is a well understood problem, how to do so over PHY cooperation networks is an open question. Our contributions are: (1) We demonstrate via theoretical evaluation, a diminishing returns trend as the number of deployed relays increases. (2) Using a simple algorithm based on network metrics, we efficiently select the sub-network to use at any given time to maintain a target reliability. (3) When streaming video from Netflix, we experimentally show (using measurements from a WARP radio testbed employing DIQIF relaying) that our adaptive PHY cooperation scheme provides a throughput gain of 2x over nonadaptive PHY schemes, and a gain of 6x over genie-aided IP-level adaptive routing. △ Less

Submitted 6 December, 2016; originally announced December 2016.

arXiv:1610.01085 [pdf, other]

Towards the Design of Prospect-Theory based Human Decision Rules for Hypothesis Testing

Authors: V. Sriram Siddhardh Nadendla, Swastik Brahma, Pramod K. Varshney

Abstract: Detection rules have traditionally been designed for rational agents that minimize the Bayes risk (average decision cost). With the advent of crowd-sensing systems, there is a need to redesign binary hypothesis testing rules for behavioral agents, whose cognitive behavior is not captured by traditional utility functions such as Bayes risk. In this paper, we adopt prospect theory based models for d… ▽ More Detection rules have traditionally been designed for rational agents that minimize the Bayes risk (average decision cost). With the advent of crowd-sensing systems, there is a need to redesign binary hypothesis testing rules for behavioral agents, whose cognitive behavior is not captured by traditional utility functions such as Bayes risk. In this paper, we adopt prospect theory based models for decision makers. We consider special agent models namely optimists and pessimists in this paper, and derive optimal detection rules under different scenarios. Using an illustrative example, we also show how the decision rule of a human agent deviates from the Bayesian decision rule under various behavioral models, considered in this paper. △ Less

Submitted 4 October, 2016; originally announced October 2016.

Comments: 8 pages, 5 figures, Presented at the 54th Annual Allerton Conference on Communication, Control, and Computing, 2016

arXiv:1509.08496 [pdf, ps, other]

doi 10.1109/LSP.2016.2604280

Optimal Auction Design with Quantized Bids

Authors: Nianxia Cao, Swastik Brahma, Pramod K. Varshney

Abstract: This letter considers the design of an auction mechanism to sell the object of a seller when the buyers quantize their private value estimates regarding the object prior to communicating them to the seller. The designed auction mechanism maximizes the utility of the seller (i.e., the auction is optimal), prevents buyers from communicating falsified quantized bids (i.e., the auction is incentive-co… ▽ More This letter considers the design of an auction mechanism to sell the object of a seller when the buyers quantize their private value estimates regarding the object prior to communicating them to the seller. The designed auction mechanism maximizes the utility of the seller (i.e., the auction is optimal), prevents buyers from communicating falsified quantized bids (i.e., the auction is incentive-compatible), and ensures that buyers will participate in the auction (i.e., the auction is individually-rational). The letter also investigates the design of the optimal quantization thresholds using which buyers quantize their private value estimates. Numerical results provide insights regarding the influence of the quantization thresholds on the auction mechanism. △ Less

Submitted 28 September, 2015; originally announced September 2015.

Comments: 6 pages, 3 figures, TSP letter

arXiv:1508.03011 [pdf, other]

Matching-based Spectrum Allocation in Cognitive Radio Networks

Authors: Raghed El-Bardan, Walid Saad, Swastik Brahma, Pramod K. Varshney

Abstract: In this paper, a novel spectrum association approach for cognitive radio networks (CRNs) is proposed. Based on a measure of both inference and confidence as well as on a measure of quality-of-service, the association between secondary users (SUs) in the network and frequency bands licensed to primary users (PUs) is investigated. The problem is formulated as a matching game between SUs and PUs. In… ▽ More In this paper, a novel spectrum association approach for cognitive radio networks (CRNs) is proposed. Based on a measure of both inference and confidence as well as on a measure of quality-of-service, the association between secondary users (SUs) in the network and frequency bands licensed to primary users (PUs) is investigated. The problem is formulated as a matching game between SUs and PUs. In this game, SUs employ a soft-decision Bayesian framework to detect PUs' signals and, eventually, rank them based on the logarithm of the a posteriori ratio. A performance measure that captures both the ranking metric and rate is further computed by the SUs. Using this performance measure, a PU evaluates its own utility function that it uses to build its own association preferences. A distributed algorithm that allows both SUs and PUs to interact and self-organize into a stable match is proposed. Simulation results show that the proposed algorithm can improve the sum of SUs' rates by up to 20 % and 60 % relative to the deferred acceptance algorithm and random channel allocation approach, respectively. The results also show an improved convergence time. △ Less

Submitted 12 August, 2015; originally announced August 2015.

Comments: 16 pages, 4 figures

arXiv:1504.03413 [pdf, ps, other]

doi 10.1109/TSIPN.2016.2607119

Consensus based Detection in the Presence of Data Falsification Attacks

Authors: Bhavya Kailkhura, Swastik Brahma, Pramod K. Varshney

Abstract: This paper considers the problem of detection in distributed networks in the presence of data falsification (Byzantine) attacks. Detection approaches considered in the paper are based on fully distributed consensus algorithms, where all of the nodes exchange information only with their neighbors in the absence of a fusion center. In such networks, we characterize the negative effect of Byzantines… ▽ More This paper considers the problem of detection in distributed networks in the presence of data falsification (Byzantine) attacks. Detection approaches considered in the paper are based on fully distributed consensus algorithms, where all of the nodes exchange information only with their neighbors in the absence of a fusion center. In such networks, we characterize the negative effect of Byzantines on the steady-state and transient detection performance of the conventional consensus based detection algorithms. To address this issue, we study the problem from the network designer's perspective. More specifically, we first propose a distributed weighted average consensus algorithm that is robust to Byzantine attacks. We show that, under reasonable assumptions, the global test statistic for detection can be computed locally at each node using our proposed consensus algorithm. We exploit the statistical distribution of the nodes' data to devise techniques for mitigating the influence of data falsifying Byzantines on the distributed detection system. Since some parameters of the statistical distribution of the nodes' data might not be known a priori, we propose learning based techniques to enable an adaptive design of the local fusion or update rules. △ Less

Submitted 13 April, 2015; originally announced April 2015.

arXiv:1410.5904 [pdf, ps, other]

Distributed Detection in Tree Networks: Byzantines and Mitigation Techniques

Authors: Bhavya Kailkhura, Swastik Brahma, Berkan Dulek, Yunghsiang S Han, Pramod K. Varshney

Abstract: In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection p… ▽ More In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection parameters in a tree network in the presence of Byzantines. Next, we model the strategic interaction between the FC and the attacker as a Leader-Follower (Stackelberg) game. This formulation provides a methodology for predicting attacker and defender (FC) equilibrium strategies, which can be used to implement the optimal detector. Finally, a reputation based scheme to identify Byzantines is proposed and its performance is analytically evaluated. We also provide some numerical examples to gain insights into the solution. △ Less

Submitted 21 October, 2014; originally announced October 2014.

arXiv:1408.3434 [pdf, other]

doi 10.1109/LSP.2014.2365196

Asymptotic Analysis of Distributed Bayesian Detection with Byzantine Data

Authors: Bhavya Kailkhura, Yunghsiang S. Han, Swastik Brahma, Pramod K. Varshney

Abstract: In this letter, we consider the problem of distributed Bayesian detection in the presence of data falsifying Byzantines in the network. The problem of distributed detection is formulated as a binary hypothesis test at the fusion center (FC) based on 1-bit data sent by the sensors. Adopting Chernoff information as our performance metric, we study the detection performance of the system under Byzant… ▽ More In this letter, we consider the problem of distributed Bayesian detection in the presence of data falsifying Byzantines in the network. The problem of distributed detection is formulated as a binary hypothesis test at the fusion center (FC) based on 1-bit data sent by the sensors. Adopting Chernoff information as our performance metric, we study the detection performance of the system under Byzantine attack in the asymptotic regime. The expression for minimum attacking power required by the Byzantines to blind the FC is obtained. More specifically, we show that above a certain fraction of Byzantine attackers in the network, the detection scheme becomes completely incapable of utilizing the sensor data for detection. When the fraction of Byzantines is not sufficient to blind the FC, we also provide closed form expressions for the optimal attacking strategies for the Byzantines that most degrade the detection performance. △ Less

Submitted 14 August, 2014; originally announced August 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1307.3544

arXiv:1405.5543 [pdf, ps, other]

doi 10.1109/TSP.2015.2398838

Target Tracking via Crowdsourcing: A Mechanism Design Approach

Authors: Nianxia Cao, Swastik Brahma, Pramod K. Varshney

Abstract: In this paper, we propose a crowdsourcing based framework for myopic target tracking by designing an incentive-compatible mechanism based optimal auction in a wireless sensor network (WSN) containing sensors that are selfish and profit-motivated. For typical WSNs which have limited bandwidth, the fusion center (FC) has to distribute the total number of bits that can be transmitted from the sensors… ▽ More In this paper, we propose a crowdsourcing based framework for myopic target tracking by designing an incentive-compatible mechanism based optimal auction in a wireless sensor network (WSN) containing sensors that are selfish and profit-motivated. For typical WSNs which have limited bandwidth, the fusion center (FC) has to distribute the total number of bits that can be transmitted from the sensors to the FC among the sensors. To accomplish the task, the FC conducts an auction by soliciting bids from the selfish sensors, which reflect how much they value their energy cost. Furthermore, the rationality and truthfulness of the sensors are guaranteed in our model. The final problem is formulated as a multiple-choice knapsack problem (MCKP), which is solved by the dynamic programming method in pseudo-polynomial time. Simulation results show the effectiveness of our proposed approach in terms of both the tracking performance and lifetime of the sensor network. △ Less

Submitted 21 May, 2014; originally announced May 2014.

Comments: 13 pages, 11 figures, IEEE Signal Processing Transaction

arXiv:1403.6807 [pdf, other]

Optimal Spectrum Auction Design with Two-Dimensional Truthful Revelations under Uncertain Spectrum Availability

Authors: V. Sriram Siddhardh Nadendla, Swastik Brahma, Pramod K. Varshney

Abstract: In this paper, we propose a novel sealed-bid auction framework to address the problem of dynamic spectrum allocation in cognitive radio (CR) networks. We design an optimal auction mechanism that maximizes the moderator's expected utility, when the spectrum is not available with certainty. We assume that the moderator employs collaborative spectrum sensing in order to make a reliable inference abou… ▽ More In this paper, we propose a novel sealed-bid auction framework to address the problem of dynamic spectrum allocation in cognitive radio (CR) networks. We design an optimal auction mechanism that maximizes the moderator's expected utility, when the spectrum is not available with certainty. We assume that the moderator employs collaborative spectrum sensing in order to make a reliable inference about spectrum availability. Due to the presence of a collision cost whenever the moderator makes an erroneous inference, and a sensing cost at each CR, we investigate feasibility conditions that guarantee a non-negative utility at the moderator. We present tight theoretical-bounds on instantaneous network throughput and also show that our algorithm provides maximum throughput if the CRs have i.i.d. valuations. Since the moderator fuses CRs' sensing decisions to obtain a global inference regarding spectrum availability, we propose a novel strategy-proof fusion rule that encourages the CRs to simultaneously reveal truthful sensing decisions, along with truthful valuations to the moderator. Numerical examples are also presented to provide insights into the performance of the proposed auction under different scenarios. △ Less

Submitted 13 November, 2015; v1 submitted 26 March, 2014; originally announced March 2014.

Comments: 14 double-column pages, 7 figures, 2 tables, Under review in IEEE/ACM Transactions in Networking

arXiv:1309.4513 [pdf, ps, other]

doi 10.1109/TSP.2014.2321735

Distributed Detection in Tree Topologies with Byzantines

Authors: Bhavya Kailkhura, Swastik Brahma, Yunghsiang S. Han, Pramod K. Varshney

Abstract: In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We o… ▽ More In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We obtain closed form expressions for the optimal attacking strategies that minimize the detection error exponent at the FC. We also look at the possible counter-measures from the FC's perspective to protect the network from these Byzantines. We formulate the robust topology design problem as a bi-level program and provide an efficient algorithm to solve it. We also provide some numerical results to gain insights into the solution. △ Less

Submitted 17 September, 2013; originally announced September 2013.

arXiv:1307.3544 [pdf, other]

doi 10.1109/TSP.2015.2450191

Distributed Bayesian Detection with Byzantine Data

Authors: Bhavya Kailkhura, Yunghsiang S. Han, Swastik Brahma, Pramod K. Varshney

Abstract: In this paper, we consider the problem of distributed Bayesian detection in the presence of Byzantines in the network. It is assumed that a fraction of the nodes in the network are compromised and reprogrammed by an adversary to transmit false information to the fusion center (FC) to degrade detection performance. The problem of distributed detection is formulated as a binary hypothesis test at th… ▽ More In this paper, we consider the problem of distributed Bayesian detection in the presence of Byzantines in the network. It is assumed that a fraction of the nodes in the network are compromised and reprogrammed by an adversary to transmit false information to the fusion center (FC) to degrade detection performance. The problem of distributed detection is formulated as a binary hypothesis test at the FC based on 1-bit data sent by the sensors. The expression for minimum attacking power required by the Byzantines to blind the FC is obtained. More specifically, we show that above a certain fraction of Byzantine attackers in the network, the detection scheme becomes completely incapable of utilizing the sensor data for detection. We analyze the problem under different attacking scenarios and derive results for different non-asymptotic cases. It is found that existing asymptotics-based results do not hold under several non-asymptotic scenarios. When the fraction of Byzantines is not sufficient to blind the FC, we also provide closed form expressions for the optimal attacking strategies for the Byzantines that most degrade the detection performance. △ Less

Submitted 3 September, 2014; v1 submitted 12 July, 2013; originally announced July 2013.

Comments: 32 pages, 4 figures, Submitted to IEEE Transactions on Signal Processing

Showing 1–30 of 30 results for author: Brahma, S