Search | arXiv e-print repository

Pseudo-OOD training for robust language models

Authors: Dhanasekar Sundararaman, Nikhil Mehta, Lawrence Carin

Abstract: While pre-trained large-scale deep models have garnered attention as an important topic for many downstream natural language processing (NLP) tasks, such models often make unreliable predictions on out-of-distribution (OOD) inputs. As such, OOD detection is a key component of a reliable machine-learning model for any industry-scale application. Common approaches often assume access to additional O… ▽ More While pre-trained large-scale deep models have garnered attention as an important topic for many downstream natural language processing (NLP) tasks, such models often make unreliable predictions on out-of-distribution (OOD) inputs. As such, OOD detection is a key component of a reliable machine-learning model for any industry-scale application. Common approaches often assume access to additional OOD samples during the training stage, however, outlier distribution is often unknown in advance. Instead, we propose a post hoc framework called POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. The model is fine-tuned by introducing a new regularization loss that separates the embeddings of IND and OOD data, which leads to significant gains on the OOD prediction task during testing. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: Work in progress

arXiv:2208.01755 [pdf, ps, other]

Debiasing Gender Bias in Information Retrieval Models

Authors: Dhanasekar Sundararaman, Vivek Subramanian

Abstract: Biases in culture, gender, ethnicity, etc. have existed for decades and have affected many areas of human social interaction. These biases have been shown to impact machine learning (ML) models, and for natural language processing (NLP), this can have severe consequences for downstream tasks. Mitigating gender bias in information retrieval (IR) is important to avoid propagating stereotypes. In thi… ▽ More Biases in culture, gender, ethnicity, etc. have existed for decades and have affected many areas of human social interaction. These biases have been shown to impact machine learning (ML) models, and for natural language processing (NLP), this can have severe consequences for downstream tasks. Mitigating gender bias in information retrieval (IR) is important to avoid propagating stereotypes. In this work, we employ a dataset consisting of two components: (1) relevance of a document to a query and (2) "gender" of a document, in which pronouns are replaced by male, female, and neutral conjugations. We definitively show that pre-trained models for IR do not perform well in zero-shot retrieval tasks when full fine-tuning of a large pre-trained BERT encoder is performed and that lightweight fine-tuning performed with adapter networks improves zero-shot retrieval performance almost by 20% over baseline. We also illustrate that pre-trained models have gender biases that result in retrieved articles tending to be more often male than female. We overcome this by introducing a debiasing technique that penalizes the model when it prefers males over females, resulting in an effective model that retrieves articles in a balanced fashion across genders. △ Less

Submitted 20 September, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

Comments: Updated title to be reflective of the methods

arXiv:2205.03559 [pdf, other]

Improving Downstream Task Performance by Treating Numbers as Entities

Authors: Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Liyan Xu, Lawrence Carin

Abstract: Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed. Though numbers are typically not accounted for distinctly in most NLP tasks, there is still an underlying amount of numeracy already exhibited by NLP models. In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their… ▽ More Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed. Though numbers are typically not accounted for distinctly in most NLP tasks, there is still an underlying amount of numeracy already exhibited by NLP models. In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their ability to boost performance in related tasks. Our proposed classification of numbers into entities helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings, outperforming the BERT and RoBERTa baseline classification. △ Less

Submitted 18 September, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: Accepted to CIKM 2022

arXiv:2201.00075 [pdf, other]

How do lexical semantics affect translation? An empirical study

Authors: Vivek Subramanian, Dhanasekar Sundararaman

Abstract: Neural machine translation (NMT) systems aim to map text from one language into another. While there are a wide variety of applications of NMT, one of the most important is translation of natural language. A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language. Although many advances have been made in develo** NMT… ▽ More Neural machine translation (NMT) systems aim to map text from one language into another. While there are a wide variety of applications of NMT, one of the most important is translation of natural language. A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language. Although many advances have been made in develo** NMT systems for translating natural language, little research has been done on understanding how the word ordering of and lexical similarity between the source and target language affect translation performance. Here, we investigate these relationships on a variety of low-resource language pairs from the OpenSubtitles2016 database, where the source language is English, and find that the more similar the target language is to English, the greater the translation performance. In addition, we study the impact of providing NMT models with part of speech of words (POS) in the English sequence and find that, for Transformer-based models, the more dissimilar the target language is from English, the greater the benefit provided by POS. △ Less

Submitted 31 December, 2021; originally announced January 2022.

arXiv:1911.06156 [pdf, other]

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

Authors: Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Shi**g Si, Dinghan Shen, Dong Wang, Lawrence Carin

Abstract: Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely b… ▽ More Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder based on their relationships to all tokens in a sequence. Recent studies have shown that although such models are capable of learning syntactic features purely by seeing examples, explicitly feeding this information to deep learning models can significantly enhance their performance. Leveraging syntactic information like part of speech (POS) may be particularly beneficial in limited training data settings for complex models such as the Transformer. We show that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset and a maximum improvement of 1.99 BLEU points when trained on a fraction of the dataset. In addition, we find that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark. △ Less

Submitted 9 November, 2019; originally announced November 2019.

arXiv:1906.08340 [pdf, other]

Learning Compressed Sentence Representations for On-Device Text Processing

Authors: Dinghan Shen, Pengyu Cheng, Dhanasekar Sundararaman, Xinyuan Zhang, Qian Yang, Meng Tang, Asli Celikyilmaz, Lawrence Carin

Abstract: Vector representations of sentences, trained on massive text corpora, are widely used as generic sentence embeddings across a variety of NLP problems. The learned representations are generally assumed to be continuous and real-valued, giving rise to a large memory footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobil… ▽ More Vector representations of sentences, trained on massive text corpora, are widely used as generic sentence embeddings across a variety of NLP problems. The learned representations are generally assumed to be continuous and real-valued, giving rise to a large memory footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobile devices. In this paper, we propose four different strategies to transform continuous and generic sentence embeddings into a binarized form, while preserving their rich semantic information. The introduced methods are evaluated across a wide range of downstream tasks, where the binarized sentence embeddings are demonstrated to degrade performance by only about 2% relative to their continuous counterparts, while reducing the storage requirement by over 98%. Moreover, with the learned binary representations, the semantic relatedness of two sentences can be evaluated by simply calculating their Hamming distance, which is more computational efficient compared with the inner product operation between continuous embeddings. Detailed analysis and case study further validate the effectiveness of proposed methods. △ Less

Submitted 19 June, 2019; originally announced June 2019.

Comments: To appear at ACL 2019

arXiv:1711.10002 [pdf]

TweetIT- Analyzing Topics for Twitter Users to garner Maximum Attention

Authors: Dhanasekar Sundararaman, Priya Arora, Vishwanath Seshagiri

Abstract: Twitter, a microblogging service, is todays most popular platform for communication in the form of short text messages, called Tweets. Users use Twitter to publish their content either for expressing concerns on information news or views on daily conversations. When this expression emerges, they are experienced by the worldwide distribution network of users and not only by the interlocutor(s). Dep… ▽ More Twitter, a microblogging service, is todays most popular platform for communication in the form of short text messages, called Tweets. Users use Twitter to publish their content either for expressing concerns on information news or views on daily conversations. When this expression emerges, they are experienced by the worldwide distribution network of users and not only by the interlocutor(s). Depending upon the impact of the tweet in the form of the likes, retweets and percentage of followers increases for the user considering a window of time frame, we compute attention factor for each tweet for the selected user profiles. This factor is used to select the top 1000 Tweets, from each user profile, to form a document. Topic modelling is then applied to this document to determine the intent of the user behind the Tweets. After topics are modelled, the similarity is determined between the BBC news data-set containing the modelled topic, and the user document under evaluation. Finally, we determine the top words for a user which would enable us to find the topics which garnered attention and has been posted recently. The experiment is performed using more than 1.1M Tweets from around 500 Twitter profiles spanning Politics, Entertainment, Sports etc. and hundreds of BBC news articles. The results show that our analysis is efficient enough to enable us to find the topics which would act as a suggestion for users to get higher popularity rating for the user in the future. △ Less

Submitted 27 November, 2017; originally announced November 2017.

arXiv:1711.06970 [pdf]

How much is my car worth? A methodology for predicting used cars prices using Random Forest

Authors: Nabarun Pal, Priya Arora, Dhanasekar Sundararaman, Puneet Kohli, Sai Sumanth Palakurthy

Abstract: Cars are being sold more than ever. Develo** countries adopt the lease culture instead of buying a new car due to affordability. Therefore, the rise of used cars sales is exponentially increasing. Car sellers sometimes take advantage of this scenario by listing unrealistic prices owing to the demand. Therefore, arises a need for a model that can assign a price for a vehicle by evaluating its fea… ▽ More Cars are being sold more than ever. Develo** countries adopt the lease culture instead of buying a new car due to affordability. Therefore, the rise of used cars sales is exponentially increasing. Car sellers sometimes take advantage of this scenario by listing unrealistic prices owing to the demand. Therefore, arises a need for a model that can assign a price for a vehicle by evaluating its features taking the prices of other cars into consideration. In this paper, we use supervised learning method namely Random Forest to predict the prices of used cars. The model has been chosen after careful exploratory data analysis to determine the impact of each feature on price. A Random Forest with 500 Decision Trees were created to train the data. From experimental results, the training accuracy was found out to be 95.82%, and the testing accuracy was 83.63%. The the model can predict the price of cars accurately by choosing the most correlated features. △ Less

Submitted 19 November, 2017; originally announced November 2017.

Comments: FICC Camera Ready

arXiv:1706.05361 [pdf]

Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles

Authors: Dhanasekar Sundararaman, Sudharshan Srinivasan

Abstract: The social media craze is on an ever increasing spree, and people are connected with each other like never before, but these vast connections are visually unexplored. We propose a methodology Twigraph to explore the connections between persons using their Twitter profiles. First, we propose a hybrid approach of recommending social media profiles, articles, and advertisements to a user.The profiles… ▽ More The social media craze is on an ever increasing spree, and people are connected with each other like never before, but these vast connections are visually unexplored. We propose a methodology Twigraph to explore the connections between persons using their Twitter profiles. First, we propose a hybrid approach of recommending social media profiles, articles, and advertisements to a user.The profiles are recommended based on the similarity score between the user profile, and profile under evaluation. The similarity between a set of profiles is investigated by finding the top influential words thus causing a high similarity through an Influence Term Metric for each word. Then, we group profiles of various domains such as politics, sports, and entertainment based on the similarity score through a novel clustering algorithm. The connectivity between profiles is envisaged using word graphs that help in finding the words that connect a set of profiles and the profiles that are connected to a word. Finally, we analyze the top influential words over a set of profiles through clustering by finding the similarity of that profiles enabling to break down a Twitter profile with a lot of followers to fine level word connections using word graphs. The proposed method was implemented on datasets comprising 1.1 M Tweets obtained from Twitter. Experimental results show that the resultant influential words were highly representative of the relationship between two profiles or a set of profiles △ Less

Submitted 29 June, 2017; v1 submitted 16 June, 2017; originally announced June 2017.

Showing 1–9 of 9 results for author: Sundararaman, D