Search | arXiv e-print repository

How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?

Authors: Anushka Singh, Ananya B. Sai, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra

Abstract: While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-… ▽ More While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-Dimensional Quality Metrics (MQM) and Direct Assessment (DA) annotations to create test sets and meta-evaluate a plethora of automatic evaluation metrics. We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45. Synthetic data approaches show mixed results and overall do not help close the gap by much for these languages. This indicates that there is still a long way to go for low-resource evaluation. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2404.09664 [pdf, other]

Closing the Gap in the Trade-off between Fair Representations and Accuracy

Authors: Biswajit Rout, Ananya B. Sai, Arun Rajkumar

Abstract: The rapid developments of various machine learning models and their deployments in several applications has led to discussions around the importance of looking beyond the accuracies of these models. Fairness of such models is one such aspect that is deservedly gaining more attention. In this work, we analyse the natural language representations of documents and sentences (i.e., encodings) for any… ▽ More The rapid developments of various machine learning models and their deployments in several applications has led to discussions around the importance of looking beyond the accuracies of these models. Fairness of such models is one such aspect that is deservedly gaining more attention. In this work, we analyse the natural language representations of documents and sentences (i.e., encodings) for any embedding-level bias that could potentially also affect the fairness of the downstream tasks that rely on them. We identify bias in these encodings either towards or against different sub-groups based on the difference in their reconstruction errors along various subsets of principal components. We explore and recommend ways to mitigate such bias in the encodings while also maintaining a decent accuracy in classification models that use them. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: DAI-24

arXiv:2307.03322 [pdf, other]

BiPhone: Modeling Inter Language Phonetic Influences in Text

Authors: Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer

Abstract: A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. The… ▽ More A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted at ACL 2023

arXiv:2306.04366 [pdf, other]

doi 10.1109/TMC.2024.3373469

Enhancing Worker Recruitment in Collaborative Mobile Crowdsourcing: A Graph Neural Network Trust Evaluation Approach

Authors: Zhongwei Zhan, Yingjie Wang, Peiyong Duan, Akshita Maradapu Vera Venkata Sai, Zhaowei Liu, Chaocan Xiang, Xiangrong Tong, Weilong Wang, Zhipeng Cai

Abstract: Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Netw… ▽ More Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Networks (GCNs) is proposed in this paper. The task completion effect is comprehensively calculated by considering the workers' ability benefits, distance benefits, and trust benefits in this paper. The worker recruitment problem is modeled as an Undirected Complete Recruitment Graph (UCRG), for which a specific Tabu Search Recruitment (TSR) algorithm solution is proposed. An optimal execution team is recruited for each task by the TSR algorithm, and the collaboration team for the task is obtained under the constraint of privacy loss. To enhance the efficiency of the recruitment algorithm on a large scale and scope, the Mini-Batch K-Means clustering algorithm and edge computing technology are introduced, enabling distributed worker recruitment. Lastly, extensive experiments conducted on five real datasets validate that the recruitment algorithm proposed in this paper outperforms other baselines. Additionally, TREF proposed herein surpasses the performance of state-of-the-art trust evaluation methods in the literature. △ Less

Submitted 21 March, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: The article has been accepted by IEEE TMC, and its DOI is 10.1109/TMC.2024.3373469

arXiv:2212.10180 [pdf, other]

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

Authors: Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Abstract: The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over… ▽ More The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area. △ Less

Submitted 3 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023 long paper

arXiv:2210.11664 [pdf, other]

Promoting Rigour in Blockchains Energy & Environmental Footprint Research: A Systematic Literature Review

Authors: Ashish Rajendra Sai, Harald Vranken

Abstract: There is a growing interest in understanding the energy and environmental footprint of digital currencies, specifically in cryptocurrencies such as Bitcoin and Ethereum. These cryptocurrencies are operated by a geographically distributed network of computing nodes, making it hard to accurately estimate their energy consumption. Existing studies, both in academia and industry, attempt to model th… ▽ More There is a growing interest in understanding the energy and environmental footprint of digital currencies, specifically in cryptocurrencies such as Bitcoin and Ethereum. These cryptocurrencies are operated by a geographically distributed network of computing nodes, making it hard to accurately estimate their energy consumption. Existing studies, both in academia and industry, attempt to model the cryptocurrencies energy consumption often based on a number of assumptions for instance about the hardware in use or geographic distribution of the computing nodes. A number of these studies has already been widely criticized for their design choices and subsequent over or under-estimation of the energy use. In this study, we evaluate the reliability of prior models and estimates by leveraging existing scientific literature from fields cognizant of blockchain such as social energy sciences and information systems. We first design a quality assessment framework based on existing research, we then conduct a systematic literature review examining scientific and non-academic literature demonstrating common issues and potential avenues of addressing these issues. Our goal with this article is to to advance the field by promoting scientific rigor in studies focusing on Blockchain's energy footprint. To that end, we provide a novel set of codes of conduct for the five most widely used research methodologies: quantitative energy modeling, literature reviews, data analysis \& statistics, case studies, and experiments. We envision that these codes of conduct would assist in standardizing the design and assessment of studies focusing on blockchain-based systems' energy and environmental footprint. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: This article is currently under peer review

arXiv:2112.02721 [pdf, other]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, **ho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter). △ Less

Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

arXiv:2109.05771 [pdf, other]

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Authors: Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, Mitesh M. Khapra

Abstract: Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human sc… ▽ More Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops). We show that existing evaluation metrics are not robust against even such simple perturbations and disagree with scores assigned by humans to the perturbed output. The proposed templates thus allow for a fine-grained assessment of automatic evaluation metrics exposing their limitations and will facilitate better design, analysis and evaluation of such metrics. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: Accepted at EMNLP 2021. See https://iitmnlp.github.io/EvalEval/ for our templates and code

arXiv:2108.13599 [pdf, other]

Through the Looking Glass: Diminishing Occlusions in Robot Vision Systems with Mirror Reflections

Authors: Kentaro Yoshioka, Hidenori Okuni, Tuan Thanh Ta, Akihide Sai

Abstract: The quality of robot vision greatly affects the performance of automation systems, where occlusions stand as one of the biggest challenges. If the target is occluded from the sensor, detecting and gras** such objects become very challenging. For example, when multiple robot arms cooperate in a single workplace, occlusions will be created under the robot arm itself and hide objects underneath. Wh… ▽ More The quality of robot vision greatly affects the performance of automation systems, where occlusions stand as one of the biggest challenges. If the target is occluded from the sensor, detecting and gras** such objects become very challenging. For example, when multiple robot arms cooperate in a single workplace, occlusions will be created under the robot arm itself and hide objects underneath. While occlusions can be greatly reduced by installing multiple sensors, the increase in sensor costs cannot be ignored. Moreover, the sensor placements must be rearranged every time the robot operation routine and layout change. To diminish occlusions, we propose the first robot vision system with tilt-type mirror reflection sensing. By instantly tilting the sensor itself, we obtain two sensing results with different views: conventional direct line-of-sight sensing and non-line-of-sight sensing via mirror reflections. Our proposed system removes occlusions adaptively by detecting the occlusions in the scene and dynamically configuring the sensor tilt angle to sense the detected occluded area. Thus, sensor rearrangements are not required even after changes in robot operation or layout. Since the required hardware is the tilt-unit and a commercially available mirror, the cost increase is marginal. Through experiments, we show that our system can achieve a similar detection accuracy as systems with multiple sensors, regardless of the single-sensor implementation. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: Accepted to IROS 2021

arXiv:2108.12901 [pdf, other]

BoostNSift: A Query Boosting and Code Sifting Technique for Method Level Bug Localization

Authors: Abdul Razzaq, Jim Buckley, James Vincent Patten, Muslim Chochlov, Ashish Rajendra Sai

Abstract: Locating bugs is an important, but effort-intensive and time-consuming task, when dealing with large-scale systems. To address this, Information Retrieval (IR) techniques are increasingly being used to suggest potential buggy source code locations, for given bug reports. While IR techniques are very scalable, in practice their effectiveness in accurately localizing bugs in a software system remain… ▽ More Locating bugs is an important, but effort-intensive and time-consuming task, when dealing with large-scale systems. To address this, Information Retrieval (IR) techniques are increasingly being used to suggest potential buggy source code locations, for given bug reports. While IR techniques are very scalable, in practice their effectiveness in accurately localizing bugs in a software system remains low. Results of empirical studies suggest that the effectiveness of bug localization techniques can be augmented by the configuration of queries used to locate buggy code. However, in most IR-based bug localization techniques, presented by researchers, the impact of the queries' configurations is not fully considered. In a similar vein, techniques consider all code elements as equally suspicious of being buggy while localizing bugs, but this is not always the case either.In this paper, we present a new method-level, information-retrieval-based bug localization technique called ``BoostNSift''. BoostNSift exploits the important information in queries by `boost'ing that information, and then `sift's the identified code elements, based on a novel technique that emphasizes the code elements' specific relatedness to a bug report over its generic relatedness to all bug reports. To evaluate the performance of BoostNSift, we employed a state-of-the-art empirical design that has been commonly used for evaluating file level IR-based bug localization techniques: 6851 bugs are selected from commonly used Eclipse, AspectJ, SWT, and ZXing benchmarks and made openly available for method-level analyses. △ Less

Submitted 29 August, 2021; originally announced August 2021.

arXiv:2009.12542 [pdf, other]

Taxonomy of Centralization in Public Blockchain Systems: A Systematic Literature Review

Authors: Ashish Rajendra Sai, Jim Buckley, Brian Fitzgerald, Andrew Le Gear

Abstract: Bitcoin introduced delegation of control over a monetary system from a select few to all who participate in that system. This delegation is known as the decentralization of controlling power and is a powerful security mechanism for the ecosystem. After the introduction of Bitcoin, the field of cryptocurrency has seen widespread attention from industry and academia, so much so that the original nov… ▽ More Bitcoin introduced delegation of control over a monetary system from a select few to all who participate in that system. This delegation is known as the decentralization of controlling power and is a powerful security mechanism for the ecosystem. After the introduction of Bitcoin, the field of cryptocurrency has seen widespread attention from industry and academia, so much so that the original novel contribution of Bitcoin i.e. decentralization, may be overlooked, due to decentralizations assumed fundamental existence for the functioning of such cryptoassets. However recent studies have observed a trend of increased centralization in cryptocurrencies such as Bitcoin and Ethereum. As this increased centralization has an impact the security of the blockchain, it is crucial that it is measured, towards adequate control. This research derives an initial taxonomy of centralization present in decentralized blockchains through rigorous synthesis using a systematic literature review. This is followed by iterative refinement through expert interviews. We systematically analyzed 89 research papers published between 2009 and 2019. Our study contributes to the existing body of knowledge by highlighting the multiple definitions and measurements of centralization in the literature. We identify different aspects of centralization and propose an encompassing taxonomy of centralization concerns. This taxonomy is based on empirically observable and measurable characteristics. It consists of 13 aspects of centralization classified over six architectural layers Governance Network Consensus Incentive Operational and Application. We also discuss how the implications of centralization can vary depending on the aspects studied. We believe that this review and taxonomy provides a comprehensive overview of centralization in decentralized blockchains involving various conceptualizations and measures. △ Less

Submitted 26 September, 2020; originally announced September 2020.

Comments: Currently under review at ELS Information Processing and Management

arXiv:2009.11321 [pdf, other]

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

Authors: Ananya B. Sai, Akash Kumar Mohankumar, Siddhartha Arora, Mitesh M. Khapra

Abstract: There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, an… ▽ More There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives. While model-based metrics perform better than n-gram and embedding based metrics on random negatives, their performance drops substantially when evaluated on adversarial examples. To check if large scale pretraining could help, we propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives (88.27% accuracy). However, its performance again drops substantially, when evaluated on adversarial responses, thereby highlighting that even large-scale pretrained evaluation models are not robust to the adversarial examples in our dataset. The dataset and code are publicly available. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: Accepted for publication in TACL

arXiv:2008.12009 [pdf, other]

A Survey of Evaluation Metrics Used for NLG Systems

Authors: Ananya B. Sai, Akash Kumar Mohankumar, Mitesh M. Khapra

Abstract: The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation m… ▽ More The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover, various evaluation metrics have shifted from using pre-determined heuristic-based formulae to trained transformer models. This rapid change in a relatively short time has led to the need for a survey of the existing NLG metrics to help existing and new researchers to quickly come up to speed with the developments that have happened in NLG evaluation in the last few years. Through this survey, we first wish to highlight the challenges and difficulties in automatically evaluating NLG systems. Then, we provide a coherent taxonomy of the evaluation metrics to organize the existing metrics and to better understand the developments in the field. We also describe the different metrics in detail and highlight their key contributions. Later, we discuss the main shortcomings identified in the existing metrics and describe the methodology used to evaluate evaluation metrics. Finally, we discuss our suggestions and recommendations on the next steps forward to improve the automatic evaluation metrics. △ Less

Submitted 5 October, 2020; v1 submitted 27 August, 2020; originally announced August 2020.

Comments: A condensed version of this paper is submitted to ACM CSUR

arXiv:2007.08222 [pdf, other]

Inheritance software metrics on smart contracts

Authors: Ashish Rajendra Sai, Conor Holmes, Jim Buckley, Andrew Le Gear

Abstract: Blockchain systems have gained substantial traction recently, partly due to the potential of decentralized immutable mediation of economic activities. Ethereum is a prominent example that has the provision for executing stateful computing scripts known as Smart Contracts. These smart contracts resemble traditional programs, but with immutability being the core differentiating factor. Given their i… ▽ More Blockchain systems have gained substantial traction recently, partly due to the potential of decentralized immutable mediation of economic activities. Ethereum is a prominent example that has the provision for executing stateful computing scripts known as Smart Contracts. These smart contracts resemble traditional programs, but with immutability being the core differentiating factor. Given their immutability and potential high monetary value, it becomes imperative to develop high-quality smart contracts. Software metrics have traditionally been an essential tool in determining programming quality. Given the similarity between smart contracts (written in Solidity for Ethereum) and object-oriented (OO) programming, OO metrics would appear applicable. In this paper, we empirically evaluate inheritance-based metrics as applied to smart contracts. We adopt this focus because, traditionally, inheritance has been linked to a more complex codebase which we posit is not the case with Solidity based smart contracts. In this work, we evaluate the hypothesis that, due to the differences in the context of smart contracts and OO programs, it may not be appropriate to use the same interpretation of inheritance based metrics for assessment. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted by International Conference on Program Comprehension (ICPC 2020)

arXiv:2007.05611 [pdf, other]

Deep Contextual Clinical Prediction with Reverse Distillation

Authors: Rohan S. Kodialam, Rebecca Boiarsky, Justin Lim, Neil Dixit, Aditya Sai, David Sontag

Abstract: Healthcare providers are increasingly using machine learning to predict patient outcomes to make meaningful interventions. However, despite innovations in this area, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. In this work, motivated by the task of clinical prediction fr… ▽ More Healthcare providers are increasingly using machine learning to predict patient outcomes to make meaningful interventions. However, despite innovations in this area, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. In this work, motivated by the task of clinical prediction from insurance claims, we present a new technique called Reverse Distillation which pretrains deep models by using high-performing linear models for initialization. We make use of the longitudinal structure of insurance claims datasets to develop Self Attention with Reverse Distillation, or SARD, an architecture that utilizes a combination of contextual embedding, temporal embedding and self-attention mechanisms and most critically is trained via reverse distillation. SARD outperforms state-of-the-art methods on multiple clinical prediction outcomes, with ablation studies revealing that reverse distillation is a primary driver of these improvements. Code is available at https://github.com/clinicalml/omop-learn. △ Less

Submitted 16 December, 2020; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: To appear in AAAI 2021

arXiv:1904.02665 [pdf, ps, other]

Frustratingly Poor Performance of Reading Comprehension Models on Non-adversarial Examples

Authors: Soham Parikh, Ananya B. Sai, Preksha Nema, Mitesh M. Khapra

Abstract: When humans learn to perform a difficult task (say, reading comprehension (RC) over longer passages), it is typically the case that their performance improves significantly on an easier version of this task (say, RC over shorter passages). Ideally, we would want an intelligent agent to also exhibit such a behavior. However, on experimenting with state of the art RC models using the standard RACE d… ▽ More When humans learn to perform a difficult task (say, reading comprehension (RC) over longer passages), it is typically the case that their performance improves significantly on an easier version of this task (say, RC over shorter passages). Ideally, we would want an intelligent agent to also exhibit such a behavior. However, on experimenting with state of the art RC models using the standard RACE dataset, we observe that this is not true. Specifically, we see counter-intuitive results wherein even when we show frustratingly easy examples to the model at test time, there is hardly any improvement in its performance. We refer to this as non-adversarial evaluation as opposed to adversarial evaluation. Such non-adversarial examples allow us to assess the utility of specialized neural components. For example, we show that even for easy examples where the answer is clearly embedded in the passage, the neural components designed for paying attention to relevant portions of the passage fail to serve their intended purpose. We believe that the non-adversarial dataset created as a part of this work would complement the research on adversarial evaluation and give a more realistic assessment of the ability of RC models. All the datasets and codes developed as a part of this work will be made publicly available. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Comments: 8 pages

arXiv:1904.02651 [pdf, other]

ElimiNet: A Model for Eliminating Options for Reading Comprehension with Multiple Choice Questions

Authors: Soham Parikh, Ananya B. Sai, Preksha Nema, Mitesh M. Khapra

Abstract: The task of Reading Comprehension with Multiple Choice Questions, requires a human (or machine) to read a given passage, question pair and select one of the n given options. The current state of the art model for this task first computes a question-aware representation for the passage and then selects the option which has the maximum similarity with this representation. However, when humans perfor… ▽ More The task of Reading Comprehension with Multiple Choice Questions, requires a human (or machine) to read a given passage, question pair and select one of the n given options. The current state of the art model for this task first computes a question-aware representation for the passage and then selects the option which has the maximum similarity with this representation. However, when humans perform this task they do not just focus on option selection but use a combination of elimination and selection. Specifically, a human would first try to eliminate the most irrelevant option and then read the passage again in the light of this new information (and perhaps ignore portions corresponding to the eliminated option). This process could be repeated multiple times till the reader is finally ready to select the correct option. We propose ElimiNet, a neural network-based model which tries to mimic this process. Specifically, it has gates which decide whether an option can be eliminated given the passage, question pair and if so it tries to make the passage representation orthogonal to this eliminated option (akin to ignoring portions of the passage corresponding to the eliminated option). The model makes multiple rounds of partial elimination to refine the passage representation and finally uses a selection module to pick the best option. We evaluate our model on the recently released large scale RACE dataset and show that it outperforms the current state of the art model on 7 out of the $13$ question types in this dataset. Further, we show that taking an ensemble of our elimination-selection based method with a selection based method gives us an improvement of 3.1% over the best-reported performance on this dataset. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Comments: IJCAI-18

Journal ref: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (2018) Main track. Pages 4272-4278

arXiv:1902.08832 [pdf, other]

Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses

Authors: Ananya B. Sai, Mithun Das Gupta, Mitesh M. Khapra, Mukundhan Srinivasan

Abstract: Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. ADEM(Lowe et al. 2017) formulated the automatic evaluation of dialogue systems as a learning problem and showed that such a model was able to predict responses which correlate significantly with human judgements, both at utterance and system level. Their system was shown to have beaten wor… ▽ More Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. ADEM(Lowe et al. 2017) formulated the automatic evaluation of dialogue systems as a learning problem and showed that such a model was able to predict responses which correlate significantly with human judgements, both at utterance and system level. Their system was shown to have beaten word-overlap metrics such as BLEU with large margins. We start with the question of whether an adversary can game the ADEM model. We design a battery of targeted attacks at the neural network based ADEM evaluation system and show that automatic evaluation of dialogue systems still has a long way to go. ADEM can get confused with a variation as simple as reversing the word order in the text! We report experiments on several such adversarial scenarios that draw out counterintuitive scores on the dialogue responses. We take a systematic look at the scoring function proposed by ADEM and connect it to linear system theory to predict the shortcomings evident in the system. We also devise an attack that can fool such a system to rate a response generation system as favorable. Finally, we allude to future research directions of using the adversarial attacks to design a truly automated dialogue evaluation system. △ Less

Submitted 23 February, 2019; originally announced February 2019.

Comments: Accepted as a long paper in the proceedings of AAAI-2019

arXiv:1808.09335 [pdf]

PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators

Authors: Kentaro Yoshioka, Yosuke Toyama, Koichiro Ban, Daisuke Yashima, Shigeru Maya, Akihide Sai, Kohei Onizuka

Abstract: PhaseMAC (PMAC), a phase domain Gated-Ring-Oscillator (GRO) based 8bit MAC circuit, is proposed to minimize both area and power consumption of deep learning accelerators. PMAC composes of only digital cells and consumes significantly smaller power than standard digital designs, owing to its efficient analog accumulation nature. It occupies 26.6 times smaller area than conventional analog designs,… ▽ More PhaseMAC (PMAC), a phase domain Gated-Ring-Oscillator (GRO) based 8bit MAC circuit, is proposed to minimize both area and power consumption of deep learning accelerators. PMAC composes of only digital cells and consumes significantly smaller power than standard digital designs, owing to its efficient analog accumulation nature. It occupies 26.6 times smaller area than conventional analog designs, which is competitive to digital MAC circuits. PMAC achieves a peak efficiency of 14 TOPS/W, which is best reported and 48% higher than conventional arts. Results in anomaly detection tasks are demonstrated, which is the hottest application in the industrial IoT scene. △ Less

Submitted 23 August, 2018; originally announced August 2018.

Comments: Presented at Symp. VLSI 2018

Showing 1–19 of 19 results for author: Sai, A