Search | arXiv e-print repository

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Authors: Abhinav Rao, Monojit Choudhury, Somak Aditya

Abstract: We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader the… ▽ More We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results. △ Less

Submitted 20 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11661 [pdf, other]

Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting

Authors: Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury

Abstract: Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI)… ▽ More Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt. △ Less

Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.00314 [pdf, other]

CASE: Efficient Curricular Data Pre-training for Building Assistive Psychology Expert Models

Authors: Sarthak Harne, Monjoy Narayan Choudhury, Madhav Rao, TK Srikanth, Seema Mehrotra, Apoorva Vashisht, Aarushi Basu, Manjit Sodhi

Abstract: The limited availability of psychologists necessitates efficient identification of individuals requiring urgent mental healthcare. This study explores the use of Natural Language Processing (NLP) pipelines to analyze text data from online mental health forums used for consultations. By analyzing forum posts, these pipelines can flag users who may require immediate professional attention. A crucial… ▽ More The limited availability of psychologists necessitates efficient identification of individuals requiring urgent mental healthcare. This study explores the use of Natural Language Processing (NLP) pipelines to analyze text data from online mental health forums used for consultations. By analyzing forum posts, these pipelines can flag users who may require immediate professional attention. A crucial challenge in this domain is data privacy and scarcity. To address this, we propose utilizing readily available curricular texts used in institutes specializing in mental health for pre-training the NLP pipelines. This helps us mimic the training process of a psychologist. Our work presents CASE-BERT that flags potential mental health disorders based on forum text. CASE-BERT demonstrates superior performance compared to existing methods, achieving an f1 score of 0.91 for Depression and 0.88 for Anxiety, two of the most commonly reported mental health disorders. Our code is publicly available. △ Less

Submitted 16 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.17840 [pdf, other]

Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents

Authors: Andrew H. Lee, Sina J. Semnani, Galo Castillo-López, Gäel de Chalendar, Monojit Choudhury, Ashna Dua, Kapil Rajesh Kavitha, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Alexis Lombard, Mehrad Moradshahi, Gihyun Park, Nasredine Semmar, Jiwon Seo, Tianhao Shen, Manish Shrivastava, Deyi Xiong, Monica S. Lam

Abstract: Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are mor… ▽ More Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning. △ Less

Submitted 16 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.05572 [pdf, other]

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Authors: Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru

Abstract: Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled… ▽ More Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05378 [pdf, other]

"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations

Authors: Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, Tanushree Mitra

Abstract: Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools. Despite their utility, research indicates that LLMs perpetuate systemic biases. Yet, prior works on LLM harms predominantly focus on Western concepts like race and gender, often overlooking cultural conc… ▽ More Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools. Despite their utility, research indicates that LLMs perpetuate systemic biases. Yet, prior works on LLM harms predominantly focus on Western concepts like race and gender, often overlooking cultural concepts from other parts of the world. Additionally, these studies typically investigate "harm" as a singular dimension, ignoring the various and subtle forms in which harms manifest. To address this gap, we introduce the Covert Harms and Social Threats (CHAST), a set of seven metrics grounded in social science literature. We utilize evaluation models aligned with human assessments to examine the presence of covert harms in LLM-generated conversations, particularly in the context of recruitment. Our experiments reveal that seven out of the eight LLMs included in this study generated conversations riddled with CHAST, characterized by malign views expressed in seemingly neutral language unlikely to be detected by existing methods. Notably, these LLMs manifested more extreme views and opinions when dealing with non-Western concepts like caste, compared to Western ones such as race. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.18460 [pdf, other]

Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in

Authors: Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, Monojit Choudhury

Abstract: Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of et… ▽ More Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.14548 [pdf, ps, other]

Advancing a Consent-Forward Paradigm for Digital Mental Health Data

Authors: Sachin R. Pendse, Logan Stapleton, Neha Kumar, Munmun De Choudhury, Stevie Chancellor

Abstract: The field of digital mental health is advancing at a rapid pace. Passively collected data from user engagements with digital tools and services continue to contribute new insights into mental health and illness. As the field of digital mental health grows, a concerning norm has been established -- digital service users are given little say over how their data is collected, shared, or used to gener… ▽ More The field of digital mental health is advancing at a rapid pace. Passively collected data from user engagements with digital tools and services continue to contribute new insights into mental health and illness. As the field of digital mental health grows, a concerning norm has been established -- digital service users are given little say over how their data is collected, shared, or used to generate revenue for private companies. Given a long history of service user exclusion from data collection practices, we propose an alternative approach that is attentive to this history: the consent-forward paradigm. This paradigm embeds principles of affirmative consent in the design of digital mental health tools and services, strengthening trust through designing around individual choices and needs, and proactively protecting users from unexpected harm. In this perspective, we outline practical steps to implement this paradigm, toward ensuring that people searching for care have the safest experiences possible. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 15 pages with 2 tables

arXiv:2403.15412 [pdf, other]

Towards Measuring and Modeling "Culture" in LLMs: A Survey

Authors: Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury

Abstract: We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of cultu… ▽ More We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications. △ Less

Submitted 19 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.10058 [pdf, other]

RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos

Authors: Anirban Mukherjee, Monjoy Narayan Choudhury, Dinesh Babu Jayagopi

Abstract: Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-… ▽ More Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-identifiable features. In our work, we propose RID-Twin: a novel pipeline that leverages the state-of-the-art generative models, and decouples identity from motion to perform automatic face de-identification in videos. We investigate the task from a holistic point of view and discuss how our approach addresses the pertinent existing challenges in this domain. We evaluate the performance of our methodology on the widely employed VoxCeleb2 dataset, and also a custom dataset designed to accommodate the limitations of certain behavioral variations absent in the VoxCeleb2 dataset. We discuss the implications and advantages of our work and suggest directions for future research. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: This work has been submitted to IEEE ICIP 2024

arXiv:2402.02135 [pdf, other]

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Authors: Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury

Abstract: This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT,… ▽ More This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language. △ Less

Submitted 3 February, 2024; originally announced February 2024.

Comments: Accepted to EACL 2024 (main)

arXiv:2401.14362 [pdf, ps, other]

The Ty** Cure: Experiences with Large Language Model Chatbots for Mental Health Support

Authors: Inhwa Song, Sachin R. Pendse, Neha Kumar, Munmun De Choudhury

Abstract: People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived expe… ▽ More People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived experiences of people who have used LLM chatbots for mental health support. We build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. We ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning AI with therapeutic values for mental health contexts. Our study offers recommendations for how designers can approach the ethical and effective use of LLM chatbots and other AI mental health support tools in mental health care. △ Less

Submitted 6 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: The first two authors contributed equally to this work; typos corrected

arXiv:2312.08800 [pdf, other]

Evaluating Large Language Models for Health-related Queries with Presuppositions

Authors: Navreet Kaur, Monojit Choudhury, Danish Pruthi

Abstract: As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consisten… ▽ More As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios. △ Less

Submitted 6 June, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Findings of ACL 2024

arXiv:2311.14693 [pdf, other]

Benefits and Harms of Large Language Models in Digital Mental Health

Authors: Munmun De Choudhury, Sachin R. Pendse, Neha Kumar

Abstract: The past decade has been transformative for mental health research and practice. The ability to harness large repositories of data, whether from electronic health records (EHR), mobile devices, or social media, has revealed a potential for valuable insights into patient experiences, promising early, proactive interventions, as well as personalized treatment plans. Recent developments in generative… ▽ More The past decade has been transformative for mental health research and practice. The ability to harness large repositories of data, whether from electronic health records (EHR), mobile devices, or social media, has revealed a potential for valuable insights into patient experiences, promising early, proactive interventions, as well as personalized treatment plans. Recent developments in generative artificial intelligence, particularly large language models (LLMs), show promise in leading digital mental health to uncharted territory. Patients are arriving at doctors' appointments with information sourced from chatbots, state-of-the-art LLMs are being incorporated in medical software and EHR systems, and chatbots from an ever-increasing number of startups promise to serve as AI companions, friends, and partners. This article presents contemporary perspectives on the opportunities and risks posed by LLMs in the design, development, and implementation of digital mental health tools. We adopt an ecological framework and draw on the affordances offered by LLMs to discuss four application areas -- care-seeking behaviors from individuals in need of care, community care provision, institutional and medical care provision, and larger care ecologies at the societal level. We engage in a thoughtful consideration of whether and how LLM-based technologies could or should be employed for enhancing mental health. The benefits and harms our article surfaces could serve to help shape future research, advocacy, and regulatory efforts focused on creating more responsible, user-friendly, equitable, and secure LLM-based tools for mental health treatment and intervention. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2311.04262 [pdf, other]

ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations

Authors: Muntabir Hasan Choudhury, Lamia Salsabil, William A. Ingram, Edward A. Fox, Jian Wu

Abstract: Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional c… ▽ More Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, discover, and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation). △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 10 pages, 3 figures, accepted to Innovative Applications of Artificial Intelligence (IAAI-24)

arXiv:2310.13132 [pdf, other]

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Authors: Yiqiao **, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, Srijan Kumar

Abstract: Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding t… ▽ More Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all. △ Less

Submitted 23 October, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: 18 pages, 7 figures

arXiv:2310.10928 [pdf, ps, other]

Using Audio Data to Facilitate Depression Risk Assessment in Primary Health Care

Authors: Adam Valen Levinson, Abhay Goyal, Roger Ho Chun Man, Roy Ka-Wei Lee, Koustuv Saha, Nimay Parekh, Frederick L. Altice, Lam Yin Cheung, Munmun De Choudhury, Navin Kumar

Abstract: Telehealth is a valuable tool for primary health care (PHC), where depression is a common condition. PHC is the first point of contact for most people with depression, but about 25% of diagnoses made by PHC physicians are inaccurate. Many other barriers also hinder depression detection and treatment in PHC. Artificial intelligence (AI) may help reduce depression misdiagnosis in PHC and improve ove… ▽ More Telehealth is a valuable tool for primary health care (PHC), where depression is a common condition. PHC is the first point of contact for most people with depression, but about 25% of diagnoses made by PHC physicians are inaccurate. Many other barriers also hinder depression detection and treatment in PHC. Artificial intelligence (AI) may help reduce depression misdiagnosis in PHC and improve overall diagnosis and treatment outcomes. Telehealth consultations often have video issues, such as poor connectivity or dropped calls. Audio-only telehealth is often more practical for lower-income patients who may lack stable internet connections. Thus, our study focused on using audio data to predict depression risk. The objectives were to: 1) Collect audio data from 24 people (12 with depression and 12 without mental health or major health condition diagnoses); 2) Build a machine learning model to predict depression risk. TPOT, an autoML tool, was used to select the best machine learning algorithm, which was the K-nearest neighbors classifier. The selected model had high performance in classifying depression risk (Precision: 0.98, Recall: 0.93, F1-Score: 0.96). These findings may lead to a range of tools to help screen for and treat depression. By develo** tools to detect depression risk, patients can be routed to AI-driven chatbots for initial screenings. Partnerships with a range of stakeholders are crucial to implementing these solutions. Moreover, ethical considerations, especially around data privacy and potential biases in AI models, need to be at the forefront of any AI-driven intervention in mental health care. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.08483 [pdf, other]

Understanding the Humans Behind Online Misinformation: An Observational Study Through the Lens of the COVID-19 Pandemic

Authors: Mohit Chandra, Anush Mattapalli, Munmun De Choudhury

Abstract: The proliferation of online misinformation has emerged as one of the biggest threats to society. Considerable efforts have focused on building misinformation detection models, still the perils of misinformation remain abound. Mitigating online misinformation and its ramifications requires a holistic approach that encompasses not only an understanding of its intricate landscape in relation to the c… ▽ More The proliferation of online misinformation has emerged as one of the biggest threats to society. Considerable efforts have focused on building misinformation detection models, still the perils of misinformation remain abound. Mitigating online misinformation and its ramifications requires a holistic approach that encompasses not only an understanding of its intricate landscape in relation to the complex issue and topic-rich information ecosystem online, but also the psychological drivers of individuals behind it. Adopting a time series analytic technique and robust causal inference-based design, we conduct a large-scale observational study analyzing over 32 million COVID-19 tweets and 16 million historical timeline tweets. We focus on understanding the behavior and psychology of users disseminating misinformation during COVID-19 and its relationship with the historical inclinations towards sharing misinformation on Non-COVID domains before the pandemic. Our analysis underscores the intricacies inherent to cross-domain misinformation, and highlights that users' historical inclination toward sharing misinformation is positively associated with their present behavior pertaining to misinformation sharing on emergent topics and beyond. This work may serve as a valuable foundation for designing user-centric inoculation strategies and ecologically-grounded agile interventions for effectively tackling online misinformation. △ Less

Submitted 18 January, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.07251 [pdf, other]

Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs

Authors: Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury

Abstract: In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. When provided with an ethical policy, an LLM should be capable of making decisions that are ethically consistent to the policy. We develop a framework that integrate… ▽ More In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. When provided with an ethical policy, an LLM should be capable of making decisions that are ethically consistent to the policy. We develop a framework that integrates moral dilemmas with moral principles pertaining to different foramlisms of normative ethics, and at different levels of abstractions. Initial experiments with GPT-x models shows that while GPT-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of Western and English speaking societies. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2309.13356 [pdf, other]

Probing the Moral Development of Large Language Models through Defining Issues Test

Authors: Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury

Abstract: In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model. DIT uses moral dilemmas followed by a set of ethical considerations that the respondent has to judge for importance in resolving the dilemma, and then r… ▽ More In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model. DIT uses moral dilemmas followed by a set of ethical considerations that the respondent has to judge for importance in resolving the dilemma, and then rank-order them by importance. A moral development stage score of the respondent is then computed based on the relevance rating and ranking. Our study shows that early LLMs such as GPT-3 exhibit a moral reasoning ability no better than that of a random baseline, while ChatGPT, Llama2-Chat, PaLM-2 and GPT-4 show significantly better performance on this task, comparable to adult humans. GPT-4, in fact, has the highest post-conventional moral reasoning score, equivalent to that of typical graduate school students. However, we also observe that the models do not perform consistently across all dilemmas, pointing to important gaps in their understanding and reasoning abilities. △ Less

Submitted 7 October, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: First three authors contributed equally

arXiv:2309.07462 [pdf, other]

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Authors: Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

Abstract: Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchma… ▽ More Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages. △ Less

Submitted 13 February, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted to EACL 2024 findings

arXiv:2307.12402 [pdf, ps, other]

ChatGPT and Bard Responses to Polarizing Questions

Authors: Abhay Goyal, Muhammad Siddique, Nimay Parekh, Zach Schwitzky, Clara Broekaert, Connor Michelotti, Allie Wong, Lam Yin Cheung, Robin O Hanlon, Lam Yin Cheung, Munmun De Choudhury, Roy Ka-Wei Lee, Navin Kumar

Abstract: Recent developments in natural language processing have demonstrated the potential of large language models (LLMs) to improve a range of educational and learning outcomes. Of recent chatbots based on LLMs, ChatGPT and Bard have made it clear that artificial intelligence (AI) technology will have significant implications on the way we obtain and search for information. However, these tools sometime… ▽ More Recent developments in natural language processing have demonstrated the potential of large language models (LLMs) to improve a range of educational and learning outcomes. Of recent chatbots based on LLMs, ChatGPT and Bard have made it clear that artificial intelligence (AI) technology will have significant implications on the way we obtain and search for information. However, these tools sometimes produce text that is convincing, but often incorrect, known as hallucinations. As such, their use can distort scientific facts and spread misinformation. To counter polarizing responses on these tools, it is critical to provide an overview of such responses so stakeholders can determine which topics tend to produce more contentious responses -- key to develo** targeted regulatory policy and interventions. In addition, there currently exists no annotated dataset of ChatGPT and Bard responses around possibly polarizing topics, central to the above aims. We address the indicated issues through the following contribution: Focusing on highly polarizing topics in the US, we created and described a dataset of ChatGPT and Bard responses. Broadly, our results indicated a left-leaning bias for both ChatGPT and Bard, with Bard more likely to provide responses around polarizing topics. Bard seemed to have fewer guardrails around controversial topics, and appeared more willing to provide comprehensive, and somewhat human-like responses. Bard may thus be more likely abused by malicious actors. Stakeholders may utilize our findings to mitigate misinformative and/or polarizing responses from LLMs △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.17674 [pdf, other]

X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Authors: Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, Sina J. Semnani, Jiwon Seo, Vivek Seshadri, Manish Shrivastava, Michael Sun, Aditya Yadavalli, Chaobin You, Deyi Xiong, Monica S. Lam

Abstract: Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-H… ▽ More Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: Accepted by ACL 2023 Findings

arXiv:2306.03316 [pdf, other]

CoSiNES: Contrastive Siamese Network for Entity Standardization

Authors: Jiaqing Yuan, Michele Merler, Mihir Choudhury, Raju Pavuluri, Munindar P. Singh, Maja Vukovic

Abstract: Entity standardization maps noisy mentions from free-form text to standard entities in a knowledge base. The unique challenge of this task relative to other entity-related tasks is the lack of surrounding context and numerous variations in the surface form of the mentions, especially when it comes to generalization across domains where labeled data is scarce. Previous research mostly focuses on de… ▽ More Entity standardization maps noisy mentions from free-form text to standard entities in a knowledge base. The unique challenge of this task relative to other entity-related tasks is the lack of surrounding context and numerous variations in the surface form of the mentions, especially when it comes to generalization across domains where labeled data is scarce. Previous research mostly focuses on develo** models either heavily relying on context, or dedicated solely to a specific domain. In contrast, we propose CoSiNES, a generic and adaptable framework with Contrastive Siamese Network for Entity Standardization that effectively adapts a pretrained language model to capture the syntax and semantics of the entities in a new domain. We construct a new dataset in the technology domain, which contains 640 technical stack entities and 6,412 mentions collected from industrial content management systems. We demonstrate that CoSiNES yields higher accuracy and faster runtime than baselines derived from leading methods in this domain. CoSiNES also achieves competitive performance in four standard datasets from the chemistry, medicine, and biomedical domains, demonstrating its cross-domain applicability. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted by Matching Workshop at ACL2023

arXiv:2305.14965 [pdf, other]

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks

Authors: Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury

Abstract: Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited studies have been conducted to formalize and analyze these attacks and their mitigations. We bridge… ▽ More Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited studies have been conducted to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We survey existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT-based models, OPT, BLOOM, and FLAN-T5-XXL). We further discuss the challenges of jailbreak detection in terms of their effectiveness against known attacks. For further analysis, we release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks. △ Less

Submitted 27 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

arXiv:2305.14288 [pdf, other]

LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

Authors: Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji

Abstract: This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness… ▽ More This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency. △ Less

Submitted 22 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: EMNLP 2023 Main Conference

arXiv:2305.14218 [pdf, other]

DUBLIN -- Document Understanding By Language-Image Network

Authors: Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary

Abstract: Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives… ▽ More Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction. △ Less

Submitted 27 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

ACM Class: F.2.2; I.2.7

arXiv:2304.10439 [pdf, other]

A Study on Reproducibility and Replicability of Table Structure Recognition Methods

Authors: Kehinde Ajayi, Muntabir Hasan Choudhury, Sarah Rajtmajer, Jian Wu

Abstract: Concerns about reproducibility in artificial intelligence (AI) have emerged, as researchers have reported unsuccessful attempts to directly reproduce published findings in the field. Replicability, the ability to affirm a finding using the same procedures on new data, has not been well studied. In this paper, we examine both reproducibility and replicability of a corpus of 16 papers on table struc… ▽ More Concerns about reproducibility in artificial intelligence (AI) have emerged, as researchers have reported unsuccessful attempts to directly reproduce published findings in the field. Replicability, the ability to affirm a finding using the same procedures on new data, has not been well studied. In this paper, we examine both reproducibility and replicability of a corpus of 16 papers on table structure recognition (TSR), an AI task aimed at identifying cell locations of tables in digital documents. We attempt to reproduce published results using codes and datasets provided by the original authors. We then examine replicability using a dataset similar to the original as well as a new dataset, GenTSR, consisting of 386 annotated tables extracted from scientific papers. Out of 16 papers studied, we reproduce results consistent with the original in only four. Two of the four papers are identified as replicable using the similar dataset under certain IoU values. No paper is identified as replicable using the new dataset. We offer observations on the causes of irreproducibility and irreplicability. All code and data are available on Codeocean at https://codeocean.com/capsule/6680116/tree. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 10 pages, 5 figures

arXiv:2304.07417 [pdf, other]

Understanding and Mitigating Mental Health Misinformation on Video Sharing Platforms

Authors: Viet Cuong Nguyen, Michael Birnbaum, Munmun De Choudhury

Abstract: Despite the ever-strong demand for mental health care globally, access to traditional mental health services remains severely limited expensive, and stifled by stigma and systemic barriers. Thus, over the last few years, young people are increasingly turning to content on video-sharing platforms (VSPs) like TikTok and YouTube to help them navigate their mental health journey. However, navigating t… ▽ More Despite the ever-strong demand for mental health care globally, access to traditional mental health services remains severely limited expensive, and stifled by stigma and systemic barriers. Thus, over the last few years, young people are increasingly turning to content on video-sharing platforms (VSPs) like TikTok and YouTube to help them navigate their mental health journey. However, navigating towards trustworthy information relating to mental health on these platforms is challenging, given the uncontrollable and unregulated growth of dedicated mental health content and content creators catering to a wide array of mental health conditions on these platforms. In this paper, we attempt to define what constitutes as "mental health misinformation" through examples. In addition, we also suggest some open questions to answer and challenges to tackle regarding this important and timely research topic △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: 5 pages, 1 figure

arXiv:2303.17661 [pdf, other]

MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries

Authors: Muntabir Hasan Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox

Abstract: Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a cas… ▽ More Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 7 pages, 3 tables, and 1 figure. Accepted by 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL '23) as a short paper

arXiv:2303.02357 [pdf, other]

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

Authors: Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Abstract: Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages. In this work, we envision languages as domains for improving zero-shot transfer by jointly reducing the feature incongruity between the source and the target language and increasing the generalization capabilities of pre-trained multilingual tra… ▽ More Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages. In this work, we envision languages as domains for improving zero-shot transfer by jointly reducing the feature incongruity between the source and the target language and increasing the generalization capabilities of pre-trained multilingual transformers. We show that our approach, DiTTO, significantly outperforms the standard zero-shot fine-tuning method on multiple datasets across all languages using solely unlabeled instances in the target language. Empirical results show that jointly reducing feature incongruity for multiple target languages is vital for successful cross-lingual transfer. Moreover, our model enables better cross-lingual transfer than standard fine-tuning methods, even in the few-shot setting. △ Less

Submitted 4 March, 2023; originally announced March 2023.

Comments: Accepted at EACL 2023

arXiv:2302.12578 [pdf, other]

Fairness in Language Models Beyond English: Gaps and Challenges

Authors: Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury

Abstract: With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. Most research on evaluating and mitigating fairness harms has been concentrated on English, while multilingual models and non-English languages have received comparatively little attention. This paper presents a survey of fairness in multi… ▽ More With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. Most research on evaluating and mitigating fairness harms has been concentrated on English, while multilingual models and non-English languages have received comparatively little attention. This paper presents a survey of fairness in multilingual and non-English contexts, highlighting the shortcomings of current research and the difficulties faced by methods designed for English. We contend that the multitude of diverse cultures and languages across the world makes it infeasible to achieve comprehensive coverage in terms of constructing fairness datasets. Thus, the measurement and mitigation of biases must evolve beyond the current dataset-driven practices that are narrowly focused on specific dimensions and types of biases and, therefore, impossible to scale across languages and cultures. △ Less

Submitted 28 February, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: Accepted to EACL 2023 (Findings)

arXiv:2302.12388 [pdf, other]

TrafFormer: A Transformer Model for Predicting Long-term Traffic

Authors: David Alexander Tedjopurnomo, Farhana M. Choudhury, A. K. Qin

Abstract: Traffic prediction is a flourishing research field due to its importance in human mobility in the urban space. Despite this, existing studies only focus on short-term prediction of up to few hours in advance, with most being up to one hour only. Long-term traffic prediction can enable more comprehensive, informed, and proactive measures against traffic congestion and is therefore an important task… ▽ More Traffic prediction is a flourishing research field due to its importance in human mobility in the urban space. Despite this, existing studies only focus on short-term prediction of up to few hours in advance, with most being up to one hour only. Long-term traffic prediction can enable more comprehensive, informed, and proactive measures against traffic congestion and is therefore an important task to explore. In this paper, we explore the task of long-term traffic prediction; where we predict traffic up to 24 hours in advance. We note the weaknesses of existing models--which are based on recurrent structures--for long-term traffic prediction and propose a modified Transformer model "TrafFormer". Experiments comparing our model with existing hybrid neural network models show the superiority of our model. △ Less

Submitted 2 March, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: 14 pages, 6 figures

MSC Class: ACM-class: I.2.1

arXiv:2302.00799 [pdf, other]

doi 10.1145/3579467

Charting the Sociotechnical Gap in Explainable AI: A Framework to Address the Gap in XAI

Authors: Upol Ehsan, Koustuv Saha, Munmun De Choudhury, Mark O. Riedl

Abstract: Explainable AI (XAI) systems are sociotechnical in nature; thus, they are subject to the sociotechnical gap--divide between the technical affordances and the social needs. However, charting this gap is challenging. In the context of XAI, we argue that charting the gap improves our problem understanding, which can reflexively provide actionable insights to improve explainability. Utilizing two case… ▽ More Explainable AI (XAI) systems are sociotechnical in nature; thus, they are subject to the sociotechnical gap--divide between the technical affordances and the social needs. However, charting this gap is challenging. In the context of XAI, we argue that charting the gap improves our problem understanding, which can reflexively provide actionable insights to improve explainability. Utilizing two case studies in distinct domains, we empirically derive a framework that facilitates systematic charting of the sociotechnical gap by connecting AI guidelines in the context of XAI and elucidating how to use them to address the gap. We apply the framework to a third case in a new domain, showcasing its affordances. Finally, we discuss conceptual implications of the framework, share practical considerations in its operationalization, and offer guidance on transferring it to new contexts. By making conceptual and practical contributions to understanding the sociotechnical gap in XAI, the framework expands the XAI design space. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Comments: Published at ACM CSCW 2023

Journal ref: ACM CSCW 2023

arXiv:2210.15184 [pdf, other]

Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Develo** Lightweight Low-Resource MT Models

Authors: Harshita Diddee, Sandipan Dandapat, Monojit Choudhury, Tanuja Ganu, Kalika Bali

Abstract: Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this w… ▽ More Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this work, we first evaluate its use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set. △ Less

Submitted 9 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: 16 Pages, 7 Figures, Accepted to WMT 2022 (Research Track)

arXiv:2210.12265 [pdf, other]

On the Calibration of Massively Multilingual Language Models

Authors: Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury

Abstract: Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first invest… ▽ More Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2208.14641 [pdf, other]

Generating Intermediate Steps for NLI with Next-Step Supervision

Authors: Deepanway Ghosal, Somak Aditya, Monojit Choudhury

Abstract: The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion. While the necessity of generating such intermediate steps (instead of a summary explanation) has gained popular support, it is unclear how to generate such steps without complete end-to-end supervision and how such generated steps can be further utilized. In this work, we train a sequenc… ▽ More The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion. While the necessity of generating such intermediate steps (instead of a summary explanation) has gained popular support, it is unclear how to generate such steps without complete end-to-end supervision and how such generated steps can be further utilized. In this work, we train a sequence-to-sequence model to generate only the next step given an NLI premise and hypothesis pair (and previous steps); then enhance it with external knowledge and symbolic search to generate intermediate steps with only next-step supervision. We show the correctness of such generated steps through automated and human verification. Furthermore, we show that such generated steps can help improve end-to-end NLI task performance using simple data augmentation strategies, across multiple public NLI datasets. △ Less

Submitted 31 August, 2022; originally announced August 2022.

arXiv:2206.15010 [pdf, other]

"Diversity and Uncertainty in Moderation" are the Key to Data Selection for Multilingual Few-shot Transfer

Authors: Shanu Kumar, Sandipan Dandapat, Monojit Choudhury

Abstract: Few-shot transfer often shows substantial gain over zero-shot transfer~\cite{lauscher2020zero}, which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretrained model-based systems. This paper explores various strategies for selecting data for annotation that can result in a better few-shot transfer. The proposed approaches rely on m… ▽ More Few-shot transfer often shows substantial gain over zero-shot transfer~\cite{lauscher2020zero}, which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretrained model-based systems. This paper explores various strategies for selecting data for annotation that can result in a better few-shot transfer. The proposed approaches rely on multiple measures such as data entropy using $n$-gram language model, predictive entropy, and gradient embedding. We propose a loss embedding method for sequence labeling tasks, which induces diversity and uncertainty sampling similar to gradient embedding. The proposed data selection strategies are evaluated and compared for POS tagging, NER, and NLI tasks for up to 20 languages. Our experiments show that the gradient and loss embedding-based strategies consistently outperform random data selection baselines, with gains varying with the initial performance of the zero-shot transfer. Furthermore, the proposed method shows similar trends in improvement even when the model is fine-tuned using a lower proportion of the original task-specific labeled training data for zero-shot transfer. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: NAACL 2022

arXiv:2206.10594 [pdf]

How is Va** Framed on Online Knowledge Dissemination Platforms?

Authors: Keyu Chen, Yiwen Shi, Jun Luo, Joyce Jiang, Shweta Yadav, Munmun De Choudhury, Ashiqur R. KhudaBukhsh, Marzieh Babaeianjelodar, Frederick Altice, Navin Kumar

Abstract: We analyze 1,888 articles and 1,119,453 va** posts to study how va** is framed across multiple knowledge dissemination platforms (Wikipedia, Quora, Medium, Reddit, Stack Exchange, wikiHow). We use various NLP techniques to understand these differences. For example, n-grams, emotion recognition, and question answering results indicate that Medium, Quora, and Stack Exchange are appropriate venue… ▽ More We analyze 1,888 articles and 1,119,453 va** posts to study how va** is framed across multiple knowledge dissemination platforms (Wikipedia, Quora, Medium, Reddit, Stack Exchange, wikiHow). We use various NLP techniques to understand these differences. For example, n-grams, emotion recognition, and question answering results indicate that Medium, Quora, and Stack Exchange are appropriate venues for those looking to transition from smoking to va**. Other platforms (Reddit, wikiHow) are more for va** hobbyists and may not sufficiently dissuade youth va**. Conversely, Wikipedia may exaggerate va** harms, dissuading smokers from transitioning. A strength of our work is how the different techniques we have applied validate each other. Based on our results, we provide several recommendations. Stakeholders may utilize our findings to design informational tools to reinforce or mitigate va** (mis)perceptions online. △ Less

Submitted 22 July, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: arXiv admin note: text overlap with arXiv:2206.07765, arXiv:2206.09024

arXiv:2206.09024 [pdf]

Partisan US News Media Representations of Syrian Refugees

Authors: Keyu Chen, Marzieh Babaeianjelodar, Yiwen Shi, Kamila Janmohamed, Rupak Sarkar, Ingmar Weber, Thomas Davidson, Munmun De Choudhury, Jonathan Huang, Shweta Yadav, Ashique Khudabukhsh, Preslav Ivanov Nakov, Chris Bauch, Orestis Papakyriakopoulos, Kaveh Khoshnood, Navin Kumar

Abstract: We investigate how representations of Syrian refugees (2011-2021) differ across US partisan news outlets. We analyze 47,388 articles from the online US media about Syrian refugees to detail differences in reporting between left- and right-leaning media. We use various NLP techniques to understand these differences. Our polarization and question answering results indicated that left-leaning media t… ▽ More We investigate how representations of Syrian refugees (2011-2021) differ across US partisan news outlets. We analyze 47,388 articles from the online US media about Syrian refugees to detail differences in reporting between left- and right-leaning media. We use various NLP techniques to understand these differences. Our polarization and question answering results indicated that left-leaning media tended to represent refugees as child victims, welcome in the US, and right-leaning media cast refugees as Islamic terrorists. We noted similar results with our sentiment and offensive speech scores over time, which detail possibly unfavorable representations of refugees in right-leaning media. A strength of our work is how the different techniques we have applied validate each other. Based on our results, we provide several recommendations. Stakeholders may utilize our findings to intervene around refugee representations, and design communications campaigns that improve the way society sees refugees and possibly aid refugee outcomes. △ Less

Submitted 17 June, 2022; originally announced June 2022.

arXiv:2206.07765 [pdf]

US News and Social Media Framing around Va**

Authors: Keyu Chen, Marzieh Babaeianjelodar, Yiwen Shi, Rohan Aanegola, Lam Yin Cheung, Preslav Ivanov Nakov, Shweta Yadav, Angus Bancroft, Ashiqur R. KhudaBukhsh, Munmun De Choudhury, Frederick L. Altice, Navin Kumar

Abstract: In this paper, we investigate how va** is framed differently (2008-2021) between US news and social media. We analyze 15,711 news articles and 1,231,379 Facebook posts about va** to study the differences in framing between media varieties. We use word embeddings to provide two-dimensional visualizations of the semantic changes around va** for news and for social media. We detail that news me… ▽ More In this paper, we investigate how va** is framed differently (2008-2021) between US news and social media. We analyze 15,711 news articles and 1,231,379 Facebook posts about va** to study the differences in framing between media varieties. We use word embeddings to provide two-dimensional visualizations of the semantic changes around va** for news and for social media. We detail that news media framing of va** shifted over time in line with emergent regulatory trends, such as; flavored va** bans, with little discussion around va** as a smoking cessation tool. We found that social media discussions were far more varied, with transitions toward va** both as a public health harm and as a smoking cessation tool. Our cloze test, dynamic topic model, and question answering showed similar patterns, where social media, but not news media, characterizes va** as combustible cigarette substitute. We use n-grams to detail that social media data first centered on va** as a smoking cessation tool, and in 2019 moved toward narratives around va** regulation, similar to news media frames. Overall, social media tracks the evolution of va** as a social practice, while news media reflects more risk based concerns. A strength of our work is how the different techniques we have applied validate each other. Stakeholders may utilize our findings to intervene around the framing of va**, and may design communications campaigns that improve the way society sees va**, thus possibly aiding smoking cessation; and reducing youth va**. △ Less

Submitted 22 July, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2205.09744 [pdf, other]

Overcoming Language Disparity in Online Content Classification with Multimodal Learning

Authors: Gaurav Verma, Rohit Mujumdar, Zijie J. Wang, Munmun De Choudhury, Srijan Kumar

Abstract: Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, side… ▽ More Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications. Resources for this paper are available at https://multimodality-language-disparity.github.io/. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accepted for publication at ICWSM 2022 as a full paper

arXiv:2205.06356 [pdf, other]

Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

Authors: Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Abstract: Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs… ▽ More Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs. △ Less

Submitted 14 November, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: NLP Power! Workshop, ACL 2022

arXiv:2205.06350 [pdf, other]

On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Authors: Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat

Abstract: Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP datas… ▽ More Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP. △ Less

Submitted 14 November, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: NAACL 2022

arXiv:2205.06130 [pdf, other]

Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models

Authors: Kabir Ahuja, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury

Abstract: Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task… ▽ More Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection and identify a common set of features that influence zero-shot performance across a variety of tasks. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: ACL 2022

arXiv:2204.02790 [pdf, other]

Global Readiness of Language Technology for Healthcare: What would it Take to Combat the Next Pandemic?

Authors: Ishani Mondal, Kabir Ahuja, Mohit Jain, Jacki O Neil, Kalika Bali, Monojit Choudhury

Abstract: The COVID-19 pandemic has brought out both the best and worst of language technology (LT). On one hand, conversational agents for information dissemination and basic diagnosis have seen widespread use, and arguably, had an important role in combating the pandemic. On the other hand, it has also become clear that such technologies are readily available for a handful of languages, and the vast major… ▽ More The COVID-19 pandemic has brought out both the best and worst of language technology (LT). On one hand, conversational agents for information dissemination and basic diagnosis have seen widespread use, and arguably, had an important role in combating the pandemic. On the other hand, it has also become clear that such technologies are readily available for a handful of languages, and the vast majority of the global south is completely bereft of these benefits. What is the state of LT, especially conversational agents, for healthcare across the world's languages? And, what would it take to ensure global readiness of LT before the next pandemic? In this paper, we try to answer these questions through survey of existing literature and resources, as well as through a rapid chatbot building exercise for 15 Asian and African languages with varying amount of resource-availability. The study confirms the pitiful state of LT even for languages with large speaker bases, such as Sinhala and Hausa, and identifies the gaps that could help us prioritize research and investment strategies in LT for healthcare. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: Under Revision

arXiv:2203.12865 [pdf, other]

Multilingual CheckList: Generation and Evaluation

Authors: Karthikeyan K, Shaily Bhatt, Pankaj Singh, Somak Aditya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Abstract: Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple appro… ▽ More Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance. △ Less

Submitted 11 October, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: Accepted to Findings of AACL-IJCNLP 2022

arXiv:2201.08277 [pdf, other]

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, Pavel Brazdil

Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yorùbá ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin… ▽ More Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yorùbá ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages. △ Less

Submitted 18 June, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: Submitted to LREC 2022, 13 pages, 2 figures

arXiv:2201.03074 [pdf, other]

A Survey of Passive Sensing in the Workplace

Authors: Subigya Nepal, Gonzalo J. Martinez, Arvind Pillai, Koustuv Saha, Shayan Mirjafari, Vedant Das Swain, Xuhai Xu, Pino G. Audia, Munmun De Choudhury, Anind K. Dey, Aaron Striegel, Andrew T. Campbell

Abstract: As emerging technologies increasingly integrate into all facets of our lives, the workplace stands at the forefront of potential transformative changes. A notable development in this realm is the advent of passive sensing technology, designed to enhance both cognitive and physical capabilities by monitoring human behavior. This paper reviews current research on the application of passive sensing t… ▽ More As emerging technologies increasingly integrate into all facets of our lives, the workplace stands at the forefront of potential transformative changes. A notable development in this realm is the advent of passive sensing technology, designed to enhance both cognitive and physical capabilities by monitoring human behavior. This paper reviews current research on the application of passive sensing technology in the workplace, focusing on its impact on employee wellbeing and productivity. Additionally, we explore unresolved issues and outline prospective pathways for the incorporation of passive sensing in future workplaces. △ Less

Submitted 30 March, 2024; v1 submitted 9 January, 2022; originally announced January 2022.

Comments: Added references and other minor revisions. Also udated to include relevant works published after 2022

ACM Class: H.5.0

arXiv:2112.14705 [pdf, other]

Lane Change Decision-Making through Deep Reinforcement Learning

Authors: Mukesh Ghimire, Malobika Roy Choudhury, Guna Sekhar Sai Harsha Lagudu

Abstract: Due to the complexity and volatility of the traffic environment, decision-making in autonomous driving is a significantly hard problem. In this project, we use a Deep Q-Network, along with rule-based constraints to make lane-changing decision. A safe and efficient lane change behavior may be obtained by combining high-level lateral decision-making with low-level rule-based trajectory monitoring. T… ▽ More Due to the complexity and volatility of the traffic environment, decision-making in autonomous driving is a significantly hard problem. In this project, we use a Deep Q-Network, along with rule-based constraints to make lane-changing decision. A safe and efficient lane change behavior may be obtained by combining high-level lateral decision-making with low-level rule-based trajectory monitoring. The agent is anticipated to perform appropriate lane-change maneuvers in a real-world-like udacity simulator after training it for a total of 100 episodes. The results shows that the rule-based DQN performs better than the DQN method. The rule-based DQN achieves a safety rate of 0.8 and average speed of 47 MPH △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: 6 pages

MSC Class: 15-04 ACM Class: I.2.6

Showing 1–50 of 96 results for author: Choudhury, M