-
Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios
Authors:
Millicent Ochieng,
Varun Gumma,
Sunayana Sitaram,
**dong Wang,
Vishrav Chaudhary,
Keshet Ronen,
Kalika Bali,
Jacki O'Neill
Abstract:
The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation include…
▽ More
The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation includes both quantitative analysis using metrics like F1 score and qualitative assessment of LLMs' explanations for their predictions. We find that, while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances, as well as lack of transparency in their decision-making process as observed from their explanations. In contrast, GPT-4 and GPT-4-Turbo excelled in gras** diverse linguistic inputs and managing various contextual information, demonstrating high consistency with human alignment and transparency in their decision-making process. The LLMs however, encountered difficulties in incorporating cultural nuance especially in non-English settings with GPT-4s doing so inconsistently. The findings emphasize the necessity of continuous improvement of LLMs to effectively tackle the challenges of culturally nuanced, low-resource real-world settings and the need for develo** evaluation benchmarks for capturing these issues.
△ Less
Submitted 13 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Towards Measuring and Modeling "Culture" in LLMs: A Survey
Authors:
Muhammad Farid Adilazuarda,
Sagnik Mukherjee,
Pradhyumna Lavania,
Siddhant Singh,
Alham Fikri Aji,
Jacki O'Neill,
Ashutosh Modi,
Monojit Choudhury
Abstract:
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of cultu…
▽ More
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
△ Less
Submitted 19 June, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
A Survey on Word Meta-Embedding Learning
Authors:
Danushka Bollegala,
James O'Neill
Abstract:
Meta-embedding (ME) learning is an emerging approach that attempts to learn more accurate word embeddings given existing (source) word embeddings as the sole input.
Due to their ability to incorporate semantics from multiple source embeddings in a compact manner with superior performance, ME learning has gained popularity among practitioners in NLP.
To the best of our knowledge, there exist no…
▽ More
Meta-embedding (ME) learning is an emerging approach that attempts to learn more accurate word embeddings given existing (source) word embeddings as the sole input.
Due to their ability to incorporate semantics from multiple source embeddings in a compact manner with superior performance, ME learning has gained popularity among practitioners in NLP.
To the best of our knowledge, there exist no prior systematic survey on ME learning and this paper attempts to fill this need.
We classify ME learning methods according to multiple factors such as whether they (a) operate on static or contextualised embeddings, (b) trained in an unsupervised manner or (c) fine-tuned for a particular task/domain.
Moreover, we discuss the limitations of existing ME learning methods and highlight potential future research directions.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization
Authors:
Dezhong Yao,
Wanning Pan,
Michael J O'Neill,
Yutong Dai,
Yao Wan,
Hai **,
Lichao Sun
Abstract:
One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper…
▽ More
One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper proposes FedHM, a novel heterogeneous federated model compression framework, distributing the heterogeneous low-rank models to clients and then aggregating them into a full-rank model. Our solution enables the training of heterogeneous models with varying computational complexities and aggregates them into a single global model. Furthermore, FedHM significantly reduces the communication cost by using low-rank models. Extensive experimental results demonstrate that FedHM is superior in the performance and robustness of models of different sizes, compared with state-of-the-art heterogeneous FL methods under various FL settings. Additionally, the convergence guarantee of FL for heterogeneous devices is first theoretically analyzed.
△ Less
Submitted 26 May, 2022; v1 submitted 29 November, 2021;
originally announced November 2021.
-
I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews
Authors:
James O'Neill,
Polina Rozenshtein,
Ryuichi Kiryo,
Motoko Kubota,
Danushka Bollegala
Abstract:
Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple language…
▽ More
Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.
△ Less
Submitted 15 September, 2021; v1 submitted 14 April, 2021;
originally announced April 2021.
-
Do not let the history haunt you -- Mitigating Compounding Errors in Conversational Question Answering
Authors:
Angrosh Mandya,
James O'Neill,
Danushka Bollegala,
Frans Coenen
Abstract:
The Conversational Question Answering (CoQA) task involves answering a sequence of inter-related conversational questions about a contextual paragraph. Although existing approaches employ human-written ground-truth answers for answering conversational questions at test time, in a realistic scenario, the CoQA model will not have any access to ground-truth answers for the previous questions, compell…
▽ More
The Conversational Question Answering (CoQA) task involves answering a sequence of inter-related conversational questions about a contextual paragraph. Although existing approaches employ human-written ground-truth answers for answering conversational questions at test time, in a realistic scenario, the CoQA model will not have any access to ground-truth answers for the previous questions, compelling the model to rely upon its own previously predicted answers for answering the subsequent questions. In this paper, we find that compounding errors occur when using previously predicted answers at test time, significantly lowering the performance of CoQA systems. To solve this problem, we propose a sampling strategy that dynamically selects between target answers and model predictions during training, thereby closely simulating the situation at test time. Further, we analyse the severity of this phenomena as a function of the question type, conversation length and domain type.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Automatic Taxonomy Generation - A Use-Case in the Legal Domain
Authors:
Cécile Robin,
James O'Neill,
Paul Buitelaar
Abstract:
A key challenge in the legal domain is the adaptation and representation of the legal knowledge expressed through texts, in order for legal practitioners and researchers to access this information easier and faster to help with compliance related issues. One way to approach this goal is in the form of a taxonomy of legal concepts. While this task usually requires a manual construction of terms and…
▽ More
A key challenge in the legal domain is the adaptation and representation of the legal knowledge expressed through texts, in order for legal practitioners and researchers to access this information easier and faster to help with compliance related issues. One way to approach this goal is in the form of a taxonomy of legal concepts. While this task usually requires a manual construction of terms and their relations by domain experts, this paper describes a methodology to automatically generate a taxonomy of legal noun concepts. We apply and compare two approaches on a corpus consisting of statutory instruments for UK, Wales, Scotland and Northern Ireland laws.
△ Less
Submitted 4 October, 2017;
originally announced October 2017.