-
Llama meets EU: Investigating the European Political Spectrum through the Lens of LLMs
Authors:
Ilias Chalkidis,
Stephanie Brandl
Abstract:
Instruction-finetuned Large Language Models inherit clear political leanings that have been shown to influence downstream task performance. We expand this line of research beyond the two-party system in the US and audit Llama Chat in the context of EU politics in various settings to analyze the model's political knowledge and its ability to reason in context. We adapt, i.e., further fine-tune, Lla…
▽ More
Instruction-finetuned Large Language Models inherit clear political leanings that have been shown to influence downstream task performance. We expand this line of research beyond the two-party system in the US and audit Llama Chat in the context of EU politics in various settings to analyze the model's political knowledge and its ability to reason in context. We adapt, i.e., further fine-tune, Llama Chat on speeches of individual euro-parties from debates in the European Parliament to reevaluate its political leaning based on the EUandI questionnaire. Llama Chat shows considerable knowledge of national parties' positions and is capable of reasoning in context. The adapted, party-specific, models are substantially re-aligned towards respective positions which we see as a starting point for using chat-based LLMs as data-driven conversational engines to assist research in political science.
△ Less
Submitted 22 March, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations
Authors:
Stephanie Brandl,
Oliver Eberle,
Tiago Ribeiro,
Anders Søgaard,
Nora Hollenstein
Abstract:
Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional in…
▽ More
Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Authors:
Laura Cabello,
Emanuele Bugliarello,
Stephanie Brandl,
Desmond Elliott
Abstract:
Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we defin…
▽ More
Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we define gender bias as our case study. We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. We investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. Overall, we find that bias amplification in pretraining and after fine-tuning are independent. We then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without significantly compromising task performance.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
On the Interplay between Fairness and Explainability
Authors:
Stephanie Brandl,
Emanuele Bugliarello,
Ilias Chalkidis
Abstract:
In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both. In this work, we perform a first study to understand how they in…
▽ More
In order to build reliable and trustworthy NLP applications, models need to be both fair across different demographics and explainable. Usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. Instead, we argue that forthcoming, trustworthy NLP systems should consider both. In this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible rationales? and vice versa. To this end, we conduct experiments on two English multi-class text classification datasets, BIOS and ECtHR, that provide information on gender and nationality, respectively, as well as human-annotated rationales. We fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations. We find that bias mitigation algorithms do not always lead to fairer models. Moreover, we discover that empirical fairness and explainability are orthogonal.
△ Less
Submitted 13 November, 2023; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Rather a Nurse than a Physician -- Contrastive Explanations under Investigation
Authors:
Oliver Eberle,
Ilias Chalkidis,
Laura Cabello,
Stephanie Brandl
Abstract:
Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain a decision than non-contrastive explanations, where the decision is not necessarily referenced to an alternative. This claim has never been empirically validated. We analyze four English text-classification datasets (SST2, DynaSent, BIOS and DBpedia-Animals). We fine-tu…
▽ More
Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain a decision than non-contrastive explanations, where the decision is not necessarily referenced to an alternative. This claim has never been empirically validated. We analyze four English text-classification datasets (SST2, DynaSent, BIOS and DBpedia-Animals). We fine-tune and extract explanations from three different models (RoBERTa, GTP-2, and T5), each in three different sizes and apply three post-hoc explainability methods (LRP, GradientxInput, GradNorm). We furthermore collect and release human rationale annotations for a subset of 100 samples from the BIOS dataset for contrastive and non-contrastive settings. A cross-comparison between model-based rationales and human annotations, both in contrastive and non-contrastive settings, yields a high agreement between the two settings for models as well as for humans. Moreover, model-based explanations computed in both settings align equally well with human rationales. Thus, we empirically find that humans do not necessarily explain in a contrastive manner.9 pages, long paper at ACL 2022 proceedings.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
WebQAmGaze: A Multilingual Webcam Eye-Tracking-While-Reading Dataset
Authors:
Tiago Ribeiro,
Stephanie Brandl,
Anders Søgaard,
Nora Hollenstein
Abstract:
We present WebQAmGaze, a multilingual low-cost eye-tracking-while-reading dataset, designed as the first webcam-based eye-tracking corpus of reading to support the development of explainable computational language processing models. WebQAmGaze includes webcam eye-tracking data from 600 participants of a wide age range naturally reading English, German, Spanish, and Turkish texts. Each participant…
▽ More
We present WebQAmGaze, a multilingual low-cost eye-tracking-while-reading dataset, designed as the first webcam-based eye-tracking corpus of reading to support the development of explainable computational language processing models. WebQAmGaze includes webcam eye-tracking data from 600 participants of a wide age range naturally reading English, German, Spanish, and Turkish texts. Each participant performs two reading tasks composed of five texts each, a normal reading and an information-seeking task, followed by a comprehension question. We compare the collected webcam data to high-quality eye-tracking recordings. The results show a moderate to strong correlation between the eye movement measures obtained with the webcam compared to those obtained with a commercial eye-tracking device. When validating the data, we find that higher fixation duration on relevant text spans accurately indicates correctness when answering the corresponding questions. This dataset advances webcam-based reading studies and opens avenues to low-cost and diverse data collection. WebQAmGaze is beneficial to learn about the cognitive processes behind question-answering and to apply these insights to computational models of language understanding.
△ Less
Submitted 15 March, 2024; v1 submitted 31 March, 2023;
originally announced March 2023.
-
Every word counts: A multilingual analysis of individual human alignment with model attention
Authors:
Stephanie Brandl,
Nora Hollenstein
Abstract:
Human fixation patterns have been shown to correlate strongly with Transformer-based attention. Those correlation analyses are usually carried out without taking into account individual differences between participants and are mostly done on monolingual datasets making it difficult to generalise findings. In this paper, we analyse eye-tracking data from speakers of 13 different languages reading b…
▽ More
Human fixation patterns have been shown to correlate strongly with Transformer-based attention. Those correlation analyses are usually carried out without taking into account individual differences between participants and are mostly done on monolingual datasets making it difficult to generalise findings. In this paper, we analyse eye-tracking data from speakers of 13 different languages reading both in their native language (L1) and in English as language learners (L2). We find considerable differences between languages but also that individual reading behaviour such as skip** rate, total reading time and vocabulary knowledge (LexTALE) influence the alignment between humans and models to an extent that should be considered in future studies.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Domain-Specific Word Embeddings with Structure Prediction
Authors:
Stephanie Brandl,
David Lassner,
Anne Baillot,
Shinichi Nakajima
Abstract:
Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, e.g., across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provi…
▽ More
Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, e.g., across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provide general word representations for the whole corpus, domain-specific representations for each sub-corpus, sub-corpus structure, and embedding alignment simultaneously. We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests, domain-specific analogy tests, and multiple specific word embedding evaluations as well as structure prediction performance when no structure is given a priori. As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Evaluating Deep Taylor Decomposition for Reliability Assessment in the Wild
Authors:
Stephanie Brandl,
Daniel Hershcovich,
Anders Søgaard
Abstract:
We argue that we need to evaluate model interpretability methods 'in the wild', i.e., in situations where professionals make critical decisions, and models can potentially assist them. We present an in-the-wild evaluation of token attribution based on Deep Taylor Decomposition, with professional journalists performing reliability assessments. We find that using this method in conjunction with RoBE…
▽ More
We argue that we need to evaluate model interpretability methods 'in the wild', i.e., in situations where professionals make critical decisions, and models can potentially assist them. We present an in-the-wild evaluation of token attribution based on Deep Taylor Decomposition, with professional journalists performing reliability assessments. We find that using this method in conjunction with RoBERTa-Large, fine-tuned on the Gossip Corpus, led to faster and better human decision-making, as well as a more critical attitude toward news sources among the journalists. We present a comparison of human and model rationales, as well as a qualitative analysis of the journalists' experiences with machine-in-the-loop decision making.
△ Less
Submitted 3 May, 2022;
originally announced June 2022.
-
Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze?
Authors:
Stephanie Brandl,
Oliver Eberle,
Jonas Pilot,
Anders Søgaard
Abstract:
Learned self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention. We compare attention functions across two task-specific reading datasets for sentiment analysis and…
▽ More
Learned self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention. We compare attention functions across two task-specific reading datasets for sentiment analysis and relation extraction. We find the predictiveness of large-scale pre-trained self-attention for human attention depends on `what is in the tail', e.g., the syntactic nature of rare contexts. Further, we observe that task-specific fine-tuning does not increase the correlation with human task-specific reading. Through an input reduction experiment we give complementary insights on the sparsity and fidelity trade-off, showing that lower-entropy attention vectors are more faithful.
△ Less
Submitted 25 April, 2022;
originally announced May 2022.
-
How Conservative are Language Models? Adapting to the Introduction of Gender-Neutral Pronouns
Authors:
Stephanie Brandl,
Ruixiang Cui,
Anders Søgaard
Abstract:
Gender-neutral pronouns have recently been introduced in many languages to a) include non-binary people and b) as a generic singular. Recent results from psycholinguistics suggest that gender-neutral pronouns (in Swedish) are not associated with human processing difficulties. This, we show, is in sharp contrast with automated processing. We show that gender-neutral pronouns in Danish, English, and…
▽ More
Gender-neutral pronouns have recently been introduced in many languages to a) include non-binary people and b) as a generic singular. Recent results from psycholinguistics suggest that gender-neutral pronouns (in Swedish) are not associated with human processing difficulties. This, we show, is in sharp contrast with automated processing. We show that gender-neutral pronouns in Danish, English, and Swedish are associated with higher perplexity, more dispersed attention patterns, and worse downstream performance. We argue that such conservativity in language models may limit widespread adoption of gender-neutral pronouns and must therefore be resolved.
△ Less
Submitted 3 May, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Challenges and Strategies in Cross-Cultural NLP
Authors:
Daniel Hershcovich,
Stella Frank,
Heather Lent,
Miryam de Lhoneux,
Mostafa Abdou,
Stephanie Brandl,
Emanuele Bugliarello,
Laura Cabello Piqueras,
Ilias Chalkidis,
Ruixiang Cui,
Constanza Fierro,
Katerina Margatina,
Phillip Rust,
Anders Søgaard
Abstract:
Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogo…
▽ More
Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Analyzing Item Popularity Bias of Music Recommender Systems: Are Different Genders Equally Affected?
Authors:
Oleg Lesota,
Alessandro B. Melchiorre,
Navid Rekabsaz,
Stefan Brandl,
Dominik Kowald,
Elisabeth Lex,
Markus Schedl
Abstract:
Several studies have identified discrepancies between the popularity of items in user profiles and the corresponding recommendation lists. Such behavior, which concerns a variety of recommendation algorithms, is referred to as popularity bias. Existing work predominantly adopts simple statistical measures, such as the difference of mean or median popularity, to quantify popularity bias. Moreover,…
▽ More
Several studies have identified discrepancies between the popularity of items in user profiles and the corresponding recommendation lists. Such behavior, which concerns a variety of recommendation algorithms, is referred to as popularity bias. Existing work predominantly adopts simple statistical measures, such as the difference of mean or median popularity, to quantify popularity bias. Moreover, it does so irrespective of user characteristics other than the inclination to popular content. In this work, in contrast, we propose to investigate popularity differences (between the user profile and recommendation list) in terms of median, a variety of statistical moments, as well as similarity measures that consider the entire popularity distributions (Kullback-Leibler divergence and Kendall's tau rank-order correlation). This results in a more detailed picture of the characteristics of popularity bias. Furthermore, we investigate whether such algorithmic popularity bias affects users of different genders in the same way. We focus on music recommendation and conduct experiments on the recently released standardized LFM-2b dataset, containing listening profiles of Last.fm users. We investigate the algorithmic popularity bias of seven common recommendation algorithms (five collaborative filtering and two baselines). Our experiments show that (1) the studied metrics provide novel insights into popularity bias in comparison with only using average differences, (2) algorithms less inclined towards popularity bias amplification do not necessarily perform worse in terms of utility (NDCG), (3) the majority of the investigated recommenders intensify the popularity bias of the female users.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Balancing the composition of word embeddings across heterogenous data sets
Authors:
Stephanie Brandl,
David Lassner,
Maximilian Alber
Abstract:
Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to ali…
▽ More
Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to align the influence of single subsets on the resulting word vectors, while retaining their quality. In this regard we propose a criteria to measure the shift towards a single data subset and develop approaches to meet both objectives. We find that a weighted average of the two subset embeddings balances the influence of those subsets while word similarity performance decreases. We further propose a promising optimization approach to balance influences and quality of word embeddings.
△ Less
Submitted 14 January, 2020;
originally announced January 2020.