-
Weak convergence of adaptive Markov chain Monte Carlo
Authors:
Austin Brown,
Jeffrey S. Rosenthal
Abstract:
This article develops general conditions for weak convergence of adaptive Markov chain Monte Carlo processes and is shown to imply a weak law of large numbers for bounded Lipschitz continuous functions. This allows an estimation theory for adaptive Markov chain Monte Carlo where previously developed theory in total variation may fail or be difficult to establish. Extensions of weak convergence to…
▽ More
This article develops general conditions for weak convergence of adaptive Markov chain Monte Carlo processes and is shown to imply a weak law of large numbers for bounded Lipschitz continuous functions. This allows an estimation theory for adaptive Markov chain Monte Carlo where previously developed theory in total variation may fail or be difficult to establish. Extensions of weak convergence to general Wasserstein distances are established along with a weak law of large numbers for possibly unbounded Lipschitz functions. Applications are applied to auto-regressive processes in various settings, unadjusted Langevin processes, and adaptive Metropolis-Hastings.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
InspectorRAGet: An Introspection Platform for RAG Evaluation
Authors:
Kshitij Fadnis,
Siva Sankalp Patel,
Odellia Boni,
Yannis Katsis,
Sara Rosenthal,
Benjamin Sznajder,
Marina Danilevsky
Abstract:
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present Inspect…
▽ More
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for RAG evaluation. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. The demo video is available at https://youtu.be/MJhe8QIXcEc
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems
Authors:
Sara Rosenthal,
Avirup Sil,
Radu Florian,
Salim Roukos
Abstract:
Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark…
▽ More
Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes
Authors:
Darren Liu,
Cheng Ding,
Delgersuren Bold,
Monique Bouvier,
Jiaying Lu,
Benjamin Shickel,
Craig S. Jabaley,
Wenhui Zhang,
Soo** Park,
Michael J. Young,
Mark S. Wainwright,
Gilles Clermont,
Parisa Rashidi,
Eric S. Rosenthal,
Laurie Dimisko,
Ran Xiao,
Joo Heung Yoon,
Carl Yang,
Xiao Hu
Abstract:
The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance. However, their performance in actual clinical applications has been underexplored. Traditional evaluations based on question-answering tasks don't fully capture the nuanced contexts. This gap highlights the need for more in-depth and practical assessments of LLMs in r…
▽ More
The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance. However, their performance in actual clinical applications has been underexplored. Traditional evaluations based on question-answering tasks don't fully capture the nuanced contexts. This gap highlights the need for more in-depth and practical assessments of LLMs in real-world healthcare settings. Objective: We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication. Methods: We investigated the performance of three general LLMs in understanding and processing real-world clinical notes. Concepts from 150 clinical notes were identified by MetaMap and then labeled by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using different prompts for an in-depth analysis. Results: GPT-4 showed overall superior performance compared to other LLMs. In contrast, both GPT-3.5 and text-davinci-003 exhibit enhanced performance when the appropriate prompting strategies are employed. The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities. Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized. This framework goes beyond singular performance aspects. With expert annotations, this methodology not only validates LLMs' capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Authors:
Christoph Tillmann,
Aashka Trivedi,
Sara Rosenthal,
Santosh Borse,
Rong Zhang,
Avirup Sil,
Bishwaranjan Bhattacharjee
Abstract:
Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce Muted, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat map…
▽ More
Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce Muted, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. Muted can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Can Large Language Models Capture Public Opinion about Global Warming? An Empirical Assessment of Algorithmic Fidelity and Bias
Authors:
S. Lee,
T. Q. Peng,
M. H. Goldberg,
S. A. Rosenthal,
J. E. Kotcher,
E. W. Maibach,
A. Leiserowitz
Abstract:
Large language models (LLMs) have demonstrated their potential in social science research by emulating human perceptions and behaviors, a concept referred to as algorithmic fidelity. This study assesses the algorithmic fidelity and bias of LLMs by utilizing two nationally representative climate change surveys. The LLMs were conditioned on demographics and/or psychological covariates to simulate su…
▽ More
Large language models (LLMs) have demonstrated their potential in social science research by emulating human perceptions and behaviors, a concept referred to as algorithmic fidelity. This study assesses the algorithmic fidelity and bias of LLMs by utilizing two nationally representative climate change surveys. The LLMs were conditioned on demographics and/or psychological covariates to simulate survey responses. The findings indicate that LLMs can effectively capture presidential voting behaviors but encounter challenges in accurately representing global warming perspectives when relevant covariates are not included. GPT-4 exhibits improved performance when conditioned on both demographics and covariates. However, disparities emerge in LLM estimations of the views of certain groups, with LLMs tending to underestimate worry about global warming among Black Americans. While highlighting the potential of LLMs to aid social science research, these results underscore the importance of meticulous conditioning, model selection, survey question format, and bias assessment when employing LLMs for survey simulation. Further investigation into prompt engineering and algorithm auditing is essential to harness the power of LLMs while addressing their inherent limitations.
△ Less
Submitted 7 February, 2024; v1 submitted 31 October, 2023;
originally announced November 2023.
-
Bounding and estimating MCMC convergence rates using common random number simulations
Authors:
Sabrina Sixta,
Jeffrey S. Rosenthal,
Austin Brown
Abstract:
This paper explores how and when to use common random number (CRN) simulation to evaluate Markov chain Monte Carlo (MCMC) convergence rates. We discuss how CRN simulation is closely related to theoretical convergence rate techniques such as one-shot coupling and coupling from the past. We present conditions under which the CRN technique generates an unbiased estimate of the squared $L^2-$Wasserste…
▽ More
This paper explores how and when to use common random number (CRN) simulation to evaluate Markov chain Monte Carlo (MCMC) convergence rates. We discuss how CRN simulation is closely related to theoretical convergence rate techniques such as one-shot coupling and coupling from the past. We present conditions under which the CRN technique generates an unbiased estimate of the squared $L^2-$Wasserstein distance between two random variables. We also discuss how this unbiasedness over a single iteration does not extend to unbiasedness over multiple iterations. We provide an upper bound on the Wasserstein distance of a Markov chain to its stationary distribution after $N$ steps in terms of averages over CRN simulations. Finally, we apply our result to a Bayesian regression Gibbs sampler.
△ Less
Submitted 22 March, 2024; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Efficiency of reversible MCMC methods: elementary derivations and applications to composite methods
Authors:
Radford M. Neal,
Jeffrey S. Rosenthal
Abstract:
We review criteria for comparing the efficiency of Markov chain Monte Carlo (MCMC) methods with respect to the asymptotic variance of estimates of expectations of functions of state, and show how such criteria can justify ways of combining improvements to MCMC methods. We say that a chain on a finite state space with transition matrix $P$ efficiency-dominates one with transition matrix $Q$ if for…
▽ More
We review criteria for comparing the efficiency of Markov chain Monte Carlo (MCMC) methods with respect to the asymptotic variance of estimates of expectations of functions of state, and show how such criteria can justify ways of combining improvements to MCMC methods. We say that a chain on a finite state space with transition matrix $P$ efficiency-dominates one with transition matrix $Q$ if for every function of state it has lower (or equal) asymptotic variance. We give elementary proofs of some previous results regarding efficiency dominance, leading to a self-contained demonstration that a reversible chain with transition matrix $P$ efficiency-dominates a reversible chain with transition matrix $Q$ if and only if none of the eigenvalues of $Q-P$ are negative. This allows us to conclude that modifying a reversible MCMC method to improve its efficiency will also improve the efficiency of a method that randomly chooses either this or some other reversible method, and to conclude that improving the efficiency of a reversible update for one component of state (as in Gibbs sampling) will improve the overall efficiency of a reversible method that combines this and other updates. It also explains how antithetic MCMC can be more efficient than i.i.d. sampling. We also establish conditions that can guarantee that a method is not efficiency-dominated by any other method.
△ Less
Submitted 27 March, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development
Authors:
Avirup Sil,
Jaydeep Sen,
Bhavani Iyer,
Martin Franz,
Kshitij Fadnis,
Mihaela Bornea,
Sara Rosenthal,
Scott McCarley,
Rong Zhang,
Vishwajeet Kumar,
Yulong Li,
Md Arafat Sultan,
Riyaz Bhat,
Radu Florian,
Salim Roukos
Abstract:
The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate…
▽ More
The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.
△ Less
Submitted 25 January, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Sampling via Rejection-Free Partial Neighbor Search
Authors:
Sigeng Chen,
Jeffrey S. Rosenthal,
Aki Dote,
Hirotaka Tamura,
Ali Sheikholeslami
Abstract:
The Metropolis algorithm involves producing a Markov chain to converge to a specified target density $π$. In order to improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm, which avoids the inefficiency of rejections by evaluating all neighbors. Rejection-Free can be made more efficient through the use of parallelism hardware. However, for some specialized hardw…
▽ More
The Metropolis algorithm involves producing a Markov chain to converge to a specified target density $π$. In order to improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm, which avoids the inefficiency of rejections by evaluating all neighbors. Rejection-Free can be made more efficient through the use of parallelism hardware. However, for some specialized hardware, such as Digital Annealing Unit, the number of units will limit the number of neighbors being considered at each step. Hence, we propose an enhanced version of Rejection-Free known as Partial Neighbor Search, which only considers a portion of the neighbors while using the Rejection-Free technique. This method will be tested on several examples to demonstrate its effectiveness and advantages under different circumstances.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
GAAMA 2.0: An Integrated System that Answers Boolean and Extractive Questions
Authors:
Scott McCarley,
Mihaela Bornea,
Sara Rosenthal,
Anthony Ferritto,
Md Arafat Sultan,
Avirup Sil,
Radu Florian
Abstract:
Recent machine reading comprehension datasets include extractive and boolean questions but current approaches do not offer integrated support for answering both question types. We present a multilingual machine reading comprehension system and front-end demo that handles boolean questions by providing both a YES/NO answer and highlighting supporting evidence, and handles extractive questions by hi…
▽ More
Recent machine reading comprehension datasets include extractive and boolean questions but current approaches do not offer integrated support for answering both question types. We present a multilingual machine reading comprehension system and front-end demo that handles boolean questions by providing both a YES/NO answer and highlighting supporting evidence, and handles extractive questions by highlighting the answer in the passage. Our system, GAAMA 2.0, is ranked first on the Tydi QA leaderboard at the time of this writing. We contrast two different implementations of our approach. The first includes several independent stacks of transformers allowing easy deployment of each component. The second is a single stack of transformers utilizing adapters to reduce GPU memory footprint in a resource-constrained environment.
△ Less
Submitted 21 June, 2022; v1 submitted 16 June, 2022;
originally announced June 2022.
-
Task Transfer and Domain Adaptation for Zero-Shot Question Answering
Authors:
Xiang Pan,
Alex Sheng,
David Shimshoni,
Aditya Singhal,
Sara Rosenthal,
Avirup Sil
Abstract:
Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domain-specific downstream tasks. We evaluate zero-shot perf…
▽ More
Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domain-specific downstream tasks. We evaluate zero-shot performance on domain-specific reading comprehension tasks by combining task transfer with domain adaptation to fine-tune a pretrained model with no labelled data from the target task. Our approach outperforms Domain-Adaptive Pretraining on downstream domain-specific reading comprehension tasks in 3 out of 4 domains.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Football Group Draw Probabilities and Corrections
Authors:
Gareth O. Roberts,
Jeffrey S. Rosenthal
Abstract:
This paper considers the challenge of designing football group draw mechanisms which have the uniform distribution over all valid draw assignments, but are also entertaining, practical, and transparent. We explain how to simulate the FIFA Sequential Draw method, to compute the non-uniformity of its draws by comparison to a uniform Rejection Sampler. We then propose two practical methods of achievi…
▽ More
This paper considers the challenge of designing football group draw mechanisms which have the uniform distribution over all valid draw assignments, but are also entertaining, practical, and transparent. We explain how to simulate the FIFA Sequential Draw method, to compute the non-uniformity of its draws by comparison to a uniform Rejection Sampler. We then propose two practical methods of achieving the uniform distribution while still using balls and bowls in a way which is suitable for a televised draw. The solutions can also be tried interactively.
△ Less
Submitted 25 January, 2023; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Optimization via Rejection-Free Partial Neighbor Search
Authors:
Sigeng Chen,
Jeffrey S. Rosenthal,
Aki Dote,
Hirotaka Tamura,
Ali Sheikholeslami
Abstract:
Simulated Annealing using Metropolis steps at decreasing temperatures is widely used to solve complex combinatorial optimization problems. In order to improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm, which avoids the inefficiency of rejections by considering all the neighbors at every step. As a solution to avoid the algorithm from becoming stuck in local…
▽ More
Simulated Annealing using Metropolis steps at decreasing temperatures is widely used to solve complex combinatorial optimization problems. In order to improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm, which avoids the inefficiency of rejections by considering all the neighbors at every step. As a solution to avoid the algorithm from becoming stuck in local extreme areas, we propose an enhanced version of Rejection-Free called Partial Neighbor Search (PNS), which only considers random parts of the neighbors while applying Rejection-Free. We demonstrate the superior performance of the Rejection-Free PNS algorithm by applying these methods to several examples, such as the QUBO question, the Knapsack problem, the 3R3XOR problem, and the quadratic programming.
△ Less
Submitted 7 October, 2022; v1 submitted 15 April, 2022;
originally announced May 2022.
-
Equivalences of Geometric Ergodicity of Markov Chains
Authors:
M. A. Gallegos-Herrada,
D. Ledvinka,
J. S. Rosenthal
Abstract:
This paper gathers together different conditions which are all equivalent to geometric ergodicity of time-homogeneous Markov chains on general state spaces. A total of 34 different conditions are presented (27 for general chains plus 7 just for reversible chains), some old and some new, in terms of such notions as convergence bounds, drift conditions, spectral properties, etc., with different assu…
▽ More
This paper gathers together different conditions which are all equivalent to geometric ergodicity of time-homogeneous Markov chains on general state spaces. A total of 34 different conditions are presented (27 for general chains plus 7 just for reversible chains), some old and some new, in terms of such notions as convergence bounds, drift conditions, spectral properties, etc., with different assumptions about the distance metric used, finiteness of function moments, initial distribution, uniformity of bounds, and more. Proofs of the connections between the different conditions are provided, mostly self-contained but using some results from the literature where appropriate.
△ Less
Submitted 3 July, 2023; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Optimal Strategies and Rules for the Game of Horse
Authors:
Daniel Rosenthal,
Jeffrey S. Rosenthal
Abstract:
We investigate the probability of scoring a point when playing the basketball shooting game called "Horse". We show that under the Traditional Rules, it is optimal to choose very easy shots. We propose alternative rules called Pops Rules, and show that they lead to more difficult optimal shots, and thus to a more interesting game.
We investigate the probability of scoring a point when playing the basketball shooting game called "Horse". We show that under the Traditional Rules, it is optimal to choose very easy shots. We propose alternative rules called Pops Rules, and show that they lead to more difficult optimal shots, and thus to a more interesting game.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Do Answers to Boolean Questions Need Explanations? Yes
Authors:
Sara Rosenthal,
Mihaela Bornea,
Avirup Sil,
Radu Florian,
Scott McCarley
Abstract:
Existing datasets that contain boolean questions, such as BoolQ and TYDI QA , provide the user with a YES/NO response to the question. However, a one word response is not sufficient for an explainable system. We promote explainability by releasing a new set of annotations marking the evidence in existing TyDi QA and BoolQ datasets. We show that our annotations can be used to train a model that ext…
▽ More
Existing datasets that contain boolean questions, such as BoolQ and TYDI QA , provide the user with a YES/NO response to the question. However, a one word response is not sufficient for an explainable system. We promote explainability by releasing a new set of annotations marking the evidence in existing TyDi QA and BoolQ datasets. We show that our annotations can be used to train a model that extracts improved evidence spans compared to models that rely on existing resources. We confirm our findings with a user study which shows that our extracted evidence spans enhance the user experience. We also provide further insight into the challenges of answering boolean questions, such as passages containing conflicting YES and NO answers, and varying degrees of relevance of the predicted evidence.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Convergence rate bounds for iterative random functions using one-shot coupling
Authors:
Sabrina Sixta,
Jeffrey S. Rosenthal
Abstract:
One-shot coupling is a method of bounding the convergence rate between two copies of a Markov chain in total variation distance, which was first introduced by Roberts and Rosenthal and generalized by Madras and Sezer. The method is divided into two parts: the contraction phase, when the chains converge in expected distance and the coalescing phase, which occurs at the last iteration, when there is…
▽ More
One-shot coupling is a method of bounding the convergence rate between two copies of a Markov chain in total variation distance, which was first introduced by Roberts and Rosenthal and generalized by Madras and Sezer. The method is divided into two parts: the contraction phase, when the chains converge in expected distance and the coalescing phase, which occurs at the last iteration, when there is an attempt to couple. One-shot coupling does not require the use of any exogenous variables like a drift function or a minorization constant. In this paper, we summarize the one-shot coupling method into the One-Shot Coupling Theorem. We then apply the theorem to two families of Markov chains: the random functional autoregressive process and the autoregressive conditional heteroscedastic (ARCH) process. We provide multiple examples of how the theorem can be used on various models including ones in high dimensions. These examples illustrate how the theorem's conditions can be verified in a straightforward way. The one-shot coupling method appears to generate tight geometric convergence rate bounds.
△ Less
Submitted 1 July, 2022; v1 submitted 7 December, 2021;
originally announced December 2021.
-
SalienTrack: providing salient information for semi-automated self-tracking feedback with model explanations
Authors:
Yunlong Wang,
Jiaying Liu,
Homin Park,
Jordan Schultz-McArdle,
Stephanie Rosenthal,
Judy Kay,
Brian Y. Lim
Abstract:
Self-tracking can improve people's awareness of their unhealthy behaviors and support reflection to inform behavior change. Increasingly, new technologies make tracking easier, leading to large amounts of tracked data. However, much of that information is not salient for reflection and self-awareness. To tackle this burden for reflection, we created the SalienTrack framework, which aims to 1) iden…
▽ More
Self-tracking can improve people's awareness of their unhealthy behaviors and support reflection to inform behavior change. Increasingly, new technologies make tracking easier, leading to large amounts of tracked data. However, much of that information is not salient for reflection and self-awareness. To tackle this burden for reflection, we created the SalienTrack framework, which aims to 1) identify salient tracking events, 2) select the salient details of those events, 3) explain why they are informative, and 4) present the details as manually elicited or automatically shown feedback. We implemented SalienTrack in the context of nutrition tracking. To do this, we first conducted a field study to collect photo-based mobile food tracking over 1-5 weeks. We then report how we used this data to train an explainable-AI model of salience. Finally, we created interfaces to present salient information and conducted a formative user study to gain insights about how SalienTrack could be integrated into an interface for reflection. Our key contributions are the SalienTrack framework, a demonstration of its implementation for semi-automated feedback in an important and challenging self-tracking context and a discussion of the broader uses of the framework.
△ Less
Submitted 16 February, 2022; v1 submitted 21 September, 2021;
originally announced September 2021.
-
Bayesian Inference of Globular Cluster Properties Using Distribution Functions
Authors:
Gwendolyn M. Eadie,
Jeremy J. Webb,
Jeffrey S. Rosenthal
Abstract:
We present a Bayesian inference approach to estimating the cumulative mass profile and mean squared velocity profile of a globular cluster given the spatial and kinematic information of its stars. Mock globular clusters with a range of sizes and concentrations are generated from lowered isothermal dynamical models, from which we test the reliability of the Bayesian method to estimate model paramet…
▽ More
We present a Bayesian inference approach to estimating the cumulative mass profile and mean squared velocity profile of a globular cluster given the spatial and kinematic information of its stars. Mock globular clusters with a range of sizes and concentrations are generated from lowered isothermal dynamical models, from which we test the reliability of the Bayesian method to estimate model parameters through repeated statistical simulation. We find that given unbiased star samples, we are able to reconstruct the cluster parameters used to generate the mock cluster and the cluster's cumulative mass and mean velocity squared profiles with good accuracy. We further explore how strongly biased sampling, which could be the result of observing constraints, may affect this approach. Our tests indicate that if we instead have biased samples, then our estimates can be off in certain ways that are dependent on cluster morphology. Overall, our findings motivate obtaining samples of stars that are as unbiased as possible. This may be achieved by combining information from multiple telescopes (e.g., Hubble and Gaia), but will require careful modeling of the measurement uncertainties through a hierarchical model, which we plan to pursue in future work.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
SemEval-2021 Task 9: Fact Verification and Evidence Finding for Tabular Data in Scientific Documents (SEM-TAB-FACTS)
Authors:
Nancy X. R. Wang,
Diwakar Mahajan,
Marina Danilevsky,
Sara Rosenthal
Abstract:
Understanding tables is an important and relevant task that involves understanding table structure as well as being able to compare and contrast information within cells. In this paper, we address this challenge by presenting a new dataset and tasks that addresses this goal in a shared task in SemEval 2020 Task 9: Fact Verification and Evidence Finding for Tabular Data in Scientific Documents (SEM…
▽ More
Understanding tables is an important and relevant task that involves understanding table structure as well as being able to compare and contrast information within cells. In this paper, we address this challenge by presenting a new dataset and tasks that addresses this goal in a shared task in SemEval 2020 Task 9: Fact Verification and Evidence Finding for Tabular Data in Scientific Documents (SEM-TAB-FACTS). Our dataset contains 981 manually-generated tables and an auto-generated dataset of 1980 tables providing over 180K statement and over 16M evidence annotations. SEM-TAB-FACTS featured two sub-tasks. In sub-task A, the goal was to determine if a statement is supported, refuted or unknown in relation to a table. In sub-task B, the focus was on identifying the specific cells of a table that provide evidence for the statement. 69 teams signed up to participate in the task with 19 successful submissions to subtask A and 12 successful submissions to subtask B. We present our results and main findings from the competition.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Dimension-free Mixing for High-dimensional Bayesian Variable Selection
Authors:
Quan Zhou,
Jun Yang,
Dootika Vats,
Gareth O. Roberts,
Jeffrey S. Rosenthal
Abstract:
Yang et al. (2016) proved that the symmetric random walk Metropolis--Hastings algorithm for Bayesian variable selection is rapidly mixing under mild high-dimensional assumptions. We propose a novel MCMC sampler using an informed proposal scheme, which we prove achieves a much faster mixing time that is independent of the number of covariates, under the same assumptions. To the best of our knowledg…
▽ More
Yang et al. (2016) proved that the symmetric random walk Metropolis--Hastings algorithm for Bayesian variable selection is rapidly mixing under mild high-dimensional assumptions. We propose a novel MCMC sampler using an informed proposal scheme, which we prove achieves a much faster mixing time that is independent of the number of covariates, under the same assumptions. To the best of our knowledge, this is the first high-dimensional result which rigorously shows that the mixing rate of informed MCMC methods can be fast enough to offset the computational cost of local posterior evaluation. Motivated by the theoretical analysis of our sampler, we further propose a new approach called "two-stage drift condition" to studying convergence rates of Markov chains on general state spaces, which can be useful for obtaining tight complexity bounds in high-dimensional settings. The practical advantages of our algorithm are illustrated by both simulation studies and real data analysis.
△ Less
Submitted 23 April, 2022; v1 submitted 12 May, 2021;
originally announced May 2021.
-
Sampling by Divergence Minimization
Authors:
Ameer Dharamshi,
Vivian Ngo,
Jeffrey S. Rosenthal
Abstract:
We introduce a Markov Chain Monte Carlo (MCMC) method that is designed to sample from target distributions with irregular geometry using an adaptive scheme. In cases where targets exhibit non-Gaussian behaviour, we propose that adaption should be regional rather than global. Our algorithm minimizes the information projection component of the Kullback-Leibler (KL) divergence between the proposal an…
▽ More
We introduce a Markov Chain Monte Carlo (MCMC) method that is designed to sample from target distributions with irregular geometry using an adaptive scheme. In cases where targets exhibit non-Gaussian behaviour, we propose that adaption should be regional rather than global. Our algorithm minimizes the information projection component of the Kullback-Leibler (KL) divergence between the proposal and target distributions to encourage proposals that are distributed similarly to the regional geometry of the target. Unlike traditional adaptive MCMC, this procedure rapidly adapts to the geometry of the target's current position as it explores the surrounding space without the need for many preexisting samples. The divergence minimization algorithms are tested on target distributions with irregularly shaped modes and we provide results demonstrating the effectiveness of our methods.
△ Less
Submitted 6 May, 2022; v1 submitted 2 May, 2021;
originally announced May 2021.
-
Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering
Authors:
Sara Rosenthal,
Mihaela Bornea,
Avirup Sil
Abstract:
Recent approaches have exploited weaknesses in monolingual question answering (QA) models by adding adversarial statements to the passage. These attacks caused a reduction in state-of-the-art performance by almost 50%. In this paper, we are the first to explore and successfully attack a multilingual QA (MLQA) system pre-trained on multilingual BERT using several attack strategies for the adversari…
▽ More
Recent approaches have exploited weaknesses in monolingual question answering (QA) models by adding adversarial statements to the passage. These attacks caused a reduction in state-of-the-art performance by almost 50%. In this paper, we are the first to explore and successfully attack a multilingual QA (MLQA) system pre-trained on multilingual BERT using several attack strategies for the adversarial statement reducing performance by as much as 85%. We show that the model gives priority to English and the language of the question regardless of the other languages in the QA pair. Further, we also show that adding our attack strategies during training helps alleviate the attacks.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Impact of Explanation on Trust of a Novel Mobile Robot
Authors:
Stephanie Rosenthal,
Elizabeth J. Carter
Abstract:
One challenge with introducing robots into novel environments is misalignment between supervisor expectations and reality, which can greatly affect a user's trust and continued use of the robot. We performed an experiment to test whether the presence of an explanation of expected robot behavior affected a supervisor's trust in an autonomous robot. We measured trust both subjectively through survey…
▽ More
One challenge with introducing robots into novel environments is misalignment between supervisor expectations and reality, which can greatly affect a user's trust and continued use of the robot. We performed an experiment to test whether the presence of an explanation of expected robot behavior affected a supervisor's trust in an autonomous robot. We measured trust both subjectively through surveys and objectively through a dual-task experiment design to capture supervisors' neglect tolerance (i.e., their willingness to perform their own task while the robot is acting autonomously). Our objective results show that explanations can help counteract the novelty effect of seeing a new robot perform in an unknown environment. Participants who received an explanation of the robot's behavior were more likely to focus on their own task at the risk of neglecting their robot supervision task during the first trials of the robot's behavior compared to those who did not receive an explanation. However, this effect diminished after seeing multiple trials, and participants who received explanations were equally trusting of the robot's behavior as those who did not receive explanations. Interestingly, participants were not able to identify their own changes in trust through their survey responses, demonstrating that the dual-task design measured subtler changes in a supervisor's trust.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Multilingual Transfer Learning for QA Using Translation as Data Augmentation
Authors:
Mihaela Bornea,
Lin Pan,
Sara Rosenthal,
Radu Florian,
Avirup Sil
Abstract:
Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. Our first strategy augments th…
▽ More
Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. Our first strategy augments the original English training data with machine translation-generated data. This results in a corpus of multilingual silver-labeled QA pairs that is 14 times larger than the original training set. In addition, we propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance and result in LM embeddings that are less language-variant. Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Convergence Rates of Attractive-Repulsive MCMC Algorithms
Authors:
Yu Hang Jiang,
Tong Liu,
Zhiya Lou,
Jeffrey S. Rosenthal,
Shanshan Shangguan,
Fei Wang,
Zixuan Wu
Abstract:
We consider MCMC algorithms for certain particle systems which include both attractive and repulsive forces, making their convergence analysis challenging. We prove that a version of these algorithms on a bounded state space is uniformly ergodic with an explicit quantitative convergence rate. We also prove that a version on an unbounded state-space is still geometrically ergodic, and then use the…
▽ More
We consider MCMC algorithms for certain particle systems which include both attractive and repulsive forces, making their convergence analysis challenging. We prove that a version of these algorithms on a bounded state space is uniformly ergodic with an explicit quantitative convergence rate. We also prove that a version on an unbounded state-space is still geometrically ergodic, and then use the method of shift-coupling to obtain an explicit quantitative bound on its convergence rate.
△ Less
Submitted 1 September, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
MCMC Confidence Intervals and Biases
Authors:
Yu Hang Jiang,
Tong Liu,
Zhiya Lou,
Jeffrey S. Rosenthal,
Shanshan Shangguan,
Fei Wang,
Zixuan Wu
Abstract:
The recent paper "Simple confidence intervals for MCMC without CLTs" by J.S. Rosenthal, showed the derivation of a simple MCMC confidence interval using only Chebyshev's inequality, not CLT. That result required certain assumptions about how the estimator bias and variance grow with the number of iterations $n$. In particular, the bias is $o(1/\sqrt{n})$. This assumption seemed mild. It is general…
▽ More
The recent paper "Simple confidence intervals for MCMC without CLTs" by J.S. Rosenthal, showed the derivation of a simple MCMC confidence interval using only Chebyshev's inequality, not CLT. That result required certain assumptions about how the estimator bias and variance grow with the number of iterations $n$. In particular, the bias is $o(1/\sqrt{n})$. This assumption seemed mild. It is generally believed that the estimator bias will be $O(1/n)$ and hence $o(1/\sqrt{n})$. However, questions were raised by researchers about how to verify this assumption. Indeed, we show that this assumption might not always hold. In this paper, we seek to simplify and weaken the assumptions in the previously mentioned paper, to make MCMC confidence intervals without CLTs more widely applicable.
△ Less
Submitted 29 June, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Introducing a new high-resolution handwritten digits data set with writer characteristics
Authors:
Cédric Beaulac,
Jeffrey S. Rosenthal
Abstract:
The contributions in this article are two-fold. First, we introduce a new hand-written digit data set that we collected. It contains high-resolution images of hand-written The contributions in this article are two-fold. First, we introduce a new handwritten digit data set that we collected. It contains high-resolution images of handwritten digits together with various writer characteristics which…
▽ More
The contributions in this article are two-fold. First, we introduce a new hand-written digit data set that we collected. It contains high-resolution images of hand-written The contributions in this article are two-fold. First, we introduce a new handwritten digit data set that we collected. It contains high-resolution images of handwritten digits together with various writer characteristics which are not available in the well-known MNIST database. The multiple writer characteristics gathered are a novelty of our data set and create new research opportunities. The data set is publicly available online. Second, we analyse this new data set. We begin with simple supervised tasks. We assess the predictability of the writer characteristics gathered, the effect of using some of those characteristics as predictors in classification task and the effect of higher resolution images on classification accuracy. We also explore semi-supervised applications; we can leverage the high quantity of handwritten digits data sets already existing online to improve the accuracy of various classifications task with noticeable success. Finally, we also demonstrate the generative perspective offered by this new data set; we are able to generate images that mimics the writing style of specific writers. The data set has unique and distinct features and our analysis establishes benchmarks and showcases some of the new opportunities made possible with this new data set.
△ Less
Submitted 13 April, 2022; v1 submitted 4 November, 2020;
originally announced November 2020.
-
Skew Brownian Motion and Complexity of the ALPS Algorithm
Authors:
Gareth O. Roberts,
Jeffrey S. Rosenthal,
Nicholas G. Tawn
Abstract:
Simulated tempering is a popular method of allowing MCMC algorithms to move between modes of a multimodal target density π. The paper [24] introduced the Annealed Leap-Point Sampler (ALPS) to allow for rapid movement between modes. In this paper, we prove that, under appropriate assumptions, a suitably scaled version of the ALPS algorithm converges weakly to skew Brownian motion. Our results show…
▽ More
Simulated tempering is a popular method of allowing MCMC algorithms to move between modes of a multimodal target density π. The paper [24] introduced the Annealed Leap-Point Sampler (ALPS) to allow for rapid movement between modes. In this paper, we prove that, under appropriate assumptions, a suitably scaled version of the ALPS algorithm converges weakly to skew Brownian motion. Our results show that under appropriate assumptions, the ALPS algorithm mixes in time O(d[log(d)]^2 ) or O(d), depending on which version is used.
△ Less
Submitted 12 May, 2021; v1 submitted 25 September, 2020;
originally announced September 2020.
-
The Coupling/Minorization/Drift Approach to Markov Chain Convergence Rates
Authors:
Yu Hang Jiang,
Tong Liu,
Zhiya Lou,
Jeffrey S. Rosenthal,
Shanshan Shangguan,
Fei Wang,
Zixuan Wu
Abstract:
This review paper provides an introduction of Markov chains and their convergence rates which is an important and interesting mathematical topic which also has important applications for very widely used Markov chain Monte Carlo (MCMC) algorithm. We first discuss eigenvalue analysis for Markov chains on finite state spaces. Then, using the coupling construction, we prove two quantitative bounds ba…
▽ More
This review paper provides an introduction of Markov chains and their convergence rates which is an important and interesting mathematical topic which also has important applications for very widely used Markov chain Monte Carlo (MCMC) algorithm. We first discuss eigenvalue analysis for Markov chains on finite state spaces. Then, using the coupling construction, we prove two quantitative bounds based on minorization condition and drift conditions, and provide descriptive and intuitive examples to showcase how these theorems can be implemented in practice. This paper is meant to provide a general overview of the subject and spark interest in new Markov chain research areas.
△ Less
Submitted 1 September, 2021; v1 submitted 24 August, 2020;
originally announced August 2020.
-
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Authors:
Marcos Zampieri,
Preslav Nakov,
Sara Rosenthal,
Pepa Atanasova,
Georgi Karadzhov,
Hamdy Mubarak,
Leon Derczynski,
Zeses Pitenis,
Çağrı Çöltekin
Abstract:
We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, En…
▽ More
We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.
△ Less
Submitted 30 September, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Learning Bounded Koopman Observables: Results on Stability, Continuity, and Controllability
Authors:
Craig Bakker,
Thiagarajan Ramachandran,
W. Steven Rosenthal
Abstract:
The Koopman operator is an useful analytical tool for studying dynamical systems -- both controlled and uncontrolled. For example, Koopman eigenfunctions can provide non-local stability information about the underlying dynamical system. Koopman representations of nonlinear systems are commonly calculated using machine learning methods, which seek to represent the Koopman eigenfunctions as a linear…
▽ More
The Koopman operator is an useful analytical tool for studying dynamical systems -- both controlled and uncontrolled. For example, Koopman eigenfunctions can provide non-local stability information about the underlying dynamical system. Koopman representations of nonlinear systems are commonly calculated using machine learning methods, which seek to represent the Koopman eigenfunctions as a linear combinations of nonlinear state measurements. As such, it is important to understand whether, in principle, these eigenfunctions can be successfully obtained using machine learning and what eigenfunctions calculated in this way can tell us about the underlying system. To that end, this paper presents an analysis of continuity, stability and control limitations associated with Koopman eigenfunctions under minimal assumptions and provides a discussion that relates these properties to the ability to calculate Koopman representations with machine learning.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification
Authors:
Sara Rosenthal,
Pepa Atanasova,
Georgi Karadzhov,
Marcos Zampieri,
Preslav Nakov
Abstract:
The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited…
▽ More
The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.
△ Less
Submitted 24 September, 2021; v1 submitted 29 April, 2020;
originally announced April 2020.
-
An evaluation of machine learning techniques to predict the outcome of children treated for Hodgkin-Lymphoma on the AHOD0031 trial: A report from the Children's Oncology Group
Authors:
Cédric Beaulac,
Jeffrey S. Rosenthal,
Qinglin Pei,
Debra Friedman,
Suzanne Wolden,
David Hodgson
Abstract:
In this manuscript we analyze a data set containing information on children with Hodgkin Lymphoma (HL) enrolled on a clinical trial. Treatments received and survival status were collected together with other covariates such as demographics and clinical measurements. Our main task is to explore the potential of machine learning (ML) algorithms in a survival analysis context in order to improve over…
▽ More
In this manuscript we analyze a data set containing information on children with Hodgkin Lymphoma (HL) enrolled on a clinical trial. Treatments received and survival status were collected together with other covariates such as demographics and clinical measurements. Our main task is to explore the potential of machine learning (ML) algorithms in a survival analysis context in order to improve over the Cox Proportional Hazard (CoxPH) model. We discuss the weaknesses of the CoxPH model we would like to improve upon and then we introduce multiple algorithms, from well-established ones to state-of-the-art models, that solve these issues. We then compare every model according to the concordance index and the brier score. Finally, we produce a series of recommendations, based on our experience, for practitioners that would like to benefit from the recent advances in artificial intelligence.
△ Less
Submitted 26 March, 2021; v1 submitted 15 January, 2020;
originally announced January 2020.
-
SemEval-2013 Task 2: Sentiment Analysis in Twitter
Authors:
Preslav Nakov,
Zornitsa Kozareva,
Alan Ritter,
Sara Rosenthal,
Veselin Stoyanov,
Theresa Wilson
Abstract:
In recent years, sentiment analysis in social media has attracted a lot of research interest and has been used for a number of applications. Unfortunately, research has been hindered by the lack of suitable datasets, complicating the comparison between approaches. To address this issue, we have proposed SemEval-2013 Task 2: Sentiment Analysis in Twitter, which included two subtasks: A, an expressi…
▽ More
In recent years, sentiment analysis in social media has attracted a lot of research interest and has been used for a number of applications. Unfortunately, research has been hindered by the lack of suitable datasets, complicating the comparison between approaches. To address this issue, we have proposed SemEval-2013 Task 2: Sentiment Analysis in Twitter, which included two subtasks: A, an expression-level subtask, and B, a message-level subtask. We used crowdsourcing on Amazon Mechanical Turk to label a large Twitter training dataset along with additional test sets of Twitter and SMS messages for both subtasks. All datasets used in the evaluation are released to the research community. The task attracted significant interest and a total of 149 submissions from 44 teams. The best-performing team achieved an F1 of 88.9% and 69% for subtasks A and B, respectively.
△ Less
Submitted 14 December, 2019;
originally announced December 2019.
-
SemEval-2014 Task 9: Sentiment Analysis in Twitter
Authors:
Sara Rosenthal,
Preslav Nakov,
Alan Ritter,
Veselin Stoyanov
Abstract:
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year's task that ran successfully as part of SemEval-2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular…
▽ More
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year's task that ran successfully as part of SemEval-2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1-score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
SemEval-2015 Task 10: Sentiment Analysis in Twitter
Authors:
Sara Rosenthal,
Saif M Mohammad,
Preslav Nakov,
Alan Ritter,
Svetlana Kiritchenko,
Veselin Stoyanov
Abstract:
In this paper, we describe the 2015 iteration of the SemEval shared task on Sentiment Analysis in Twitter. This was the most popular sentiment analysis shared task to date with more than 40 teams participating in each of the last three years. This year's shared task competition consisted of five sentiment prediction subtasks. Two were reruns from previous years: (A) sentiment expressed by a phrase…
▽ More
In this paper, we describe the 2015 iteration of the SemEval shared task on Sentiment Analysis in Twitter. This was the most popular sentiment analysis shared task to date with more than 40 teams participating in each of the last three years. This year's shared task competition consisted of five sentiment prediction subtasks. Two were reruns from previous years: (A) sentiment expressed by a phrase in the context of a tweet, and (B) overall sentiment of a tweet. We further included three new subtasks asking to predict (C) the sentiment towards a topic in a single tweet, (D) the overall sentiment towards a topic in a set of tweets, and (E) the degree of prior polarity of a phrase.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
SemEval-2016 Task 4: Sentiment Analysis in Twitter
Authors:
Preslav Nakov,
Alan Ritter,
Sara Rosenthal,
Fabrizio Sebastiani,
Veselin Stoyanov
Abstract:
This paper discusses the fourth year of the ``Sentiment Analysis in Twitter Task''. SemEval-2016 Task 4 comprises five subtasks, three of which represent a significant departure from previous editions. The first two subtasks are reruns from prior years and ask to predict the overall sentiment, and the sentiment towards a topic in a tweet. The three new subtasks focus on two variants of the basic `…
▽ More
This paper discusses the fourth year of the ``Sentiment Analysis in Twitter Task''. SemEval-2016 Task 4 comprises five subtasks, three of which represent a significant departure from previous editions. The first two subtasks are reruns from prior years and ask to predict the overall sentiment, and the sentiment towards a topic in a tweet. The three new subtasks focus on two variants of the basic ``sentiment classification in Twitter'' task. The first variant adopts a five-point scale, which confers an ordinal character to the classification task. The second variant focuses on the correct estimation of the prevalence of each class of interest, a task which has been called quantification in the supervised learning literature. The task continues to be very popular, attracting a total of 43 teams.
△ Less
Submitted 3 December, 2019;
originally announced December 2019.
-
SemEval-2017 Task 4: Sentiment Analysis in Twitter
Authors:
Sara Rosenthal,
Noura Farra,
Preslav Nakov
Abstract:
This paper describes the fifth year of the Sentiment Analysis in Twitter task. SemEval-2017 Task 4 continues with a rerun of the subtasks of SemEval-2016 Task 4, which include identifying the overall sentiment of the tweet, sentiment towards a topic with classification on a two-point and on a five-point ordinal scale, and quantification of the distribution of sentiment towards a topic across a num…
▽ More
This paper describes the fifth year of the Sentiment Analysis in Twitter task. SemEval-2017 Task 4 continues with a rerun of the subtasks of SemEval-2016 Task 4, which include identifying the overall sentiment of the tweet, sentiment towards a topic with classification on a two-point and on a five-point ordinal scale, and quantification of the distribution of sentiment towards a topic across a number of tweets: again on a two-point and on a five-point ordinal scale. Compared to 2016, we made two changes: (i) we introduced a new language, Arabic, for all subtasks, and (ii)~we made available information from the profiles of the Twitter users who posted the target tweets. The task continues to be very popular, with a total of 48 teams participating this year.
△ Less
Submitted 2 December, 2019;
originally announced December 2019.
-
Jump Markov Chains and Rejection-Free Metropolis Algorithms
Authors:
J. S. Rosenthal,
A. Dote,
K. Dabiri,
H. Tamura,
S. Chen,
A. Sheikholeslami
Abstract:
We consider versions of the Metropolis algorithm which avoid the inefficiency of rejections. We first illustrate that a natural Uniform Selection Algorithm might not converge to the correct distribution. We then analyse the use of Markov jump chains which avoid successive repetitions of the same state. After exploring the properties of jump chains, we show how they can exploit parallelism in compu…
▽ More
We consider versions of the Metropolis algorithm which avoid the inefficiency of rejections. We first illustrate that a natural Uniform Selection Algorithm might not converge to the correct distribution. We then analyse the use of Markov jump chains which avoid successive repetitions of the same state. After exploring the properties of jump chains, we show how they can exploit parallelism in computer hardware to produce more efficient samples. We apply our results to the Metropolis algorithm, to Parallel Tempering, to a Bayesian model, to a two-dimensional ferromagnetic 4 x 4 Ising model, and to a pseudo-marginal MCMC algorithm.
△ Less
Submitted 28 October, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Koopman Representations of Dynamic Systems with Control
Authors:
Craig Bakker,
Steven Rosenthal,
Kathleen E. Nowak
Abstract:
The design and analysis of optimal control policies for dynamical systems can be complicated by nonlinear dependence in the state variables. Koopman operators have been used to simplify the analysis of dynamical systems by map** the flow of the system onto a space of observables where the dynamics are linear (and possibly infinte). This paper focuses on the development of consistent Koopman repr…
▽ More
The design and analysis of optimal control policies for dynamical systems can be complicated by nonlinear dependence in the state variables. Koopman operators have been used to simplify the analysis of dynamical systems by map** the flow of the system onto a space of observables where the dynamics are linear (and possibly infinte). This paper focuses on the development of consistent Koopman representations for controlled dynamical system. We introduce the concept of dynamical consistency for Koopman representations and analyze several existing and proposed representations deriving necessary constraints on the dynamical system, observables, and Koopman operators. Our main result is a hybrid formulation which independently and jointly observes the state and control inputs. This formulation admits a relatively large space of dynamical systems compared to earlier formulations while kee** the Koopman operator independent of the state and control inputs. More generally, this work provides an analysis framework to evaluate and rank proposed simplifications to the general Koopman representation for controlled dynamical systems.
△ Less
Submitted 6 August, 2019;
originally announced August 2019.
-
Optimal Scaling of Random-Walk Metropolis Algorithms on General Target Distributions
Authors:
Jun Yang,
Gareth O. Roberts,
Jeffrey S. Rosenthal
Abstract:
One main limitation of the existing optimal scaling results for Metropolis--Hastings algorithms is that the assumptions on the target distribution are unrealistic. In this paper, we consider optimal scaling of random-walk Metropolis algorithms on general target distributions in high dimensions arising from practical MCMC models from Bayesian statistics. For optimal scaling by maximizing expected s…
▽ More
One main limitation of the existing optimal scaling results for Metropolis--Hastings algorithms is that the assumptions on the target distribution are unrealistic. In this paper, we consider optimal scaling of random-walk Metropolis algorithms on general target distributions in high dimensions arising from practical MCMC models from Bayesian statistics. For optimal scaling by maximizing expected squared jum** distance (ESJD), we show the asymptotically optimal acceptance rate $0.234$ can be obtained under general realistic sufficient conditions on the target distribution. The new sufficient conditions are easy to be verified and may hold for some general classes of MCMC models arising from Bayesian statistics applications, which substantially generalize the product i.i.d. condition required in most existing literature of optimal scaling. Furthermore, we show one-dimensional diffusion limits can be obtained under slightly stronger conditions, which still allow dependent coordinates of the target distribution. We also connect the new diffusion limit results to complexity bounds of Metropolis algorithms in high dimensions.
△ Less
Submitted 4 May, 2020; v1 submitted 27 April, 2019;
originally announced April 2019.
-
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Authors:
Marcos Zampieri,
Shervin Malmasi,
Preslav Nakov,
Sara Rosenthal,
Noura Farra,
Ritesh Kumar
Abstract:
We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. I…
▽ More
We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. In sub-task B, the focus was on the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, about 800 teams signed up to participate in the task, and 115 of them submitted results, which we present and analyze in this report.
△ Less
Submitted 26 April, 2019; v1 submitted 19 March, 2019;
originally announced March 2019.
-
Predicting the Type and Target of Offensive Posts in Social Media
Authors:
Marcos Zampieri,
Shervin Malmasi,
Preslav Nakov,
Sara Rosenthal,
Noura Farra,
Ritesh Kumar
Abstract:
As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offe…
▽ More
As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.
△ Less
Submitted 16 April, 2019; v1 submitted 25 February, 2019;
originally announced February 2019.
-
Simple Confidence Intervals for MCMC Without CLTs
Authors:
Jeffrey S. Rosenthal
Abstract:
This short note argues that 95% confidence intervals for MCMC estimates can be obtained even without establishing a CLT, by multiplying their widths by 2.3.
This short note argues that 95% confidence intervals for MCMC estimates can be obtained even without establishing a CLT, by multiplying their widths by 2.3.
△ Less
Submitted 30 November, 2018;
originally announced December 2018.
-
A Deep Latent-Variable Model Application to Select Treatment Intensity in Survival Analysis
Authors:
Cédric Beaulac,
Jeffrey S. Rosenthal,
David Hodgson
Abstract:
In the following short article we adapt a new and popular machine learning model for inference on medical data sets. Our method is based on the Variational AutoEncoder (VAE) framework that we adapt to survival analysis on small data sets with missing values. In our model, the true health status appears as a set of latent variables that affects the observed covariates and the survival chances. We s…
▽ More
In the following short article we adapt a new and popular machine learning model for inference on medical data sets. Our method is based on the Variational AutoEncoder (VAE) framework that we adapt to survival analysis on small data sets with missing values. In our model, the true health status appears as a set of latent variables that affects the observed covariates and the survival chances. We show that this flexible model allows insightful decision-making using a predicted distribution and outperforms a classic survival analysis model.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Ten Simple Rules for Reproducible Research in Jupyter Notebooks
Authors:
Adam Rule,
Amanda Birmingham,
Cristal Zuniga,
Ilkay Altintas,
Shih-Cheng Huang,
Rob Knight,
Niema Moshiri,
Mai H. Nguyen,
Sara Brin Rosenthal,
Fernando Pérez,
Peter W. Rose
Abstract:
Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or s…
▽ More
Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or simulations.
In this paper, we address several questions about reproducibility. For example, what are the technical and non-technical barriers to reproducible computational studies? What opportunities and challenges do computational notebooks offer to overcome some of these barriers? What tools are available and how can they be used effectively?
We have developed a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems, such as Jupyter Notebooks, which have become a tool of choice for many applications. Notebooks combine detailed workflows with narrative text and visualization of results. Combined with software repositories and open source licensing, notebooks are powerful tools for transparent, collaborative, reproducible, and reusable data analyses.
△ Less
Submitted 13 October, 2018;
originally announced October 2018.
-
Trimmed Ensemble Kalman Filter for Nonlinear and Non-Gaussian Data Assimilation Problems
Authors:
Weixuan Li,
W. Steven Rosenthal,
Guang Lin
Abstract:
We study the ensemble Kalman filter (EnKF) algorithm for sequential data assimilation in a general situation, that is, for nonlinear forecast and measurement models with non-additive and non-Gaussian noises. Such applications traditionally force us to choose between inaccurate Gaussian assumptions that permit efficient algorithms (e.g., EnKF), or more accurate direct sampling methods which scale p…
▽ More
We study the ensemble Kalman filter (EnKF) algorithm for sequential data assimilation in a general situation, that is, for nonlinear forecast and measurement models with non-additive and non-Gaussian noises. Such applications traditionally force us to choose between inaccurate Gaussian assumptions that permit efficient algorithms (e.g., EnKF), or more accurate direct sampling methods which scale poorly with dimension (e.g., particle filters, or PF). We introduce a trimmed ensemble Kalman filter (TEnKF) which can interpolate between the limiting distributions of the EnKF and PF to facilitate adaptive control over both accuracy and efficiency. This is achieved by introducing a trimming function that removes non-Gaussian outliers that introduce errors in the correlation between the model and observed forecast, which otherwise prevent the EnKF from proposing accurate forecast updates. We show for specific trimming functions that the TEnKF exactly reproduces the limiting distributions of the EnKF and PF. We also develop an adaptive implementation which provides control of the effective sample size and allows the filter to overcome periods of increased model nonlinearity. This algorithm allow us to demonstrate substantial improvements over the traditional EnKF in convergence and robustness for the nonlinear Lorenz-63 and Lorenz-96 models.
△ Less
Submitted 15 August, 2018;
originally announced August 2018.
-
Weight-Preserving Simulated Tempering
Authors:
Nicholas G. Tawn,
Gareth O. Roberts,
Jeffrey S. Rosenthal
Abstract:
Simulated tempering is popular method of allowing MCMC algorithms to move between modes of a multimodal target density π. One problem with simulated tempering for multimodal targets is that the weights of the various modes change for different inverse-temperature values, sometimes dramatically so. In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserv…
▽ More
Simulated tempering is popular method of allowing MCMC algorithms to move between modes of a multimodal target density π. One problem with simulated tempering for multimodal targets is that the weights of the various modes change for different inverse-temperature values, sometimes dramatically so. In this paper, we provide a fix to overcome this problem, by adjusting the mode weights to be preserved (i.e., constant) over different inverse-temperature settings. We then apply simulated tempering algorithms to multimodal targets using our mode weight correction. We present simulations in which our weight-preserving algorithm mixes between modes much more successfully than traditional tempering algorithms. We also prove a diffusion limit for an version of our algorithm, which shows that under appropriate assumptions, our algorithm mixes in time O(d [log d]^2).
△ Less
Submitted 11 February, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.