-
FlowVQA: Map** Multimodal Logic in Visual Question Answering with Flowcharts
Authors:
Shubhankar Singh,
Purvi Chaurasia,
Yerram Varun,
Pranshu Pandya,
Vatsal Gupta,
Vivek Gupta,
Dan Roth
Abstract:
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart im…
▽ More
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
△ Less
Submitted 28 June, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation
Authors:
Bangzheng Li,
Ben Zhou,
Xingyu Fu,
Fei Wang,
Dan Roth,
Muhao Chen
Abstract:
Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity…
▽ More
Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Authors:
Bowen Jiang,
Yangxinyu Xie,
Zhuoqun Hao,
Xiaomeng Wang,
Tanwi Mallick,
Weijie J. Su,
Camillo J. Taylor,
Dan Roth
Abstract:
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syll…
▽ More
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Authors:
Fei Wang,
Xingyu Fu,
James Y. Huang,
Zekun Li,
Qin Liu,
Xiaogeng Liu,
Mingyu Derek Ma,
Nan Xu,
Wenxuan Zhou,
Kai Zhang,
Tianyi Lorena Yan,
Wenjie Jacky Mo,
Hsiang-Hui Liu,
Pan Lu,
Chunyuan Li,
Chaowei Xiao,
Kai-Wei Chang,
Dan Roth,
Sheng Zhang,
Hoifung Poon,
Muhao Chen
Abstract:
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a…
▽ More
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
△ Less
Submitted 1 July, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Authors:
Yushi Hu,
Weijia Shi,
Xingyu Fu,
Dan Roth,
Mari Ostendorf,
Luke Zettlemoyer,
Noah A Smith,
Ranjay Krishna
Abstract:
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In t…
▽ More
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
Authors:
Xingyu Fu,
Muyu He,
Yujie Lu,
William Yang Wang,
Dan Roth
Abstract:
We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I model…
▽ More
We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
Authors:
Aparna Elangovan,
Ling Liu,
Lei Xu,
Sravan Bodapati,
Dan Roth
Abstract:
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, a…
▽ More
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars --Consistency, Scoring Critera, Differentiating, User Experience, Responsible, and Scalability.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Devil's Advocate: Anticipatory Reflection for LLM Agents
Authors:
Haoyu Wang,
Tao Li,
Zhiwei Deng,
Dan Roth,
Yang Li
Abstract:
In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. %; and when necessary, to explore ``the road not taken.'' W…
▽ More
In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. %; and when necessary, to explore ``the road not taken.'' We implement a three-fold introspective intervention: 1) anticipatory reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement. By deploying and experimenting with this methodology -- a zero-shot approach -- within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental results suggest that our introspection-driven approach not only enhances the agent's ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.
△ Less
Submitted 20 June, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models
Authors:
Yu Feng,
Ben Zhou,
Weidong Lin,
Dan Roth
Abstract:
Large language models primarily rely on inductive reasoning for decision making. This results in unreliable decisions when applied to real-world tasks that often present incomplete contexts and conditions. Thus, accurate probability estimation and appropriate interpretations are required to enhance decision-making reliability. In this paper, we propose a Bayesian inference framework called BIRD fo…
▽ More
Large language models primarily rely on inductive reasoning for decision making. This results in unreliable decisions when applied to real-world tasks that often present incomplete contexts and conditions. Thus, accurate probability estimation and appropriate interpretations are required to enhance decision-making reliability. In this paper, we propose a Bayesian inference framework called BIRD for large language models. BIRD provides controllable and interpretable probability estimation for model decisions, based on abductive factors, LLM entailment, as well as learnable deductive Bayesian modeling. Experiments show that BIRD produces probability estimations that align with human judgments over 65% of the time using open-sourced Llama models, outperforming the state-of-the-art GPT-4 by 35%. We also show that BIRD can be directly used for trustworthy decision making on many real-world applications.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Authors:
Xingyu Fu,
Yushi Hu,
Bangzheng Li,
Yu Feng,
Haoyu Wang,
Xudong Lin,
Dan Roth,
Noah A. Smith,
Wei-Chiu Ma,
Ranjay Krishna
Abstract:
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challeng…
▽ More
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
△ Less
Submitted 3 July, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Fewer Truncations Improve Language Modeling
Authors:
Hantian Ding,
Zijian Wang,
Giovanni Paolini,
Varun Kumar,
Anoop Deoras,
Dan Roth,
Stefano Soatto
Abstract:
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and…
▽ More
In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
△ Less
Submitted 2 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval
Authors:
Peter Baile Chen,
Yi Zhang,
Dan Roth
Abstract:
Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found either in a single table or multiple tables identified through question decomposition or rewriting. However, neither of these approaches is sufficient, as many…
▽ More
Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found either in a single table or multiple tables identified through question decomposition or rewriting. However, neither of these approaches is sufficient, as many questions require retrieving multiple tables and joining them through a join plan that cannot be discerned from the user query itself. If the join plan is not considered in the retrieval stage, the subsequent steps of reasoning and answering based on those retrieved tables are likely to be incorrect. To address this problem, we introduce a method that uncovers useful join relations for any query and database during table retrieval. We use a novel re-ranking method formulated as a mixed-integer program that considers not only table-query relevance but also table-table relevance that requires inferring join relationships. Our method outperforms the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score and for end-to-end QA by up to 5.4% in accuracy.
△ Less
Submitted 5 June, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Conceptual and Unbiased Reasoning in Language Models
Authors:
Ben Zhou,
Hongming Zhang,
Sihao Chen,
Dian Yu,
Hongwei Wang,
Baolin Peng,
Dan Roth,
Dong Yu
Abstract:
Conceptual reasoning, the ability to reason in abstract and high-level perspectives, is key to generalization in human cognition. However, limited study has been done on large language models' capability to perform conceptual reasoning. In this work, we bridge this gap and propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions and gener…
▽ More
Conceptual reasoning, the ability to reason in abstract and high-level perspectives, is key to generalization in human cognition. However, limited study has been done on large language models' capability to perform conceptual reasoning. In this work, we bridge this gap and propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions and generate solutions in a verifiable symbolic space. Using this framework as an analytical tool, we show that existing large language models fall short on conceptual reasoning, drop** 9% to 28% on various benchmarks compared to direct inference methods. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making. We propose two techniques to add trustworthy induction signals by generating familiar questions with similar underlying reasoning paths and asking models to perform self-refinement. Experiments show that our proposed techniques improve models' conceptual reasoning performance by 8% to 11%, achieving a more robust reasoning system that relies less on inductive biases.
△ Less
Submitted 29 March, 2024;
originally announced April 2024.
-
ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation
Authors:
Hannah Schieber,
Shiyu Li,
Niklas Corell,
Philipp Beckerle,
Julian Kreimeier,
Daniel Roth
Abstract:
In medical and industrial domains, providing guidance for assembly processes is critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times, and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ AR visualization to provide guidance, reduce assembly times and minimize errors. To…
▽ More
In medical and industrial domains, providing guidance for assembly processes is critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times, and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ AR visualization to provide guidance, reduce assembly times and minimize errors. To enable in-situ visualization 6D pose estimation can be leveraged. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics including occlusion during assembly and dynamics in the assembly objects appearance. Existing work, combining object detection/6D pose estimation and assembly state detection focuses either on pure deep learning-based approaches, or limit the assembly state detection to building blocks. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. Our evaluation on our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network, and even outperform the hybrid and pure tracking-based approaches.
△ Less
Submitted 11 April, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
Authors:
Bowen Jiang,
Zhijun Zhuang,
Shreyas S. Shivakumar,
Dan Roth,
Camillo J. Taylor
Abstract:
This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on speci…
▽ More
This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification
Authors:
Fei Wang,
Chao Shang,
Sarthak Jain,
Shuai Wang,
Qiang Ning,
Bonan Min,
Vittorio Castelli,
Yassine Benajiba,
Dan Roth
Abstract:
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the sat…
▽ More
User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity ty**, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering
Authors:
Pragya Srivastava,
Manuj Malik,
Vivek Gupta,
Tanuja Ganu,
Dan Roth
Abstract:
Large Language Models (LLMs), excel in natural language understanding, but their capability for complex mathematical reasoning with an amalgamation of structured tables and unstructured text is uncertain. This study explores LLMs' mathematical reasoning on four financial tabular question-answering datasets: TATQA, FinQA, ConvFinQA, and Multihiertt. Through extensive experiments with various models…
▽ More
Large Language Models (LLMs), excel in natural language understanding, but their capability for complex mathematical reasoning with an amalgamation of structured tables and unstructured text is uncertain. This study explores LLMs' mathematical reasoning on four financial tabular question-answering datasets: TATQA, FinQA, ConvFinQA, and Multihiertt. Through extensive experiments with various models and prompting techniques, we assess how LLMs adapt to complex tables and mathematical tasks. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. The results provide insights into LLMs' capabilities and limitations in handling complex mathematical scenarios for semi-structured tables. Ultimately, we introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance while providing a nuanced understanding of LLMs abilities for such a task.
△ Less
Submitted 29 February, 2024; v1 submitted 17 February, 2024;
originally announced February 2024.
-
GBOT: Graph-Based 3D Object Tracking for Augmented Reality-Assisted Assembly Guidance
Authors:
Shiyu Li,
Hannah Schieber,
Niklas Corell,
Bernhard Egger,
Julian Kreimeier,
Daniel Roth
Abstract:
Guidance for assemblable parts is a promising field for augmented reality. Augmented reality assembly guidance requires 6D object poses of target objects in real time. Especially in time-critical medical or industrial settings, continuous and markerless tracking of individual parts is essential to visualize instructions superimposed on or next to the target object parts. In this regard, occlusions…
▽ More
Guidance for assemblable parts is a promising field for augmented reality. Augmented reality assembly guidance requires 6D object poses of target objects in real time. Especially in time-critical medical or industrial settings, continuous and markerless tracking of individual parts is essential to visualize instructions superimposed on or next to the target object parts. In this regard, occlusions by the user's hand or other objects and the complexity of different assembly states complicate robust and real-time markerless multi-object tracking. To address this problem, we present Graph-based Object Tracking (GBOT), a novel graph-based single-view RGB-D tracking approach. The real-time markerless multi-object tracking is initialized via 6D pose estimation and updates the graph-based assembly poses. The tracking through various assembly states is achieved by our novel multi-state assembly graph. We update the multi-state assembly graph by utilizing the relative poses of the individual assembly parts. Linking the individual objects in this graph enables more robust object tracking during the assembly process. For evaluation, we introduce a synthetic dataset of publicly available and 3D printable assembly assets as a benchmark for future work. Quantitative experiments in synthetic data and further qualitative study in real test data show that GBOT can outperform existing work towards enabling context-aware augmented reality assembly guidance. Dataset and code will be made publically available.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
DeAL: Decoding-time Alignment for Large Language Models
Authors:
James Y. Huang,
Sailik Sengupta,
Daniele Bonadiman,
Yi-an Lai,
Arshit Gupta,
Nikolaos Pappas,
Saab Mansour,
Katrin Kirchhoff,
Dan Roth
Abstract:
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom r…
▽ More
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer's view of universal and static principles are key limitations. Second, the residual gaps in model training and the reliability of such approaches are also questionable (e.g. susceptibility to jail-breaking even after safety training). To address these, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints (studied widely in the pre-LLM era) and abstract objectives such as harmlessness and helpfulness (proposed in the post-LLM era) show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs. Lastly, while DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization we leave for future work.
△ Less
Submitted 20 February, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Code Representation Learning At Scale
Authors:
Dejiao Zhang,
Wasi Ahmad,
Ming Tan,
Hantian Ding,
Ramesh Nallapati,
Dan Roth,
Xiaofei Ma,
Bing Xiang
Abstract:
Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-st…
▽ More
Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning
Authors:
Ge Li,
Hongyi Zhou,
Dominik Roth,
Serge Thilges,
Fabian Otto,
Rudolf Lioutikov,
Gerhard Neumann
Abstract:
Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to impl…
▽ More
Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to implement on real hardware. Episodic RL (ERL) seeks to overcome these challenges by exploring in parameters space that capture the correlation of actions. However, these approaches typically compromise data efficiency, as they treat trajectories as opaque \emph{black boxes}. In this work, we introduce a novel ERL algorithm, Temporally-Correlated Episodic RL (TCE), which effectively utilizes step information in episodic policy updates, opening the 'black box' in existing ERL methods while retaining the smooth and consistent exploration in parameter space. TCE synergistically combines the advantages of step-based and episodic RL, achieving comparable performance to recent ERL methods while maintaining data efficiency akin to state-of-the-art (SoTA) step-based RL.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?
Authors:
Bangzheng Li,
Ben Zhou,
Fei Wang,
Xingyu Fu,
Dan Roth,
Muhao Chen
Abstract:
Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of hallucination induced by semantic associations. Specifically, we investigate to what extent LLMs take shortcuts from certain keyword/entity biases in th…
▽ More
Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of hallucination induced by semantic associations. Specifically, we investigate to what extent LLMs take shortcuts from certain keyword/entity biases in the prompt instead of following the correct reasoning path. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA. We start from questions that LLMs will answer correctly with utmost certainty, and mask the important entity with evidence sentence recursively, asking models to find masked entities according to a chain of evidence before answering the question.
During the construction of the evidence, we purposefully replace semantic clues (entities) that may lead to the correct answer with distractor clues (evidence) that will not directly lead to the correct answer but require a chain-like reasoning process. We evaluate if models can follow the correct reasoning chain instead of short-cutting through distractor clues. We find that existing LLMs lack the necessary capabilities to follow correct reasoning paths and resist the attempt of greedy shortcuts. We show that the distractor semantic associations often lead to model hallucination, which is strong evidence that questions the validity of current LLM reasoning.
△ Less
Submitted 5 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception
Authors:
Chaitanya Malaviya,
Subin Lee,
Dan Roth,
Mark Yatskar
Abstract:
Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales (or explanations) generated by QA models to support the…
▽ More
Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales (or explanations) generated by QA models to support their answers. We specifically consider decomposed QA models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample rationales from language models using few-shot prompting for two datasets, and then perform two user studies. First, we present users with incorrect answers and corresponding rationales in various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition, formats with attributions to the context and in-depth reasoning significantly enhance user-reported understanding and trust of model outputs.
△ Less
Submitted 1 April, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
On the Calibration of Multilingual Question Answering LLMs
Authors:
Yahan Yang,
Soham Dan,
Dan Roth,
Insup Lee
Abstract:
Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety…
▽ More
Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.
△ Less
Submitted 15 April, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Multi-Set Inoculation: Assessing Model Robustness Across Multiple Challenge Sets
Authors:
Vatsal Gupta,
Pranshu Pandya,
Tushar Kataria,
Vivek Gupta,
Dan Roth
Abstract:
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, leading to trust issues due to hallucinations. To bolster trust, it's essential to understand these models' failure modes and devise strategies to enhance their performance. In this study, we propose a framework to study the effect of input perturbations on language models of different scales, from pre…
▽ More
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, leading to trust issues due to hallucinations. To bolster trust, it's essential to understand these models' failure modes and devise strategies to enhance their performance. In this study, we propose a framework to study the effect of input perturbations on language models of different scales, from pre-trained models to large language models (LLMs). We use fine-tuning to train a robust model to perturbations, and we investigate whether exposure to one perturbation improves or degrades the model's performance on other perturbations. To address multi-perturbation robustness, we suggest three distinct training strategies. We also extend the framework to LLMs via a chain of thought(COT) prompting with exemplars. We instantiate our framework for the Tabular-NLI task and show that the proposed strategies train the model robust to different perturbations without losing accuracy on a given dataset.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
Authors:
Sihao Chen,
Hongming Zhang,
Tong Chen,
Ben Zhou,
Wenhao Yu,
Dian Yu,
Baolin Peng,
Hongwei Wang,
Dan Roth,
Dong Yu
Abstract:
We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propos…
▽ More
We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks
Authors:
Xiaodong Yu,
Hao Cheng,
Xiaodong Liu,
Dan Roth,
Jianfeng Gao
Abstract:
Despite remarkable advancements in mitigating hallucinations in large language models (LLMs) by retrieval augmentation, it remains challenging to measure the reliability of LLMs using static question-answering (QA) data. Specifically, given the potential of data contamination (e.g., leading to memorization), good static benchmark performance does not ensure that model can reliably use the provided…
▽ More
Despite remarkable advancements in mitigating hallucinations in large language models (LLMs) by retrieval augmentation, it remains challenging to measure the reliability of LLMs using static question-answering (QA) data. Specifically, given the potential of data contamination (e.g., leading to memorization), good static benchmark performance does not ensure that model can reliably use the provided evidence for responding, which is essential to avoid hallucination when the required knowledge is new or private. Inspired by adversarial machine learning, we investigate the feasibility of automatically perturbing existing static one for dynamic evaluation. Specifically, this paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases for evaluating the LLMs' reliability in using new evidence for answering.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets on a collection of LLMs under various prompting settings. Our generated data is human-readable and useful to trigger hallucination in LLM. Accurate models on static data are observed to produce unsupported answers from the perturbed evidence, with pronounced accuracy drops across LLMs including GPT-4. We find that our adversarial examples are transferable across all considered LLMs. The examples generated by a small model can be used to evaluate a much larger model, making our approach cost-effective.
△ Less
Submitted 31 May, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
Authors:
Yangruibo Ding,
Zijian Wang,
Wasi Uddin Ahmad,
Hantian Ding,
Ming Tan,
Nihal Jain,
Murali Krishna Ramanathan,
Ramesh Nallapati,
Parminder Bhatia,
Dan Roth,
Bing Xiang
Abstract:
Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing…
▽ More
Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly.
To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file.
Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.
△ Less
Submitted 16 November, 2023; v1 submitted 17 October, 2023;
originally announced October 2023.
-
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
Authors:
Hangfeng He,
Hongming Zhang,
Dan Roth
Abstract:
To comprehensively gauge the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains as references to assess the model-derived chains. However, such "gold-standard" human-written reasoning chains may not be unique and their acquisition is ofte…
▽ More
To comprehensively gauge the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains as references to assess the model-derived chains. However, such "gold-standard" human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning evaluation metrics, while eliminating the need for human-crafted reasoning chains as references, often require fine-tuning with human-derived chains before evaluation, complicating the process and questioning their adaptability to other datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, thereby removing the dependency on human-written reasoning chains for both model fine-tuning and evaluative purposes. Leveraging the Socratic method, we develop SocREval ({\bf Soc}ratic Method-Inspired {\bf R}easoning {\bf Eval}uation), a novel approach for prompt design in reference-free reasoning evaluation. Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, SocREval, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis.
△ Less
Submitted 18 April, 2024; v1 submitted 29 September, 2023;
originally announced October 2023.
-
DynaMoN: Motion-Aware Fast and Robust Camera Localization for Dynamic Neural Radiance Fields
Authors:
Nicolas Schischka,
Hannah Schieber,
Mert Asim Karaoglu,
Melih Görgülü,
Florian Grötzner,
Alexander Ladikos,
Daniel Roth,
Nassir Navab,
Benjamin Busam
Abstract:
The accurate reconstruction of dynamic scenes with neural radiance fields is significantly dependent on the estimation of camera poses. Widely used structure-from-motion pipelines encounter difficulties in accurately tracking the camera trajectory when faced with separate dynamics of the scene content and the camera movement. To address this challenge, we propose DynaMoN. DynaMoN utilizes semantic…
▽ More
The accurate reconstruction of dynamic scenes with neural radiance fields is significantly dependent on the estimation of camera poses. Widely used structure-from-motion pipelines encounter difficulties in accurately tracking the camera trajectory when faced with separate dynamics of the scene content and the camera movement. To address this challenge, we propose DynaMoN. DynaMoN utilizes semantic segmentation and generic motion masks to handle dynamic content for initial camera pose estimation and statics-focused ray sampling for fast and accurate novel-view synthesis. Our novel iterative learning scheme switches between training the NeRF and updating the pose parameters for an improved reconstruction and trajectory estimation quality. The proposed pipeline shows significant acceleration of the training process. We extensively evaluate our approach on two real-world dynamic datasets, the TUM RGB-D and the BONN RGB-D Dynamic dataset. DynaMoN improves over the state-of-the-art both in terms of reconstruction quality and trajectory accuracy. We plan to make our code public to enhance research in this area.
△ Less
Submitted 17 March, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
ExpertQA: Expert-Curated Questions and Attributed Answers
Authors:
Chaitanya Malaviya,
Subin Lee,
Sihao Chen,
Elizabeth Sieber,
Mark Yatskar,
Dan Roth
Abstract:
As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal c…
▽ More
As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from language models. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
△ Less
Submitted 1 April, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining
Authors:
Saksham Bassi,
Giulio Duregon,
Siddhartha Jalagam,
David Roth
Abstract:
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality…
▽ More
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and further, that the choice of pretraining objective substantially effects a model's ability to be adapted to the disfluency removal task.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Variations and Relaxations of Normalizing Flows
Authors:
Keegan Kelly,
Lorena Piedras,
Sukrit Rao,
David Roth
Abstract:
Normalizing Flows (NFs) describe a class of models that express a complex target distribution as the composition of a series of bijective transformations over a simpler base distribution. By limiting the space of candidate transformations to diffeomorphisms, NFs enjoy efficient, exact sampling and density evaluation, enabling NFs to flexibly behave as both discriminative and generative models. The…
▽ More
Normalizing Flows (NFs) describe a class of models that express a complex target distribution as the composition of a series of bijective transformations over a simpler base distribution. By limiting the space of candidate transformations to diffeomorphisms, NFs enjoy efficient, exact sampling and density evaluation, enabling NFs to flexibly behave as both discriminative and generative models. Their restriction to diffeomorphisms, however, enforces that input, output and all intermediary spaces share the same dimension, limiting their ability to effectively represent target distributions with complex topologies. Additionally, in cases where the prior and target distributions are not homeomorphic, Normalizing Flows can leak mass outside of the support of the target. This survey covers a selection of recent works that combine aspects of other generative model classes, such as VAEs and score-based diffusion, and in doing so loosen the strict bijectivity constraints of NFs to achieve a balance of expressivity, training speed, sample efficiency and likelihood tractability.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
A search for rare $B \rightarrow D μ^+ μ^-$ decays
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
B. Adeva,
M. Adinolfi,
P. Adlarson,
H. Afsharnia,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
A. Alfonso Albero,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey
, et al. (1038 additional authors not shown)
Abstract:
A search for rare $B \rightarrow D μ^+ μ^-$ decays is performed using proton-proton collision data collected by the LHCb experiment, corresponding to an integrated luminosity of 9 fb$^{-1}$. No significant signals are observed in the non-resonant $μ^+μ^-$ modes, and upper limits of $\mathcal{B}(B^0 \rightarrow \overline{D}^0 μ^+ μ^-) < 5.1 \times 10^{-8}$,…
▽ More
A search for rare $B \rightarrow D μ^+ μ^-$ decays is performed using proton-proton collision data collected by the LHCb experiment, corresponding to an integrated luminosity of 9 fb$^{-1}$. No significant signals are observed in the non-resonant $μ^+μ^-$ modes, and upper limits of $\mathcal{B}(B^0 \rightarrow \overline{D}^0 μ^+ μ^-) < 5.1 \times 10^{-8}$, $\mathcal{B}(B^+ \rightarrow D_s^+ μ^+ μ^-) < 3.2 \times 10^{-8}$, $\mathcal{B}(B_s^0 \rightarrow \overline{D}^0 μ^+ μ^-) < 1.6 \times 10^{-7}$ and $f_c/f_u \cdot \mathcal{B}(B_c^+ \rightarrow D_s^+ μ^+ μ^-) < 9.6 \times 10^{-8}$ are set at the 95\% confidence level, where $f_c$ and $f_u$ are the fragmentation fractions of a $B$ meson with a $c$ and $u$ quark respectively in proton-proton collisions. Each result is either the first such measurement or an improvement by three orders of magnitude on an existing limit. Separate upper limits are calculated when the muon pair originates from a $J/ψ\rightarrow μ^+ μ^-$ decay. The branching fraction of $B_c^+ \rightarrow D_s^+ J/ψ$ multiplied by the fragmentation-fraction ratio is measured to be $f_c/f_u \cdot \mathcal{B}(B_c^+ \rightarrow D_s^+ J/ψ) = (1.63 \pm 0.15 \pm 0.13) \times 10^{-5}$, where the first uncertainty is statistical and the second systematic.
△ Less
Submitted 26 February, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning
Authors:
Alexander Hanbo Li,
Mingyue Shang,
Evangelia Spiliopoulou,
Jie Ma,
Patrick Ng,
Zhiguo Wang,
Bonan Min,
William Wang,
Kathleen McKeown,
Vittorio Castelli,
Dan Roth,
Bing Xiang
Abstract:
We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph…
▽ More
We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Building Interpretable and Reliable Open Information Retriever for New Domains Overnight
Authors:
Xiaodong Yu,
Ben Zhou,
Dan Roth
Abstract:
Information retrieval (IR) or knowledge retrieval, is a critical component for many down-stream tasks such as open-domain question answering (QA). It is also very challenging, as it requires succinctness, completeness, and correctness. In recent works, dense retrieval models have achieved state-of-the-art (SOTA) performance on in-domain IR and QA benchmarks by representing queries and knowledge pa…
▽ More
Information retrieval (IR) or knowledge retrieval, is a critical component for many down-stream tasks such as open-domain question answering (QA). It is also very challenging, as it requires succinctness, completeness, and correctness. In recent works, dense retrieval models have achieved state-of-the-art (SOTA) performance on in-domain IR and QA benchmarks by representing queries and knowledge passages with dense vectors and learning the lexical and semantic similarity. However, using single dense vectors and end-to-end supervision are not always optimal because queries may require attention to multiple aspects and event implicit knowledge. In this work, we propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query. We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks. It will be the go-to system to use for applications that need to perform IR on a new domain without much dedicated effort, because of its superior interpretability and cross-domain performance.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
On Regularization and Inference with Label Constraints
Authors:
Kaifu Wang,
Hangfeng He,
Tin D. Nguyen,
Piyush Kumar,
Dan Roth
Abstract:
Prior knowledge and symbolic rules in machine learning are often expressed in the form of label constraints, especially in structured prediction problems. In this work, we compare two common strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference, by quantifying their impact on model performance. For regularization, we sho…
▽ More
Prior knowledge and symbolic rules in machine learning are often expressed in the form of label constraints, especially in structured prediction problems. In this work, we compare two common strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference, by quantifying their impact on model performance. For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints. However, its preference for small violations introduces a bias toward a suboptimal model. For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage. Given these differences, we further explore the use of two approaches together and propose conditions for constrained inference to compensate for the bias introduced by regularization, aiming to improve both the model complexity and optimal risk.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
The Integer Linear Programming Inference Cookbook
Authors:
Vivek Srikumar,
Dan Roth
Abstract:
Over the years, integer linear programs have been employed to model inference in many natural language processing problems. This survey is meant to guide the reader through the process of framing a new inference problem as an instance of an integer linear program and is structured as a collection of recipes. At the end, we will see two worked examples to illustrate the use of these recipes.
Over the years, integer linear programs have been employed to model inference in many natural language processing problems. This survey is meant to guide the reader through the process of framing a new inference problem as an instance of an integer linear program and is structured as a collection of recipes. At the end, we will see two worked examples to illustrate the use of these recipes.
△ Less
Submitted 30 June, 2023;
originally announced July 2023.
-
Towards Open-Domain Topic Classification
Authors:
Hantian Ding,
**rui Yang,
Yuqian Deng,
Hongming Zhang,
Dan Roth
Abstract:
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time. Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface. To obtain such flexibility, we build the backend model in a zero-shot way. By training on a new dataset constructed from Wikipedia, our label-aware text…
▽ More
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time. Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface. To obtain such flexibility, we build the backend model in a zero-shot way. By training on a new dataset constructed from Wikipedia, our label-aware text classifier can effectively utilize implicit knowledge in the pretrained language model to handle labels it has never seen before. We evaluate our model across four datasets from various domains with different label sets. Experiments show that the model significantly improves over existing zero-shot baselines in open-domain scenarios, and performs competitively with weakly-supervised models trained on in-domain data.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Large Language Models as Sous Chefs: Revising Recipes with GPT-3
Authors:
Alyssa Hwang,
Bryan Li,
Zhaoyi Hou,
Dan Roth
Abstract:
With their remarkably improved text generation and prompting capabilities, large language models can adapt existing written information into forms that are easier to use and understand. In our work, we focus on recipes as an example of complex, diverse, and widely used instructions. We develop a prompt grounded in the original recipe and ingredients list that breaks recipes down into simpler steps…
▽ More
With their remarkably improved text generation and prompting capabilities, large language models can adapt existing written information into forms that are easier to use and understand. In our work, we focus on recipes as an example of complex, diverse, and widely used instructions. We develop a prompt grounded in the original recipe and ingredients list that breaks recipes down into simpler steps. We apply this prompt to recipes from various world cuisines, and experiment with several large language models (LLMs), finding best results with GPT-3.5. We also contribute an Amazon Mechanical Turk task that is carefully designed to reduce fatigue while collecting human judgment of the quality of recipe revisions. We find that annotators usually prefer the revision over the original, demonstrating a promising application of LLMs in serving as digital sous chefs for recipes and beyond. We release our prompt, code, and MTurk template for public use.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
On Learning Latent Models with Multi-Instance Weak Supervision
Authors:
Kaifu Wang,
Efi Tsamoura,
Dan Roth
Abstract:
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $σ$ of labels associated with multiple input instances. We formulate this problem as \emph{multi-instance Partial Label Learning (multi-instance PLL)}, which is an extension to the standard PLL problem. Our problem is met in different fields, including latent structural learning and…
▽ More
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $σ$ of labels associated with multiple input instances. We formulate this problem as \emph{multi-instance Partial Label Learning (multi-instance PLL)}, which is an extension to the standard PLL problem. Our problem is met in different fields, including latent structural learning and neuro-symbolic integration. Despite the existence of many learning techniques, limited theoretical analysis has been dedicated to this problem. In this paper, we provide the first theoretical study of multi-instance PLL with possibly an unknown transition $σ$. Our main contributions are as follows. Firstly, we propose a necessary and sufficient condition for the learnability of the problem. This condition non-trivially generalizes and relaxes the existing small ambiguity degree in the PLL literature, since we allow the transition to be deterministic. Secondly, we derive Rademacher-style error bounds based on a top-$k$ surrogate loss that is widely used in the neuro-symbolic literature. Furthermore, we conclude with empirical experiments for learning under unknown transitions. The empirical results align with our theoretical findings; however, they also expose the issue of scalability in the weak supervision literature.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
A Static Evaluation of Code Completion by Large Language Models
Authors:
Hantian Ding,
Varun Kumar,
Yuchen Tian,
Zijian Wang,
Rob Kwiatkowski,
Xiaopeng Li,
Murali Krishna Ramanathan,
Baishakhi Ray,
Parminder Bhatia,
Sudipta Sengupta,
Dan Roth,
Bing Xiang
Abstract:
Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary,…
▽ More
Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Authors:
Xingyu Fu,
Sheng Zhang,
Gukyeong Kwon,
Pramuditha Perera,
Henghui Zhu,
Yuhao Zhang,
Alexander Hanbo Li,
William Yang Wang,
Zhiguo Wang,
Vittorio Castelli,
Patrick Ng,
Dan Roth,
Bing Xiang
Abstract:
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certa…
▽ More
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result.
To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code and models are released at http://cogcomp.org/page/publication_view/1010
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Characterizing and Measuring Linguistic Dataset Drift
Authors:
Tyler A. Chang,
Kishaloy Halder,
Neha Anna John,
Yogarshi Vyas,
Yassine Benajiba,
Miguel Ballesteros,
Dan Roth
Abstract:
NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often…
▽ More
NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often used in practice. In this paper, we propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies (e.g. lexical semantic change). We propose interpretable metrics for all three drift dimensions, and we modify past performance prediction methods to predict model performance at both the example and dataset level for English sentiment classification and natural language inference. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies (mean 16.8% root mean square error decrease), particularly when compared to popular fine-tuned embedding distances (mean 47.7% error decrease). Fine-tuned embedding distances are much more effective at ranking individual examples by expected performance, but decomposing into vocabulary, structural, and semantic drift produces the best example rankings of all considered model-agnostic drift metrics (mean 6.7% ROC AUC increase).
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Associated production of prompt $J/ψ$ and $\mathitΥ$ mesons in $pp$ collisions at $\sqrt{s}=13\,\mathrm{TeV}$
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
B. Adeva,
M. Adinolfi,
P. Adlarson,
H. Afsharnia,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
A. Alfonso Albero,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey
, et al. (1037 additional authors not shown)
Abstract:
The associated production of prompt $J/ψ$ and $\mathit{\mathitΥ}$ mesons in $pp$ collisions at a centre-of-mass energy of $\sqrt{s}=13\,\mathrm{TeV}$ is studied using LHCb data, corresponding to an integrated luminosity of $4\,\mathrm{fb}^{-1}$. The measurement is performed for $J/ψ$ ($\mathitΥ$) mesons with a transverse momentum $p_{\mathrm{T}}<10\,(30)\,\mathrm{GeV}/c$ in the rapidity range…
▽ More
The associated production of prompt $J/ψ$ and $\mathit{\mathitΥ}$ mesons in $pp$ collisions at a centre-of-mass energy of $\sqrt{s}=13\,\mathrm{TeV}$ is studied using LHCb data, corresponding to an integrated luminosity of $4\,\mathrm{fb}^{-1}$. The measurement is performed for $J/ψ$ ($\mathitΥ$) mesons with a transverse momentum $p_{\mathrm{T}}<10\,(30)\,\mathrm{GeV}/c$ in the rapidity range $2.0<y<4.5$. In this kinematic range, the cross-section of the associated production of prompt $J/ψ$ and $\mathitΥ(1S)$ mesons is measured to be $133 \pm 22 \pm 7 \pm 3 \, \mathrm{pb}$, with a significance of $7.9\,σ$, and that of prompt $J/ψ$ and $\mathitΥ(2S)$ mesons to be $76\pm 21 \pm 4 \pm 7 \, \mathrm{pb}$, with a significance of $4.9\,σ$. The first uncertainty is statistical, the second systematic, and the third due to uncertainties on the used branching fractions. This is the first observation of the associated production of $J/ψ$ and $\mathitΥ(1S)$ in proton-proton collisions. Differential cross-sections are measured as functions of variables that are sensitive to kinematic correlations between the $J/ψ$ and $\mathitΥ(1S)$ mesons. The effective cross-sections of the associated production of prompt $J/ψ$ and $\mathitΥ$ mesons are obtained and found to be compatible with measurements using other particle productions.
△ Less
Submitted 29 August, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering
Authors:
Xingyu Fu,
Ben Zhou,
Sihao Chen,
Mark Yatskar,
Dan Roth
Abstract:
Recent advances in multimodal large language models (LLMs) have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust and applicability in critical domains. While post-hoc rationales offer certain insight into understanding model behavior, these explanations are not g…
▽ More
Recent advances in multimodal large language models (LLMs) have shown extreme effectiveness in visual question answering (VQA). However, the design nature of these end-to-end models prevents them from being interpretable to humans, undermining trust and applicability in critical domains. While post-hoc rationales offer certain insight into understanding model behavior, these explanations are not guaranteed to be faithful to the model. In this paper, we address these shortcomings by introducing an interpretable by design model that factors model decisions into intermediate human-legible explanations, and allows people to easily understand why a model fails or succeeds. We propose the Dynamic Clue Bottleneck Model ( (DCLUB), a method that is designed towards an inherently interpretable VQA system. DCLUB provides an explainable intermediate space before the VQA decision and is faithful from the beginning, while maintaining comparable performance to black-box systems. Given a question, DCLUB first returns a set of visual clues: natural language statements of visually salient evidence from the image, and then generates the output based solely on the visual clues. To supervise and evaluate the generation of VQA explanations within DCLUB, we collect a dataset of 1.7k reasoning-focused questions with visual clues. Evaluations show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions while preserving 99.43% of performance on VQA-v2.
△ Less
Submitted 13 April, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Taxonomy Expansion for Named Entity Recognition
Authors:
Karthikeyan K,
Yogarshi Vyas,
Jie Ma,
Giovanni Paolini,
Neha Anna John,
Shuai Wang,
Yassine Benajiba,
Vittorio Castelli,
Dan Roth,
Miguel Ballesteros
Abstract:
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To re…
▽ More
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Open-Domain Event Graph Induction for Mitigating Framing Bias
Authors:
Siyi Liu,
Hongming Zhang,
Hongwei Wang,
Kaiqiang Song,
Dan Roth,
Dong Yu
Abstract:
Researchers have proposed various information extraction (IE) techniques to convert news articles into structured knowledge for news understanding. However, none of the existing methods have explicitly addressed the issue of framing bias that is inherent in news articles. We argue that studying and identifying framing bias is a crucial step towards trustworthy event understanding. We propose a nov…
▽ More
Researchers have proposed various information extraction (IE) techniques to convert news articles into structured knowledge for news understanding. However, none of the existing methods have explicitly addressed the issue of framing bias that is inherent in news articles. We argue that studying and identifying framing bias is a crucial step towards trustworthy event understanding. We propose a novel task, neutral event graph induction, to address this problem. An event graph is a network of events and their temporal relations. Our task aims to induce such structural knowledge with minimal framing bias in an open domain. We propose a three-step framework to induce a neutral event graph from multiple input sources. The process starts by inducing an event graph from each input source, then merging them into one merged event graph, and lastly using a Graph Convolutional Network to remove event nodes with biased connotations. We demonstrate the effectiveness of our framework through the use of graph prediction metrics and bias-focused metrics.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Comparing Biases and the Impact of Multilingual Training across Multiple Languages
Authors:
Sharon Levy,
Neha Anna John,
Ling Liu,
Yogarshi Vyas,
Jie Ma,
Yoshinari Fu**uma,
Miguel Ballesteros,
Vittorio Castelli,
Dan Roth
Abstract:
Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases com…
▽ More
Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases compare across languages and how the biases are affected when training a model on multilingual data versus monolingual data. We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task to observe whether specific demographics are viewed more positively. We study bias similarities and differences across these languages and investigate the impact of multilingual vs. monolingual training data. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture (e.g. majority religions and nationalities). Additionally, we find an increased variation in predictions across protected groups, indicating bias amplification, after multilingual finetuning in comparison to multilingual pretraining.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
The LHCb upgrade I
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
C. Achard,
T. Ackernley,
B. Adeva,
M. Adinolfi,
P. Adlarson,
H. Afsharnia,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
A. Alfonso Albero,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato
, et al. (1298 additional authors not shown)
Abstract:
The LHCb upgrade represents a major change of the experiment. The detectors have been almost completely renewed to allow running at an instantaneous luminosity five times larger than that of the previous running periods. Readout of all detectors into an all-software trigger is central to the new design, facilitating the reconstruction of events at the maximum LHC interaction rate, and their select…
▽ More
The LHCb upgrade represents a major change of the experiment. The detectors have been almost completely renewed to allow running at an instantaneous luminosity five times larger than that of the previous running periods. Readout of all detectors into an all-software trigger is central to the new design, facilitating the reconstruction of events at the maximum LHC interaction rate, and their selection in real time. The experiment's tracking system has been completely upgraded with a new pixel vertex detector, a silicon tracker upstream of the dipole magnet and three scintillating fibre tracking stations downstream of the magnet. The whole photon detection system of the RICH detectors has been renewed and the readout electronics of the calorimeter and muon systems have been fully overhauled. The first stage of the all-software trigger is implemented on a GPU farm. The output of the trigger provides a combination of totally reconstructed physics objects, such as tracks and vertices, ready for final analysis, and of entire events which need further offline reprocessing. This scheme required a complete revision of the computing model and rewriting of the experiment's software.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.