Search | arXiv e-print repository

One Thousand and One Pairs: A "novel" challenge for long-context language models

Authors: Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

Abstract: Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, wr… ▽ More Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: preprint, 29 pages

arXiv:2404.13784 [pdf, other]

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Authors: Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

Abstract: With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual… ▽ More With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2210.14250 [pdf, other]

Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

Authors: Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, Mohit Iyyer

Abstract: Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than m… ▽ More Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than more traditional MT settings since translators must balance meaning equivalence, readability, and critical interpretability in the target language. This property, along with the complex discourse-level context present in literary texts, also makes literary MT more challenging to computationally model and evaluate. To explore this task, we collect a dataset (Par3) of non-English language novels in the public domain, each aligned at the paragraph level to both human and automatic English translations. Using Par3, we discover that expert literary translators prefer reference human translations over machine-translated paragraphs at a rate of 84%, while state-of-the-art automatic MT metrics do not correlate with those preferences. The experts note that MT outputs contain not only mistranslations, but also discourse-disrupting errors and stylistic inconsistencies. To address these problems, we train a post-editing model whose output is preferred over normal MT output at a rate of 69% by experts. We publicly release Par3 at https://github.com/katherinethai/par3/ to spur future research into literary MT. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2210.13746 [pdf, other]

DEMETR: Diagnosing Evaluation Metrics for Translation

Authors: Marzena Karpinska, Nishant Raj, Katherine Thai, Yixiao Song, Ankita Gupta, Mohit Iyyer

Abstract: While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlati… ▽ More While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: 22 pages, EMNLP 2022 (camera ready)

arXiv:2204.10878 [pdf, other]

ChapterBreak: A Challenge Dataset for Long-Range Language Models

Authors: Simeng Sun, Katherine Thai, Mohit Iyyer

Abstract: While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of t… ▽ More While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our ChapterBreak dataset to spur more principled future research into LRLMs. △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:2203.10053 [pdf, other]

RELIC: Retrieving Evidence for Literary Claims

Authors: Katherine Thai, Yapei Chang, Kalpesh Krishna, Mohit Iyyer

Abstract: Humanities scholars commonly provide evidence for claims that they make about a work of literature (e.g., a novel) in the form of quotations from the work. We collect a large-scale dataset (RELiC) of 78K literary quotations and surrounding critical analysis and use it to formulate the novel task of literary evidence retrieval, in which models are given an excerpt of literary analysis surrounding a… ▽ More Humanities scholars commonly provide evidence for claims that they make about a work of literature (e.g., a novel) in the form of quotations from the work. We collect a large-scale dataset (RELiC) of 78K literary quotations and surrounding critical analysis and use it to formulate the novel task of literary evidence retrieval, in which models are given an excerpt of literary analysis surrounding a masked quotation and asked to retrieve the quoted passage from the set of all passages in the work. Solving this retrieval task requires a deep understanding of complex literary and linguistic phenomena, which proves challenging to methods that overwhelmingly rely on lexical and semantic similarity matching. We implement a RoBERTa-based dense passage retriever for this task that outperforms existing pretrained information retrieval baselines; however, experiments and analysis by human domain experts indicate that there is substantial room for improvement over our dense retriever. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: ACL 2022 camera ready (19 pages)

arXiv:2009.12413 [pdf, other]

doi 10.1109/TCBB.2020.3040910

Maximum Covering Subtrees for Phylogenetic Networks

Authors: Nathan Davidov, Amanda Hernandez, Justin Jian, Patrick McKenna, K. A. Medlin, Roadra Mojumder, Megan Owen, Andrew Quijano, Amanda Rodriguez, Katherine St. John, Katherine Thai, Meliza Uraga

Abstract: Tree-based phylogenetic networks, which may be roughly defined as leaf-labeled networks built by adding arcs only between the original tree edges, have elegant properties for modeling evolutionary histories. We answer an open question of Francis, Semple, and Steel about the complexity of determining how far a phylogenetic network is from being tree-based, including non-binary phylogenetic networks… ▽ More Tree-based phylogenetic networks, which may be roughly defined as leaf-labeled networks built by adding arcs only between the original tree edges, have elegant properties for modeling evolutionary histories. We answer an open question of Francis, Semple, and Steel about the complexity of determining how far a phylogenetic network is from being tree-based, including non-binary phylogenetic networks. We show that finding a phylogenetic tree covering the maximum number of nodes in a phylogenetic network can be be computed in polynomial time via an encoding into a minimum-cost maximum flow problem. △ Less

Submitted 24 November, 2020; v1 submitted 25 September, 2020; originally announced September 2020.

arXiv:1910.06078 [pdf, other]

MUTLA: A Large-Scale Dataset for Multimodal Teaching and Learning Analytics

Authors: Fangli Xu, Lingfei Wu, KP Thai, Carol Hsu, Wei Wang, Richard Tong

Abstract: Automatic analysis of teacher and student interactions could be very important to improve the quality of teaching and student engagement. However, despite some recent progress in utilizing multimodal data for teaching and learning analytics, a thorough analysis of a rich multimodal dataset coming for a complex real learning environment has yet to be done. To bridge this gap, we present a large-sca… ▽ More Automatic analysis of teacher and student interactions could be very important to improve the quality of teaching and student engagement. However, despite some recent progress in utilizing multimodal data for teaching and learning analytics, a thorough analysis of a rich multimodal dataset coming for a complex real learning environment has yet to be done. To bridge this gap, we present a large-scale MUlti-modal Teaching and Learning Analytics (MUTLA) dataset. This dataset includes time-synchronized multimodal data records of students (learning logs, videos, EEG brainwaves) as they work in various subjects from Squirrel AI Learning System (SAIL) to solve problems of varying difficulty levels. The dataset resources include user records from the learner records store of SAIL, brainwave data collected by EEG headset devices, and video data captured by web cameras while students worked in the SAIL products. Our hope is that by analyzing real-world student learning activities, facial expressions, and brainwave patterns, researchers can better predict engagement, which can then be used to improve adaptive learning selection and student learning outcomes. An additional goal is to provide a dataset gathered from real-world educational activities versus those from controlled lab environments to benefit the educational learning community. △ Less

Submitted 6 December, 2022; v1 submitted 4 October, 2019; originally announced October 2019.

Comments: 3 pages, 1 figure, 2 tables workshop paper

arXiv:1910.03225 [pdf, other]

NGBoost: Natural Gradient Boosting for Probabilistic Prediction

Authors: Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, Alejandro Schuler

Abstract: We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. Typical regression models return a point estimate, conditional on covariates, but probabilistic regression models output a full probability distribution over the outcome space, conditional on the covariates. This allows for predictive uncertainty estimation -- crucial in applica… ▽ More We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. Typical regression models return a point estimate, conditional on covariates, but probabilistic regression models output a full probability distribution over the outcome space, conditional on the covariates. This allows for predictive uncertainty estimation -- crucial in applications like healthcare and weather forecasting. NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm. Furthermore, we show how the Natural Gradient is required to correct the training dynamics of our multiparameter boosting approach. NGBoost can be used with any base learner, any family of distributions with continuous parameters, and any scoring rule. NGBoost matches or exceeds the performance of existing methods for probabilistic prediction while offering additional benefits in flexibility, scalability, and usability. An open-source implementation is available at github.com/stanfordmlgroup/ngboost. △ Less

Submitted 9 June, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: Accepted for ICML 2020

arXiv:1907.03627 [pdf, other]

HyperPubSub: Blockchain based Publish/Subscribe

Authors: Gewu Bu, Thanh Son Lam Nguyen, Maria Potop-Butucaru, Kim Thai

Abstract: In this paper we describe the architecture and the implementation of a broker based publish/subscribe system where the broker role is played by a private blockchain, Hy-perledger Fabric. We show the effectiveness of our architecture by implementing and deploying a photo trading plateform. Interestingly, our architecture is generic enough to be adapted to any digital asset trading. In this paper we describe the architecture and the implementation of a broker based publish/subscribe system where the broker role is played by a private blockchain, Hy-perledger Fabric. We show the effectiveness of our architecture by implementing and deploying a photo trading plateform. Interestingly, our architecture is generic enough to be adapted to any digital asset trading. △ Less

Submitted 8 July, 2019; originally announced July 2019.

arXiv:1903.08856 [pdf, other]

Impact of network delays on Hyperledger Fabric

Authors: Thanh Son Lam Nguyen, Guillaume Jourjon, Maria Potop-Butucaru, Kim Thai

Abstract: Blockchain has become one of the most attractive technologies for applications, with a large range of deployments such as production, economy, or banking. Under the hood, Blockchain technology is a type of distributed database that supports untrusted parties. In this paper we focus Hyperledger Fabric, the first blockchain in the market tailored for a private environment, allowing businesses to cre… ▽ More Blockchain has become one of the most attractive technologies for applications, with a large range of deployments such as production, economy, or banking. Under the hood, Blockchain technology is a type of distributed database that supports untrusted parties. In this paper we focus Hyperledger Fabric, the first blockchain in the market tailored for a private environment, allowing businesses to create a permissioned network. Hyperledger Fabric implements a PBFT consensus in order to maintain a non forking blockchain at the application level. We deployed this framework over an area network between France and Germany in order to evaluate its performance when potentially large network delays are observed. Overall we found that when network delay increases significantly (i.e. up to 3.5 seconds at network layer between two clouds), we observed that the blocks added to our blockchain had up to 134 seconds offset after 100 th block from one cloud to another. Thus by delaying block propagation, we demonstrated that Hyperledger Fabric does not provide sufficient consistency guaranties to be deployed in critical environments. Our work, is the fist to evidence the negative impact of network delays on a PBFT-based blockchain. △ Less

Submitted 21 March, 2019; originally announced March 2019.

arXiv:1901.10268 [pdf]

Performance comparison of an AI-based Adaptive Learning System in China

Authors: Wei Cui, Zhen Xue, Khanh-Phuong Thai

Abstract: Adaptive learning systems stand apart from traditional learning systems by offering a personalized learning experience to students according to their different knowledge states. Adaptive systems collect and analyse students' behavior data, update learner profiles, then accordingly provide timely individualized feedback to each student. Such interactions between the learning system and students can… ▽ More Adaptive learning systems stand apart from traditional learning systems by offering a personalized learning experience to students according to their different knowledge states. Adaptive systems collect and analyse students' behavior data, update learner profiles, then accordingly provide timely individualized feedback to each student. Such interactions between the learning system and students can improve the engagement of students and the efficiency of learning. This paper evaluates the effectiveness of an adaptive learning system, "Yixue Squirrel AI" (or Yixue), on English and math learning in middle school. The effectiveness of the Yixue's math and English learning systems is respectively compared against (1) traditional classroom math instruction conducted by expert human teachers and (2) BOXFiSH, another adaptive learning platform for English language learning. Results suggest that students achieved better performance using Yixue adaptive learning system than both traditional classroom instruction by expert teachers and another adaptive learning platform. △ Less

Submitted 29 January, 2019; originally announced January 2019.

Showing 1–12 of 12 results for author: Thai, K