-
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Authors:
Krista Opsahl-Ong,
Michael J Ryan,
Josh Purtell,
David Broman,
Christopher Potts,
Matei Zaharia,
Omar Khattab
Abstract:
Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we fa…
▽ More
Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Unintended Impacts of LLM Alignment on Global Representation
Authors:
Michael J. Ryan,
William Held,
Diyi Yang
Abstract:
Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not…
▽ More
Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.
△ Less
Submitted 6 June, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Revisiting non-English Text Simplification: A Unified Multilingual Benchmark
Authors:
Michael J. Ryan,
Tarek Naous,
Wei Xu
Abstract:
Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resour…
▽ More
Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in develo** more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment
Authors:
Tarek Naous,
Michael J. Ryan,
Anton Lavrouk,
Mohit Chandra,
Wei Xu
Abstract:
We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collect…
▽ More
We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on develo** robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme
△ Less
Submitted 8 June, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
Authors:
Tarek Naous,
Michael J. Ryan,
Alan Ritter,
Wei Xu
Abstract:
As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel…
▽ More
As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereoty** and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: https://github.com/tareknaous/camel
△ Less
Submitted 20 March, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Single Image Internal Distribution Measurement Using Non-Local Variational Autoencoder
Authors:
Yeahia Sarker,
Abdullah-Al-Zubaer Imran,
Md Hafiz Ahamed,
Ripon K. Chakrabortty,
Michael J. Ryan,
Sajal K. Das
Abstract:
Deep learning-based super-resolution methods have shown great promise, especially for single image super-resolution (SISR) tasks. Despite the performance gain, these methods are limited due to their reliance on copious data for model training. In addition, supervised SISR solutions rely on local neighbourhood information focusing only on the feature learning processes for the reconstruction of low…
▽ More
Deep learning-based super-resolution methods have shown great promise, especially for single image super-resolution (SISR) tasks. Despite the performance gain, these methods are limited due to their reliance on copious data for model training. In addition, supervised SISR solutions rely on local neighbourhood information focusing only on the feature learning processes for the reconstruction of low-dimensional images. Moreover, they fail to capitalize on global context due to their constrained receptive field. To combat these challenges, this paper proposes a novel image-specific solution, namely non-local variational autoencoder (\texttt{NLVAE}), to reconstruct a high-resolution (HR) image from a single low-resolution (LR) image without the need for any prior training. To harvest maximum details for various receptive regions and high-quality synthetic images, \texttt{NLVAE} is introduced as a self-supervised strategy that reconstructs high-resolution images using disentangled information from the non-local neighbourhood. Experimental results from seven benchmark datasets demonstrate the effectiveness of the \texttt{NLVAE} model. Moreover, our proposed model outperforms a number of baseline and state-of-the-art methods as confirmed through extensive qualitative and quantitative evaluations.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Radiation hardness qualification of PbWO4 scintillation crystals for the CMS Electromagnetic Calorimeter
Authors:
The CMS Electromagnetic Calorimeter Group,
P. Adzic,
N. Almeida,
D. Andelin,
I. Anicin,
Z. Antunovic,
R. Arcidiacono,
M. W. Arenton,
E. Auffray,
S. Argiro,
A. Askew,
S. Baccaro,
S. Baffioni,
M. Balazs,
D. Bandurin,
D. Barney,
L. M. Barone,
A. Bartoloni,
C. Baty,
S. Beauceron,
K. W. Bell,
C. Bernet,
M. Besancon,
B. Betev,
R. Beuselinck
, et al. (245 additional authors not shown)
Abstract:
Ensuring the radiation hardness of PbWO4 crystals was one of the main priorities during the construction of the electromagnetic calorimeter of the CMS experiment at CERN. The production on an industrial scale of radiation hard crystals and their certification over a period of several years represented a difficult challenge both for CMS and for the crystal suppliers. The present article reviews t…
▽ More
Ensuring the radiation hardness of PbWO4 crystals was one of the main priorities during the construction of the electromagnetic calorimeter of the CMS experiment at CERN. The production on an industrial scale of radiation hard crystals and their certification over a period of several years represented a difficult challenge both for CMS and for the crystal suppliers. The present article reviews the related scientific and technological problems encountered.
△ Less
Submitted 21 December, 2009;
originally announced December 2009.