Search | arXiv e-print repository

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Authors: Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab

Abstract: Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we fa… ▽ More Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Krista and Michael contributed equally to this work

arXiv:2402.15018 [pdf, other]

Unintended Impacts of LLM Alignment on Global Representation

Authors: Michael J. Ryan, William Held, Diyi Yang

Abstract: Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not… ▽ More Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github. △ Less

Submitted 6 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: Accepted to ACL 2024 main conference

arXiv:2305.15678 [pdf, other]

Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

Authors: Michael J. Ryan, Tarek Naous, Wei Xu

Abstract: Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resour… ▽ More Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in develo** more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted to ACL 2023 main conference

arXiv:2305.14463 [pdf, other]

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment

Authors: Tarek Naous, Michael J. Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

Abstract: We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collect… ▽ More We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on develo** robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme △ Less

Submitted 8 June, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.14456 [pdf, other]

Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

Authors: Tarek Naous, Michael J. Ryan, Alan Ritter, Wei Xu

Abstract: As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel… ▽ More As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereoty** and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: https://github.com/tareknaous/camel △ Less

Submitted 20 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2204.01711 [pdf, other]

Single Image Internal Distribution Measurement Using Non-Local Variational Autoencoder

Authors: Yeahia Sarker, Abdullah-Al-Zubaer Imran, Md Hafiz Ahamed, Ripon K. Chakrabortty, Michael J. Ryan, Sajal K. Das

Abstract: Deep learning-based super-resolution methods have shown great promise, especially for single image super-resolution (SISR) tasks. Despite the performance gain, these methods are limited due to their reliance on copious data for model training. In addition, supervised SISR solutions rely on local neighbourhood information focusing only on the feature learning processes for the reconstruction of low… ▽ More Deep learning-based super-resolution methods have shown great promise, especially for single image super-resolution (SISR) tasks. Despite the performance gain, these methods are limited due to their reliance on copious data for model training. In addition, supervised SISR solutions rely on local neighbourhood information focusing only on the feature learning processes for the reconstruction of low-dimensional images. Moreover, they fail to capitalize on global context due to their constrained receptive field. To combat these challenges, this paper proposes a novel image-specific solution, namely non-local variational autoencoder (\texttt{NLVAE}), to reconstruct a high-resolution (HR) image from a single low-resolution (LR) image without the need for any prior training. To harvest maximum details for various receptive regions and high-quality synthetic images, \texttt{NLVAE} is introduced as a self-supervised strategy that reconstructs high-resolution images using disentangled information from the non-local neighbourhood. Experimental results from seven benchmark datasets demonstrate the effectiveness of the \texttt{NLVAE} model. Moreover, our proposed model outperforms a number of baseline and state-of-the-art methods as confirmed through extensive qualitative and quantitative evaluations. △ Less

Submitted 2 April, 2022; originally announced April 2022.

Comments: A Preprint Version

arXiv:0912.4300 [pdf]

doi 10.1088/1748-0221/5/03/P03010

Radiation hardness qualification of PbWO4 scintillation crystals for the CMS Electromagnetic Calorimeter

Authors: The CMS Electromagnetic Calorimeter Group, P. Adzic, N. Almeida, D. Andelin, I. Anicin, Z. Antunovic, R. Arcidiacono, M. W. Arenton, E. Auffray, S. Argiro, A. Askew, S. Baccaro, S. Baffioni, M. Balazs, D. Bandurin, D. Barney, L. M. Barone, A. Bartoloni, C. Baty, S. Beauceron, K. W. Bell, C. Bernet, M. Besancon, B. Betev, R. Beuselinck , et al. (245 additional authors not shown)

Abstract: Ensuring the radiation hardness of PbWO4 crystals was one of the main priorities during the construction of the electromagnetic calorimeter of the CMS experiment at CERN. The production on an industrial scale of radiation hard crystals and their certification over a period of several years represented a difficult challenge both for CMS and for the crystal suppliers. The present article reviews t… ▽ More Ensuring the radiation hardness of PbWO4 crystals was one of the main priorities during the construction of the electromagnetic calorimeter of the CMS experiment at CERN. The production on an industrial scale of radiation hard crystals and their certification over a period of several years represented a difficult challenge both for CMS and for the crystal suppliers. The present article reviews the related scientific and technological problems encountered. △ Less

Submitted 21 December, 2009; originally announced December 2009.

Comments: 24 pages, 19 figures, available on CMS information server at http://cms.cern.ch/iCMS/

Report number: CMS Note 2009/016

Journal ref: JINST 5:P03010,2010

Showing 1–7 of 7 results for author: Ryan, M J