Search | arXiv e-print repository

Following Length Constraints in Instructions

Authors: Weizhe Yuan, Ilia Kulikov, ** Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston, **g Xu

Abstract: Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length… ▽ More Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 13 pages

arXiv:2405.18719 [pdf, other]

Contextual Position Encoding: Learning to Count What's Important

Authors: Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar

Abstract: The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (PE) makes it possible to address by position, such as attending to the i-th token. However, current PE methods use token counts to derive position, and thus cannot generalize to higher levels of abstra… ▽ More The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (PE) makes it possible to address by position, such as attending to the i-th token. However, current PE methods use token counts to derive position, and thus cannot generalize to higher levels of abstraction, such as attending to the i-th sentence. In this paper, we propose a new position encoding method, Contextual Position Encoding (CoPE), that allows positions to be conditioned on context by incrementing position only on certain tokens determined by the model. This allows more general position addressing such as attending to the $i$-th particular word, noun, or sentence. We show that CoPE can solve the selective copy, counting and Flip-Flop tasks where popular position embeddings fail, and improves perplexity on language modeling and coding tasks. △ Less

Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.19733 [pdf, other]

Iterative Reasoning Preference Optimization

Authors: Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston

Abstract: Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoni… ▽ More Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples. △ Less

Submitted 25 June, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.10660 [pdf, other]

Discovery of the optical and radio counterpart to the fast X-ray transient EP240315a

Authors: J. H. Gillanders, L. Rhodes, S. Srivastav, F. Carotenuto, J. Bright, M. E. Huber, H. F. Stevance, S. J. Smartt, K. C. Chambers, T. -W. Chen, R. Fender, A. Andersson, A. J. Cooper, P. G. Jonker, F. J. Cowie, T. deBoer, N. Erasmus, M. D. Fulton, H. Gao, J. Herman, C. -C. Lin, T. Lowe, E. A. Magnier, H. -Y. Miao, P. Minguez , et al. (14 additional authors not shown)

Abstract: Fast X-ray Transients (FXTs) are extragalactic bursts of soft X-rays first identified >10 years ago. Since then, nearly 40 events have been discovered, although almost all of these have been recovered from archival Chandra and XMM-Newton data. To date, optical sky surveys and follow-up searches have not revealed any multi-wavelength counterparts. The Einstein Probe, launched in January 2024, has s… ▽ More Fast X-ray Transients (FXTs) are extragalactic bursts of soft X-rays first identified >10 years ago. Since then, nearly 40 events have been discovered, although almost all of these have been recovered from archival Chandra and XMM-Newton data. To date, optical sky surveys and follow-up searches have not revealed any multi-wavelength counterparts. The Einstein Probe, launched in January 2024, has started surveying the sky in the soft X-ray regime (0.5-4 keV) and will rapidly increase the sample of FXTs discovered in real time. Here, we report the first discovery of both an optical and radio counterpart to a distant FXT, the fourth source publicly released by the Einstein Probe. We discovered a fast-fading optical transient within the 3 arcmin localisation radius of EP240315a with the all-sky optical survey ATLAS, and our follow-up Gemini spectrum provides a redshift, z=4.859+/-0.002. Furthermore, we uncovered a radio counterpart in the S-band (3.0 GHz) with the MeerKAT radio interferometer. The optical (rest-frame UV) and radio luminosities indicate the FXT most likely originates from either a long gamma-ray burst or a relativistic tidal disruption event. This may be a fortuitous early mission detection by the Einstein Probe or may signpost a mode of discovery for high-redshift, high-energy transients through soft X-ray surveys, combined with locating multi-wavelength counterparts. △ Less

Submitted 19 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: Updated to match version accepted for publication in ApJL (17 pages, 4 figures, 2 tables)

arXiv:2403.13799 [pdf, other]

Reverse Training to Nurse the Reversal Curse

Authors: Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar

Abstract: Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby al… ▽ More Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, hel** resolve the reversal curse issue. △ Less

Submitted 7 May, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.07816 [pdf, other]

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Authors: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li

Abstract: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts… ▽ More We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.14158 [pdf, other]

TOOLVERIFIER: Generalization to New Tools via Self-Verification

Authors: Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, **gbo Shang, Jane Dwivedi-Yu

Abstract: Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguish… ▽ More Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced. △ Less

Submitted 13 March, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

arXiv:2401.10020 [pdf, other]

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, **g Xu, Jason Weston

Abstract: We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewardi… ▽ More We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes. △ Less

Submitted 8 February, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.09549 [pdf, other]

Interferometric Single-Shot Parity Measurement in an InAs-Al Hybrid Device

Authors: Morteza Aghaee, Alejandro Alcaraz Ramirez, Zulfi Alam, Rizwan Ali, Mariusz Andrzejczuk, Andrey Antipov, Mikhail Astafev, Amin Barzegar, Bela Bauer, Jonathan Becker, Umesh Kumar Bhaskar, Alex Bocharov, Srini Boddapati, David Bohn, Jouri Bommer, Leo Bourdet, Arnaud Bousquet, Samuel Boutin, Lucas Casparis, Benjamin James Chapman, Sohail Chatoor, Anna Wulff Christensen, Cassandra Chua, Patrick Codd, William Cole , et al. (137 additional authors not shown)

Abstract: The fusion of non-Abelian anyons or topological defects is a fundamental operation in measurement-only topological quantum computation. In topological superconductors, this operation amounts to a determination of the shared fermion parity of Majorana zero modes. As a step towards this, we implement a single-shot interferometric measurement of fermion parity in indium arsenide-aluminum heterostruct… ▽ More The fusion of non-Abelian anyons or topological defects is a fundamental operation in measurement-only topological quantum computation. In topological superconductors, this operation amounts to a determination of the shared fermion parity of Majorana zero modes. As a step towards this, we implement a single-shot interferometric measurement of fermion parity in indium arsenide-aluminum heterostructures with a gate-defined nanowire. The interferometer is formed by tunnel-coupling the proximitized nanowire to quantum dots. The nanowire causes a state-dependent shift of these quantum dots' quantum capacitance of up to 1 fF. Our quantum capacitance measurements show flux h/2e-periodic bimodality with a signal-to-noise ratio of 1 in 3.7 $μ$s at optimal flux values. From the time traces of the quantum capacitance measurements, we extract a dwell time in the two associated states that is longer than 1 ms at in-plane magnetic fields of approximately 2 T. These results are consistent with a measurement of the fermion parity encoded in a pair of Majorana zero modes that are separated by approximately 3 $μ$m and subjected to a low rate of poisoning by non-equilibrium quasiparticles. The large capacitance shift and long poisoning time enable a parity measurement error probability of 1%. △ Less

Submitted 2 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: Added data on a second measurement of device A and a measurement of device B, expanded discussion of a trivial scenario. Refs added, author list updated

arXiv:2312.16682 [pdf, other]

Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss

Authors: **g Xu, Andrew Lee, Sainbayar Sukhbaatar, Jason Weston

Abstract: Practitioners commonly align large language models using pairwise preferences, i.e., given labels of the type response A is preferred to response B for a given input. Perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response A is good or bad. We show how an existing performant binary feedback method, the Cringe Loss (Adolphs et… ▽ More Practitioners commonly align large language models using pairwise preferences, i.e., given labels of the type response A is preferred to response B for a given input. Perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response A is good or bad. We show how an existing performant binary feedback method, the Cringe Loss (Adolphs et al., 2022), can be generalized to the pairwise preference setting using a simple soft margin extension. Pairwise Cringe Loss is straightforward to implement and efficient to train, and we find it outperforms state-of-the-art preference optimization algorithms such as PPO and DPO on the AlpacaFarm benchmark. We show that iterations of training of our model are important for improved results, and that we can generalize DPO to Iterative DPO in the same way. △ Less

Submitted 22 April, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

arXiv:2311.11829 [pdf, other]

System 2 Attention (is something you might need too)

Authors: Jason Weston, Sainbayar Sukhbaatar

Abstract: Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what… ▽ More Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information, QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.07961 [pdf, other]

The ART of LLM Refinement: Ask, Refine, and Trust

Authors: Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, ** Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz

Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often st… ▽ More In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with refinement objective called ART: Ask, Refine, and Trust, which asks necessary questions to decide when an LLM should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), ART achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2310.15123 [pdf, other]

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Authors: Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li

Abstract: Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Mode… ▽ More Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA2-chat to match or outperform GPT-4 on most domains. On a constraint story generation task, BSM improves the coherence of stories while also improving constraint satisfaction by 12%. △ Less

Submitted 7 June, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: NAACL 2024 (19 pages, 7 figures, 11 tables)

arXiv:2310.05029 [pdf, other]

Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading

Authors: Howard Chen, Ramakanth Pasunuru, Jason Weston, Asli Celikyilmaz

Abstract: Large language models (LLMs) have advanced in large strides due to the effectiveness of the self-attention mechanism that processes and compares all tokens at once. However, this mechanism comes with a fundamental issue -- the predetermined context window is bound to be limited. Despite attempts to extend the context window through methods like extrapolating the positional embedding, using recurre… ▽ More Large language models (LLMs) have advanced in large strides due to the effectiveness of the self-attention mechanism that processes and compares all tokens at once. However, this mechanism comes with a fundamental issue -- the predetermined context window is bound to be limited. Despite attempts to extend the context window through methods like extrapolating the positional embedding, using recurrence, or selectively retrieving essential parts of the long sequence, long-text understanding continues to be a challenge. We propose an alternative approach which instead treats the LLM as an interactive agent, allowing it to decide how to read the text via iterative prompting. We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information. On long-text question answering tasks our method outperforms baseline approaches that use long context windows, recurrence, and retrieval. We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.11495 [pdf, other]

Chain-of-Verification Reduces Hallucination in Large Language Models

Authors: Shehzaad Dhuliawala, Mojtaba Komeili, **g Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston

Abstract: Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-c… ▽ More Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation. △ Less

Submitted 25 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

arXiv:2308.06259 [pdf, other]

Self-Alignment with Instruction Backtranslation

Authors: Xian Li, ** Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason Weston, Mike Lewis

Abstract: We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts… ▽ More We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment. △ Less

Submitted 12 March, 2024; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: ICLR2024 camera ready

arXiv:2307.14117 [pdf, other]

Leveraging Implicit Feedback from Deployment Data in Dialogue

Authors: Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston

Abstract: We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployme… ▽ More We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors. △ Less

Submitted 31 January, 2024; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: EACL 2024

arXiv:2306.13588 [pdf, other]

System-Level Natural Language Feedback

Authors: Weizhe Yuan, Kyunghyun Cho, Jason Weston

Abstract: Natural language (NL) feedback offers rich insights into user experience. While existing studies focus on an instance-level approach, where feedback is used to refine specific examples, we introduce a framework for system-level use of NL feedback. We show how to use feedback to formalize system-level design decisions in a human-in-the-loop-process -- in order to produce better models. In particula… ▽ More Natural language (NL) feedback offers rich insights into user experience. While existing studies focus on an instance-level approach, where feedback is used to refine specific examples, we introduce a framework for system-level use of NL feedback. We show how to use feedback to formalize system-level design decisions in a human-in-the-loop-process -- in order to produce better models. In particular this is done through: (i) metric design for tasks; and (ii) language model prompt design for refining model responses. We conduct two case studies of this approach for improving search query and dialog response generation, demonstrating the effectiveness of system-level feedback. We show the combination of system-level and instance-level feedback brings further gains, and that human written instance-level feedback results in more grounded refinements than GPT-3.5 written ones, underlying the importance of human feedback for building systems. We release our code and data at https://github.com/yyy-Apple/Sys-NL-Feedback. △ Less

Submitted 2 February, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: Accepted by EACL 2024

arXiv:2306.04765 [pdf, other]

The HCI Aspects of Public Deployment of Research Chatbots: A User Study, Design Recommendations, and Open Challenges

Authors: Morteza Behrooz, William Ngan, Joshua Lane, Giuliano Morse, Benjamin Babcock, Kurt Shuster, Mojtaba Komeili, Moya Chen, Melanie Kambadur, Y-Lan Boureau, Jason Weston

Abstract: Publicly deploying research chatbots is a nuanced topic involving necessary risk-benefit analyses. While there have recently been frequent discussions on whether it is responsible to deploy such models, there has been far less focus on the interaction paradigms and design approaches that the resulting interfaces should adopt, in order to achieve their goals more effectively. We aim to pose, ground… ▽ More Publicly deploying research chatbots is a nuanced topic involving necessary risk-benefit analyses. While there have recently been frequent discussions on whether it is responsible to deploy such models, there has been far less focus on the interaction paradigms and design approaches that the resulting interfaces should adopt, in order to achieve their goals more effectively. We aim to pose, ground, and attempt to answer HCI questions involved in this scope, by reporting on a mixed-methods user study conducted on a recent research chatbot. We find that abstract anthropomorphic representation for the agent has a significant effect on user's perception, that offering AI explainability may have an impact on feedback rates, and that two (diegetic and extradiegetic) levels of the chat experience should be intentionally designed. We offer design recommendations and areas of further focus for the research community. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2306.04707 [pdf, other]

Improving Open Language Models by Learning from Organic Interactions

Authors: **g Xu, Da Ju, Joshua Lane, Mojtaba Komeili, Eric Michael Smith, Megan Ung, Morteza Behrooz, William Ngan, Rashel Moritz, Sainbayar Sukhbaatar, Y-Lan Boureau, Jason Weston, Kurt Shuster

Abstract: We present BlenderBot 3x, an update on the conversational model BlenderBot 3, which is now trained using organic conversation and feedback data from participating users of the system in order to improve both its skills and safety. We are publicly releasing the participating de-identified interaction data for use by the research community, in order to spur further progress. Training models with org… ▽ More We present BlenderBot 3x, an update on the conversational model BlenderBot 3, which is now trained using organic conversation and feedback data from participating users of the system in order to improve both its skills and safety. We are publicly releasing the participating de-identified interaction data for use by the research community, in order to spur further progress. Training models with organic data is challenging because interactions with people "in the wild" include both high quality conversations and feedback, as well as adversarial and toxic behavior. We study techniques that enable learning from helpful teachers while avoiding learning from people who are trying to trick the model into unhelpful or toxic responses. BlenderBot 3x is both preferred in conversation to BlenderBot 3, and is shown to produce safer responses in challenging situations. While our current models are still far from perfect, we believe further improvement can be achieved by continued use of the techniques explored in this work. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2305.05364 [pdf, other]

Large Language Model Programs

Authors: Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen-tau Yih, Jason Weston, Jürgen Schmidhuber, Xian Li

Abstract: In recent years, large pre-trained language models (LLMs) have demonstrated the ability to follow instructions and perform novel tasks from a few examples. The possibility to parameterise an LLM through such in-context examples widens their capability at a much lower cost than finetuning. We extend this line of reasoning and present a method which further expands the capabilities of an LLM by embe… ▽ More In recent years, large pre-trained language models (LLMs) have demonstrated the ability to follow instructions and perform novel tasks from a few examples. The possibility to parameterise an LLM through such in-context examples widens their capability at a much lower cost than finetuning. We extend this line of reasoning and present a method which further expands the capabilities of an LLM by embedding it within an algorithm or program. To demonstrate the benefits of this approach, we present an illustrative example of evidence-supported question-answering. We obtain a 6.4\% improvement over the chain of thought baseline through a more algorithmic approach without any finetuning. Furthermore, we highlight recent work from this perspective and discuss the advantages and disadvantages in comparison to the standard approaches. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.00833 [pdf, other]

Learning to Reason and Memorize with Self-Notes

Authors: Jack Lanchantin, Shubham Toshniwal, Jason Weston, Arthur Szlam, Sainbayar Sukhbaatar

Abstract: Large language models have been shown to struggle with multi-step reasoning, and do not retain previous reasoning steps for future use. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes. Unlike recent chain-of-thought or scratchpad approaches, the model can deviate from the input context at any time to explicitly think and write down its thought… ▽ More Large language models have been shown to struggle with multi-step reasoning, and do not retain previous reasoning steps for future use. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes. Unlike recent chain-of-thought or scratchpad approaches, the model can deviate from the input context at any time to explicitly think and write down its thoughts. This allows the model to perform reasoning on the fly as it reads the context and even integrate previous reasoning steps, thus enhancing its memory with useful information and enabling multi-step reasoning. Experiments across a wide variety of tasks demonstrate that our method can outperform chain-of-thought and scratchpad methods by taking Self-Notes that interleave the input text. △ Less

Submitted 31 October, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

arXiv:2304.13835 [pdf, other]

Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models

Authors: Jimmy Wei, Kurt Shuster, Arthur Szlam, Jason Weston, Jack Urbanek, Mojtaba Komeili

Abstract: Current dialogue research primarily studies pairwise (two-party) conversations, and does not address the everyday setting where more than two speakers converse together. In this work, we both collect and evaluate multi-party conversations to study this more general case. We use the LIGHT environment to construct grounded conversations, where each participant has an assigned character to role-play.… ▽ More Current dialogue research primarily studies pairwise (two-party) conversations, and does not address the everyday setting where more than two speakers converse together. In this work, we both collect and evaluate multi-party conversations to study this more general case. We use the LIGHT environment to construct grounded conversations, where each participant has an assigned character to role-play. We thus evaluate the ability of language models to act as one or more characters in such conversations. Models require two skills that pairwise-trained models appear to lack: (1) being able to decide when to talk; (2) producing coherent utterances grounded on multiple characters. We compare models trained on our new dataset to existing pairwise-trained dialogue models, as well as large language models with few-shot prompting. We find that our new dataset, MultiLIGHT, which we will publicly release, can help bring significant improvements in the group setting. △ Less

Submitted 8 June, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

arXiv:2302.06784 [pdf, other]

The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation

Authors: Kushal Arora, Timothy J. O'Donnell, Doina Precup, Jason Weston, Jackie C. K. Cheung

Abstract: State-of-the-art language generation models can degenerate when applied to open-ended generation problems such as text completion, story generation, or dialog modeling. This degeneration usually shows up in the form of incoherence, lack of vocabulary diversity, and self-repetition or copying from the context. In this paper, we postulate that ``human-like'' generations usually lie in a narrow and n… ▽ More State-of-the-art language generation models can degenerate when applied to open-ended generation problems such as text completion, story generation, or dialog modeling. This degeneration usually shows up in the form of incoherence, lack of vocabulary diversity, and self-repetition or copying from the context. In this paper, we postulate that ``human-like'' generations usually lie in a narrow and nearly flat entropy band, and violation of these entropy bounds correlates with degenerate behavior. Our experiments show that this stable narrow entropy zone exists across models, tasks, and domains and confirm the hypothesis that violations of this zone correlate with degeneration. We then use this insight to propose an entropy-aware decoding algorithm that respects these entropy bounds resulting in less degenerate, more contextual, and "human-like" language generation in open-ended text generation settings. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2301.05746 [pdf, other]

Infusing Commonsense World Models with Graph Knowledge

Authors: Alexander Gurung, Mojtaba Komeili, Arthur Szlam, Jason Weston, Jack Urbanek

Abstract: While language models have become more capable of producing compelling language, we find there are still gaps in maintaining consistency, especially when describing events in a dynamically changing world. We study the setting of generating narratives in an open world text adventure game, where a graph representation of the underlying game state can be used to train models that consume and output b… ▽ More While language models have become more capable of producing compelling language, we find there are still gaps in maintaining consistency, especially when describing events in a dynamically changing world. We study the setting of generating narratives in an open world text adventure game, where a graph representation of the underlying game state can be used to train models that consume and output both grounded graph representations and natural language descriptions and actions. We build a large set of tasks by combining crowdsourced and simulated gameplays with a novel dataset of complex actions in order to to construct such models. We find it is possible to improve the consistency of action narration models by training on graph contexts and targets, even if graphs are not present at test time. This is shown both in automatic metrics and human evaluations. We plan to release our code, the new set of tasks, and best performing models. △ Less

Submitted 13 January, 2023; originally announced January 2023.

arXiv:2211.05826 [pdf, other]

The CRINGE Loss: Learning what language not to model

Authors: Leonard Adolphs, Tianyu Gao, **g Xu, Kurt Shuster, Sainbayar Sukhbaatar, Jason Weston

Abstract: Standard language model training employs gold human documents or human-human interaction data, and treats all training data as positive examples. Growing evidence shows that even with very large amounts of positive training data, issues remain that can be alleviated with relatively small amounts of negative data -- examples of what the model should not do. In this work, we propose a novel procedur… ▽ More Standard language model training employs gold human documents or human-human interaction data, and treats all training data as positive examples. Growing evidence shows that even with very large amounts of positive training data, issues remain that can be alleviated with relatively small amounts of negative data -- examples of what the model should not do. In this work, we propose a novel procedure to train with such data called the CRINGE loss (ContRastive Iterative Negative GEneration). We show the effectiveness of this approach across three different experiments on the tasks of safe generation, contradiction avoidance, and open-domain dialogue. Our models outperform multiple strong baselines and are conceptually simple, easy to train and implement. △ Less

Submitted 10 November, 2022; originally announced November 2022.

arXiv:2210.15893 [pdf, other]

When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels

Authors: Weiyan Shi, Emily Dinan, Kurt Shuster, Jason Weston, **g Xu

Abstract: Deployed dialogue agents have the potential to integrate human feedback to continuously improve themselves. However, humans may not always provide explicit signals when the chatbot makes mistakes during interactions. In this work, we propose Juicer, a framework to make use of both binary and free-form textual human feedback. It works by: (i) extending sparse binary feedback by training a satisfact… ▽ More Deployed dialogue agents have the potential to integrate human feedback to continuously improve themselves. However, humans may not always provide explicit signals when the chatbot makes mistakes during interactions. In this work, we propose Juicer, a framework to make use of both binary and free-form textual human feedback. It works by: (i) extending sparse binary feedback by training a satisfaction classifier to label the unlabeled data; and (ii) training a reply corrector to map the bad replies to good ones. We find that augmenting training with model-corrected replies improves the final dialogue model, and we can further improve performance by using both positive and negative replies through the recently proposed Director model. △ Less

Submitted 28 October, 2022; originally announced October 2022.

arXiv:2208.03295 [pdf, other]

Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring the trolls

Authors: Da Ju, **g Xu, Y-Lan Boureau, Jason Weston

Abstract: The promise of interaction between intelligent conversational agents and humans is that models can learn from such feedback in order to improve. Unfortunately, such exchanges in the wild will not always involve human utterances that are benign or of high quality, and will include a mixture of engaged (helpers) and unengaged or even malicious users (trolls). In this work we study how to perform rob… ▽ More The promise of interaction between intelligent conversational agents and humans is that models can learn from such feedback in order to improve. Unfortunately, such exchanges in the wild will not always involve human utterances that are benign or of high quality, and will include a mixture of engaged (helpers) and unengaged or even malicious users (trolls). In this work we study how to perform robust learning in such an environment. We introduce a benchmark evaluation, SafetyMix, which can evaluate methods that learn safe vs. toxic language in a variety of adversarial settings to test their robustness. We propose and analyze several mitigating learning algorithms that identify trolls either at the example or at the user level. Our main finding is that user-based methods, that take into account that troll users will exhibit adversarial behavior across multiple examples, work best in a variety of settings on our benchmark. We then test these methods in a further real-life setting of conversations collected during deployment, with similar results. △ Less

Submitted 5 August, 2022; originally announced August 2022.

arXiv:2208.03270 [pdf, other]

Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback

Authors: **g Xu, Megan Ung, Mojtaba Komeili, Kushal Arora, Y-Lan Boureau, Jason Weston

Abstract: Frozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We… ▽ More Frozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We collect deployment data, which we make publicly available, of human interactions, and collect various types of human feedback -- including binary quality measurements, free-form text feedback, and fine-grained reasons for failure. We then study various algorithms for improving from such feedback, including standard supervised learning, rejection sampling, model-guiding and reward-based learning, in order to make recommendations on which type of feedback and algorithms work best. We find the recently introduced Director model (Arora et al., '22) shows significant improvements over other existing approaches. △ Less

Submitted 16 August, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

arXiv:2208.03188 [pdf, other]

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Authors: Kurt Shuster, **g Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston

Abstract: We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (arc… ▽ More We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (Roller et al., 2021; Komeili et al., 2022). Finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. The goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction. △ Less

Submitted 10 August, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

arXiv:2207.02472 [pdf, other]

doi 10.1103/PhysRevB.107.245423

InAs-Al Hybrid Devices Passing the Topological Gap Protocol

Authors: Morteza Aghaee, Arun Akkala, Zulfi Alam, Rizwan Ali, Alejandro Alcaraz Ramirez, Mariusz Andrzejczuk, Andrey E Antipov, Pavel Aseev, Mikhail Astafev, Bela Bauer, Jonathan Becker, Srini Boddapati, Frenk Boekhout, Jouri Bommer, Esben Bork Hansen, Tom Bosma, Leo Bourdet, Samuel Boutin, Philippe Caroff, Lucas Casparis, Maja Cassidy, Anna Wulf Christensen, Noah Clay, William S Cole, Fabiano Corsetti , et al. (102 additional authors not shown)

Abstract: We present measurements and simulations of semiconductor-superconductor heterostructure devices that are consistent with the observation of topological superconductivity and Majorana zero modes. The devices are fabricated from high-mobility two-dimensional electron gases in which quasi-one-dimensional wires are defined by electrostatic gates. These devices enable measurements of local and non-loca… ▽ More We present measurements and simulations of semiconductor-superconductor heterostructure devices that are consistent with the observation of topological superconductivity and Majorana zero modes. The devices are fabricated from high-mobility two-dimensional electron gases in which quasi-one-dimensional wires are defined by electrostatic gates. These devices enable measurements of local and non-local transport properties and have been optimized via extensive simulations to ensure robustness against non-uniformity and disorder. Our main result is that several devices, fabricated according to the design's engineering specifications, have passed the topological gap protocol defined in Pikulin et al. [arXiv:2103.12217]. This protocol is a stringent test composed of a sequence of three-terminal local and non-local transport measurements performed while varying the magnetic field, semiconductor electron density, and junction transparencies. Passing the protocol indicates a high probability of detection of a topological phase hosting Majorana zero modes as determined by large-scale disorder simulations. Our experimental results are consistent with a quantum phase transition into a topological superconducting phase that extends over several hundred millitesla in magnetic field and several millivolts in gate voltage, corresponding to approximately one hundred micro-electron-volts in Zeeman energy and chemical potential in the semiconducting wire. These regions feature a closing and re-opening of the bulk gap, with simultaneous zero-bias conductance peaks at both ends of the devices that withstand changes in the junction transparencies. The extracted maximum topological gaps in our devices are 20-60 $μ$eV. This demonstration is a prerequisite for experiments involving fusion and braiding of Majorana zero modes. △ Less

Submitted 8 March, 2024; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: Final version

arXiv:2206.07694 [pdf, other]

DIRECTOR: Generator-Classifiers For Supervised Language Modeling

Authors: Kushal Arora, Kurt Shuster, Sainbayar Sukhbaatar, Jason Weston

Abstract: Current language models achieve low perplexity but their resulting generations still suffer from toxic responses, repetitiveness and contradictions. The standard language modeling setup fails to address these issues. In this paper, we introduce a new architecture, {\sc Director}, that consists of a unified generator-classifier with both a language modeling and a classification head for each output… ▽ More Current language models achieve low perplexity but their resulting generations still suffer from toxic responses, repetitiveness and contradictions. The standard language modeling setup fails to address these issues. In this paper, we introduce a new architecture, {\sc Director}, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token. Training is conducted jointly using both standard language modeling data, and data labeled with desirable and undesirable sequences. Experiments in several settings show that the model has competitive training and decoding speed compared to standard language models while yielding superior results, alleviating known issues while maintaining generation quality. It also outperforms existing model guiding approaches in terms of both accuracy and efficiency. △ Less

Submitted 25 November, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2203.13224 [pdf, other]

Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Authors: Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston

Abstract: Language models (LMs) have recently been shown to generate more factual responses by employing modularity (Zhou et al., 2021) in combination with retrieval (Adolphs et al., 2021). We extend the recent approach of Adolphs et al. (2021) to include internet search as a module. Our SeeKeR (Search engine->Knowledge->Response) method thus applies a single LM to three modular tasks in succession: search,… ▽ More Language models (LMs) have recently been shown to generate more factual responses by employing modularity (Zhou et al., 2021) in combination with retrieval (Adolphs et al., 2021). We extend the recent approach of Adolphs et al. (2021) to include internet search as a module. Our SeeKeR (Search engine->Knowledge->Response) method thus applies a single LM to three modular tasks in succession: search, generating knowledge, and generating a final response. We show that, when using SeeKeR as a dialogue model, it outperforms the state-of-the-art model BlenderBot 2 (Chen et al., 2021) on open-domain knowledge-grounded conversations for the same number of parameters, in terms of consistency, knowledge and per-turn engagingness. SeeKeR applied to topical prompt completions as a standard language model outperforms GPT2 (Radford et al., 2019) and GPT3 (Brown et al., 2020) in terms of factuality and topicality, despite GPT3 being a vastly larger model. Our code and models are made publicly available. △ Less

Submitted 29 March, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

arXiv:2202.11507 [pdf, other]

On Carbon Taxes Effectiveness to Induce a Clean Technology Transition: An Evaluation Framework Based on Optimal Strategic Capacity Planning

Authors: N. Wolf, P. Escalona, A. Angulo, J. Weston

Abstract: This paper studies carbon taxes effectiveness to induce a transition to cleaner production when a firm faces different technologies and demands. To determine carbon taxes effectiveness, we propose a framework based on a strategic capacity planning under carbon taxes model, that consider proper perfomance measures. The model, which is formulated as a mixed integer linear problem (MILP), considers i… ▽ More This paper studies carbon taxes effectiveness to induce a transition to cleaner production when a firm faces different technologies and demands. To determine carbon taxes effectiveness, we propose a framework based on a strategic capacity planning under carbon taxes model, that consider proper perfomance measures. The model, which is formulated as a mixed integer linear problem (MILP), considers issues that previous work have not studied jointly, such as machine replacement, workforce planning, and maintenance. The effectiveness measures consider levels of clean production and periods to reach a technological transition. Our computational experiments, based on a real case, have shown that carbon taxes by themselves do not necessarily induce a transition to clean production, since their effectiveness depends on the available technology relationship and the demand magnitude. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2201.04723 [pdf, other]

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Authors: Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston

Abstract: At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistica… ▽ More At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions. △ Less

Submitted 12 January, 2022; originally announced January 2022.

arXiv:2112.05843 [pdf, other]

Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity

Authors: Kurt Shuster, Jack Urbanek, Arthur Szlam, Jason Weston

Abstract: State-of-the-art dialogue models still often stumble with regards to factual accuracy and self-contradiction. Anecdotally, they have been observed to fail to maintain character identity throughout discourse; and more specifically, may take on the role of their interlocutor. In this work we formalize and quantify this deficiency, and show experimentally through human evaluations that this is indeed… ▽ More State-of-the-art dialogue models still often stumble with regards to factual accuracy and self-contradiction. Anecdotally, they have been observed to fail to maintain character identity throughout discourse; and more specifically, may take on the role of their interlocutor. In this work we formalize and quantify this deficiency, and show experimentally through human evaluations that this is indeed a problem. In contrast, we show that discriminative models trained specifically to recognize who is speaking can perform well; and further, these can be used as automated metrics. Finally, we evaluate a wide variety of mitigation methods, including changes to model architecture, training protocol, and decoding strategy. Our best models reduce mistaken identity issues by nearly 65% according to human annotators, while simultaneously improving engagingness. Despite these results, we find that maintaining character identity still remains a challenging problem. △ Less

Submitted 10 December, 2021; originally announced December 2021.

arXiv:2111.05204 [pdf, other]

Reason first, then respond: Modular Generation for Knowledge-infused Dialogue

Authors: Leonard Adolphs, Kurt Shuster, Jack Urbanek, Arthur Szlam, Jason Weston

Abstract: Large language models can produce fluent dialogue but often hallucinate factual inaccuracies. While retrieval-augmented models help alleviate this issue, they still face a difficult challenge of both reasoning to provide correct knowledge and generating conversation simultaneously. In this work, we propose a modular model, Knowledge to Response (K2R), for incorporating knowledge into conversationa… ▽ More Large language models can produce fluent dialogue but often hallucinate factual inaccuracies. While retrieval-augmented models help alleviate this issue, they still face a difficult challenge of both reasoning to provide correct knowledge and generating conversation simultaneously. In this work, we propose a modular model, Knowledge to Response (K2R), for incorporating knowledge into conversational agents, which breaks down this problem into two easier steps. K2R first generates a knowledge sequence, given a dialogue context, as an intermediate step. After this "reasoning step", the model then attends to its own generated knowledge sequence, as well as the dialogue context, to produce a final response. In detailed experiments, we find that such a model hallucinates less in knowledge-grounded dialogue tasks, and has advantages in terms of interpretability and modularity. In particular, it can be used to fuse QA and dialogue systems together to enable dialogue agents to give knowledgeable answers, or QA models to give conversational responses in a zero-shot setting. △ Less

Submitted 9 November, 2021; originally announced November 2021.

arXiv:2110.10497 [pdf, other]

doi 10.1103/PhysRevD.105.052010

Search for dark photons using a multilayer dielectric haloscope equipped with a single-photon avalanche diode

Authors: Laura Manenti, Umang Mishra, Gianmarco Bruno, Adriano Di Giovanni, Alexander John Millar, Knut Dundas Morå, Renu Pasricha, Henry Roberts, Panos Oikonomou, Isaac Sarnoff, James Weston, Francesco Arneodo

Abstract: We report on the results of the search for dark photons with mass around 1.5$\,\rm eV/c^2$ using a multilayer dielectric haloscope equipped with an affordable and commercially available photosensor. The multilayer stack, which enables the conversion of dark photons (DP) to Standard Model photons, is made of 23 bilayers of alternating SiO$_2$ and Si$_3$N$_4$ thin films with linearly increasing thic… ▽ More We report on the results of the search for dark photons with mass around 1.5$\,\rm eV/c^2$ using a multilayer dielectric haloscope equipped with an affordable and commercially available photosensor. The multilayer stack, which enables the conversion of dark photons (DP) to Standard Model photons, is made of 23 bilayers of alternating SiO$_2$ and Si$_3$N$_4$ thin films with linearly increasing thicknesses through the stack (a configuration known as a "chirped stack"). The thicknesses have been chosen according to an optimisation algorithm in order to maximise the DP-photon conversion in the energy region where the photosensor sensitivity peaks. This prototype experiment, baptised MuDHI (Multilayer Dielectric Haloscope Investigation) by the authors of this paper, has been designed, developed and run at the Astroparticle Laboratory of New York University Abu Dhabi, which marks the first time a dark matter experiment has been operated in the Middle East. No significant signal excess is observed, and the method of maximum log-likelihood is used to set exclusion limits at $90\%$ confidence level on the kinetic mixing coupling constant between dark photons and ordinary photons. △ Less

Submitted 7 January, 2023; v1 submitted 20 October, 2021; originally announced October 2021.

Journal ref: Phys. Rev. D 105, 052010 (2022)

arXiv:2110.09456 [pdf, other]

NormFormer: Improved Transformer Pretraining with Extra Normalization

Authors: Sam Shleifer, Jason Weston, Myle Ott

Abstract: During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first… ▽ More During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq https://github.com/pytorch/fairseq/tree/main/examples/normformer . △ Less

Submitted 1 November, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

arXiv:2107.08251 [pdf, other]

Generative Pretraining for Paraphrase Evaluation

Authors: Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

Abstract: We introduce ParaBLEU, a paraphrase representation learning model and evaluation metric for text generation. Unlike previous approaches, ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective. ParaBLEU correlates more strongly with human judgements than existing metrics, obtaining new state-of-the-art results on the 2017 WMT Metrics Shared Task. We show… ▽ More We introduce ParaBLEU, a paraphrase representation learning model and evaluation metric for text generation. Unlike previous approaches, ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective. ParaBLEU correlates more strongly with human judgements than existing metrics, obtaining new state-of-the-art results on the 2017 WMT Metrics Shared Task. We show that our model is robust to data scarcity, exceeding previous state-of-the-art performance using only $50\%$ of the available training data and surpassing BLEU, ROUGE and METEOR with only $40$ labelled examples. Finally, we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration, which we use to confirm our hypothesis that it learns abstract, generalized paraphrase representations. △ Less

Submitted 24 July, 2021; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: Under review

arXiv:2107.08248 [pdf, other]

Learning De-identified Representations of Prosody from Raw Audio

Authors: Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

Abstract: We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling o… ▽ More We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech de-identifiability and use it to demonstrate that our prosody representations are less identifiable than other speech representations. △ Less

Submitted 17 July, 2021; originally announced July 2021.

Comments: ICML 2021

Journal ref: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research 139, PMLR 2021

arXiv:2107.07567 [pdf, other]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

Authors: **g Xu, Arthur Szlam, Jason Weston

Abstract: Despite recent improvements in open-domain dialogue models, state of the art models are trained and evaluated on short conversations with little context. In contrast, the long-term conversation setting has hardly been studied. In this work we collect and release a human-human dataset consisting of multiple chat sessions whereby the speaking partners learn about each other's interests and discuss t… ▽ More Despite recent improvements in open-domain dialogue models, state of the art models are trained and evaluated on short conversations with little context. In contrast, the long-term conversation setting has hardly been studied. In this work we collect and release a human-human dataset consisting of multiple chat sessions whereby the speaking partners learn about each other's interests and discuss the things they have learnt from past sessions. We show how existing models trained on existing datasets perform poorly in this long-term conversation setting in both automatic and human evaluations, and we study long-context models that can perform much better. In particular, we find retrieval-augmented methods and methods with an ability to summarize and recall previous conversations outperform the standard encoder-decoder architectures currently considered state of the art. △ Less

Submitted 15 July, 2021; originally announced July 2021.

arXiv:2107.07566 [pdf, other]

Internet-Augmented Dialogue Generation

Authors: Mojtaba Komeili, Kurt Shuster, Jason Weston

Abstract: The largest store of continually updating knowledge on our planet can be accessed via internet search. In this work we study giving access to this information to conversational agents. Large language models, even though they store an impressive amount of knowledge within their weights, are known to hallucinate facts when generating dialogue (Shuster et al., 2021); moreover, those facts are frozen… ▽ More The largest store of continually updating knowledge on our planet can be accessed via internet search. In this work we study giving access to this information to conversational agents. Large language models, even though they store an impressive amount of knowledge within their weights, are known to hallucinate facts when generating dialogue (Shuster et al., 2021); moreover, those facts are frozen in time at the point of model training. In contrast, we propose an approach that learns to generate an internet search query based on the context, and then conditions on the search results to finally generate a response, a method that can employ up-to-the-minute relevant information. We train and evaluate such models on a newly collected dataset of human-human conversations whereby one of the speakers is given access to internet search during knowledgedriven discussions in order to ground their responses. We find that search-query based access of the internet in conversation provides superior performance compared to existing approaches that either use no augmentation or FAISS-based retrieval (Lewis et al., 2020). △ Less

Submitted 15 July, 2021; originally announced July 2021.

arXiv:2107.06251 [pdf, other]

doi 10.3847/1538-4365/ac24ab

Classical Novae at Radio Wavelengths

Authors: Laura Chomiuk, Justin D. Linford, Elias Aydi, Keith W. Bannister, Miriam I. Krauss, Amy J. Mioduszewski, Koji Mukai, Thomas J. Nelson, Michael P. Rupen, Stuart D. Ryder, Jennifer L. Sokoloski, Kirill V. Sokolovsky, Jay Strader, Miroslav D. Filipovic, Tom Finzell, Adam Kawash, Erik C. Kool, Brian D. Metzger, Miriam M. Nyamai, Valerio A. R. M. Ribeiro, Nirupam Roy, Ryan Urquhart, Jennifer Weston

Abstract: We present radio observations (1--40 GHz) for 36 classical novae, representing data from over five decades compiled from the literature, telescope archives, and our own programs. Our targets display a striking diversity in their optical parameters (e.g., spanning optical fading timescales, t_2 = 1--263 days), and we find a similar diversity in the radio light curves. Using a brightness temperature… ▽ More We present radio observations (1--40 GHz) for 36 classical novae, representing data from over five decades compiled from the literature, telescope archives, and our own programs. Our targets display a striking diversity in their optical parameters (e.g., spanning optical fading timescales, t_2 = 1--263 days), and we find a similar diversity in the radio light curves. Using a brightness temperature analysis, we find that radio emission from novae is a mixture of thermal and synchrotron emission, with non-thermal emission observed at earlier times. We identify high brightness temperature emission (T_B > 5x10^4 K) as an indication of synchrotron emission in at least 9 (25%) of the novae. We find a class of synchrotron-dominated novae with mildly evolved companions, exemplified by V5589 Sgr and V392 Per, that appear to be a bridge between classical novae with dwarf companions and symbiotic binaries with giant companions. Four of the novae in our sample have two distinct radio maxima (the first dominated by synchrotron and the later by thermal emission), and in four cases the early synchrotron peak is temporally coincident with a dramatic dip in the optical light curve, hinting at a common site for particle acceleration and dust formation. We publish the light curves as tables and encourage use of these data by the broader community in multi-wavelength studies and modeling efforts. △ Less

Submitted 13 July, 2021; originally announced July 2021.

Comments: Submitted to AAS Journals

arXiv:2106.15782 [pdf, other]

doi 10.1093/mnras/stac1366

Shocks and dust formation in nova V809 Cep

Authors: Aliya-Nur Babul, Jennifer L. Sokoloski, Laura Chomiuk, Justin D. Linford, Jennifer H. S. Weston, Elias Aydi, Kirill V. Sokolovsky, Adam M. Kawash

Abstract: The discovery that many classical novae produce detectable GeV $γ$-ray emission has raised the question of the role of shocks in nova eruptions. Here we use radio observations of nova V809 Cep (Nova Cep 2013) with the Jansky Very Large Array to show that it produced non-thermal emission indicative of particle acceleration in strong shocks for more than a month starting about six weeks into the eru… ▽ More The discovery that many classical novae produce detectable GeV $γ$-ray emission has raised the question of the role of shocks in nova eruptions. Here we use radio observations of nova V809 Cep (Nova Cep 2013) with the Jansky Very Large Array to show that it produced non-thermal emission indicative of particle acceleration in strong shocks for more than a month starting about six weeks into the eruption, quasi-simultaneous with the production of dust. Broadly speaking, the radio emission at late times -- more than a six months or so into the eruption -- is consistent with thermal emission from $10^{-4} M_\odot$ of freely expanding, $10^4$~K ejecta. At 4.6 and 7.4 GHz, however, the radio light-curves display an initial early-time peak 76 days after the discovery of the eruption in the optical ($t_0$). The brightness temperature at 4.6 GHz on day 76 was greater than $10^5 K$, an order of magnitude above what is expected for thermal emission. We argue that the brightness temperature is the result of synchrotron emission due to internal shocks within the ejecta. The evolution of the radio spectrum was consistent with synchrotron emission that peaked at high frequencies before low frequencies, suggesting that the synchrotron from the shock was initially subject to free-free absorption by optically thick ionized material in front of the shock. Dust formation began around day 37, and we suggest that internal shocks in the ejecta were established prior to dust formation and caused the nucleation of dust. △ Less

Submitted 29 June, 2021; originally announced June 2021.

arXiv:2106.04426 [pdf, other]

Hash Layers For Large Sparse Models

Authors: Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston

Abstract: We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert meth… ▽ More We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks. △ Less

Submitted 20 July, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2106.04279 [pdf, other]

Staircase Attention for Recurrent Processing of Sequences

Authors: Da Ju, Stephen Roller, Sainbayar Sukhbaatar, Jason Weston

Abstract: Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of… ▽ More Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing. A step in the staircase comprises of backward tokens (encoding the sequence so far seen) and forward tokens (ingesting a new part of the sequence), or an extreme Ladder version with a forward step of zero that simply repeats the Transformer on each step of the ladder, sharing the weights. We thus describe a family of such models that can trade off performance and compute, by either increasing the amount of recurrence through time, the amount of sequential processing via recurrence in depth, or both. Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains. △ Less

Submitted 8 June, 2021; originally announced June 2021.

arXiv:2105.06548 [pdf, other]

Not All Memories are Created Equal: Learning to Forget by Expiring

Authors: Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, Jason Weston, Angela Fan

Abstract: Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant info… ▽ More Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant information. This forgetting of memories enables Transformers to scale to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved. We demonstrate that Expire-Span can help models identify and retain critical information and show it can achieve strong performance on reinforcement learning tasks specifically designed to challenge this functionality. Next, we show that Expire-Span can scale to memories that are tens of thousands in size, setting a new state of the art on incredibly long context tasks such as character-level language modeling and a frame-by-frame moving objects task. Finally, we analyze the efficiency of Expire-Span compared to existing approaches and demonstrate that it trains faster and uses less memory. △ Less

Submitted 13 June, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

arXiv:2104.07567 [pdf, other]

Retrieval Augmentation Reduces Hallucination in Conversation

Authors: Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston

Abstract: Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialog… ▽ More Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We study various types of architectures with multiple components - retrievers, rankers, and encoder-decoders - with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2104.00641 [pdf]

Dynamic Silos: Increased Modularity in Intra-organizational Communication Networks during the Covid-19 Pandemic

Authors: Tiona Zuzul, Emily Cox Pahnke, Jonathan Larson, Patrick Bourke, Nicholas Caurvina, Neha Parikh Shah, Fereshteh Amini, Jeffrey Weston, Youngser Park, Joshua Vogelstein, Christopher White, Carey E. Priebe

Abstract: Workplace communications around the world were drastically altered by Covid-19, related work-from-home orders, and the rise of remote work. To understand these shifts, we analyzed aggregated, anonymized metadata from over 360 billion emails within 4,361 organizations worldwide. By comparing month-to-month and year-over-year metrics, we examined changes in network community structures over 24 month… ▽ More Workplace communications around the world were drastically altered by Covid-19, related work-from-home orders, and the rise of remote work. To understand these shifts, we analyzed aggregated, anonymized metadata from over 360 billion emails within 4,361 organizations worldwide. By comparing month-to-month and year-over-year metrics, we examined changes in network community structures over 24 months before and after Covid-19. We also examined shifts across multiple communication media (email, instant messages, video calls, and calendaring software) within a single global organization, and compared them to communications shifts that were driven by changes in formal organizational structure. We found that, in 2020, organizations around the world became more siloed than in 2019, evidenced by increased modularity. This shift was concurrent with decreased stability within silos. Collectively, our analyses indicate that following the onset of Covid-19, employees began to shift more dynamically between subcommunities (teams, workgroups or functional areas). At the same time, once in a subcommunity, they limited their communication to other members of that community. We term these network changes dynamic silos. We provide initial insights into the meaning and implications of dynamic silos for the future of work. △ Less

Submitted 28 July, 2023; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: 48 pages, 15 figures

Showing 1–50 of 160 results for author: Weston, J