Search | arXiv e-print repository

Teaching Large Language Models to Reason with Reinforcement Learning

Authors: Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu

Abstract: Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both spa… ▽ More Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.10963 [pdf, other]

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Authors: Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Roberta Raileanu

Abstract: State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, o… ▽ More State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled. △ Less

Submitted 24 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.04004 [pdf, other]

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

Authors: Alex Havrilla, Maia Iyer

Abstract: During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we stud… ▽ More During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (\textbf{CoT}) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (\textbf{TInt}) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: \textit{static} noise, a local form of noise which is applied after the CoT trace is computed, and \textit{dynamic} noise, a global form of noise which propagates errors in the trace as it is computed. We then evaluate the test performance of pretrained models both prompted and fine-tuned on noised datasets with varying levels of dataset contamination and intensity. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise. In contrast, few-shot prompted models appear more sensitive to even static noise. We conclude with a discussion of how our findings impact noise filtering best-practices, in particular emphasizing the importance of removing samples containing destructive dynamic noise with global errors. △ Less

Submitted 8 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.06144 [pdf, other]

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Authors: Alex Havrilla, Kevin Rojas, Wen**g Liao, Molei Tao

Abstract: Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operat… ▽ More Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving. △ Less

Submitted 22 January, 2024; v1 submitted 30 November, 2023; originally announced January 2024.

arXiv:2307.13692 [pdf, other]

ARB: Advanced Reasoning Benchmark for Large Language Models

Authors: Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more c… ▽ More Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores. △ Less

Submitted 27 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

Comments: Submitted to NeurIPS Datasets and Benchmarks Track

arXiv:2303.09863 [pdf, other]

Deep Nonparametric Estimation of Intrinsic Data Structures by Chart Autoencoders: Generalization Error and Robustness

Authors: Hao Liu, Alex Havrilla, Rongjie Lai, Wen**g Liao

Abstract: Autoencoders have demonstrated remarkable success in learning low-dimensional latent features of high-dimensional data across various applications. Assuming that data are sampled near a low-dimensional manifold, we employ chart autoencoders, which encode data into low-dimensional latent features on a collection of charts, preserving the topology and geometry of the data manifold. Our paper establi… ▽ More Autoencoders have demonstrated remarkable success in learning low-dimensional latent features of high-dimensional data across various applications. Assuming that data are sampled near a low-dimensional manifold, we employ chart autoencoders, which encode data into low-dimensional latent features on a collection of charts, preserving the topology and geometry of the data manifold. Our paper establishes statistical guarantees on the generalization error of chart autoencoders, and we demonstrate their denoising capabilities by considering $n$ noisy training samples, along with their noise-free counterparts, on a $d$-dimensional manifold. By training autoencoders, we show that chart autoencoders can effectively denoise the input data with normal noise. We prove that, under proper network architectures, chart autoencoders achieve a squared generalization error in the order of $\displaystyle n^{-\frac{2}{d+2}}\log^4 n$, which depends on the intrinsic dimension of the manifold and only weakly depends on the ambient dimension and noise level. We further extend our theory on data with noise containing both normal and tangential components, where chart autoencoders still exhibit a denoising effect for the normal component. As a special case, our theory also applies to classical autoencoders, as long as the data manifold has a global parametrization. Our results provide a solid theoretical foundation for the effectiveness of autoencoders, which is further validated through several numerical experiments. △ Less

Submitted 25 October, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

arXiv:2302.13183 [pdf, other]

On Deep Generative Models for Approximation and Estimation of Distributions on Manifolds

Authors: Biraj Dahal, Alex Havrilla, Minshuo Chen, Tuo Zhao, Wen**g Liao

Abstract: Generative networks have experienced great empirical successes in distribution learning. Many existing experiments have demonstrated that generative networks can generate high-dimensional complex data from a low-dimensional easy-to-sample distribution. However, this phenomenon can not be justified by existing theories. The widely held manifold hypothesis speculates that real-world data sets, such… ▽ More Generative networks have experienced great empirical successes in distribution learning. Many existing experiments have demonstrated that generative networks can generate high-dimensional complex data from a low-dimensional easy-to-sample distribution. However, this phenomenon can not be justified by existing theories. The widely held manifold hypothesis speculates that real-world data sets, such as natural images and signals, exhibit low-dimensional geometric structures. In this paper, we take such low-dimensional data structures into consideration by assuming that data distributions are supported on a low-dimensional manifold. We prove statistical guarantees of generative networks under the Wasserstein-1 loss. We show that the Wasserstein-1 loss converges to zero at a fast rate depending on the intrinsic dimension instead of the ambient data dimension. Our theory leverages the low-dimensional geometric structures in data sets and justifies the practical power of generative networks. We require no smoothness assumptions on the data distribution which is desirable in practice. △ Less

Submitted 25 February, 2023; originally announced February 2023.

arXiv:2210.07792 [pdf, other]

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

Authors: Louis Castricato, Alexander Havrilla, Shahbuland Matiana, Michael Pieler, Anbang Ye, Ian Yang, Spencer Frazier, Mark Riedl

Abstract: Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences. Existing methods to control for story preference utilize prompt engineering which is labor intensive and often inconsistent. They may also use logit-manipulation methods which require annotated datasets to exist for the desired attributes. To addre… ▽ More Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences. Existing methods to control for story preference utilize prompt engineering which is labor intensive and often inconsistent. They may also use logit-manipulation methods which require annotated datasets to exist for the desired attributes. To address these issues, we first train a contrastive bi-encoder model to align stories with corresponding human critiques, named CARP, building a general purpose preference model. This is subsequently used as a reward function to fine-tune a generative language model via reinforcement learning. However, simply fine-tuning a generative language model with a contrastive reward model does not always reliably result in a story generation system capable of generating stories that meet user preferences. To increase story generation robustness we further fine-tune the contrastive reward model using a prompt-learning technique. A human participant study is then conducted comparing generations from our full system, ablations, and two baselines. We show that the full fine-tuning pipeline results in a story generator preferred over a LLM 20x as large as well as logit-based methods. This motivates the use of contrastive learning for general purpose human preference modeling. △ Less

Submitted 15 December, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2102.09500 [pdf, ps, other]

Khinchin-type inequalities via Hadamard's factorisation

Authors: Alex Havrilla, Piotr Nayar, Tomasz Tkocz

Abstract: We prove Khinchin-type inequalities with sharp constants for type L random variables and all even moments. Our main tool is Hadamard's factorisation theorem from complex analysis, combined with Newton's inequalities for elementary symmetric functions. Besides the case of independent summands, we also treat ferromagnetic dependencies in a nonnegative external magnetic field (thanks to Newman's gene… ▽ More We prove Khinchin-type inequalities with sharp constants for type L random variables and all even moments. Our main tool is Hadamard's factorisation theorem from complex analysis, combined with Newton's inequalities for elementary symmetric functions. Besides the case of independent summands, we also treat ferromagnetic dependencies in a nonnegative external magnetic field (thanks to Newman's generalisation of the Lee-Yang theorem). Lastly, we compare the notions of type L, ultra sub-Gaussianity (introduced by Nayar and Oleszkiewicz) and strong log-concavity (introduced by Gurvits), with the latter two being equivalent. △ Less

Submitted 11 October, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: Final version. To appear in Int. Math. Res. Not. IMRN

arXiv:1912.13345 [pdf, ps, other]

Sharp Khinchin-type inequalities for symmetric discrete uniform random variables

Authors: Alex Havrilla, Tomasz Tkocz

Abstract: We establish several optimal moment comparison inequalities (Khinchin-type inequalities) for weighted sums of independent identically distributed symmetric discrete random variables which are uniform on sets of consecutive integers. Specifically, we obtain sharp constants for the second moment and any moment of order at least 3 (using convex dominance by Gaussian random variables). In the case of… ▽ More We establish several optimal moment comparison inequalities (Khinchin-type inequalities) for weighted sums of independent identically distributed symmetric discrete random variables which are uniform on sets of consecutive integers. Specifically, we obtain sharp constants for the second moment and any moment of order at least 3 (using convex dominance by Gaussian random variables). In the case of only 3 atoms, we also establish a Schur-convexity result. For moments of order less than 2, we get sharp constants in two cases by exploiting Haagerup's arguments for random signs. △ Less

Submitted 7 December, 2020; v1 submitted 31 December, 2019; originally announced December 2019.

Comments: Revised (exposition shortened; L1-L2 inequality generalised to arbitrary symmetric distributions with large atom at 0; results for even moments will appear elsewhere). 12 pages

MSC Class: 60E15; 26D15

Journal ref: Israel J. Math. 246 (2021), no. 1, 281-297

Showing 1–10 of 10 results for author: Havrilla, A