Search | arXiv e-print repository

Soft Preference Optimization: Aligning Language Models to Expert Distributions

Authors: Arsalan Sharifnassab, Sina Ghiassian, Saber Salehkaleybar, Surya Kanoria, Dale Schuurmans

Abstract: We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than l… ▽ More We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution's "softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision. △ Less

Submitted 27 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

arXiv:2402.02342 [pdf, other]

MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters

Authors: Arsalan Sharifnassab, Saber Salehkaleybar, Richard Sutton

Abstract: This paper addresses the challenge of optimizing meta-parameters (i.e., hyperparameters) in machine learning algorithms, a critical factor influencing training efficiency and model performance. Moving away from the computationally expensive traditional meta-parameter search methods, we introduce MetaOptimize framework that dynamically adjusts meta-parameters, particularly step sizes (also known as… ▽ More This paper addresses the challenge of optimizing meta-parameters (i.e., hyperparameters) in machine learning algorithms, a critical factor influencing training efficiency and model performance. Moving away from the computationally expensive traditional meta-parameter search methods, we introduce MetaOptimize framework that dynamically adjusts meta-parameters, particularly step sizes (also known as learning rates), during training. More specifically, MetaOptimize can wrap around any first-order optimization algorithm, tuning step sizes on the fly to minimize a specific form of regret that accounts for long-term effect of step sizes on training, through a discounted sum of future losses. We also introduce low complexity variants of MetaOptimize that, in conjunction with its adaptability to multiple optimization algorithms, demonstrate performance competitive to those of best hand-crafted learning rate schedules across various machine learning applications. △ Less

Submitted 27 May, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.17401 [pdf, other]

Step-size Optimization for Continual Learning

Authors: Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, Richard Sutton

Abstract: In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt t… ▽ More In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2301.13757 [pdf, other]

Toward Efficient Gradient-Based Value Estimation

Authors: Arsalan Sharifnassab, Richard Sutton

Abstract: Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effe… ▽ More Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested. △ Less

Submitted 23 July, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

arXiv:2108.08677 [pdf, other]

Order Optimal Bounds for One-Shot Federated Learning over non-Convex Loss Functions

Authors: Arsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani

Abstract: We consider the problem of federated learning in a one-shot setting in which there are $m$ machines, each observing $n$ sample functions from an unknown distribution on non-convex loss functions. Let $F:[-1,1]^d\to\mathbb{R}$ be the expected loss function with respect to this unknown distribution. The goal is to find an estimate of the minimizer of $F$. Based on its observations, each machine gene… ▽ More We consider the problem of federated learning in a one-shot setting in which there are $m$ machines, each observing $n$ sample functions from an unknown distribution on non-convex loss functions. Let $F:[-1,1]^d\to\mathbb{R}$ be the expected loss function with respect to this unknown distribution. The goal is to find an estimate of the minimizer of $F$. Based on its observations, each machine generates a signal of bounded length $B$ and sends it to a server. The server collects signals of all machines and outputs an estimate of the minimizer of $F$. We show that the expected loss of any algorithm is lower bounded by $\max\big(1/(\sqrt{n}(mB)^{1/d}), 1/\sqrt{mn}\big)$, up to a logarithmic factor. We then prove that this lower bound is order optimal in $m$ and $n$ by presenting a distributed learning algorithm, called Multi-Resolution Estimator for Non-Convex loss function (MRE-NC), whose expected loss matches the lower bound for large $mn$ up to polylogarithmic factors. △ Less

Submitted 6 February, 2024; v1 submitted 19 August, 2021; originally announced August 2021.

arXiv:1911.00731 [pdf, other]

Order Optimal One-Shot Distributed Learning

Authors: Arsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani

Abstract: We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d. samples. Based on its observed samples, each machine then sends an $O(\log(mn))$-length message to a server, at which a parameter minimizing an expected loss is to be estimated. We propose an algorithm called Multi-Resolution Estimator (MRE) whose expected error is no larger… ▽ More We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d. samples. Based on its observed samples, each machine then sends an $O(\log(mn))$-length message to a server, at which a parameter minimizing an expected loss is to be estimated. We propose an algorithm called Multi-Resolution Estimator (MRE) whose expected error is no larger than $\tilde{O}\big(m^{-{1}/{\max(d,2)}} n^{-1/2}\big)$, where $d$ is the dimension of the parameter space. This error bound meets existing lower bounds up to poly-logarithmic factors, and is thereby order optimal. The expected error of MRE, unlike existing algorithms, tends to zero as the number of machines ($m$) goes to infinity, even when the number of samples per machine ($n$) remains upper bounded by a constant. This property of the MRE algorithm makes it applicable in new machine learning paradigms where $m$ is much larger than $n$. △ Less

Submitted 2 November, 2019; originally announced November 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.04634

Journal ref: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada

arXiv:1905.04634 [pdf, other]

One-Shot Federated Learning: Theoretical Limits and Algorithms to Achieve Them

Authors: Saber Salehkaleybar, Arsalan Sharifnassab, S. Jamaloddin Golestani

Abstract: We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d. samples. Based on its observed samples, each machine sends a $B$-bit-long message to a server. The server then collects messages from all machines, and estimates a parameter that minimizes an expected convex loss function. We investigate the impact of communication constrain… ▽ More We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d. samples. Based on its observed samples, each machine sends a $B$-bit-long message to a server. The server then collects messages from all machines, and estimates a parameter that minimizes an expected convex loss function. We investigate the impact of communication constraint, $B$, on the expected error and derive a tight lower bound on the error achievable by any algorithm. We then propose an estimator, which we call Multi-Resolution Estimator (MRE), whose expected error (when $B\ge\log mn$) meets the aforementioned lower bound up to poly-logarithmic factors, and is thereby order optimal. We also address the problem of learning under tiny communication budget, and present lower and upper error bounds when $B$ is a constant. The expected error of MRE, unlike existing algorithms, tends to zero as the number of machines ($m$) goes to infinity, even when the number of samples per machine ($n$) remains upper bounded by a constant. This property of the MRE algorithm makes it applicable in new machine learning paradigms where $m$ is much larger than $n$. △ Less

Submitted 30 December, 2019; v1 submitted 11 May, 2019; originally announced May 2019.

arXiv:1810.09180 [pdf, ps, other]

Fluctuation Bounds for the Max-Weight Policy, with Applications to State Space Collapse

Authors: Arsalan Sharifnassab, John N. Tsitsiklis, S. Jamaloddin Golestani

Abstract: We consider a multi-hop switched network operating under a Max-Weight (MW) scheduling policy, and show that the distance between the queue length process and a fluid solution remains bounded by a constant multiple of the deviation of the cumulative arrival process from its average. We then exploit this result to prove matching upper and lower bounds for the time scale over which additive state spa… ▽ More We consider a multi-hop switched network operating under a Max-Weight (MW) scheduling policy, and show that the distance between the queue length process and a fluid solution remains bounded by a constant multiple of the deviation of the cumulative arrival process from its average. We then exploit this result to prove matching upper and lower bounds for the time scale over which additive state space collapse (SSC) takes place. This implies, as two special cases, an additive SSC result in diffusion scaling under non-Markovian arrivals and, for the case of i.i.d. arrivals, an additive SSC result over an exponential time scale. △ Less

Submitted 12 June, 2019; v1 submitted 22 October, 2018; originally announced October 2018.

Showing 1–8 of 8 results for author: Sharifnassab, A