Skip to main content

Showing 1–20 of 20 results for author: Xiong, W

Searching in archive stat. Search in all archives.
.
  1. arXiv:2405.07863  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    RLHF Workflow: From Reward Modeling to Online RLHF

    Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

    Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill i… ▽ More

    Submitted 12 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  2. arXiv:2404.18922  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    DPO Meets PPO: Reinforced Token Optimization for RLHF

    Authors: Han Zhong, Guhao Feng, Wei Xiong, Li Zhao, Di He, Jiang Bian, Liwei Wang

    Abstract: In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still larg… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  3. arXiv:2402.18571  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

    Authors: Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang

    Abstract: Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directi… ▽ More

    Submitted 6 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: The code and model are released at https://github.com/Haoxiang-Wang/directional-preference-alignment

  4. arXiv:2402.07314  [pdf, other

    cs.LG stat.ML

    Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

    Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang

    Abstract: We study Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle. In particular, we do not assume that there exists a reward function and the preference signal is drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference ora… ▽ More

    Submitted 25 April, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: RLHF, Preference Learning, Alignment for LLMs

  5. arXiv:2312.15124  [pdf, other

    quant-ph cs.ET cs.LG stat.ML

    On fundamental aspects of quantum extreme learning machines

    Authors: Weijie Xiong, Giorgio Facelli, Mehrad Sahebi, Owen Agnel, Thiparat Chotibut, Supanut Thanasilp, Zoë Holmes

    Abstract: Quantum Extreme Learning Machines (QELMs) have emerged as a promising framework for quantum machine learning. Their appeal lies in the rich feature map induced by the dynamics of a quantum substrate - the quantum reservoir - and the efficient post-measurement training via linear regression. Here we study the expressivity of QELMs by decomposing the prediction of QELMs into a Fourier series. We sho… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: 16+17 pages, 8+2 figures

  6. arXiv:2312.11456  [pdf, other

    cs.LG cs.AI stat.ML

    Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

    Authors: Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang

    Abstract: This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL re… ▽ More

    Submitted 1 May, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 53 pages; theoretical study and algorithmic design of iterative RLHF and DPO

  7. arXiv:2306.08364  [pdf, other

    stat.ML cs.IT cs.LG

    Provably Efficient Offline Reinforcement Learning with Perturbed Data Sources

    Authors: Chengshuai Shi, Wei Xiong, Cong Shen, **g Yang

    Abstract: Existing theoretical studies on offline reinforcement learning (RL) mostly consider a dataset sampled directly from the target task. In practice, however, data often come from several heterogeneous but related sources. Motivated by this gap, this work aims at rigorously understanding offline RL with multiple datasets that are collected from randomly perturbed versions of the target task instead of… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: ICML 2023

  8. arXiv:2305.18258  [pdf, other

    cs.LG cs.AI cs.GT math.OC stat.ML

    Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration

    Authors: Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang

    Abstract: In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithm… ▽ More

    Submitted 25 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

  9. arXiv:2305.02441  [pdf, other

    stat.ML cs.IT cs.LG eess.SP

    Reward Teaching for Federated Multi-armed Bandits

    Authors: Chengshuai Shi, Wei Xiong, Cong Shen, **g Yang

    Abstract: Most of the existing federated multi-armed bandits (FMAB) designs are based on the presumption that clients will implement the specified design to collaborate with the server. In reality, however, it may not be possible to modify the clients' existing protocols. To address this challenge, this work focuses on clients who always maximize their individual cumulative rewards, and introduces a novel i… ▽ More

    Submitted 20 November, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted to IEEE Transactions on Signal Processing

  10. arXiv:2304.06767  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ML

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    Authors: Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang

    Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-worl… ▽ More

    Submitted 1 December, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: 29 pages, 12 figures, Published in Transactions on Machine Learning Research (TMLR)

  11. arXiv:2212.05949  [pdf, ps, other

    stat.ML cs.LG

    Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

    Authors: Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

    Abstract: Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}ζ)$ regret bound, where $T$ is the number of rounds and $ζ$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and pr… ▽ More

    Submitted 10 February, 2024; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: We study the corruption-robust MDPs and contextual bandits with general function approximation

    Journal ref: ICML 2023

  12. arXiv:2211.01962  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

    Authors: Han Zhong, Wei Xiong, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang, Tong Zhang

    Abstract: We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generali… ▽ More

    Submitted 30 June, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: We changed the title from the first version. We fixed a technical issue in the first version regarding the $\ell_2$ eluder technique (Lemma D.2)

  13. arXiv:2205.15512  [pdf, ps, other

    cs.LG cs.GT stat.ML

    Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game

    Authors: Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Liwei Wang, Tong Zhang

    Abstract: Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear funct… ▽ More

    Submitted 1 March, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

  14. arXiv:2202.07511  [pdf, ps, other

    cs.LG cs.GT stat.ML

    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

    Authors: Han Zhong, Wei Xiong, Jiyuan Tan, Liwei Wang, Tong Zhang, Zhaoran Wang, Zhuoran Yang

    Abstract: We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal p… ▽ More

    Submitted 29 December, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

  15. arXiv:2201.03533  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    SCROLLS: Standardized CompaRison Over Long Language Sequences

    Authors: Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, Omer Levy

    Abstract: NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing infor… ▽ More

    Submitted 11 October, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

    Comments: EMNLP 2022

  16. arXiv:2110.14628  [pdf, ps, other

    stat.ML cs.IT cs.LG

    (Almost) Free Incentivized Exploration from Decentralized Learning Agents

    Authors: Chengshuai Shi, Haifeng Xu, Wei Xiong, Cong Shen

    Abstract: Incentivized exploration in multi-armed bandits (MAB) has witnessed increasing interests and many progresses in recent years, where a principal offers bonuses to agents to do explorations on her behalf. However, almost all existing studies are confined to temporary myopic agents. In this work, we break this barrier and study incentivized exploration with multiple and long-term strategic agents, wh… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted to NeurIPS 2021, camera-ready version

  17. arXiv:2110.14622  [pdf, ps, other

    stat.ML cs.IT cs.LG

    Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization

    Authors: Chengshuai Shi, Wei Xiong, Cong Shen, **g Yang

    Abstract: Despite the significant interests and many progresses in decentralized multi-player multi-armed bandits (MP-MAB) problems in recent years, the regret gap to the natural centralized lower bound in the heterogeneous MP-MAB setting remains open. In this paper, we propose BEACON -- Batched Exploration with Adaptive COmmunicatioN -- that closes this gap. BEACON accomplishes this goal with novel contrib… ▽ More

    Submitted 29 October, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted to NeurIPS 2021, camera-ready version

  18. arXiv:2110.00653  [pdf, ps, other

    stat.ML cs.LG

    Sparse Deep Learning: A New Framework Immune to Local Traps and Miscalibration

    Authors: Yan Sun, Wenjun Xiong, Faming Liang

    Abstract: Deep learning has powered recent successes of artificial intelligence (AI). However, the deep neural network, as the basic model of deep learning, has suffered from issues such as local traps and miscalibration. In this paper, we provide a new framework for sparse deep learning, which has the above issues addressed in a coherent way. In particular, we lay down a theoretical foundation for sparse d… ▽ More

    Submitted 2 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: Neurips 2021

  19. arXiv:2003.00162  [pdf, ps, other

    cs.LG cs.IT stat.ML

    Decentralized Multi-player Multi-armed Bandits with No Collision Information

    Authors: Chengshuai Shi, Wei Xiong, Cong Shen, **g Yang

    Abstract: The decentralized stochastic multi-player multi-armed bandit (MP-MAB) problem, where the collision information is not available to the players, is studied in this paper. Building on the seminal work of Boursier and Perchet (2019), we propose error correction synchronization involving communication (EC-SIC), whose regret is shown to approach that of the centralized stochastic MP-MAB with collision… ▽ More

    Submitted 28 February, 2020; originally announced March 2020.

    Comments: 17 pages, 11 figures. Accepted to AISTATS 2020

  20. arXiv:1811.03970  [pdf, other

    cs.IR cs.LG stat.ML

    Looking Deeper into Deep Learning Model: Attribution-based Explanations of TextCNN

    Authors: Wenting Xiong, Iftitahu Ni'mah, Juan M. G. Huesca, Werner van Ipenburg, Jan Veldsink, Mykola Pechenizkiy

    Abstract: Layer-wise Relevance Propagation (LRP) and saliency maps have been recently used to explain the predictions of Deep Learning models, specifically in the domain of text classification. Given different attribution-based explanations to highlight relevant words for a predicted class label, experiments based on word deleting perturbation is a common evaluation method. This word removal approach, howev… ▽ More

    Submitted 2 December, 2018; v1 submitted 8 November, 2018; originally announced November 2018.

    Comments: NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy, Montréal, Canada