-
DPO Meets PPO: Reinforced Token Optimization for RLHF
Authors:
Han Zhong,
Guhao Feng,
Wei Xiong,
Li Zhao,
Di He,
Jiang Bian,
Liwei Wang
Abstract:
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still larg…
▽ More
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Do Efficient Transformers Really Save Computation?
Authors:
Kai Yang,
Jan Ackermann,
Zhenyu He,
Guhao Feng,
Bohang Zhang,
Yunzhen Feng,
Qiwei Ye,
Di He,
Liwei Wang
Abstract:
As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This ma…
▽ More
As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
Authors:
Zhenyu He,
Guhao Feng,
Shengjie Luo,
Kai Yang,
Liwei Wang,
**g**g Xu,
Zhi Zhang,
Hongxia Yang,
Di He
Abstract:
In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute pos…
▽ More
In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.
△ Less
Submitted 17 June, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity
Authors:
Guhao Feng,
Han Zhong
Abstract:
Reinforcement Learning (RL) encompasses diverse paradigms, including model-based RL, policy-based RL, and value-based RL, each tailored to approximate the model, optimal policy, and optimal value function, respectively. This work investigates the potential hierarchy of representation complexity -- the complexity of functions to be represented -- among these RL paradigms. We first demonstrate that,…
▽ More
Reinforcement Learning (RL) encompasses diverse paradigms, including model-based RL, policy-based RL, and value-based RL, each tailored to approximate the model, optimal policy, and optimal value function, respectively. This work investigates the potential hierarchy of representation complexity -- the complexity of functions to be represented -- among these RL paradigms. We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be $\mathsf{NP}$-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is $\mathsf{P}$-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Authors:
Guhao Feng,
Bohang Zhang,
Yuntian Gu,
Haotian Ye,
Di He,
Liwei Wang
Abstract:
Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step toward…
▽ More
Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.
△ Less
Submitted 22 December, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
A General Taylor Framework for Unifying and Revisiting Attribution Methods
Authors:
Huiqi Deng,
Na Zou,
Mengnan Du,
Weifu Chen,
Guocan Feng,
Xia Hu
Abstract:
Attribution methods provide an insight into the decision-making process of machine learning models, especially deep neural networks, by assigning contribution scores to each individual feature. However, the attribution problem has not been well-defined, which lacks a unified guideline to the contribution assignment process. Furthermore, existing attribution methods often built upon various empiric…
▽ More
Attribution methods provide an insight into the decision-making process of machine learning models, especially deep neural networks, by assigning contribution scores to each individual feature. However, the attribution problem has not been well-defined, which lacks a unified guideline to the contribution assignment process. Furthermore, existing attribution methods often built upon various empirical intuitions and heuristics. There still lacks a general theoretical framework that not only can offer a good description of the attribution problem, but also can be applied to unifying and revisiting existing attribution methods. To bridge the gap, in this paper, we propose a Taylor attribution framework, which models the attribution problem as how to decide individual payoffs in a coalition. Then, we reformulate fourteen mainstream attribution methods into the Taylor framework and analyze these attribution methods in terms of rationale, fidelity, and limitation in the framework. Moreover, we establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct Taylor contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.
△ Less
Submitted 25 February, 2023; v1 submitted 28 May, 2021;
originally announced May 2021.
-
A Unified Taylor Framework for Revisiting Attribution Methods
Authors:
Huiqi Deng,
Na Zou,
Mengnan Du,
Weifu Chen,
Guocan Feng,
Xia Hu
Abstract:
Attribution methods have been developed to understand the decision-making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a general and theoretical framework that not only can unify these attribution methods, but also theor…
▽ More
Attribution methods have been developed to understand the decision-making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a general and theoretical framework that not only can unify these attribution methods, but also theoretically reveal their rationales, fidelity, and limitations. To bridge the gap, in this paper, we propose a Taylor attribution framework and reformulate seven mainstream attribution methods into the framework. Based on reformulations, we analyze the attribution methods in terms of rationale, fidelity, and limitation. Moreover, We establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.
△ Less
Submitted 13 April, 2021; v1 submitted 21 August, 2020;
originally announced August 2020.
-
Nonparametric Estimation of the Fisher Information and Its Applications
Authors:
Wei Cao,
Alex Dytso,
Michael Fauß,
H. Vincent Poor,
Gang Feng
Abstract:
This paper considers the problem of estimation of the Fisher information for location from a random sample of size $n$. First, an estimator proposed by Bhattacharya is revisited and improved convergence rates are derived. Second, a new estimator, termed a clipped estimator, is proposed. Superior upper bounds on the rates of convergence can be shown for the new estimator compared to the Bhattachary…
▽ More
This paper considers the problem of estimation of the Fisher information for location from a random sample of size $n$. First, an estimator proposed by Bhattacharya is revisited and improved convergence rates are derived. Second, a new estimator, termed a clipped estimator, is proposed. Superior upper bounds on the rates of convergence can be shown for the new estimator compared to the Bhattacharya estimator, albeit with different regularity conditions. Third, both of the estimators are evaluated for the practically relevant case of a random variable contaminated by Gaussian noise. Moreover, using Brown's identity, which relates the Fisher information and the minimum mean squared error (MMSE) in Gaussian noise, two corresponding consistent estimators for the MMSE are proposed. Simulation examples for the Bhattacharya estimator and the clipped estimator as well as the MMSE estimators are presented. The examples demonstrate that the clipped estimator can significantly reduce the required sample size to guarantee a specific confidence interval compared to the Bhattacharya estimator.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
The Learning of Fuzzy Cognitive Maps With Noisy Data: A Rapid and Robust Learning Method With Maximum Entropy
Authors:
Guoliang Feng,
Wei Lu,
Witold Pedrycz,
Jianhua Yang,
Xiaodong Liu
Abstract:
Numerous learning methods for fuzzy cognitive maps (FCMs), such as the Hebbian-based and the population-based learning methods, have been developed for modeling and simulating dynamic systems. However, these methods are faced with several obvious limitations. Most of these models are extremely time consuming when learning the large-scale FCMs with hundreds of nodes. Furthermore, the FCMs learned b…
▽ More
Numerous learning methods for fuzzy cognitive maps (FCMs), such as the Hebbian-based and the population-based learning methods, have been developed for modeling and simulating dynamic systems. However, these methods are faced with several obvious limitations. Most of these models are extremely time consuming when learning the large-scale FCMs with hundreds of nodes. Furthermore, the FCMs learned by those algorithms lack robustness when the experimental data contain noise. In addition, reasonable distribution of the weights is rarely considered in these algorithms, which could result in the reduction of the performance of the resulting FCM. In this article, a straightforward, rapid, and robust learning method is proposed to learn FCMs from noisy data, especially, to learn large-scale FCMs. The crux of the proposed algorithm is to equivalently transform the learning problem of FCMs to a classic-constrained convex optimization problem in which the least-squares term ensures the robustness of the well-learned FCM and the maximum entropy term regularizes the distribution of the weights of the well-learned FCM. A series of experiments covering two frequently used activation functions (the sigmoid and hyperbolic tangent functions) are performed on both synthetic datasets with noise and real-world datasets. The experimental results show that the proposed method is rapid and robust against data containing noise and that the well-learned weights have better distribution. In addition, the FCMs learned by the proposed method also exhibit superior performance in comparison with the existing methods. Index Terms-Fuzzy cognitive maps (FCMs), maximum entropy, noisy data, rapid and robust learning.
△ Less
Submitted 22 August, 2019;
originally announced August 2019.
-
Factor Investing: A Bayesian Hierarchical Approach
Authors:
Guanhao Feng,
**gyu He
Abstract:
This paper investigates asset allocation problems when returns are predictable. We introduce a market-timing Bayesian hierarchical (BH) approach that adopts heterogeneous time-varying coefficients driven by lagged fundamental characteristics. Our approach includes a joint estimation of conditional expected returns and covariance matrix and considers estimation risk for portfolio analysis. The hier…
▽ More
This paper investigates asset allocation problems when returns are predictable. We introduce a market-timing Bayesian hierarchical (BH) approach that adopts heterogeneous time-varying coefficients driven by lagged fundamental characteristics. Our approach includes a joint estimation of conditional expected returns and covariance matrix and considers estimation risk for portfolio analysis. The hierarchical prior allows modeling different assets separately while sharing information across assets. We demonstrate the performance of the U.S. equity market. Though the Bayesian forecast is slightly biased, our BH approach outperforms most alternative methods in point and interval prediction. Our BH approach in sector investment for the recent twenty years delivers a 0.92\% average monthly returns and a 0.32\% significant Jensen`s alpha. We also find technology, energy, and manufacturing are important sectors in the past decade, and size, investment, and short-term reversal factors are heavily weighted. Finally, the stochastic discount factor constructed by our BH approach explains most anomalies.
△ Less
Submitted 17 September, 2020; v1 submitted 3 February, 2019;
originally announced February 2019.
-
V-CNN: When Convolutional Neural Network encounters Data Visualization
Authors:
Mao Yang,
Bo Li,
Guanxiong Feng,
Zhongjiang Yan
Abstract:
In recent years, deep learning poses a deep technical revolution in almost every field and attracts great attentions from industry and academia. Especially, the convolutional neural network (CNN), one representative model of deep learning, achieves great successes in computer vision and natural language processing. However, simply or blindly applying CNN to the other fields results in lower traini…
▽ More
In recent years, deep learning poses a deep technical revolution in almost every field and attracts great attentions from industry and academia. Especially, the convolutional neural network (CNN), one representative model of deep learning, achieves great successes in computer vision and natural language processing. However, simply or blindly applying CNN to the other fields results in lower training effects or makes it quite difficult to adjust the model parameters. In this poster, we propose a general methodology named V-CNN by introducing data visualizing for CNN. V-CNN introduces a data visualization model prior to CNN modeling to make sure the data after processing is fit for the features of images as well as CNN modeling. We apply V-CNN to the network intrusion detection problem based on a famous practical dataset: AWID. Simulation results confirm V-CNN significantly outperforms other studies and the recall rate of each invasion category is more than 99.8%.
△ Less
Submitted 12 June, 2018;
originally announced July 2018.
-
Deep Learning in Characteristics-Sorted Factor Models
Authors:
Guanhao Feng,
**gyu He,
Nicholas G. Polson,
Jianeng Xu
Abstract:
This paper presents an augmented deep factor model that generates latent factors for cross-sectional asset pricing. The conventional security sorting on firm characteristics for constructing long-short factor portfolio weights is nonlinear modeling, while factors are treated as inputs in linear models. We provide a structural deep learning framework to generalize the complete mechanism for fitting…
▽ More
This paper presents an augmented deep factor model that generates latent factors for cross-sectional asset pricing. The conventional security sorting on firm characteristics for constructing long-short factor portfolio weights is nonlinear modeling, while factors are treated as inputs in linear models. We provide a structural deep learning framework to generalize the complete mechanism for fitting cross-sectional returns by firm characteristics through generating risk factors -- hidden layers. Our model has an economic-guided objective function that minimizes aggregated realized pricing errors. Empirical results on high-dimensional characteristics demonstrate robust asset pricing performance and strong investment improvements by identifying important raw characteristic sources.
△ Less
Submitted 19 July, 2023; v1 submitted 2 May, 2018;
originally announced May 2018.
-
Deep Learning for Predicting Asset Returns
Authors:
Guanhao Feng,
**gyu He,
Nicholas G. Polson
Abstract:
Deep learning searches for nonlinear factors for predicting asset returns. Predictability is achieved via multiple layers of composite factors as opposed to additive ones. Viewed in this way, asset pricing studies can be revisited using multi-layer deep learners, such as rectified linear units (ReLU) or long-short-term-memory (LSTM) for time-series effects. State-of-the-art algorithms including st…
▽ More
Deep learning searches for nonlinear factors for predicting asset returns. Predictability is achieved via multiple layers of composite factors as opposed to additive ones. Viewed in this way, asset pricing studies can be revisited using multi-layer deep learners, such as rectified linear units (ReLU) or long-short-term-memory (LSTM) for time-series effects. State-of-the-art algorithms including stochastic gradient descent (SGD), TensorFlow and dropout design provide imple- mentation and efficient factor exploration. To illustrate our methodology, we revisit the equity market risk premium dataset of Welch and Goyal (2008). We find the existence of nonlinear factors which explain predictability of returns, in particular at the extremes of the characteristic space. Finally, we conclude with directions for future research.
△ Less
Submitted 26 April, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
Sparse Regularization in Marketing and Economics
Authors:
Guanhao Feng,
Nicholas Polson,
Yuexi Wang,
Jianeng Xu
Abstract:
Sparse alpha-norm regularization has many data-rich applications in Marketing and Economics. Alpha-norm, in contrast to lasso and ridge regularization, jumps to a sparse solution. This feature is attractive for ultra high-dimensional problems that occur in demand estimation and forecasting. The alpha-norm objective is nonconvex and requires coordinate descent and proximal operators to find the spa…
▽ More
Sparse alpha-norm regularization has many data-rich applications in Marketing and Economics. Alpha-norm, in contrast to lasso and ridge regularization, jumps to a sparse solution. This feature is attractive for ultra high-dimensional problems that occur in demand estimation and forecasting. The alpha-norm objective is nonconvex and requires coordinate descent and proximal operators to find the sparse solution. We study a typical marketing demand forecasting problem, grocery store sales for salty snacks, that has many dummy variables as controls. The key predictors of demand include price, equivalized volume, promotion, flavor, scent, and brand effects. By comparing with many commonly used machine learning methods, alpha-norm regularization achieves its goal of providing accurate out-of-sample estimates for the promotion lift effects. Finally, we conclude with directions for future research.
△ Less
Submitted 5 February, 2018; v1 submitted 1 September, 2017;
originally announced September 2017.
-
Regularizing Bayesian Predictive Regressions
Authors:
Guanhao Feng,
Nicholas G. Polson
Abstract:
We show that regularizing Bayesian predictive regressions provides a framework for prior sensitivity analysis. We develop a procedure that jointly regularizes expectations and variance-covariance matrices using a pair of shrinkage priors. Our methodology applies directly to vector autoregressions (VAR) and seemingly unrelated regressions (SUR). The regularization path provides a prior sensitivity…
▽ More
We show that regularizing Bayesian predictive regressions provides a framework for prior sensitivity analysis. We develop a procedure that jointly regularizes expectations and variance-covariance matrices using a pair of shrinkage priors. Our methodology applies directly to vector autoregressions (VAR) and seemingly unrelated regressions (SUR). The regularization path provides a prior sensitivity diagnostic. By exploiting a duality between regularization penalties and predictive prior distributions, we reinterpret two classic Bayesian analyses of macro-finance studies: equity premium predictability and forecasting macroeconomic growth rates. We find there exist plausible prior specifications for predictability in excess S&P 500 index returns using book-to-market ratios, CAY (consumption, wealth, income ratio), and T-bill rates. We evaluate the forecasts using a market-timing strategy, and we show the optimally regularized solution outperforms a buy-and-hold approach. A second empirical application involves forecasting industrial production, inflation, and consumption growth rates, and demonstrates the feasibility of our approach.
△ Less
Submitted 13 September, 2017; v1 submitted 6 June, 2016;
originally announced June 2016.
-
The Market for English Premier League (EPL) Odds
Authors:
Guanhao Feng,
Nicholas G. Polson,
Jianeng Xu
Abstract:
This paper employs a Skellam process to represent real-time betting odds for English Premier League (EPL) soccer games. Given a matrix of market odds on all possible score outcomes, we estimate the expected scoring rates for each team. The expected scoring rates then define the implied volatility of an EPL game. As events in the game evolve, we re-estimate the expected scoring rates and our implie…
▽ More
This paper employs a Skellam process to represent real-time betting odds for English Premier League (EPL) soccer games. Given a matrix of market odds on all possible score outcomes, we estimate the expected scoring rates for each team. The expected scoring rates then define the implied volatility of an EPL game. As events in the game evolve, we re-estimate the expected scoring rates and our implied volatility measure to provide a dynamic representation of the market's expectation of the game outcome. Using a dataset of 1520 EPL games from 2012-2016, we show how our model calibrates well to the game outcome. We illustrate our methodology on real-time market odds data for a game between Everton and West Ham in the 2015-2016 season. We show how the implied volatility for the outcome evolves as goals, red cards, and corner kicks occur. Finally, we conclude with directions for future research.
△ Less
Submitted 5 January, 2017; v1 submitted 12 April, 2016;
originally announced April 2016.