-
CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data
Authors:
**-Han Hsieh,
Ru-Xiu Hsiao,
Katalin Ferenc,
Anthony Mathelier,
Rebekka Burkholz,
Chien-Yu Chen,
Geir Kjetil Sandve,
Tatiana Belova,
Marieke Lydia Kuijjer
Abstract:
Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modaliti…
▽ More
Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modalities, which could significantly enhance modeling and interpretation. We propose a novel probabilistic learning framework that explicitly incorporates conditional independence relationships between multi-modal data as a directed acyclic graph using a generalized hierarchical variational autoencoder. We demonstrate the versatility of our framework across various applications pertinent to single-cell multi-omics data integration. These include the isolation of common and distinct information from different modalities, modality-specific differential analysis, and integrated cell clustering. We anticipate that the proposed framework can facilitate the construction of highly flexible graphical models that can capture the complexities of biological hypotheses and unravel the connections between different biological data types, such as different modalities of paired single-cell multi-omics data. The implementation of the proposed framework can be found in the repository https://github.com/kuijjerlab/CAVACHON.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Diffusion-Reward Adversarial Imitation Learning
Authors:
Chun-Mao Lai,
Hsiang-Chun Wang,
**-Chun Hsieh,
Yu-Chiang Frank Wang,
Min-Hung Chen,
Shao-Hua Sun
Abstract:
Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despit…
▽ More
Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despite its encouraging results, GAIL training is often brittle and unstable. Inspired by the recent dominance of diffusion models in generative modeling, this work proposes Diffusion-Reward Adversarial Imitation Learning (DRAIL), which integrates a diffusion model into GAIL, aiming to yield more precise and smoother rewards for policy learning. Specifically, we propose a diffusion discriminative classifier to construct an enhanced discriminator; then, we design diffusion rewards based on the classifier's output for policy learning. We conduct extensive experiments in navigation, manipulation, and locomotion, verifying DRAIL's effectiveness compared to prior imitation learning methods. Moreover, additional experimental results demonstrate the generalizability and data efficiency of DRAIL. Visualized learned reward functions of GAIL and DRAIL suggest that DRAIL can produce more precise and smoother rewards.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Image Deraining via Self-supervised Reinforcement Learning
Authors:
He-Hao Liao,
Yan-Tsung Peng,
Wen-Tao Chu,
**-Chun Hsieh,
Chung-Chi Tsai
Abstract:
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from…
▽ More
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Offline Imitation of Badminton Player Behavior via Experiential Contexts and Brownian Motion
Authors:
Kuang-Da Wang,
Wei-Yao Wang,
**-Chun Hsieh,
Wen-Chih Peng
Abstract:
In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained…
▽ More
In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained underexplored. Replicating opponents' behavior benefits players by allowing them to undergo strategic development with direction before matches. However, directly applying existing methods suffers from the inherent hierarchy of the match and the compounding effect due to the turn-based nature of players alternatively taking actions. In this paper, we propose RallyNet, a novel hierarchical offline imitation learning model for badminton player behaviors: (i) RallyNet captures players' decision dependencies by modeling decision-making processes as a contextual Markov decision process. (ii) RallyNet leverages the experience to generate context as the agent's intent in the rally. (iii) To generate more realistic behavior, RallyNet leverages Geometric Brownian Motion (GBM) to model the interactions between players by introducing a valuable inductive bias for learning player behaviors. In this manner, RallyNet links player intents with interaction models with GBM, providing an understanding of interactions for sports analytics. We extensively validate RallyNet with the largest available real-world badminton dataset consisting of men's and women's singles, demonstrating its ability to imitate player behaviors. Results reveal RallyNet's superiority over offline imitation learning methods and state-of-the-art turn-based approaches, outperforming them by at least 16% in mean rule-based agent normalization score. Furthermore, we discuss various practical use cases to highlight RallyNet's applicability.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clip**
Authors:
Nai-Chieh Huang,
**-Chun Hsieh,
Kuo-Hao Ho,
I-Chen Wu
Abstract:
Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural func…
▽ More
Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clip** mechanism on PPO-Clip convergence. Importantly, the clip** range affects only the pre-constant of the convergence rate.
△ Less
Submitted 19 February, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning
Authors:
Yen-Ju Chen,
Nai-Chieh Huang,
Ching-Pei Lee,
**-Chun Hsieh
Abstract:
Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Pol…
▽ More
Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.
△ Less
Submitted 6 June, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs
Authors:
Yu-Heng Hung,
**-Chun Hsieh,
Akshay Mete,
P. R. Kumar
Abstract:
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature map**. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a…
▽ More
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature map**. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys $\widetilde{O}(d\sqrt{T})$ regret, where $T$ is the time horizon and $d$ is the dimension of the model parameter, and (ii) VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step. In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct and uncover an interesting connection between linear MDPs and online learning, which could be of independent interest. Finally, the simulation results show that VBMLE significantly outperforms the benchmark method in terms of both empirical regret and computation time.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Towards Human-Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games
Authors:
Kuo-Hao Ho,
**-Chun Hsieh,
Chiu-Chou Lin,
You-Ren Luo,
Feng-Jian Wang,
I-Chen Wu
Abstract:
In this paper, we propose a new approach called Adaptive Behavioral Costs in Reinforcement Learning (ABC-RL) for training a human-like agent with competitive strength. While deep reinforcement learning agents have recently achieved superhuman performance in various video games, some of these unconstrained agents may exhibit actions, such as shaking and spinning, that are not typically observed in…
▽ More
In this paper, we propose a new approach called Adaptive Behavioral Costs in Reinforcement Learning (ABC-RL) for training a human-like agent with competitive strength. While deep reinforcement learning agents have recently achieved superhuman performance in various video games, some of these unconstrained agents may exhibit actions, such as shaking and spinning, that are not typically observed in human behavior, resulting in peculiar gameplay experiences. To behave like humans and retain similar performance, ABC-RL augments behavioral limitations as cost signals in reinforcement learning with dynamically adjusted weights. Unlike traditional constrained policy optimization, we propose a new formulation that minimizes the behavioral costs subject to a constraint of the value function. By leveraging the augmented Lagrangian, our approach is an approximation of the Lagrangian adjustment, which handles the trade-off between the performance and the human-like behavior. Through experiments conducted on 3D games in DMLab-30 and Unity ML-Agents Toolkit, we demonstrate that ABC-RL achieves the same performance level while significantly reducing instances of shaking and spinning. These findings underscore the effectiveness of our proposed approach in promoting more natural and human-like behavior during gameplay.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees
Authors:
Hsin-En Su,
Yen-Ju Chen,
**-Chun Hsieh,
Xi Liu
Abstract:
We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the d…
▽ More
We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.
△ Less
Submitted 10 December, 2022;
originally announced December 2022.
-
Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots
Authors:
Wei Hung,
Bo-Kai Huang,
**-Chun Hsieh,
Xi Liu
Abstract:
Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or u…
▽ More
Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265
Authors:
Yung-Han Ho,
Chia-Hao Kao,
Wen-Hsiao Peng,
**-Chun Hsieh
Abstract:
This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently,…
▽ More
This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits
Authors:
Yu-Heng Hung,
**-Chun Hsieh
Abstract:
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages th…
▽ More
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.
△ Less
Submitted 29 May, 2022; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective
Authors:
Nai-Chieh Huang,
**-Chun Hsieh,
Kuo-Hao Ho,
Hsuan-Yu Yao,
Kai-Chun Hu,
Liang-Chun Ouyang,
I-Chen Wu
Abstract:
Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical p…
▽ More
Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical proof up to date. In this paper, we establish the first global convergence rate of PPO-Clip under neural function approximation. We identify the fundamental challenges of analyzing PPO-Clip and address them with the two core ideas: (i) We reinterpret PPO-Clip from the perspective of hinge loss, which connects policy improvement with solving a large-margin classification problem with hinge loss and offers a generalized version of the PPO-Clip objective. (ii) Based on the above viewpoint, we propose a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme. Moreover, our theoretical results provide the first characterization of the effect of the clip** mechanism on the convergence of PPO-Clip. Through experiments, we empirically validate the reinterpretation of PPO-Clip and the generalized objective with various classifiers on various RL benchmark tasks.
△ Less
Submitted 31 August, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL
Authors:
Khaled Nakhleh,
Santosh Ganji,
**-Chun Hsieh,
I-Hong Hou,
Srinivas Shakkottai
Abstract:
Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless ba…
▽ More
Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is significantly better than other RL algorithms.
△ Less
Submitted 19 January, 2022; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Reinforced Few-Shot Acquisition Function Learning for Bayesian Optimization
Authors:
Bing-**g Hsieh,
**-Chun Hsieh,
Xi Liu
Abstract:
Bayesian optimization (BO) conventionally relies on handcrafted acquisition functions (AFs) to sequentially determine the sample points. However, it has been widely observed in practice that the best-performing AF in terms of regret can vary significantly under different types of black-box functions. It has remained a challenge to design one AF that can attain the best performance over a wide vari…
▽ More
Bayesian optimization (BO) conventionally relies on handcrafted acquisition functions (AFs) to sequentially determine the sample points. However, it has been widely observed in practice that the best-performing AF in terms of regret can vary significantly under different types of black-box functions. It has remained a challenge to design one AF that can attain the best performance over a wide variety of black-box functions. This paper aims to attack this challenge through the perspective of reinforced few-shot AF learning (FSAF). Specifically, we first connect the notion of AFs with Q-functions and view a deep Q-network (DQN) as a surrogate differentiable AF. While it serves as a natural idea to combine DQN and an existing few-shot learning method, we identify that such a direct combination does not perform well due to severe overfitting, which is particularly critical in BO due to the need of a versatile sampling policy. To address this, we present a Bayesian variant of DQN with the following three features: (i) It learns a distribution of Q-networks as AFs based on the Kullback-Leibler regularization framework. This inherently provides the uncertainty required in sampling for BO and mitigates overfitting. (ii) For the prior of the Bayesian DQN, we propose to use a demo policy induced by an off-the-shelf AF for better training stability. (iii) On the meta-level, we leverage the meta-loss of Bayesian model-agnostic meta-learning, which serves as a natural companion to the proposed FSAF. Moreover, with the proper design of the Q-networks, FSAF is general-purpose in that it is agnostic to the dimension and the cardinality of the input domain. Through extensive experiments, we demonstrate that the FSAF achieves comparable or better regrets than the state-of-the-art benchmarks on a wide variety of synthetic and real-world test functions.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Esca** from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization
Authors:
Jyun-Li Lin,
Wei Hung,
Shang-Hsuan Yang,
**-Chun Hsieh,
Xi Liu
Abstract:
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy grad…
▽ More
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy gradient and the projection, which results in sample-inefficient training and slow convergence. To tackle this issue, we propose a learning algorithm that decouples the action constraints from the policy parameter update by leveraging state-wise Frank-Wolfe and a regression-based policy update scheme. Moreover, we show that the proposed algorithm enjoys convergence and policy improvement properties in the tabular case as well as generalizes the popular DDPG algorithm for action-constrained RL in the general case. Through experiments, we demonstrate that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
△ Less
Submitted 2 August, 2021; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits
Authors:
Yu-Heng Hung,
**-Chun Hsieh,
Xi Liu,
P. R. Kumar
Abstract:
Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with…
▽ More
Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
Develo** Multi-Task Recommendations with Long-Term Rewards via Policy Distilled Reinforcement Learning
Authors:
Xi Liu,
Li Li,
**-Chun Hsieh,
Muhe Xie,
Yong Ge,
Rui Chen
Abstract:
With the explosive growth of online products and content, recommendation techniques have been considered as an effective tool to overcome information overload, improve user experience, and boost business revenue. In recent years, we have observed a new desideratum of considering long-term rewards of multiple related recommendation tasks simultaneously. The consideration of long-term rewards is str…
▽ More
With the explosive growth of online products and content, recommendation techniques have been considered as an effective tool to overcome information overload, improve user experience, and boost business revenue. In recent years, we have observed a new desideratum of considering long-term rewards of multiple related recommendation tasks simultaneously. The consideration of long-term rewards is strongly tied to business revenue and growth. Learning multiple tasks simultaneously could generally improve the performance of individual task due to knowledge sharing in multi-task learning. While a few existing works have studied long-term rewards in recommendations, they mainly focus on a single recommendation task. In this paper, we propose {\it PoDiRe}: a \underline{po}licy \underline{di}stilled \underline{re}commender that can address long-term rewards of recommendations and simultaneously handle multiple recommendation tasks. This novel recommendation solution is based on a marriage of deep reinforcement learning and knowledge distillation techniques, which is able to establish knowledge sharing among different tasks and reduce the size of a learning model. The resulting model is expected to attain better performance and lower response latency for real-time recommendation services. In collaboration with Samsung Game Launcher, one of the world's largest commercial mobile game platforms, we conduct a comprehensive experimental study on large-scale real data with hundreds of millions of events and show that our solution outperforms many state-of-the-art methods in terms of several standard evaluation metrics.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Fresher Content or Smoother Playback? A Brownian-Approximation Framework for Scheduling Real-Time Wireless Video Streams
Authors:
**-Chun Hsieh,
Xi Liu,
I-Hong Hou
Abstract:
This paper presents a Brownian-approximation framework to optimize the quality of experience (QoE) for real-time video streaming in wireless networks. In real-time video streaming, one major challenge is to tackle the natural tension between the two most critical QoE metrics: playback latency and video interruption. To study this trade-off, we first propose an analytical model that precisely captu…
▽ More
This paper presents a Brownian-approximation framework to optimize the quality of experience (QoE) for real-time video streaming in wireless networks. In real-time video streaming, one major challenge is to tackle the natural tension between the two most critical QoE metrics: playback latency and video interruption. To study this trade-off, we first propose an analytical model that precisely captures all aspects of the playback process of a real-time video stream, including playback latency, video interruptions, and packet drop**. Built on this model, we show that the playback process of a real-time video can be approximated by a two-sided reflected Brownian motion. Through such Brownian approximation, we are able to study the fundamental limits of the two QoE metrics and characterize a necessary and sufficient condition for a set of QoE performance requirements to be feasible. We propose a scheduling policy that satisfies any feasible set of QoE performance requirements and then obtain simple rules on the trade-off between playback latency and the video interrupt rates, in both heavy-traffic and under-loaded regimes. Finally, simulation results verify the accuracy of the proposed approximation and show that the proposed policy outperforms other popular baseline policies.
△ Less
Submitted 23 October, 2020; v1 submitted 3 November, 2019;
originally announced November 2019.
-
Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits
Authors:
Xi Liu,
**-Chun Hsieh,
Anirban Bhattacharya,
P. R. Kumar
Abstract:
Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate…
▽ More
Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate $α(t)$ in RBMLE, we reveal the nontrivial interplay between $α(t)$ and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of $α(t)$ for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them.
△ Less
Submitted 23 October, 2020; v1 submitted 2 July, 2019;
originally announced July 2019.
-
Streaming Network Embedding through Local Actions
Authors:
Xi Liu,
**-Chun Hsieh,
Nick Duffield,
Rui Chen,
Muhe Xie,
Xidao Wen
Abstract:
Recently, considerable research attention has been paid to network embedding, a popular approach to construct feature vectors of vertices. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of existing literature has considered this technique under the assumption that the network…
▽ More
Recently, considerable research attention has been paid to network embedding, a popular approach to construct feature vectors of vertices. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of existing literature has considered this technique under the assumption that the network is static. However, networks in many applications, nodes and edges accrue to a growing network as a streaming. A small number of very recent results have addressed the problem of embedding for dynamic networks. However, they either rely on knowledge of vertex attributes, suffer high-time complexity or need to be re-trained without closed-form expression. Thus the approach of adapting the existing methods to the streaming environment faces non-trivial technical challenges.
These challenges motivate develo** new approaches to the problems of streaming network embedding. In this paper, We propose a new framework that is able to generate latent features for new vertices with high efficiency and low complexity under specified iteration rounds. We formulate a constrained optimization problem for the modification of the representation resulting from a stream arrival. We show this problem has no closed-form solution and instead develop an online approximation solution. Our solution follows three steps: (1) identify vertices affected by new vertices, (2) generate latent features for new vertices, and (3) update the latent features of the most affected vertices. The generated representations are provably feasible and not far from the optimal ones in terms of expectation. Multi-class classification and clustering on five real-world networks demonstrate that our model can efficiently update vertex representations and simultaneously achieve comparable or even better performance.
△ Less
Submitted 14 November, 2018;
originally announced November 2018.
-
Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging
Authors:
**-Chun Hsieh,
Xi Liu,
Anirban Bhattacharya,
P. R. Kumar
Abstract:
Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a "reneging" phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedast…
▽ More
Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a "reneging" phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedastic linear bandits with reneging, which allows each participant to have a distinct "satisfaction level," with any interaction outcome falling short of that level resulting in that participant reneging. Moreover, it allows the variance of the outcome to be context-dependent. Based on this model, we develop a UCB-type policy, namely HR-UCB, and prove that it achieves $\mathcal{O}\big(\sqrt{{T}(\log({T}))^{3}}\big)$ regret. Finally, we validate the performance of HR-UCB via simulations.
△ Less
Submitted 15 May, 2019; v1 submitted 29 October, 2018;
originally announced October 2018.
-
Throughput-Optimal Scheduling for Multi-Hop Networked Transportation Systems With Switch-Over Delay
Authors:
**-Chun Hsieh,
Xi Liu,
Jian Jiao,
I-Hong Hou,
Yunlong Zhang,
P. R. Kumar
Abstract:
The emerging connected-vehicle technology provides a new dimension in develo** more intelligent traffic control algorithms for signalized intersections in networked transportation systems. An important challenge for the scheduling problem in networked transportation systems is the switch-over delay caused by the guard time before any traffic signal change. The switch-over delay can result in sig…
▽ More
The emerging connected-vehicle technology provides a new dimension in develo** more intelligent traffic control algorithms for signalized intersections in networked transportation systems. An important challenge for the scheduling problem in networked transportation systems is the switch-over delay caused by the guard time before any traffic signal change. The switch-over delay can result in significant loss of system capacity and hence needs to be accommodated in the scheduling design. To tackle this challenge, we propose a distributed online scheduling policy that extends the well-known Max-Pressure policy to address switch-over delay by introducing a bias factor toward the current schedule. We prove that the proposed policy is throughput-optimal with switch-over delay. Furthermore, the proposed policy remains optimal when there are both connected signalized intersections and conventional fixed-time ones in the system. With connected-vehicle technology, the proposed policy can be easily incorporated into the current transportation systems without additional infrastructure. Through extensive simulation in VISSIM, we show that our policy indeed outperforms the existing popular policies.
△ Less
Submitted 18 January, 2017; v1 submitted 14 January, 2017;
originally announced January 2017.
-
Delay-Optimal Scheduling for Queueing Systems with Switching Overhead
Authors:
**-Chun Hsieh,
I-Hong Hou,
Xi Liu
Abstract:
We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addr…
▽ More
We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addressed in the design of scheduling policies. For example, in 60GHz wireless networks with directional antennas, base stations need to train and reconfigure their beam patterns whenever they switch from one client to another. Considerable switching overhead can also be observed in many other queueing systems such as transportation networks and manufacturing systems. While the celebrated Max-Weight policy achieves asymptotically optimal average delay for systems without switching overhead, it fails to preserve throughput-optimality, let alone delay-optimality, when switching overhead is taken into account. We propose a class of Biased Max-Weight scheduling policies that explicitly takes switching overhead into account. The Biased Max-Weight policy can use either queue length or head-of-line waiting time as an indicator of the system status. We prove that our policies not only are throughput-optimal, but also can be made arbitrarily close to the asymptotic lower bound on average delay. To validate the performance of the proposed policies, we provide extensive simulation with various system topologies and different traffic patterns. We show that the proposed policies indeed achieve much better delay performance than that of the state-of-the-art policy.
△ Less
Submitted 13 January, 2017;
originally announced January 2017.