Search | arXiv e-print repository

CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data

Authors: **-Han Hsieh, Ru-Xiu Hsiao, Katalin Ferenc, Anthony Mathelier, Rebekka Burkholz, Chien-Yu Chen, Geir Kjetil Sandve, Tatiana Belova, Marieke Lydia Kuijjer

Abstract: Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modaliti… ▽ More Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modalities, which could significantly enhance modeling and interpretation. We propose a novel probabilistic learning framework that explicitly incorporates conditional independence relationships between multi-modal data as a directed acyclic graph using a generalized hierarchical variational autoencoder. We demonstrate the versatility of our framework across various applications pertinent to single-cell multi-omics data integration. These include the isolation of common and distinct information from different modalities, modality-specific differential analysis, and integrated cell clustering. We anticipate that the proposed framework can facilitate the construction of highly flexible graphical models that can capture the complexities of biological hypotheses and unravel the connections between different biological data types, such as different modalities of paired single-cell multi-omics data. The implementation of the proposed framework can be found in the repository https://github.com/kuijjerlab/CAVACHON. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.16194 [pdf, other]

Diffusion-Reward Adversarial Imitation Learning

Authors: Chun-Mao Lai, Hsiang-Chun Wang, **-Chun Hsieh, Yu-Chiang Frank Wang, Min-Hung Chen, Shao-Hua Sun

Abstract: Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despit… ▽ More Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despite its encouraging results, GAIL training is often brittle and unstable. Inspired by the recent dominance of diffusion models in generative modeling, this work proposes Diffusion-Reward Adversarial Imitation Learning (DRAIL), which integrates a diffusion model into GAIL, aiming to yield more precise and smoother rewards for policy learning. Specifically, we propose a diffusion discriminative classifier to construct an enhanced discriminator; then, we design diffusion rewards based on the classifier's output for policy learning. We conduct extensive experiments in navigation, manipulation, and locomotion, verifying DRAIL's effectiveness compared to prior imitation learning methods. Moreover, additional experimental results demonstrate the generalizability and data efficiency of DRAIL. Visualized learned reward functions of GAIL and DRAIL suggest that DRAIL can produce more precise and smoother rewards. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2403.18270 [pdf, other]

Image Deraining via Self-supervised Reinforcement Learning

Authors: He-Hao Liao, Yan-Tsung Peng, Wen-Tao Chu, **-Chun Hsieh, Chung-Chi Tsai

Abstract: The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from… ▽ More The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.12406 [pdf, other]

Offline Imitation of Badminton Player Behavior via Experiential Contexts and Brownian Motion

Authors: Kuang-Da Wang, Wei-Yao Wang, **-Chun Hsieh, Wen-Chih Peng

Abstract: In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained… ▽ More In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained underexplored. Replicating opponents' behavior benefits players by allowing them to undergo strategic development with direction before matches. However, directly applying existing methods suffers from the inherent hierarchy of the match and the compounding effect due to the turn-based nature of players alternatively taking actions. In this paper, we propose RallyNet, a novel hierarchical offline imitation learning model for badminton player behaviors: (i) RallyNet captures players' decision dependencies by modeling decision-making processes as a contextual Markov decision process. (ii) RallyNet leverages the experience to generate context as the agent's intent in the rally. (iii) To generate more realistic behavior, RallyNet leverages Geometric Brownian Motion (GBM) to model the interactions between players by introducing a valuable inductive bias for learning player behaviors. In this manner, RallyNet links player intents with interaction models with GBM, providing an understanding of interactions for sports analytics. We extensively validate RallyNet with the largest available real-world badminton dataset consisting of men's and women's singles, demonstrating its ability to imitate player behaviors. Results reveal RallyNet's superiority over offline imitation learning methods and state-of-the-art turn-based approaches, outperforming them by at least 16% in mean rule-based agent normalization score. Furthermore, we discuss various practical use cases to highlight RallyNet's applicability. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Preprint

arXiv:2312.12065 [pdf, other]

PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clip**

Authors: Nai-Chieh Huang, **-Chun Hsieh, Kuo-Hao Ho, I-Chen Wu

Abstract: Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural func… ▽ More Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clip** mechanism on PPO-Clip convergence. Importantly, the clip** range affects only the pre-constant of the convergence rate. △ Less

Submitted 19 February, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

arXiv:2310.11897 [pdf, other]

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Authors: Yen-Ju Chen, Nai-Chieh Huang, Ching-Pei Lee, **-Chun Hsieh

Abstract: Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Pol… ▽ More Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG. △ Less

Submitted 6 June, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: 69 pages, 17 figures

arXiv:2310.11515 [pdf, ps, other]

Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs

Authors: Yu-Heng Hung, **-Chun Hsieh, Akshay Mete, P. R. Kumar

Abstract: We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature map**. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a… ▽ More We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature map**. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys $\widetilde{O}(d\sqrt{T})$ regret, where $T$ is the time horizon and $d$ is the dimension of the model parameter, and (ii) VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step. In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct and uncover an interesting connection between linear MDPs and online learning, which could be of independent interest. Finally, the simulation results show that VBMLE significantly outperforms the benchmark method in terms of both empirical regret and computation time. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2309.15484 [pdf, other]

Towards Human-Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games

Authors: Kuo-Hao Ho, **-Chun Hsieh, Chiu-Chou Lin, You-Ren Luo, Feng-Jian Wang, I-Chen Wu

Abstract: In this paper, we propose a new approach called Adaptive Behavioral Costs in Reinforcement Learning (ABC-RL) for training a human-like agent with competitive strength. While deep reinforcement learning agents have recently achieved superhuman performance in various video games, some of these unconstrained agents may exhibit actions, such as shaking and spinning, that are not typically observed in… ▽ More In this paper, we propose a new approach called Adaptive Behavioral Costs in Reinforcement Learning (ABC-RL) for training a human-like agent with competitive strength. While deep reinforcement learning agents have recently achieved superhuman performance in various video games, some of these unconstrained agents may exhibit actions, such as shaking and spinning, that are not typically observed in human behavior, resulting in peculiar gameplay experiences. To behave like humans and retain similar performance, ABC-RL augments behavioral limitations as cost signals in reinforcement learning with dynamically adjusted weights. Unlike traditional constrained policy optimization, we propose a new formulation that minimizes the behavioral costs subject to a constraint of the value function. By leveraging the augmented Lagrangian, our approach is an approximation of the Lagrangian adjustment, which handles the trade-off between the performance and the human-like behavior. Through experiments conducted on 3D games in DMLab-30 and Unity ML-Agents Toolkit, we demonstrate that ABC-RL achieves the same performance level while significantly reducing instances of shaking and spinning. These findings underscore the effectiveness of our proposed approach in promoting more natural and human-like behavior during gameplay. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2212.05237 [pdf, other]

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

Authors: Hsin-En Su, Yen-Ju Chen, **-Chun Hsieh, Xi Liu

Abstract: We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the d… ▽ More We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice. △ Less

Submitted 10 December, 2022; originally announced December 2022.

Comments: 47 pages, 4 figures

arXiv:2212.03117 [pdf, ps, other]

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots

Authors: Wei Hung, Bo-Kai Huang, **-Chun Hsieh, Xi Liu

Abstract: Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or u… ▽ More Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: 17 pages, 15 figures

arXiv:2209.13210 [pdf, other]

Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265

Authors: Yung-Han Ho, Chia-Hao Kao, Wen-Hsiao Peng, **-Chun Hsieh

Abstract: This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently,… ▽ More This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: Accepted by VCIP 2022. arXiv admin note: text overlap with arXiv:2203.05127

arXiv:2203.04192 [pdf, ps, other]

Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits

Authors: Yu-Heng Hung, **-Chun Hsieh

Abstract: Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages th… ▽ More Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions. △ Less

Submitted 29 May, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

arXiv:2110.13799 [pdf, other]

Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective

Authors: Nai-Chieh Huang, **-Chun Hsieh, Kuo-Hao Ho, Hsuan-Yu Yao, Kai-Chun Hu, Liang-Chun Ouyang, I-Chen Wu

Abstract: Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical p… ▽ More Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical proof up to date. In this paper, we establish the first global convergence rate of PPO-Clip under neural function approximation. We identify the fundamental challenges of analyzing PPO-Clip and address them with the two core ideas: (i) We reinterpret PPO-Clip from the perspective of hinge loss, which connects policy improvement with solving a large-margin classification problem with hinge loss and offers a generalized version of the PPO-Clip objective. (ii) Based on the above viewpoint, we propose a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme. Moreover, our theoretical results provide the first characterization of the effect of the clip** mechanism on the convergence of PPO-Clip. Through experiments, we empirically validate the reinterpretation of PPO-Clip and the generalized objective with various classifiers on various RL benchmark tasks. △ Less

Submitted 31 August, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: 33 pages, 1 figure

arXiv:2110.02128 [pdf, other]

NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL

Authors: Khaled Nakhleh, Santosh Ganji, **-Chun Hsieh, I-Hong Hou, Srinivas Shakkottai

Abstract: Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless ba… ▽ More Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is significantly better than other RL algorithms. △ Less

Submitted 19 January, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

Comments: Accepted for publication in NeurIPS 2021

arXiv:2106.04335 [pdf, other]

Reinforced Few-Shot Acquisition Function Learning for Bayesian Optimization

Authors: Bing-**g Hsieh, **-Chun Hsieh, Xi Liu

Abstract: Bayesian optimization (BO) conventionally relies on handcrafted acquisition functions (AFs) to sequentially determine the sample points. However, it has been widely observed in practice that the best-performing AF in terms of regret can vary significantly under different types of black-box functions. It has remained a challenge to design one AF that can attain the best performance over a wide vari… ▽ More Bayesian optimization (BO) conventionally relies on handcrafted acquisition functions (AFs) to sequentially determine the sample points. However, it has been widely observed in practice that the best-performing AF in terms of regret can vary significantly under different types of black-box functions. It has remained a challenge to design one AF that can attain the best performance over a wide variety of black-box functions. This paper aims to attack this challenge through the perspective of reinforced few-shot AF learning (FSAF). Specifically, we first connect the notion of AFs with Q-functions and view a deep Q-network (DQN) as a surrogate differentiable AF. While it serves as a natural idea to combine DQN and an existing few-shot learning method, we identify that such a direct combination does not perform well due to severe overfitting, which is particularly critical in BO due to the need of a versatile sampling policy. To address this, we present a Bayesian variant of DQN with the following three features: (i) It learns a distribution of Q-networks as AFs based on the Kullback-Leibler regularization framework. This inherently provides the uncertainty required in sampling for BO and mitigates overfitting. (ii) For the prior of the Bayesian DQN, we propose to use a demo policy induced by an off-the-shelf AF for better training stability. (iii) On the meta-level, we leverage the meta-loss of Bayesian model-agnostic meta-learning, which serves as a natural companion to the proposed FSAF. Moreover, with the proper design of the Q-networks, FSAF is general-purpose in that it is agnostic to the dimension and the cardinality of the input domain. Through extensive experiments, we demonstrate that the FSAF achieves comparable or better regrets than the state-of-the-art benchmarks on a wide variety of synthetic and real-world test functions. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: 21 pages, 8 figures

arXiv:2102.11055 [pdf, ps, other]

Esca** from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization

Authors: Jyun-Li Lin, Wei Hung, Shang-Hsuan Yang, **-Chun Hsieh, Xi Liu

Abstract: Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy grad… ▽ More Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy gradient and the projection, which results in sample-inefficient training and slow convergence. To tackle this issue, we propose a learning algorithm that decouples the action constraints from the policy parameter update by leveraging state-wise Frank-Wolfe and a regression-based policy update scheme. Moreover, we show that the proposed algorithm enjoys convergence and policy improvement properties in the tabular case as well as generalizes the popular DDPG algorithm for action-constrained RL in the general case. Through experiments, we demonstrate that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks. △ Less

Submitted 2 August, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: Published in UAI 2021

arXiv:2010.04091 [pdf, ps, other]

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Authors: Yu-Heng Hung, **-Chun Hsieh, Xi Liu, P. R. Kumar

Abstract: Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with… ▽ More Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency. △ Less

Submitted 8 October, 2020; originally announced October 2020.

arXiv:2001.09595 [pdf, other]

Develo** Multi-Task Recommendations with Long-Term Rewards via Policy Distilled Reinforcement Learning

Authors: Xi Liu, Li Li, **-Chun Hsieh, Muhe Xie, Yong Ge, Rui Chen

Abstract: With the explosive growth of online products and content, recommendation techniques have been considered as an effective tool to overcome information overload, improve user experience, and boost business revenue. In recent years, we have observed a new desideratum of considering long-term rewards of multiple related recommendation tasks simultaneously. The consideration of long-term rewards is str… ▽ More With the explosive growth of online products and content, recommendation techniques have been considered as an effective tool to overcome information overload, improve user experience, and boost business revenue. In recent years, we have observed a new desideratum of considering long-term rewards of multiple related recommendation tasks simultaneously. The consideration of long-term rewards is strongly tied to business revenue and growth. Learning multiple tasks simultaneously could generally improve the performance of individual task due to knowledge sharing in multi-task learning. While a few existing works have studied long-term rewards in recommendations, they mainly focus on a single recommendation task. In this paper, we propose {\it PoDiRe}: a \underline{po}licy \underline{di}stilled \underline{re}commender that can address long-term rewards of recommendations and simultaneously handle multiple recommendation tasks. This novel recommendation solution is based on a marriage of deep reinforcement learning and knowledge distillation techniques, which is able to establish knowledge sharing among different tasks and reduce the size of a learning model. The resulting model is expected to attain better performance and lower response latency for real-time recommendation services. In collaboration with Samsung Game Launcher, one of the world's largest commercial mobile game platforms, we conduct a comprehensive experimental study on large-scale real data with hundreds of millions of events and show that our solution outperforms many state-of-the-art methods in terms of several standard evaluation metrics. △ Less

Submitted 27 January, 2020; originally announced January 2020.

arXiv:1911.00902 [pdf, other]

Fresher Content or Smoother Playback? A Brownian-Approximation Framework for Scheduling Real-Time Wireless Video Streams

Authors: **-Chun Hsieh, Xi Liu, I-Hong Hou

Abstract: This paper presents a Brownian-approximation framework to optimize the quality of experience (QoE) for real-time video streaming in wireless networks. In real-time video streaming, one major challenge is to tackle the natural tension between the two most critical QoE metrics: playback latency and video interruption. To study this trade-off, we first propose an analytical model that precisely captu… ▽ More This paper presents a Brownian-approximation framework to optimize the quality of experience (QoE) for real-time video streaming in wireless networks. In real-time video streaming, one major challenge is to tackle the natural tension between the two most critical QoE metrics: playback latency and video interruption. To study this trade-off, we first propose an analytical model that precisely captures all aspects of the playback process of a real-time video stream, including playback latency, video interruptions, and packet drop**. Built on this model, we show that the playback process of a real-time video can be approximated by a two-sided reflected Brownian motion. Through such Brownian approximation, we are able to study the fundamental limits of the two QoE metrics and characterize a necessary and sufficient condition for a set of QoE performance requirements to be feasible. We propose a scheduling policy that satisfies any feasible set of QoE performance requirements and then obtain simple rules on the trade-off between playback latency and the video interrupt rates, in both heavy-traffic and under-loaded regimes. Finally, simulation results verify the accuracy of the proposed approximation and show that the proposed policy outperforms other popular baseline policies. △ Less

Submitted 23 October, 2020; v1 submitted 3 November, 2019; originally announced November 2019.

Comments: MobiHoc 2020

arXiv:1907.01287 [pdf, ps, other]

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

Authors: Xi Liu, **-Chun Hsieh, Anirban Bhattacharya, P. R. Kumar

Abstract: Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate… ▽ More Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate $α(t)$ in RBMLE, we reveal the nontrivial interplay between $α(t)$ and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of $α(t)$ for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them. △ Less

Submitted 23 October, 2020; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: ICML 2020

arXiv:1811.05932 [pdf, ps, other]

Streaming Network Embedding through Local Actions

Authors: Xi Liu, **-Chun Hsieh, Nick Duffield, Rui Chen, Muhe Xie, Xidao Wen

Abstract: Recently, considerable research attention has been paid to network embedding, a popular approach to construct feature vectors of vertices. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of existing literature has considered this technique under the assumption that the network… ▽ More Recently, considerable research attention has been paid to network embedding, a popular approach to construct feature vectors of vertices. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of existing literature has considered this technique under the assumption that the network is static. However, networks in many applications, nodes and edges accrue to a growing network as a streaming. A small number of very recent results have addressed the problem of embedding for dynamic networks. However, they either rely on knowledge of vertex attributes, suffer high-time complexity or need to be re-trained without closed-form expression. Thus the approach of adapting the existing methods to the streaming environment faces non-trivial technical challenges. These challenges motivate develo** new approaches to the problems of streaming network embedding. In this paper, We propose a new framework that is able to generate latent features for new vertices with high efficiency and low complexity under specified iteration rounds. We formulate a constrained optimization problem for the modification of the representation resulting from a stream arrival. We show this problem has no closed-form solution and instead develop an online approximation solution. Our solution follows three steps: (1) identify vertices affected by new vertices, (2) generate latent features for new vertices, and (3) update the latent features of the most affected vertices. The generated representations are provably feasible and not far from the optimal ones in terms of expectation. Multi-class classification and clustering on five real-world networks demonstrate that our model can efficiently update vertex representations and simultaneously achieve comparable or even better performance. △ Less

Submitted 14 November, 2018; originally announced November 2018.

arXiv:1810.12418 [pdf, ps, other]

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging

Authors: **-Chun Hsieh, Xi Liu, Anirban Bhattacharya, P. R. Kumar

Abstract: Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a "reneging" phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedast… ▽ More Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a "reneging" phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedastic linear bandits with reneging, which allows each participant to have a distinct "satisfaction level," with any interaction outcome falling short of that level resulting in that participant reneging. Moreover, it allows the variance of the outcome to be context-dependent. Based on this model, we develop a UCB-type policy, namely HR-UCB, and prove that it achieves $\mathcal{O}\big(\sqrt{{T}(\log({T}))^{3}}\big)$ regret. Finally, we validate the performance of HR-UCB via simulations. △ Less

Submitted 15 May, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

Comments: To appear in ICML 2019. More rounds of experiments are performed before being taken average of compared to versions before

arXiv:1701.03991 [pdf, other]

Throughput-Optimal Scheduling for Multi-Hop Networked Transportation Systems With Switch-Over Delay

Authors: **-Chun Hsieh, Xi Liu, Jian Jiao, I-Hong Hou, Yunlong Zhang, P. R. Kumar

Abstract: The emerging connected-vehicle technology provides a new dimension in develo** more intelligent traffic control algorithms for signalized intersections in networked transportation systems. An important challenge for the scheduling problem in networked transportation systems is the switch-over delay caused by the guard time before any traffic signal change. The switch-over delay can result in sig… ▽ More The emerging connected-vehicle technology provides a new dimension in develo** more intelligent traffic control algorithms for signalized intersections in networked transportation systems. An important challenge for the scheduling problem in networked transportation systems is the switch-over delay caused by the guard time before any traffic signal change. The switch-over delay can result in significant loss of system capacity and hence needs to be accommodated in the scheduling design. To tackle this challenge, we propose a distributed online scheduling policy that extends the well-known Max-Pressure policy to address switch-over delay by introducing a bias factor toward the current schedule. We prove that the proposed policy is throughput-optimal with switch-over delay. Furthermore, the proposed policy remains optimal when there are both connected signalized intersections and conventional fixed-time ones in the system. With connected-vehicle technology, the proposed policy can be easily incorporated into the current transportation systems without additional infrastructure. Through extensive simulation in VISSIM, we show that our policy indeed outperforms the existing popular policies. △ Less

Submitted 18 January, 2017; v1 submitted 14 January, 2017; originally announced January 2017.

Comments: 16 pages, 6 figures

arXiv:1701.03831 [pdf, ps, other]

Delay-Optimal Scheduling for Queueing Systems with Switching Overhead

Authors: **-Chun Hsieh, I-Hong Hou, Xi Liu

Abstract: We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addr… ▽ More We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addressed in the design of scheduling policies. For example, in 60GHz wireless networks with directional antennas, base stations need to train and reconfigure their beam patterns whenever they switch from one client to another. Considerable switching overhead can also be observed in many other queueing systems such as transportation networks and manufacturing systems. While the celebrated Max-Weight policy achieves asymptotically optimal average delay for systems without switching overhead, it fails to preserve throughput-optimality, let alone delay-optimality, when switching overhead is taken into account. We propose a class of Biased Max-Weight scheduling policies that explicitly takes switching overhead into account. The Biased Max-Weight policy can use either queue length or head-of-line waiting time as an indicator of the system status. We prove that our policies not only are throughput-optimal, but also can be made arbitrarily close to the asymptotic lower bound on average delay. To validate the performance of the proposed policies, we provide extensive simulation with various system topologies and different traffic patterns. We show that the proposed policies indeed achieve much better delay performance than that of the state-of-the-art policy. △ Less

Submitted 13 January, 2017; originally announced January 2017.

Comments: 37 pages

MSC Class: 60K25; 68M20; 90B22

Showing 1–24 of 24 results for author: Hsieh, P