Skip to main content

Showing 1–48 of 48 results for author: Jiang, N

Searching in archive stat. Search in all archives.
.
  1. arXiv:2405.07863  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    RLHF Workflow: From Reward Modeling to Online RLHF

    Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

    Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill i… ▽ More

    Submitted 12 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  2. arXiv:2404.09946  [pdf, other

    cs.LG cs.AI stat.ML

    A Note on Loss Functions and Error Compounding in Model-based Reinforcement Learning

    Authors: Nan Jiang

    Abstract: This note clarifies some confusions (and perhaps throws out more) around model-based reinforcement learning and their theoretical understanding in the context of deep RL. Main topics of discussion are (1) how to reconcile model-based RL's bad empirical reputation on error compounding with its superior theoretical properties, and (2) the limitations of empirically popular losses. For the latter, co… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  3. arXiv:2402.14703  [pdf, ps, other

    cs.LG cs.AI stat.ML

    On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation

    Authors: Yuheng Zhang, Nan Jiang

    Abstract: We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of develo** estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversi… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  4. arXiv:2402.07314  [pdf, other

    cs.LG stat.ML

    Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

    Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang

    Abstract: We study Reinforcement Learning from Human Feedback (RLHF) under a general preference oracle. In particular, we do not assume that there exists a reward function and the preference signal is drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference ora… ▽ More

    Submitted 25 April, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: RLHF, Preference Learning, Alignment for LLMs

  5. arXiv:2401.09681  [pdf, other

    cs.LG stat.ML

    Harnessing Density Ratios for Online Reinforcement Learning

    Authors: Philip Amortila, Dylan J. Foster, Nan Jiang, Ayush Sekhari, Tengyang Xie

    Abstract: The theories of offline and online reinforcement learning, despite having evolved in parallel, have begun to show signs of the possibility for a unification, with algorithms and analysis techniques for one setting often having natural counterparts in the other. However, the notion of density ratio modeling, an emerging paradigm in offline RL, has been largely absent from online RL, perhaps for goo… ▽ More

    Submitted 4 June, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    Comments: ICLR 2024

  6. arXiv:2312.11456  [pdf, other

    cs.LG cs.AI stat.ML

    Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

    Authors: Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang

    Abstract: This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL re… ▽ More

    Submitted 1 May, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 53 pages; theoretical study and algorithmic design of iterative RLHF and DPO

  7. arXiv:2307.13332  [pdf, other

    cs.LG cs.AI stat.ML

    The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation

    Authors: Philip Amortila, Nan Jiang, Csaba Szepesvári

    Abstract: Theoretical guarantees in reinforcement learning (RL) are known to suffer multiplicative blow-up factors with respect to the misspecification error of function approximation. Yet, the nature of such \emph{approximation factors} -- especially their optimal form in a given learning problem -- is poorly understood. In this paper we study this question in linear off-policy value function estimation, w… ▽ More

    Submitted 14 December, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted to ICML 2023. The arXiv version contains improved results

  8. arXiv:2302.02571  [pdf, other

    cs.LG cs.AI cs.MA stat.ML

    Offline Learning in Markov Games with General Function Approximation

    Authors: Yuheng Zhang, Yu Bai, Nan Jiang

    Abstract: We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framewo… ▽ More

    Submitted 6 February, 2023; originally announced February 2023.

  9. arXiv:2302.02252  [pdf, other

    cs.LG cs.AI stat.ML

    Reinforcement Learning in Low-Rank MDPs with Density Features

    Authors: Audrey Huang, **glin Chen, Nan Jiang

    Abstract: MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density feat… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

  10. arXiv:2210.15543  [pdf, other

    cs.LG cs.AI stat.ML

    Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

    Authors: Audrey Huang, Nan Jiang

    Abstract: Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stro… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

  11. arXiv:2210.04157  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    The Role of Coverage in Online Reinforcement Learning

    Authors: Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade

    Abstract: Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

  12. arXiv:2207.13081  [pdf, other

    cs.LG stat.ML

    Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

    Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun

    Abstract: We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs.… ▽ More

    Submitted 14 November, 2023; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: This paper was accepted in NeurIPS 2023

  13. arXiv:2206.10770  [pdf, ps, other

    cs.LG cs.AI stat.ML

    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

    Authors: **glin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal

    Abstract: We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings… ▽ More

    Submitted 22 October, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

  14. arXiv:2206.08364  [pdf, other

    cs.LG cs.AI cs.HC stat.ML

    Interaction-Grounded Learning with Action-inclusive Feedback

    Authors: Tengyang Xie, Akanksha Saran, Dylan J. Foster, Lekan Molu, Ida Momennejad, Nan Jiang, Paul Mineiro, John Langford

    Abstract: Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches f… ▽ More

    Submitted 12 October, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: Published in NeurIPS 2022

  15. arXiv:2205.12418  [pdf, other

    cs.LG cs.AI stat.ML

    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

    Authors: Jiawei Huang, Li Zhao, Tao Qin, Wei Chen, Nan Jiang, Tie-Yan Liu

    Abstract: We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $π^{\text{O}}$ and $π^{\text{E}}$: $π^{\text{O}}$ ("O" for "online") interacts with m… ▽ More

    Submitted 26 February, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: 38 pages; NeurIPS 2022

  16. arXiv:2203.13935  [pdf, other

    cs.LG stat.ML

    Offline Reinforcement Learning Under Value and Density-Ratio Realizability: The Power of Gaps

    Authors: **glin Chen, Nan Jiang

    Abstract: We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simulta… ▽ More

    Submitted 14 June, 2022; v1 submitted 25 March, 2022; originally announced March 2022.

  17. arXiv:2202.06450  [pdf, other

    cs.LG cs.AI stat.ML

    Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

    Authors: Jiawei Huang, **glin Chen, Li Zhao, Tao Qin, Nan Jiang, Tie-Yan Liu

    Abstract: Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obta… ▽ More

    Submitted 30 August, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: 49 Pages; ICLR 2022

  18. arXiv:2202.04634  [pdf, ps, other

    cs.LG stat.ML

    Offline Reinforcement Learning with Realizability and Single-policy Concentrability

    Authors: Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee

    Abstract: Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As a… ▽ More

    Submitted 27 June, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

  19. arXiv:2201.01051  [pdf

    cs.CR eess.SP stat.ML

    Open Access Dataset for Electromyography based Multi-code Biometric Authentication

    Authors: Ashirbad Pradhan, Jiayuan He, Ning Jiang

    Abstract: Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures)… ▽ More

    Submitted 5 January, 2022; v1 submitted 4 January, 2022; originally announced January 2022.

    Comments: manuscript for open access dataset (paper and appendix)

    Journal ref: Sci Data 9, 733 (2022)

  20. arXiv:2111.06784  [pdf, other

    cs.LG stat.ML

    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

    Authors: Chengchun Shi, Masatoshi Uehara, Jiawei Huang, Nan Jiang

    Abstract: We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose… ▽ More

    Submitted 15 June, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

  21. arXiv:2110.14000  [pdf, other

    cs.LG cs.AI stat.ML

    Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

    Authors: Siyuan Zhang, Nan Jiang

    Abstract: How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL) -- which is crucial for hyperpa-rameter tuning -- is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we… ▽ More

    Submitted 2 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021

  22. arXiv:2106.06926  [pdf, other

    cs.LG cs.AI stat.ML

    Bellman-consistent Pessimism for Offline Reinforcement Learning

    Authors: Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal

    Abstract: The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bell… ▽ More

    Submitted 23 October, 2023; v1 submitted 13 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021 (Oral)

  23. arXiv:2106.04895  [pdf, ps, other

    cs.LG stat.ML

    Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

    Authors: Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, Yu Bai

    Abstract: Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the… ▽ More

    Submitted 11 February, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

    Comments: Published in NeurIPS 2021

  24. arXiv:2102.07035  [pdf, other

    cs.LG stat.ML

    Model-free Representation Learning and Exploration in Low-rank MDPs

    Authors: Aditya Modi, **glin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal

    Abstract: The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free rep… ▽ More

    Submitted 21 June, 2022; v1 submitted 13 February, 2021; originally announced February 2021.

    Comments: Changelog v2: Significant reorganization of the paper, added an improved analysis of elliptic planner and updated discussion wrt follow-up work

  25. arXiv:2102.02981  [pdf, ps, other

    cs.LG math.ST stat.ML

    Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

    Authors: Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, Tengyang Xie

    Abstract: We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights… ▽ More

    Submitted 24 July, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

    Comments: Under Review

  26. arXiv:2102.02049  [pdf, ps, other

    cs.LG cs.AI stat.ML

    On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function

    Authors: Gellért Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, Csaba Szepesvári

    Abstract: We consider local planning in fixed-horizon MDPs with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a local access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and features are only accessible for states that are encountered in this proce… ▽ More

    Submitted 9 July, 2021; v1 submitted 3 February, 2021; originally announced February 2021.

  27. arXiv:2011.01075  [pdf, other

    cs.LG cs.AI stat.ML

    A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted Setting

    Authors: Philip Amortila, Nan Jiang, Tengyang Xie

    Abstract: Recently, Wang et al. (2020) showed a highly intriguing hardness result for batch reinforcement learning (RL) with linearly realizable value function and good feature coverage in the finite-horizon case. In this note we show that once adapted to the discounted setting, the construction can be simplified to a 2-state MDP with 1-dimensional features, such that learning is impossible even with an inf… ▽ More

    Submitted 3 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

  28. arXiv:2010.12163  [pdf, ps, other

    cs.LG stat.ML

    Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

    Authors: Priyank Agrawal, **glin Chen, Nan Jiang

    Abstract: This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clip** variant of one classical Thompson Sampling (TS)-like algorithm, randomized least-squares value iteration (RLSVI). Our $\tilde{\mathrm{O}}(H^2S\sqrt{AT})$ high-probability worst-case regret bound improves the previous sharpest… ▽ More

    Submitted 9 November, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: Updated version, bug fixed

  29. arXiv:2008.04990  [pdf, ps, other

    cs.LG stat.ML

    Batch Value-function Approximation with Only Realizability

    Authors: Tengyang Xie, Nan Jiang

    Abstract: We make progress in a long-standing problem of batch reinforcement learning (RL): learning $Q^\star$ from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learn… ▽ More

    Submitted 17 June, 2021; v1 submitted 11 August, 2020; originally announced August 2020.

    Comments: Published in ICML 2021

  30. arXiv:2003.03924  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

    Authors: Tengyang Xie, Nan Jiang

    Abstract: We prove performance guarantees of two algorithms for approximating $Q^\star$ in batch reinforcement learning. Compared to classical iterative methods such as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on horizon---these methods estimate (some forms of) the Bellman error and enjoy linear-in-horizon error propagation, a property established for the first time for algori… ▽ More

    Submitted 24 August, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

    Comments: Published in UAI 2020

  31. arXiv:2002.02081  [pdf, other

    cs.LG math.OC stat.ML

    Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

    Authors: Nan Jiang, Jiawei Huang

    Abstract: We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? (2) T… ▽ More

    Submitted 4 November, 2020; v1 submitted 5 February, 2020; originally announced February 2020.

  32. arXiv:1911.06854  [pdf, other

    cs.LG cs.AI cs.RO stat.ML

    Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

    Authors: Cameron Voloshin, Hoang M. Le, Nan Jiang, Yisong Yue

    Abstract: We offer an experimental benchmark and empirical study for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on div… ▽ More

    Submitted 27 November, 2021; v1 submitted 15 November, 2019; originally announced November 2019.

  33. arXiv:1910.12809  [pdf, other

    cs.LG stat.ML

    Minimax Weight and Q-Function Learning for Off-Policy Evaluation

    Authors: Masatoshi Uehara, Jiawei Huang, Nan Jiang

    Abstract: We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2… ▽ More

    Submitted 6 October, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

  34. arXiv:1910.10597  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

    Authors: Aditya Modi, Nan Jiang, Ambuj Tewari, Satinder Singh

    Abstract: Reinforcement learning (RL) methods have been shown to be capable of learning intelligent behavior in rich domains. However, this has largely been done in simulated domains without adequate focus on the process of building the simulator. In this paper, we consider a setting where we have access to an ensemble of pre-trained and possibly inaccurate simulators (models). We approximate the real envir… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

  35. arXiv:1910.09066  [pdf, other

    cs.LG stat.ML

    From Importance Sampling to Doubly Robust Policy Gradient

    Authors: Jiawei Huang, Nan Jiang

    Abstract: We show that on-policy policy gradient (PG) and its variance reduction variants can be derived by taking finite difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes… ▽ More

    Submitted 23 June, 2020; v1 submitted 20 October, 2019; originally announced October 2019.

    Comments: ICML 2020

  36. arXiv:1908.10001  [pdf, other

    cs.LG cs.CL stat.ML

    Real-world Conversational AI for Hotel Bookings

    Authors: Bai Li, Nanyi Jiang, Joey Sham, Henry Shi, Hussein Fazal

    Abstract: In this paper, we present a real-world conversational AI system to search for and book hotels through text messaging. Our architecture consists of a frame-based dialogue management system, which calls machine learning models for intent classification, named entity recognition, and information retrieval subtasks. Our chatbot has been deployed on a commercial scale, handling tens of thousands of hot… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

    Comments: Accepted to IEEE AI4I 2019 (International Conference on Artificial Intelligence for Industries)

  37. arXiv:1905.13341  [pdf, other

    cs.LG cs.AI stat.ML

    On Value Functions and the Agent-Environment Boundary

    Authors: Nan Jiang

    Abstract: When function approximation is deployed in reinforcement learning (RL), the same problem may be formulated in different ways, often by treating a pre-processing step as a part of the environment or as part of the agent. As a consequence, fundamental concepts in RL, such as (optimal) value functions, are not uniquely defined as they depend on where we draw this agent-environment boundary, causing p… ▽ More

    Submitted 31 May, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: 16 pages

  38. arXiv:1905.12849  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Provably Efficient Q-Learning with Low Switching Cost

    Authors: Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang

    Abstract: We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our… ▽ More

    Submitted 9 February, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Published at NeurIPS 2019

  39. arXiv:1905.00360  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Information-Theoretic Considerations in Batch Reinforcement Learning

    Authors: **glin Chen, Nan Jiang

    Abstract: Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity ("why do we need them?") and the naturalness ("when do the… ▽ More

    Submitted 1 May, 2019; originally announced May 2019.

    Comments: Published in ICML 2019

  40. arXiv:1901.09018  [pdf, other

    cs.LG stat.ML

    Provably efficient RL with Rich Observations via Latent State Decoding

    Authors: Simon S. Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudík, John Langford

    Abstract: We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. Under certain identifiability assumptions, we demonstrate how to estimate a map** from the observations to latent states inductively through a sequence of regression and clustering steps -- where previously decoded latent states provide labels for later regression problems --… ▽ More

    Submitted 9 September, 2021; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: The ICML 2019 version omitted the second constraint on $ε$ in Theorem 4.1. We thank Yonathan Efroni for calling this to our attention

  41. arXiv:1811.08540  [pdf, other

    cs.LG stat.ML

    Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches

    Authors: Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford

    Abstract: We study the sample complexity of model-based reinforcement learning (henceforth RL) in general contextual decision processes that require strategic exploration to find a near-optimal policy. We design new algorithms for RL with a generic model class and analyze their statistical properties. Our algorithms have sample complexity governed by a new structural parameter called the witness rank, which… ▽ More

    Submitted 30 May, 2019; v1 submitted 20 November, 2018; originally announced November 2018.

    Comments: COLT 2019

  42. arXiv:1803.00606  [pdf, other

    cs.LG stat.ML

    On Oracle-Efficient PAC RL with Rich Observations

    Authors: Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire

    Abstract: We study the computational tractability of PAC reinforcement learning with rich observations. We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and… ▽ More

    Submitted 16 January, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: appeared at NeurIPS 18; full paper including appendix; updated style file

  43. arXiv:1803.00590  [pdf, other

    cs.LG cs.AI stat.ML

    Hierarchical Imitation and Reinforcement Learning

    Authors: Hoang M. Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, Hal Daumé III

    Abstract: We study how to effectively leverage expert feedback to learn sequential decision-making policies. We focus on problems with sparse rewards and long time horizons, which typically pose significant challenges in reinforcement learning. We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes o… ▽ More

    Submitted 9 June, 2018; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: Proceedings of the 35th International Conference on Machine Learning (ICML 2018)

  44. arXiv:1711.05726  [pdf, other

    stat.ML cs.AI cs.LG

    Markov Decision Processes with Continuous Side Information

    Authors: Aditya Modi, Nan Jiang, Satinder Singh, Ambuj Tewari

    Abstract: We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the co… ▽ More

    Submitted 15 November, 2017; originally announced November 2017.

    Journal ref: PMLR Volume 83: Algorithmic Learning Theory, 7-9 April 2018

  45. arXiv:1610.09512  [pdf, other

    cs.LG stat.ML

    Contextual Decision Processes with Low Bellman Rank are PAC-Learnable

    Authors: Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire

    Abstract: This paper studies systematic exploration for reinforcement learning with rich observations and function approximation. We introduce a new model called contextual decision processes, that unifies and generalizes most prior settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in these processes and is naturally… ▽ More

    Submitted 1 December, 2016; v1 submitted 29 October, 2016; originally announced October 2016.

    Comments: 42 pages, 1 figure

  46. arXiv:1609.00074   

    stat.ML cs.LG

    Neural Network Architecture Optimization through Submodularity and Supermodularity

    Authors: Junqi **, Ziang Yan, Kun Fu, Nan Jiang, Changshui Zhang

    Abstract: Deep learning models' architectures, including depth and width, are key factors influencing models' performance, such as test accuracy and computation time. This paper solves two problems: given computation time budget, choose an architecture to maximize accuracy, and given accuracy requirement, choose an architecture to minimize computation time. We convert this architecture optimization into a s… ▽ More

    Submitted 20 February, 2018; v1 submitted 31 August, 2016; originally announced September 2016.

    Comments: Withdrawn due to incompleteness and some overlaps with existing literatures, I will resubmit adding further results

  47. arXiv:1608.07892   

    stat.ML cs.LG

    Optimizing Recurrent Neural Networks Architectures under Time Constraints

    Authors: Junqi **, Ziang Yan, Kun Fu, Nan Jiang, Changshui Zhang

    Abstract: Recurrent neural network (RNN)'s architecture is a key factor influencing its performance. We propose algorithms to optimize hidden sizes under running time constraint. We convert the discrete optimization into a subset selection problem. By novel transformations, the objective function becomes submodular and constraint becomes supermodular. A greedy algorithm with bounds is suggested to solve the… ▽ More

    Submitted 20 February, 2018; v1 submitted 28 August, 2016; originally announced August 2016.

    Comments: Withdrawn due to incompleteness and some overlaps with existing literatures, I will resubmit adding further results

  48. arXiv:1511.03722  [pdf, other

    cs.LG cs.AI eess.SY stat.ME stat.ML

    Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

    Authors: Nan Jiang, Lihong Li

    Abstract: We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL in real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubl… ▽ More

    Submitted 26 May, 2016; v1 submitted 11 November, 2015; originally announced November 2015.

    Comments: 14 pages; 4 figures; ICML 2016