Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Madhow, Sunil; Qiao, Dan; Yin, Ming; Wang, Yu-** theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.

Computer Science > Machine Learning

arXiv:2306.14063 (cs)

[Submitted on 24 Jun 2023 (v1), last revised 1 May 2024 (this version, v2)]

Title:Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Authors:Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang

Abstract:Develo** theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.14063 [cs.LG]
	(or arXiv:2306.14063v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.14063

Submission history

From: Sunil Madhow [view email]
[v1] Sat, 24 Jun 2023 21:48:28 UTC (1,385 KB)
[v2] Wed, 1 May 2024 00:42:22 UTC (1,385 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2023-06

Change to browse by:

cs
cs.AI

References & Citations

export BibTeX citation

Computer Science > Machine Learning

Title:Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators