Search | arXiv e-print repository

Adding Conditional Control to Diffusion Models with Reinforcement Learning

Authors: Yulai Zhao, Masatoshi Uehara, Gabriele Scalia, Tommaso Biancalani, Sergey Levine, Ehsan Hajiramezanali

Abstract: Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method ba… ▽ More Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method based on reinforcement learning (RL) to add additional controls, leveraging an offline dataset comprising inputs and corresponding labels. We formulate this task as an RL problem, with the classifier learned from the offline dataset and the KL divergence against pre-trained models serving as the reward functions. We introduce our method, $\textbf{CTRL}$ ($\textbf{C}$onditioning pre-$\textbf{T}$rained diffusion models with $\textbf{R}$einforcement $\textbf{L}$earning), which produces soft-optimal policies that maximize the abovementioned reward functions. We formally demonstrate that our method enables sampling from the conditional distribution conditioned on additional controls during inference. Our RL-based approach offers several advantages over existing methods. Compared to commonly used classifier-free guidance, our approach improves sample efficiency, and can greatly simplify offline dataset construction by exploiting conditional independence between the inputs and additional controls. Furthermore, unlike classifier guidance, we avoid the need to train classifiers from intermediate states to additional controls. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2405.19673 [pdf, other]

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

Authors: Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso Biancalani

Abstract: AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge… ▽ More AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. Although prior work has explored similar avenues, they primarily focus on scenarios where accurate reward models are accessible. In contrast, we concentrate on an offline setting where a reward model is unknown, and we must learn from static offline datasets, a common scenario in scientific domains. In offline scenarios, existing approaches tend to suffer from overoptimization, as they may be misled by the reward model in out-of-distribution regions. To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions. Through empirical and theoretical analysis, we demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models while avoiding the generation of invalid designs through pre-trained diffusion models. △ Less

Submitted 31 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Comments: Under review

arXiv:2403.04236 [pdf, ps, other]

Regularized DeepIV with Model Selection

Authors: Zihao Li, Hui Lan, Vasilis Syrgkanis, Mengdi Wang, Masatoshi Uehara

Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) ab… ▽ More In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.16359 [pdf, other]

Feedback Efficient Online Fine-Tuning of Diffusion Models

Authors: Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine, Tommaso Biancalani

Abstract: Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) prob… ▽ More Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property. Even with access to online queries of the ground-truth reward function, efficiently discovering high-reward samples can be challenging: they might have a low probability in the initial distribution, and there might be many infeasible samples that do not even have a well-defined reward (e.g., unnatural images or physically impossible molecules). In this work, we propose a novel reinforcement learning procedure that efficiently explores on the manifold of feasible samples. We present a theoretical analysis providing a regret guarantee, as well as empirical validation across three domains: images, biological sequences, and molecules. △ Less

Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Under review (codes will be released soon)

arXiv:2402.15194 [pdf, other]

Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control

Authors: Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, Sergey Levine

Abstract: Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal… ▽ More Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth "genuine" reward, as is the case in many practical applications. These challenges, collectively termed "reward collapse," pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models. △ Less

Submitted 28 February, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: Under review (codes will be released soon)

arXiv:2401.05442 [pdf, other]

Functional Graphical Models: Structure Enables Offline Data-Driven Optimization

Authors: Jakub Grudzien Kuba, Masatoshi Uehara, Pieter Abbeel, Sergey Levine

Abstract: While machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those i… ▽ More While machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set. It is not clear theoretically when existing approaches can even perform better than the naive approach that simply selects the best design in the dataset. In this paper, we study how structure can enable sample-efficient data-driven optimization. To formalize the notion of structure, we introduce functional graphical models (FGMs) and show theoretically how they can provide for principled data-driven optimization by decomposing the original high-dimensional optimization problem into smaller sub-problems. This allows us to derive much more practical regret bounds for DDO, and the result implies that DDO with FGMs can achieve nearly optimal designs in situations where naive approaches fail due to insufficient coverage of the offline data. We further present a data-driven optimization algorithm that inferes the FGM structure itself, either over the original input variables or a latent variable representation of the inputs. △ Less

Submitted 11 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2307.13793 [pdf, ps, other]

Source Condition Double Robust Inference on Functionals of Inverse Problems

Authors: Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara

Abstract: We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems. Any such parameter admits a doubly robust representation that depends on the solution to a dual linear inverse problem, where the dual solution can be thought as a generalization of the inverse propensity function. We provide the first source condition double robust inference method that ens… ▽ More We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems. Any such parameter admits a doubly robust representation that depends on the solution to a dual linear inverse problem, where the dual solution can be thought as a generalization of the inverse propensity function. We provide the first source condition double robust inference method that ensures asymptotic normality around the parameter of interest as long as either the primal or the dual inverse problem is sufficiently well-posed, without knowledge of which inverse problem is the more well-posed one. Our result is enabled by novel guarantees for iterated Tikhonov regularized adversarial estimators for linear inverse problems, over general hypothesis spaces, which are developments of independent interest. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2306.15098 [pdf, other]

doi 10.1145/3580305.3599447

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Authors: Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito

Abstract: Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking… ▽ More Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. To deal with this problem, previous studies assume either independent or cascade user behavior, resulting in some ranking versions of IPS. While these estimators are somewhat effective in reducing the variance, all existing estimators apply a single universal assumption to every user, causing excessive bias and variance. Therefore, this work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior. Moreover, AIPS achieves the minimum variance among all unbiased estimators based on IPS. We further develop a procedure to identify the appropriate user behavior model to minimize the mean squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments demonstrate that the empirical accuracy improvement can be significant, enabling effective OPE of ranking systems even under diverse user behavior. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: KDD2023 Research track

arXiv:2305.18505 [pdf, ps, other]

Provable Reward-Agnostic Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

Abstract: Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In t… ▽ More Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we fill in such a gap between theoretical PbRL and practical algorithms by proposing a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario. △ Less

Submitted 17 April, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: ICLR 2024 Spotlight

arXiv:2305.14816 [pdf, ps, other]

Provable Offline Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

Abstract: In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offl… ▽ More In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs. △ Less

Submitted 29 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: The first two authors contribute equally

arXiv:2302.09456 [pdf, other]

Distributional Offline Policy Evaluation with Predictive Error Guarantees

Authors: Runzhe Wu, Masatoshi Uehara, Wen Sun

Abstract: We study the problem of estimating the distribution of the return of a policy using an offline dataset that is not generated from the policy, i.e., distributional offline policy evaluation (OPE). We propose an algorithm called Fitted Likelihood Estimation (FLE), which conducts a sequence of Maximum Likelihood Estimation (MLE) and has the flexibility of integrating any state-of-the-art probabilisti… ▽ More We study the problem of estimating the distribution of the return of a policy using an offline dataset that is not generated from the policy, i.e., distributional offline policy evaluation (OPE). We propose an algorithm called Fitted Likelihood Estimation (FLE), which conducts a sequence of Maximum Likelihood Estimation (MLE) and has the flexibility of integrating any state-of-the-art probabilistic generative models as long as it can be trained via MLE. FLE can be used for both finite-horizon and infinite-horizon discounted settings where rewards can be multi-dimensional vectors. Our theoretical results show that for both finite-horizon and infinite-horizon discounted settings, FLE can learn distributions that are close to the ground truth under total variation distance and Wasserstein distance, respectively. Our theoretical results hold under the conditions that the offline data covers the test policy's traces and that the supervised learning MLE procedures succeed. Experimentally, we demonstrate the performance of FLE with two generative models, Gaussian mixture models and diffusion models. For the multi-dimensional reward setting, FLE with diffusion models is capable of estimating the complicated distribution of the return of a test policy. △ Less

Submitted 29 December, 2023; v1 submitted 18 February, 2023; originally announced February 2023.

Comments: Accepted at ICML 2023

arXiv:2302.05404 [pdf, ps, other]

Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness

Authors: Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara

Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (… ▽ More In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem. △ Less

Submitted 10 February, 2023; originally announced February 2023.

Comments: Under review

arXiv:2302.02392 [pdf, ps, other]

Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

Authors: Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

Abstract: In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage… ▽ More In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying. △ Less

Submitted 13 November, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: The original title of this paper was "Refined Value-Based Offline RL under Realizability and Partial Coverage," but it was later changed. This paper has been accepted for NeurIPS 2023

arXiv:2212.06355 [pdf, ps, other]

A Review of Off-Policy Evaluation in Reinforcement Learning

Authors: Masatoshi Uehara, Chengchun Shi, Nathan Kallus

Abstract: Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a… ▽ More Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a discussion on the efficiency bound of OPE, some of the existing state-of-the-art OPE methods, their statistical properties and some other related research directions that are currently actively explored. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: Still under revision

arXiv:2208.08291 [pdf, ps, other]

Inference on Strongly Identified Functionals of Weakly Identified Functions

Authors: Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara

Abstract: In a variety of applications, including nonparametric instrumental variable (NPIV) analysis, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables, we are interested in inference on a continuous linear functional (e.g., average causal effects) of nuisance function (e.g., NPIV regression) defined by conditional moment restrictions. These nuisan… ▽ More In a variety of applications, including nonparametric instrumental variable (NPIV) analysis, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables, we are interested in inference on a continuous linear functional (e.g., average causal effects) of nuisance function (e.g., NPIV regression) defined by conditional moment restrictions. These nuisance functions are generally weakly identified, in that the conditional moment restrictions can be severely ill-posed as well as admit multiple solutions. This is sometimes resolved by imposing strong conditions that imply the function can be estimated at rates that make inference on the functional possible. In this paper, we study a novel condition for the functional to be strongly identified even when the nuisance function is not; that is, the functional is amenable to asymptotically-normal estimation at $\sqrt{n}$-rates. The condition implies the existence of debiasing nuisance functions, and we propose penalized minimax estimators for both the primary and debiasing nuisance functions. The proposed nuisance estimators can accommodate flexible function classes, and importantly they can converge to fixed limits determined by the penalization regardless of the identifiability of the nuisances. We use the penalized nuisance estimators to form a debiased estimator for the functional of interest and prove its asymptotic normality under generic high-level conditions, which provide for asymptotically valid confidence intervals. We also illustrate our method in a novel partially linear proximal causal inference problem and a partially linear instrumental variable regression problem. △ Less

Submitted 30 June, 2023; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: This supersedes the previous version titled "Debiased Inference on Identified Linear Functionals of Underidentified Nuisances via Penalized Minimax Estimation"

arXiv:2207.13081 [pdf, other]

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Authors: Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun

Abstract: We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs.… ▽ More We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs. △ Less

Submitted 14 November, 2023; v1 submitted 26 July, 2022; originally announced July 2022.

Comments: This paper was accepted in NeurIPS 2023

arXiv:2207.05738 [pdf, other]

PAC Reinforcement Learning for Predictive State Representations

Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

Abstract: In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using… ▽ More In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using observable quantities. We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scaling polynomially with respect to all the relevant parameters of the systems. Our algorithm naturally works with function approximation to extend to systems with potentially large state and observation spaces. We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces. Notably, our work is the first work that shows polynomial sample complexities to compete with the globally optimal policy in PSRs. Finally, we demonstrate how our general theorem can be directly used to derive sample complexity bounds for special models including $m$-step weakly revealing and $m$-step decodable tabular POMDPs, POMDPs with low-rank latent transition, and POMDPs with linear emission and latent transition. △ Less

Submitted 13 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

arXiv:2206.12081 [pdf, other]

Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings

Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

Abstract: We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emis… ▽ More We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs. △ Less

Submitted 24 June, 2022; originally announced June 2022.

arXiv:2206.12020 [pdf, ps, other]

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

Abstract: We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as we… ▽ More We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2204.02718 [pdf, other]

Annotation-Scheme Reconstruction for "Fake News" and Japanese Fake News Dataset

Authors: Taichi Murayama, Shohei Hisada, Makoto Uehara, Shoko Wakamiya, Eiji Aramaki

Abstract: Fake news provokes many societal problems; therefore, there has been extensive research on fake news detection tasks to counter it. Many fake news datasets were constructed as resources to facilitate this task. Contemporary research focuses almost exclusively on the factuality aspect of the news. However, this aspect alone is insufficient to explain "fake news," which is a complex phenomenon that… ▽ More Fake news provokes many societal problems; therefore, there has been extensive research on fake news detection tasks to counter it. Many fake news datasets were constructed as resources to facilitate this task. Contemporary research focuses almost exclusively on the factuality aspect of the news. However, this aspect alone is insufficient to explain "fake news," which is a complex phenomenon that involves a wide range of issues. To fully understand the nature of each instance of fake news, it is important to observe it from various perspectives, such as the intention of the false news disseminator, the harmfulness of the news to our society, and the target of the news. We propose a novel annotation scheme with fine-grained labeling based on detailed investigations of existing fake news datasets to capture these various aspects of fake news. Using the annotation scheme, we construct and publish the first Japanese fake news dataset. The annotation scheme is expected to provide an in-depth understanding of fake news. We plan to build datasets for both Japanese and other languages using our scheme. Our Japanese dataset is published at https://hkefka385.github.io/dataset/fakenews-japanese/. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: 13th International Conference on Language Resources and Evaluation (LREC), 2022

arXiv:2202.00063 [pdf, other]

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Authors: Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

Abstract: We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably l… ▽ More We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration. △ Less

Submitted 11 October, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

arXiv:2111.06784 [pdf, other]

A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

Authors: Chengchun Shi, Masatoshi Uehara, Jiawei Huang, Nan Jiang

Abstract: We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose… ▽ More We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail. △ Less

Submitted 15 June, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

arXiv:2110.04652 [pdf, other]

Representation Learning for Online and Offline RL in Low-rank MDPs

Authors: Masatoshi Uehara, Xuezhou Zhang, Wen Sun

Abstract: This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike pr… ▽ More This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (ε^{10} (1-γ)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (ε^2 (1-γ)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $γ$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution. △ Less

Submitted 5 January, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

arXiv:2107.06226 [pdf, other]

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Authors: Masatoshi Uehara, Wen Sun

Abstract: We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs t… ▽ More We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors. △ Less

Submitted 9 January, 2023; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: We changed the title from the first version. This is a longer version of the article accepted in ICLR 2022. The following things are added (1) a new algorithm CPPO-LR where the constraint is given in a log-likelihood form, (2) how to instantiate CPPO on (nonparametric) linear MDPs, (3) posterior sampling in a model-free way

arXiv:2106.03207 [pdf, other]

Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage

Authors: Jonathan D. Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, Wen Sun

Abstract: This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework… ▽ More This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo. △ Less

Submitted 31 January, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: 42 pages, 5 figures, 7 tables

arXiv:2103.14029 [pdf, ps, other]

Causal Inference Under Unmeasured Confounding With Negative Controls: A Minimax Learning Approach

Authors: Nathan Kallus, Xiaojie Mao, Masatoshi Uehara

Abstract: We study the estimation of causal parameters when not all confounders are observed and instead negative controls are available. Recent work has shown how these can enable identification and efficient estimation via two so-called bridge functions. In this paper, we tackle the primary challenge to causal inference using negative controls: the identification and estimation of these bridge functions.… ▽ More We study the estimation of causal parameters when not all confounders are observed and instead negative controls are available. Recent work has shown how these can enable identification and efficient estimation via two so-called bridge functions. In this paper, we tackle the primary challenge to causal inference using negative controls: the identification and estimation of these bridge functions. Previous work has relied on completeness conditions on these functions to identify the causal parameters and required uniqueness assumptions in estimation, and they also focused on parametric estimation of bridge functions. Instead, we provide a new identification strategy that avoids the completeness condition. And, we provide new estimators for these functions based on minimax learning formulations. These estimators accommodate general function classes such as Reproducing Kernel Hilbert Spaces and neural networks. We study finite-sample convergence results both for estimating bridge functions themselves and for the final estimation of the causal parameter under a variety of combinations of assumptions. We avoid uniqueness conditions on the bridge functions as much as possible. △ Less

Submitted 9 October, 2022; v1 submitted 25 March, 2021; originally announced March 2021.

arXiv:2102.02981 [pdf, ps, other]

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

Authors: Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, Tengyang Xie

Abstract: We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights… ▽ More We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions, characterized by the critical inequality \citep{bartlett2005}. Based on this result, we analyze convergence rates for OPE. In particular, we introduce novel alternative completeness conditions under which OPE is feasible and we present the first finite-sample result with first-order efficiency in non-tabular environments, i.e., having the minimal coefficient in the leading term. △ Less

Submitted 24 July, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: Under Review

arXiv:2102.00479 [pdf, other]

Fast Rates for the Regret of Offline Reinforcement Learning

Authors: Yichun Hu, Nathan Kallus, Masatoshi Uehara

Abstract: We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits \emph{much} faster convergence. In this paper, we present a finer regret a… ▽ More We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits \emph{much} faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the \emph{decision-making} problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-Ω(n))$ regret rates in tabular cases. We extend our findings to general function approximation by extending our results to regret guarantees based on $L_p$-convergence rates for estimating $Q^*$ rather than pointwise rates, where $L_2$ guarantees for nonparametric $Q^*$-estimation can be ensured under mild conditions. △ Less

Submitted 12 July, 2023; v1 submitted 31 January, 2021; originally announced February 2021.

arXiv:2010.11002 [pdf, other]

Optimal Off-Policy Evaluation from Multiple Logging Policies

Authors: Nathan Kallus, Yuta Saito, Masatoshi Uehara

Abstract: We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding t… ▽ More We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods' efficiently leveraging of the stratified sampling of off-policy data from multiple loggers. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: Under Review

arXiv:2006.03900 [pdf, other]

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: Offline reinforcement learning, wherein one uses off-policy data logged by a fixed behavior policy to evaluate and learn new policies, is crucial in applications where experimentation is limited such as medicine. We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. Targeting deterministic policies, for which action is a de… ▽ More Offline reinforcement learning, wherein one uses off-policy data logged by a fixed behavior policy to evaluate and learn new policies, is crucial in applications where experimentation is limited such as medicine. We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. Targeting deterministic policies, for which action is a deterministic function of state, is crucial since optimal policies are always deterministic (up to ties). In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. To circumvent this issue, we propose several new doubly robust estimators based on different kernelization approaches. We analyze the asymptotic mean-squared error of each of these under mild rate conditions for nuisance estimators. Specifically, we demonstrate how to obtain a rate that is independent of the horizon length. △ Less

Submitted 6 June, 2020; originally announced June 2020.

arXiv:2006.03886 [pdf, other]

Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap,… ▽ More We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap, lead to policies that build upon current practice, and improve policies' implementability in practice. Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown. In this paper, we derive the efficiency bounds of two major types of natural stochastic policies: tilting policies and modified treatment policies. We then propose efficient nonparametric estimators that attain the efficiency bounds under very lax conditions. These also enjoy a (partial) double robustness property. △ Less

Submitted 3 November, 2020; v1 submitted 6 June, 2020; originally announced June 2020.

Comments: Under review

arXiv:2002.11642 [pdf, ps, other]

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Authors: Masahiro Kato, Masatoshi Uehara, Shota Yasui

Abstract: We consider evaluating and training a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OP… ▽ More We consider evaluating and training a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, a covariate shift often exists, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using a nonparametric estimator of the density ratio between the historical and evaluation data distributions. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments. △ Less

Submitted 15 October, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

arXiv:2002.04014 [pdf, other]

Statistically Efficient Off-Policy Policy Gradients

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. We derive the asymptotic lower bound on the feasible mean-squared error in both Markov and n… ▽ More Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. We derive the asymptotic lower bound on the feasible mean-squared error in both Markov and non-Markov decision processes and show that existing estimators fail to achieve it in general settings. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions and exhibits a unique 3-way double robustness property. We discuss how to estimate nuisances that the algorithm relies on. Finally, we establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient. △ Less

Submitted 20 February, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:1912.12945 [pdf, other]

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond

Authors: Nathan Kallus, Xiaojie Mao, Masatoshi Uehara

Abstract: We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisances that depend on the parameter. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves as a nuisance the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated… ▽ More We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisances that depend on the parameter. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves as a nuisance the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated. Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances using flexible machine learning methods, but applying it to problems with parameter-dependent nuisances is impractical. For (L)QTE, DML requires we learn the whole covariate-conditional cumulative distribution function. We instead propose localized debiased machine learning (LDML), which avoids this burdensome step and needs only estimate nuisances at a single initial rough guess for the parameter. For (L)QTE, LDML involves learning just two regression functions, a standard task for machine learning methods. We prove that under lax rate conditions our estimator has the same favorable asymptotic behavior as the infeasible estimator that uses the unknown true nuisances. Thus, LDML notably enables practically-feasible and theoretically-grounded efficient estimation of important quantities in causal inference such as (L)QTEs when we must control for many covariates and/or flexible relationships, as we demonstrate in empirical studies. △ Less

Submitted 17 August, 2022; v1 submitted 30 December, 2019; originally announced December 2019.

arXiv:1910.12809 [pdf, other]

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Authors: Masatoshi Uehara, Jiawei Huang, Nan Jiang

Abstract: We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2… ▽ More We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swap** the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights into these methods, including the sample complexity analyses of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL. △ Less

Submitted 6 October, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

arXiv:1909.05850 [pdf, other]

Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the cu… ▽ More Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE. △ Less

Submitted 15 January, 2023; v1 submitted 12 September, 2019; originally announced September 2019.

Comments: In V3, we significantly changed the derivation of the efficiency bound to follow standard (iid) semiparametric theory. We also derive the efficient influence function. In V4, we add an experiment in a continuous-state environment employing function approximation. In v6, we fixed several typos. Please refer to this version as the final version

arXiv:1908.08526 [pdf, ps, other]

Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be ef… ▽ More Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness. △ Less

Submitted 5 June, 2020; v1 submitted 22 August, 2019; originally announced August 2019.

arXiv:1906.03735 [pdf, ps, other]

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Authors: Nathan Kallus, Masatoshi Uehara

Abstract: Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ens… ▽ More Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages. △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1905.05976 [pdf, ps, other]

Information criteria for non-normalized models

Authors: Takeru Matsuda, Masatoshi Uehara, Aapo Hyvarinen

Abstract: Many statistical models are given in the form of non-normalized densities with an intractable normalization constant. Since maximum likelihood estimation is computationally intensive for these models, several estimation methods have been developed which do not require explicit computation of the normalization constant, such as noise contrastive estimation (NCE) and score matching. However, model s… ▽ More Many statistical models are given in the form of non-normalized densities with an intractable normalization constant. Since maximum likelihood estimation is computationally intensive for these models, several estimation methods have been developed which do not require explicit computation of the normalization constant, such as noise contrastive estimation (NCE) and score matching. However, model selection methods for general non-normalized models have not been proposed so far. In this study, we develop information criteria for non-normalized models estimated by NCE or score matching. They are approximately unbiased estimators of discrepancy measures for non-normalized models. Simulation results and applications to real data demonstrate that the proposed criteria enable selection of the appropriate non-normalized model in a data-driven manner. △ Less

Submitted 27 July, 2021; v1 submitted 15 May, 2019; originally announced May 2019.

Journal ref: Journal of Machine Learning Research, 22(158):1--33, 2021

arXiv:1903.03630 [pdf, ps, other]

Imputation estimators for unnormalized models with missing data

Authors: Masatoshi Uehara, Takeru Matsuda, Jae Kwang Kim

Abstract: Several statistical models are given in the form of unnormalized densities, and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine imputation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. In addition, we derive asymptotic… ▽ More Several statistical models are given in the form of unnormalized densities, and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine imputation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. In addition, we derive asymptotic distributions of the proposed estimators and construct confidence intervals. Simulation results with truncated Gaussian graphical models and the application to real data of wind direction reveal that the proposed methods effectively enable statistical inference with unnormalized models from missing data. △ Less

Submitted 8 June, 2020; v1 submitted 8 March, 2019; originally announced March 2019.

Comments: To appear (AISTATS 2020)

arXiv:1901.07710 [pdf, other]

Unified estimation framework for unnormalized models with statistical efficiency

Authors: Masatoshi Uehara, Takafumi Kanamori, Takashi Takenouchi, Takeru Matsuda

Abstract: The parameter estimation of unnormalized models is a challenging problem. The maximum likelihood estimation (MLE) is computationally infeasible for these models since normalizing constants are not explicitly calculated. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency remains. In this study, we propose a unified, statistically efficient estimati… ▽ More The parameter estimation of unnormalized models is a challenging problem. The maximum likelihood estimation (MLE) is computationally infeasible for these models since normalizing constants are not explicitly calculated. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency remains. In this study, we propose a unified, statistically efficient estimation framework for unnormalized models and several efficient estimators, whose asymptotic variance is the same as the MLE. The computational cost of these estimators is also reasonable and they can be employed whether the sample space is discrete or continuous. The loss functions of the proposed estimators are derived by combining the following two methods: (1) density-ratio matching using Bregman divergence, and (2) plugging-in nonparametric estimators. We also analyze the properties of the proposed estimators when the unnormalized models are misspecified. The experimental results demonstrate the advantages of our method over existing approaches. △ Less

Submitted 5 June, 2020; v1 submitted 22 January, 2019; originally announced January 2019.

Comments: To appear at AISTATS 2020

arXiv:1810.12519 [pdf, ps, other]

Semiparametric response model with nonignorable nonresponse

Authors: Masatoshi Uehara, Jae Kwang Kim

Abstract: How to deal with nonignorable response is often a challenging problem encountered in statistical analysis with missing data. Parametric model assumption for the response mechanism is often made and there is no way to validate the model assumption with missing data. We consider a semiparametric response model that relaxes the parametric model assumption in the response mechanism. Two types of effic… ▽ More How to deal with nonignorable response is often a challenging problem encountered in statistical analysis with missing data. Parametric model assumption for the response mechanism is often made and there is no way to validate the model assumption with missing data. We consider a semiparametric response model that relaxes the parametric model assumption in the response mechanism. Two types of efficient estimators, profile maximum likelihood estimator and profile calibration estimator, are proposed and their asymptotic properties are investigated. Two extensive simulation studies are used to compare with some existing methods. We present an application of our method using Korean Labor and Income Panel Survey data. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1808.07983 [pdf, other]

Analysis of Noise Contrastive Estimation from the Perspective of Asymptotic Variance

Authors: Masatoshi Uehara, Takeru Matsuda, Fumiyasu Komaki

Abstract: There are many models, often called unnormalized models, whose normalizing constants are not calculated in closed form. Maximum likelihood estimation is not directly applicable to unnormalized models. Score matching, contrastive divergence method, pseudo-likelihood, Monte Carlo maximum likelihood, and noise contrastive estimation (NCE) are popular methods for estimating parameters of such models.… ▽ More There are many models, often called unnormalized models, whose normalizing constants are not calculated in closed form. Maximum likelihood estimation is not directly applicable to unnormalized models. Score matching, contrastive divergence method, pseudo-likelihood, Monte Carlo maximum likelihood, and noise contrastive estimation (NCE) are popular methods for estimating parameters of such models. In this paper, we focus on NCE. The estimator derived from NCE is consistent and asymptotically normal because it is an M-estimator. NCE characteristically uses an auxiliary distribution to calculate the normalizing constant in the same spirit of the importance sampling. In addition, there are several candidates as objective functions of NCE. We focus on how to reduce asymptotic variance. First, we propose a method for reducing asymptotic variance by estimating the parameters of the auxiliary distribution. Then, we determine the form of the objective functions, where the asymptotic variance takes the smallest values in the original estimator class and the proposed estimator classes. We further analyze the robustness of the estimator. △ Less

Submitted 23 August, 2018; originally announced August 2018.

arXiv:1610.02920 [pdf, other]

Generative Adversarial Nets from a Density Ratio Estimation Perspective

Authors: Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

Abstract: Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective… ▽ More Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful. △ Less

Submitted 9 November, 2016; v1 submitted 10 October, 2016; originally announced October 2016.

Comments: Add contents especially theoretical things for ICLR 2017

arXiv:1605.06569 [pdf, other]

doi 10.3847/1538-4365/227/2/21

The Atacama Cosmology Telescope: The polarization-sensitive ACTPol instrument

Authors: R. J. Thornton, P. A. R. Ade, S. Aiola, F. E. Angile, M. Amiri, J. A. Beall, D. T. Becker, H-M. Cho, S. K. Choi, P. Corlies, K. P. Coughlin, R. Datta, M. J. Devlin, S. R. Dicker, R. Dunner, J. W. Fowler, A. E. Fox, P. A. Gallardo, J. Gao, E. Grace, M. Halpern, M. Hasselfield, S. W. Henderson, G. C. Hilton, A. D. Hincks , et al. (31 additional authors not shown)

Abstract: The Atacama Cosmology Telescope (ACT) is designed to make high angular resolution measurements of anisotropies in the Cosmic Microwave Background (CMB) at millimeter wavelengths. We describe ACTPol, an upgraded receiver for ACT, which uses feedhorn-coupled, polarization-sensitive detector arrays, a 3 degree field of view, 100 mK cryogenics with continuous cooling, and meta material anti-reflection… ▽ More The Atacama Cosmology Telescope (ACT) is designed to make high angular resolution measurements of anisotropies in the Cosmic Microwave Background (CMB) at millimeter wavelengths. We describe ACTPol, an upgraded receiver for ACT, which uses feedhorn-coupled, polarization-sensitive detector arrays, a 3 degree field of view, 100 mK cryogenics with continuous cooling, and meta material anti-reflection coatings. ACTPol comprises three arrays with separate cryogenic optics: two arrays at a central frequency of 148 GHz and one array operating simultaneously at both 97 GHz and 148 GHz. The combined instrument sensitivity, angular resolution, and sky coverage are optimized for measuring angular power spectra, clusters via the thermal Sunyaev-Zel'dovich and kinetic Sunyaev-Zel'dovich signals, and CMB lensing due to large scale structure. The receiver was commissioned with its first 148 GHz array in 2013, observed with both 148 GHz arrays in 2014, and has recently completed its first full season of operations with the full suite of three arrays. This paper provides an overview of the design and initial performance of the receiver and related systems. △ Less

Submitted 20 May, 2016; originally announced May 2016.

arXiv:1405.5524 [pdf, other]

doi 10.1088/1475-7516/2014/10/007

The Atacama Cosmology Telescope: CMB Polarization at $200<\ell<9000$

Authors: Sigurd Naess, Matthew Hasselfield, Jeff McMahon, Michael D. Niemack, Graeme E. Addison, Peter A. R. Ade, Rupert Allison, Mandana Amiri, Nick Battaglia, James A. Beall, Francesco de Bernardis, J Richard Bond, Joe Britton, Erminia Calabrese, Hsiao-mei Cho, Kevin Coughlin, Devin Crichton, Sudeep Das, Rahul Datta, Mark J. Devlin, Simon R. Dicker, Joanna Dunkley, Rolando Dünner, Joseph W. Fowler, Anna E. Fox , et al. (53 additional authors not shown)

Abstract: We report on measurements of the cosmic microwave background (CMB) and celestial polarization at 146 GHz made with the Atacama Cosmology Telescope Polarimeter (ACTPol) in its first three months of observing. Four regions of sky covering a total of 270 square degrees were mapped with an angular resolution of $1.3'$. The map noise levels in the four regions are between 11 and 17 $μ$K-arcmin. We pres… ▽ More We report on measurements of the cosmic microwave background (CMB) and celestial polarization at 146 GHz made with the Atacama Cosmology Telescope Polarimeter (ACTPol) in its first three months of observing. Four regions of sky covering a total of 270 square degrees were mapped with an angular resolution of $1.3'$. The map noise levels in the four regions are between 11 and 17 $μ$K-arcmin. We present TT, TE, EE, TB, EB, and BB power spectra from three of these regions. The observed E-mode polarization power spectrum, displaying six acoustic peaks in the range $200<\ell<3000$, is an excellent fit to the prediction of the best-fit cosmological models from WMAP9+ACT and Planck data. The polarization power spectrum, which mainly reflects primordial plasma velocity perturbations, provides an independent determination of cosmological parameters consistent with those based on the temperature power spectrum, which results mostly from primordial density perturbations. We find that without masking any point sources in the EE data at $\ell<9000$, the Poisson tail of the EE power spectrum due to polarized point sources has an amplitude less than $2.4$ $μ$K$^2$ at $\ell = 3000$ at 95\% confidence. Finally, we report that the Crab Nebula, an important polarization calibration source at microwave frequencies, has 8.7\% polarization with an angle of $150.7^\circ \pm 0.6^\circ$ when smoothed with a $5'$ Gaussian beam. △ Less

Submitted 21 September, 2014; v1 submitted 21 May, 2014; originally announced May 2014.

Comments: 16 pages, 15 figures, 5 tables

arXiv:1301.0824 [pdf, other]

doi 10.1088/1475-7516/2013/10/060

The Atacama Cosmology Telescope: Cosmological parameters from three seasons of data

Authors: Jonathan L. Sievers, Renée A. Hlozek, Michael R. Nolta, Viviana Acquaviva, Graeme E. Addison, Peter A. R. Ade, Paula Aguirre, Mandana Amiri, John William Appel, L. Felipe Barrientos, Elia S. Battistelli, Nick Battaglia, J. Richard Bond, Ben Brown, Bryce Burger, Erminia Calabrese, Jay Chervenak, Devin Crichton, Sudeep Das, Mark J. Devlin, Simon R. Dicker, W. Bertrand Doriese, Joanna Dunkley, Rolando Dünner, Thomas Essinger-Hileman , et al. (68 additional authors not shown)

Abstract: We present constraints on cosmological and astrophysical parameters from high-resolution microwave background maps at 148 GHz and 218 GHz made by the Atacama Cosmology Telescope (ACT) in three seasons of observations from 2008 to 2010. A model of primary cosmological and secondary foreground parameters is fit to the map power spectra and lensing deflection power spectrum, including contributions f… ▽ More We present constraints on cosmological and astrophysical parameters from high-resolution microwave background maps at 148 GHz and 218 GHz made by the Atacama Cosmology Telescope (ACT) in three seasons of observations from 2008 to 2010. A model of primary cosmological and secondary foreground parameters is fit to the map power spectra and lensing deflection power spectrum, including contributions from both the thermal Sunyaev-Zeldovich (tSZ) effect and the kinematic Sunyaev-Zeldovich (kSZ) effect, Poisson and correlated anisotropy from unresolved infrared sources, radio sources, and the correlation between the tSZ effect and infrared sources. The power ell^2 C_ell/2pi of the thermal SZ power spectrum at 148 GHz is measured to be 3.4 +\- 1.4 muK^2 at ell=3000, while the corresponding amplitude of the kinematic SZ power spectrum has a 95% confidence level upper limit of 8.6 muK^2. Combining ACT power spectra with the WMAP 7-year temperature and polarization power spectra, we find excellent consistency with the LCDM model. We constrain the number of effective relativistic degrees of freedom in the early universe to be Neff=2.79 +\- 0.56, in agreement with the canonical value of Neff=3.046 for three massless neutrinos. We constrain the sum of the neutrino masses to be Sigma m_nu < 0.39 eV at 95% confidence when combining ACT and WMAP 7-year data with BAO and Hubble constant measurements. We constrain the amount of primordial helium to be Yp = 0.225 +\- 0.034, and measure no variation in the fine structure constant alpha since recombination, with alpha/alpha0 = 1.004 +/- 0.005. We also find no evidence for any running of the scalar spectral index, dns/dlnk = -0.004 +\- 0.012. △ Less

Submitted 11 October, 2013; v1 submitted 4 January, 2013; originally announced January 2013.

Comments: 26 pages, 22 figures. This paper is a companion to Das et al. (2013) and Dunkley et al. (2013). Matches published JCAP version

arXiv:1012.1996 [pdf]

doi 10.1016/j.physc.2011.05.088

Intrinsic pinning property of FeSe0.5Te0.5

Authors: M. Migita, Y. Takikawa, M. Takeda, M. Uehara, T. Kuramoto, Y. Takano, Y. Mizuguchi, Y. Kimishima

Abstract: The intrinsic pinning properties of FeSe0.5Te0.5, which is the superconductor with Tc of about 14 K, were studied by the analysis of magnetization curves by the extended critical state model. In the magnetization measurements by SQUID magnetometer, the external magnetic fields were applied parallel and perpendicular to c-axis of the sample. The critical current density Jc's under the perpendicular… ▽ More The intrinsic pinning properties of FeSe0.5Te0.5, which is the superconductor with Tc of about 14 K, were studied by the analysis of magnetization curves by the extended critical state model. In the magnetization measurements by SQUID magnetometer, the external magnetic fields were applied parallel and perpendicular to c-axis of the sample. The critical current density Jc's under the perpendicular field of 1 T were estimated by using the Kimishima model as about 1.6 x 10^4, 8.8 x 10^3, 4.1 x 10^3, and 1.5 x 10^3 A/cm2 at 5, 7, 9, and 11 K, respectively, and the temperature dependence of Jc could be fitted with the exponential law of Jc(0)xexp(-αT /Tc) up to 9 K and power law of Jc(0)x(1-T / Tc)n near Tc. △ Less

Submitted 9 December, 2010; originally announced December 2010.

Comments: 19 pages, 7 figures

arXiv:0811.3483 [pdf]

doi 10.1143/JPSJ.78.033702

New anti-perovskite-type Superconductor ZnNyNi3

Authors: Masatomo Uehara, Akira Uehara, Katsuya Kozawa, Yoshihide Kimishima

Abstract: We have synthesized a new superconductor ZnNyNi3 with Tc ~3 K. The crystal structure has the same anti-perovskite-type such as MgCNi3 and CdCNi3. As far as we know, this is the third superconducting material in Ni-based anti-perovskite series. For this material, superconducting parameters, lower-critical field Hc1(0), upper-critical field Hc2(0), coherence length x(0), penetration depth l(0), an… ▽ More We have synthesized a new superconductor ZnNyNi3 with Tc ~3 K. The crystal structure has the same anti-perovskite-type such as MgCNi3 and CdCNi3. As far as we know, this is the third superconducting material in Ni-based anti-perovskite series. For this material, superconducting parameters, lower-critical field Hc1(0), upper-critical field Hc2(0), coherence length x(0), penetration depth l(0), and Gintzburg -Landau parameter k(0) have been experimentally determined. △ Less

Submitted 21 November, 2008; originally announced November 2008.

Comments: 13 pages, 3 figures, 1 table

arXiv:0810.0350 [pdf]

Carrier do** to pseudo-low-dimensional compound La2RuO5

Authors: Masatomo Uehara, Kenich Ashikawa, Yoshimasa Aka, Yoshihide Kimishima

Abstract: Hole carrier do** has been tried to pseudo-low-dimensional material La2RuO5 by substituting La3+ with Cd2+. Single phased samples of La2-xCdxRuO5 with x up to 0.5 have been successfully obtained and also high pressure O2 annealing has been performed to the x=0.5 sample. Although the formal ionic state of Ru is expected to increase from 4+ (at x=0) to 4.5+ (at x=0.5), the magnetic and electrica… ▽ More Hole carrier do** has been tried to pseudo-low-dimensional material La2RuO5 by substituting La3+ with Cd2+. Single phased samples of La2-xCdxRuO5 with x up to 0.5 have been successfully obtained and also high pressure O2 annealing has been performed to the x=0.5 sample. Although the formal ionic state of Ru is expected to increase from 4+ (at x=0) to 4.5+ (at x=0.5), the magnetic and electrical properties show no significant changes in as-sintered samples. In contrast, high pressure O2 annealed x=0.5 samples show a little reduction of electrical resistivity and the decrease of thermoelectric power at 260 K. From these results, it can be speculated that the doped carriers are mostly compensated by oxygen deficiency in as-sintered samples. △ Less

Submitted 2 October, 2008; originally announced October 2008.

Comments: 10 pages, 7 figures

Showing 1–50 of 71 results for author: Uehara, M