Search | arXiv e-print repository

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

Authors: Yuhui Wang, Qingyuan Wu, Weida Li, Dylan R. Ashley, Francesco Faccio, Chao Huang, Jürgen Schmidhuber

Abstract: The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze -- a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due… ▽ More The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze -- a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments. △ Less

Submitted 12 June, 2024; originally announced June 2024.

ACM Class: I.2.6

arXiv:2406.03485 [pdf, other]

Highway Value Iteration Networks

Authors: Yuhui Wang, Weida Li, Francesco Faccio, Qingyuan Wu, Jürgen Schmidhuber

Abstract: Value iteration networks (VINs) enable end-to-end learning for planning tasks by employing a differentiable "planning module" that approximates the value iteration algorithm. However, long-term planning remains a challenge because training very deep VINs is difficult. To address this problem, we embed highway value iteration -- a recent algorithm designed to facilitate long-term credit assignment… ▽ More Value iteration networks (VINs) enable end-to-end learning for planning tasks by employing a differentiable "planning module" that approximates the value iteration algorithm. However, long-term planning remains a challenge because training very deep VINs is difficult. To address this problem, we embed highway value iteration -- a recent algorithm designed to facilitate long-term credit assignment -- into the structure of VINs. This improvement augments the "planning module" of the VIN with three additional components: 1) an "aggregate gate," which constructs skip connections to improve information flow across many layers; 2) an "exploration module," crafted to increase the diversity of information and gradient flow in spatial dimensions; 3) a "filter gate" designed to ensure safe exploration. The resulting novel highway VIN can be trained effectively with hundreds of layers using standard backpropagation. In long-term planning tasks requiring hundreds of planning steps, deep highway VINs outperform both traditional VINs and several advanced, very deep NNs. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2405.18289 [pdf, other]

Highway Reinforcement Learning

Authors: Yuhui Wang, Miroslav Strupl, Francesco Faccio, Qingyuan Wu, Haozhe Liu, Michał Grudzień, Xiaoyang Tan, Jürgen Schmidhuber

Abstract: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize… ▽ More Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2404.08093 [pdf, other]

Towards a Robust Soft Baby Robot With Rich Interaction Ability for Advanced Machine Learning Algorithms

Authors: Mohannad Alhakami, Dylan R. Ashley, Joel Dunham, Francesco Faccio, Eric Feron, Jürgen Schmidhuber

Abstract: Artificial intelligence has made great strides in many areas lately, yet it has had comparatively little success in general-use robotics. We believe one of the reasons for this is the disconnect between traditional robotic design and the properties needed for open-ended, creativity-based AI systems. To that end, we, taking selective inspiration from nature, build a robust, partially soft robotic l… ▽ More Artificial intelligence has made great strides in many areas lately, yet it has had comparatively little success in general-use robotics. We believe one of the reasons for this is the disconnect between traditional robotic design and the properties needed for open-ended, creativity-based AI systems. To that end, we, taking selective inspiration from nature, build a robust, partially soft robotic limb with a large action space, rich sensory data stream from multiple cameras, and the ability to connect with others to enhance the action space and data stream. As a proof of concept, we train two contemporary machine learning algorithms to perform a simple target-finding task. Altogether, we believe that this design serves as a first step to building a robot tailor-made for achieving artificial general intelligence. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 5 pages in main text + 1 page of references, 7 figures in main text; source code available at https://github.com/dylanashley/robot-limb-testai

ACM Class: I.2.9; I.2.6

arXiv:2404.02747 [pdf, other]

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

Authors: Wentian Zhang, Haozhe Liu, **heng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber

Abstract: This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-… ▽ More This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2403.11998 [pdf, other]

Learning Useful Representations of Recurrent Neural Network Weight Matrices

Authors: Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber

Abstract: Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality-specific… ▽ More Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality-specifically, its input-output map**. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits.With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority. △ Less

Submitted 18 June, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

ACM Class: I.2.6

arXiv:2402.16823 [pdf, other]

Language Agents as Optimizable Graphs

Authors: Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, Jürgen Schmidhuber

Abstract: Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs c… ▽ More Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent collaboration (where edges connect operations of different agents). Our novel automatic graph optimizers (1) refine node-level LLM prompts (node optimization) and (2) improve agent orchestration by changing graph connectivity (edge optimization). Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found at https://github.com/metauto-ai/gptswarm. △ Less

Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Project Website: https://gptswarm.org ; Github Repo: https://github.com/metauto-ai/gptswarm ; Replace to fix typos

arXiv:2309.11197 [pdf, other]

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

Authors: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag

Abstract: The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the m… ▽ More The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research. △ Less

Submitted 20 September, 2023; originally announced September 2023.

arXiv:2308.07795 [pdf, other]

Learning to Identify Critical States for Reinforcement Learning from Videos

Authors: Haozhe Liu, Mingchen Zhuge, Bing Li, Yuhui Wang, Francesco Faccio, Bernard Ghanem, Jürgen Schmidhuber

Abstract: Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first lear… ▽ More Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards. Without relying on ground-truth annotations, our new method called Deep State Identifier learns to predict returns from episodes encoded as videos. Then it uses a kind of mask-based sensitivity analysis to extract/identify important critical states. Extensive experiments showcase our method's potential for understanding and improving agent behavior. The source code and the generated datasets are available at https://github.com/AI-Initiative-KAUST/VideoRLCS. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: This paper was accepted to ICCV23

arXiv:2305.17066 [pdf, other]

Mindstorms in Natural Language-Based Societies of Mind

Authors: Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, **jie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-** Fan, Bernard Ghanem , et al. (1 additional authors not shown)

Abstract: Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overco… ▽ More Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices

MSC Class: 68T07 ACM Class: I.2.6; I.2.11

arXiv:2207.01570 [pdf, other]

Goal-Conditioned Generators of Deep Policies

Authors: Francesco Faccio, Vincent Herrmann, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber

Abstract: Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a… ▽ More Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a desired expected return," our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. Our code is public. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Preprint. Under Review

arXiv:2207.01566 [pdf, other]

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Authors: Francesco Faccio, Aditya Ramesh, Vincent Herrmann, Jean Harb, Jürgen Schmidhuber

Abstract: Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Netw… ▽ More Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus hel** to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a map** from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Preprint. Under review

arXiv:2206.01649 [pdf, other]

Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

Authors: Kazuki Irie, Francesco Faccio, Jürgen Schmidhuber

Abstract: Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are… ▽ More Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are typically expressed as additive iterative update processes which have straightforward ODE counterparts. Here we introduce a novel combination of learning rules and Neural ODEs to build continuous-time sequence processing nets that learn to manipulate short-term memory in rapidly changing synaptic connections of other nets. This yields continuous-time counterparts of Fast Weight Programmers and linear Transformers. Our novel models outperform the best existing Neural Controlled Differential Equation based models on various time series classification tasks, while also addressing their fundamental scalability limitations. Our code is public. △ Less

Submitted 14 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2205.06595 [pdf, other]

Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Jürgen Schmidhuber, Rupesh Kumar Srivastava

Abstract: Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching perfo… ▽ More Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: presented at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making; 5 pages in main text + 1 page of references + 3 pages of appendices, 1 figure in main text; source code available at https://github.com/struplm/UDRL-GCSL-counterexample.git

MSC Class: 68T05 ACM Class: I.2.6

arXiv:2107.09088 [pdf, other]

Reward-Weighted Regression Converges to a Global Optimum

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber

Abstract: Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic im… ▽ More Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum. △ Less

Submitted 23 February, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

Comments: 7 pages in main text + 2 pages of references + 6 pages of appendices, 1 figure in main text + 1 figure in appendices; source code available at https://github.com/dylanashley/reward-weighted-regression

MSC Class: 68T05 ACM Class: I.2.6

arXiv:2107.05438 [pdf, other]

Bayesian brains and the Rényi divergence

Authors: Noor Sajid, Francesco Faccio, Lancelot Da Costa, Thomas Parr, Jürgen Schmidhuber, Karl Friston

Abstract: Under the Bayesian brain hypothesis, behavioural variations can be attributed to different priors over generative model parameters. This provides a formal explanation for why individuals exhibit inconsistent behavioural preferences when confronted with similar choices. For example, greedy preferences are a consequence of confident (or precise) beliefs over certain outcomes. Here, we offer an alter… ▽ More Under the Bayesian brain hypothesis, behavioural variations can be attributed to different priors over generative model parameters. This provides a formal explanation for why individuals exhibit inconsistent behavioural preferences when confronted with similar choices. For example, greedy preferences are a consequence of confident (or precise) beliefs over certain outcomes. Here, we offer an alternative account of behavioural variability using Rényi divergences and their associated variational bounds. Rényi bounds are analogous to the variational free energy (or evidence lower bound) and can be derived under the same assumptions. Importantly, these bounds provide a formal way to establish behavioural differences through an $α$ parameter, given fixed priors. This rests on changes in $α$ that alter the bound (on a continuous scale), inducing different posterior estimates and consequent variations in behaviour. Thus, it looks as if individuals have different priors, and have reached different conclusions. More specifically, $α\to 0^{+}$ optimisation leads to mass-covering variational estimates and increased variability in choice behaviour. Furthermore, $α\to + \infty$ optimisation leads to mass-seeking variational posteriors and greedy preferences. We exemplify this formulation through simulations of the multi-armed bandit task. We note that these $α$ parameterisations may be especially relevant, i.e., shape preferences, when the true posterior is not in the same family of distributions as the assumed (simpler) approximate density, which may be the case in many real-world scenarios. The ensuing departure from vanilla variational inference provides a potentially useful explanation for differences in behavioural preferences of biological (or artificial) agents under the assumption that the brain performs variational Bayesian inference. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: 23 pages, 5 figures

arXiv:2006.09226 [pdf, other]

Parameter-Based Value Functions

Authors: Francesco Faccio, Louis Kirsch, Jürgen Schmidhuber

Abstract: Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can ge… ▽ More Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods. △ Less

Submitted 13 August, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:1809.06098 [pdf, other]

Policy Optimization via Importance Sampling

Authors: Alberto Maria Metelli, Matteo Papini, Francesco Faccio, Marcello Restelli

Abstract: Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estima… ▽ More Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods. △ Less

Submitted 31 October, 2018; v1 submitted 17 September, 2018; originally announced September 2018.

Journal ref: 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada

arXiv:1706.00222 [pdf, other]

doi 10.1088/1748-0221/12/05/P05022

Test Beam Performance Measurements for the Phase I Upgrade of the CMS Pixel Detector

Authors: M. Dragicevic, M. Friedl, J. Hrubec, H. Steininger, A. Gädda, J. Härkönen, T. Lampén, P. Luukka, T. Peltola, E. Tuominen, E. Tuovinen, A. Winkler, P. Eerola, T. Tuuva, G. Baulieu, G. Boudoul, L. Caponetto, C. Combaret, D. Contardo, T. Dupasquier, G. Gallbit, N. Lumb, L. Mirabito, S. Perries, M. Vander Donckt , et al. (462 additional authors not shown)

Abstract: A new pixel detector for the CMS experiment was built in order to cope with the instantaneous luminosities anticipated for the Phase~I Upgrade of the LHC. The new CMS pixel detector provides four-hit tracking with a reduced material budget as well as new cooling and powering schemes. A new front-end readout chip mitigates buffering and bandwidth limitations, and allows operation at low comparator… ▽ More A new pixel detector for the CMS experiment was built in order to cope with the instantaneous luminosities anticipated for the Phase~I Upgrade of the LHC. The new CMS pixel detector provides four-hit tracking with a reduced material budget as well as new cooling and powering schemes. A new front-end readout chip mitigates buffering and bandwidth limitations, and allows operation at low comparator thresholds. In this paper, comprehensive test beam studies are presented, which have been conducted to verify the design and to quantify the performance of the new detector assemblies in terms of tracking efficiency and spatial resolution. Under optimal conditions, the tracking efficiency is $99.95\pm0.05\,\%$, while the intrinsic spatial resolutions are $4.80\pm0.25\,μ\mathrm{m}$ and $7.99\pm0.21\,μ\mathrm{m}$ along the $100\,μ\mathrm{m}$ and $150\,μ\mathrm{m}$ pixel pitch, respectively. The findings are compared to a detailed Monte Carlo simulation of the pixel detector and good agreement is found. △ Less

Submitted 1 June, 2017; originally announced June 2017.

Report number: CMS-NOTE-2017-002

arXiv:1505.01824 [pdf, other]

doi 10.1088/1748-0221/11/04/P04023

Trap** in irradiated p-on-n silicon sensors at fluences anticipated at the HL-LHC outer tracker

Authors: W. Adam, T. Bergauer, M. Dragicevic, M. Friedl, R. Fruehwirth, M. Hoch, J. Hrubec, M. Krammer, W. Treberspurg, W. Waltenberger, S. Alderweireldt, W. Beaumont, X. Janssen, S. Luyckx, P. Van Mechelen, N. Van Remortel, A. Van Spilbeeck, P. Barria, C. Caillol, B. Clerbaux, G. De Lentdecker, D. Dobur, L. Favart, A. Grebenyuk, Th. Lenzi , et al. (663 additional authors not shown)

Abstract: The degradation of signal in silicon sensors is studied under conditions expected at the CERN High-Luminosity LHC. 200 $μ$m thick n-type silicon sensors are irradiated with protons of different energies to fluences of up to $3 \cdot 10^{15}$ neq/cm$^2$. Pulsed red laser light with a wavelength of 672 nm is used to generate electron-hole pairs in the sensors. The induced signals are used to determi… ▽ More The degradation of signal in silicon sensors is studied under conditions expected at the CERN High-Luminosity LHC. 200 $μ$m thick n-type silicon sensors are irradiated with protons of different energies to fluences of up to $3 \cdot 10^{15}$ neq/cm$^2$. Pulsed red laser light with a wavelength of 672 nm is used to generate electron-hole pairs in the sensors. The induced signals are used to determine the charge collection efficiencies separately for electrons and holes drifting through the sensor. The effective trap** rates are extracted by comparing the results to simulation. The electric field is simulated using Synopsys device simulation assuming two effective defects. The generation and drift of charge carriers are simulated in an independent simulation based on PixelAV. The effective trap** rates are determined from the measured charge collection efficiencies and the simulated and measured time-resolved current pulses are compared. The effective trap** rates determined for both electrons and holes are about 50% smaller than those obtained using standard extrapolations of studies at low fluences and suggests an improved tracker performance over initial expectations. △ Less

Submitted 7 May, 2015; originally announced May 2015.

Journal ref: 2016 JINST 11 P04023

Showing 1–20 of 20 results for author: Faccio, F