-
Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning
Authors:
Yuhui Wang,
Qingyuan Wu,
Weida Li,
Dylan R. Ashley,
Francesco Faccio,
Chao Huang,
Jürgen Schmidhuber
Abstract:
The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze -- a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due…
▽ More
The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze -- a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Highway Value Iteration Networks
Authors:
Yuhui Wang,
Weida Li,
Francesco Faccio,
Qingyuan Wu,
Jürgen Schmidhuber
Abstract:
Value iteration networks (VINs) enable end-to-end learning for planning tasks by employing a differentiable "planning module" that approximates the value iteration algorithm. However, long-term planning remains a challenge because training very deep VINs is difficult. To address this problem, we embed highway value iteration -- a recent algorithm designed to facilitate long-term credit assignment…
▽ More
Value iteration networks (VINs) enable end-to-end learning for planning tasks by employing a differentiable "planning module" that approximates the value iteration algorithm. However, long-term planning remains a challenge because training very deep VINs is difficult. To address this problem, we embed highway value iteration -- a recent algorithm designed to facilitate long-term credit assignment -- into the structure of VINs. This improvement augments the "planning module" of the VIN with three additional components: 1) an "aggregate gate," which constructs skip connections to improve information flow across many layers; 2) an "exploration module," crafted to increase the diversity of information and gradient flow in spatial dimensions; 3) a "filter gate" designed to ensure safe exploration. The resulting novel highway VIN can be trained effectively with hundreds of layers using standard backpropagation. In long-term planning tasks requiring hundreds of planning steps, deep highway VINs outperform both traditional VINs and several advanced, very deep NNs.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Highway Reinforcement Learning
Authors:
Yuhui Wang,
Miroslav Strupl,
Francesco Faccio,
Qingyuan Wu,
Haozhe Liu,
Michał Grudzień,
Xiaoyang Tan,
Jürgen Schmidhuber
Abstract:
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize…
▽ More
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Towards a Robust Soft Baby Robot With Rich Interaction Ability for Advanced Machine Learning Algorithms
Authors:
Mohannad Alhakami,
Dylan R. Ashley,
Joel Dunham,
Francesco Faccio,
Eric Feron,
Jürgen Schmidhuber
Abstract:
Artificial intelligence has made great strides in many areas lately, yet it has had comparatively little success in general-use robotics. We believe one of the reasons for this is the disconnect between traditional robotic design and the properties needed for open-ended, creativity-based AI systems. To that end, we, taking selective inspiration from nature, build a robust, partially soft robotic l…
▽ More
Artificial intelligence has made great strides in many areas lately, yet it has had comparatively little success in general-use robotics. We believe one of the reasons for this is the disconnect between traditional robotic design and the properties needed for open-ended, creativity-based AI systems. To that end, we, taking selective inspiration from nature, build a robust, partially soft robotic limb with a large action space, rich sensory data stream from multiple cameras, and the ability to connect with others to enhance the action space and data stream. As a proof of concept, we train two contemporary machine learning algorithms to perform a simple target-finding task. Altogether, we believe that this design serves as a first step to building a robot tailor-made for achieving artificial general intelligence.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
Authors:
Wentian Zhang,
Haozhe Liu,
**heng Xie,
Francesco Faccio,
Mike Zheng Shou,
Jürgen Schmidhuber
Abstract:
This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-…
▽ More
This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Learning Useful Representations of Recurrent Neural Network Weight Matrices
Authors:
Vincent Herrmann,
Francesco Faccio,
Jürgen Schmidhuber
Abstract:
Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality-specific…
▽ More
Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality-specifically, its input-output map**. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits.With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority.
△ Less
Submitted 18 June, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Language Agents as Optimizable Graphs
Authors:
Mingchen Zhuge,
Wenyi Wang,
Louis Kirsch,
Francesco Faccio,
Dmitrii Khizbullin,
Jürgen Schmidhuber
Abstract:
Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs c…
▽ More
Various human-designed prompt engineering techniques have been proposed to improve problem solvers based on Large Language Models (LLMs), yielding many disparate code bases. We unify these approaches by describing LLM-based agents as computational graphs. The nodes implement functions to process multimodal data or query LLMs, and the edges describe the information flow between operations. Graphs can be recursively combined into larger composite graphs representing hierarchies of inter-agent collaboration (where edges connect operations of different agents). Our novel automatic graph optimizers (1) refine node-level LLM prompts (node optimization) and (2) improve agent orchestration by changing graph connectivity (edge optimization). Experiments demonstrate that our framework can be used to efficiently develop, integrate, and automatically improve various LLM agents. The code can be found at https://github.com/metauto-ai/gptswarm.
△ Less
Submitted 27 February, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
Authors:
Aleksandar Stanić,
Dylan Ashley,
Oleg Serikov,
Louis Kirsch,
Francesco Faccio,
Jürgen Schmidhuber,
Thomas Hofmann,
Imanol Schlag
Abstract:
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the m…
▽ More
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Learning to Identify Critical States for Reinforcement Learning from Videos
Authors:
Haozhe Liu,
Mingchen Zhuge,
Bing Li,
Yuhui Wang,
Francesco Faccio,
Bernard Ghanem,
Jürgen Schmidhuber
Abstract:
Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first lear…
▽ More
Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards. Without relying on ground-truth annotations, our new method called Deep State Identifier learns to predict returns from episodes encoded as videos. Then it uses a kind of mask-based sensitivity analysis to extract/identify important critical states. Extensive experiments showcase our method's potential for understanding and improving agent behavior. The source code and the generated datasets are available at https://github.com/AI-Initiative-KAUST/VideoRLCS.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Mindstorms in Natural Language-Based Societies of Mind
Authors:
Mingchen Zhuge,
Haozhe Liu,
Francesco Faccio,
Dylan R. Ashley,
Róbert Csordás,
Anand Gopalakrishnan,
Abdullah Hamdi,
Hasan Abed Al Kader Hammoud,
Vincent Herrmann,
Kazuki Irie,
Louis Kirsch,
Bing Li,
Guohao Li,
Shuming Liu,
**jie Mai,
Piotr Piękos,
Aditya Ramesh,
Imanol Schlag,
Weimin Shi,
Aleksandar Stanić,
Wenyi Wang,
Yuhui Wang,
Mengmeng Xu,
Deng-** Fan,
Bernard Ghanem
, et al. (1 additional authors not shown)
Abstract:
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overco…
▽ More
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Goal-Conditioned Generators of Deep Policies
Authors:
Francesco Faccio,
Vincent Herrmann,
Aditya Ramesh,
Louis Kirsch,
Jürgen Schmidhuber
Abstract:
Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a…
▽ More
Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a desired expected return," our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. Our code is public.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States
Authors:
Francesco Faccio,
Aditya Ramesh,
Vincent Herrmann,
Jean Harb,
Jürgen Schmidhuber
Abstract:
Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Netw…
▽ More
Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus hel** to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a map** from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
Authors:
Kazuki Irie,
Francesco Faccio,
Jürgen Schmidhuber
Abstract:
Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are…
▽ More
Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are typically expressed as additive iterative update processes which have straightforward ODE counterparts. Here we introduce a novel combination of learning rules and Neural ODEs to build continuous-time sequence processing nets that learn to manipulate short-term memory in rapidly changing synaptic connections of other nets. This yields continuous-time counterparts of Fast Weight Programmers and linear Transformers. Our novel models outperform the best existing Neural Controlled Differential Equation based models on various time series classification tasks, while also addressing their fundamental scalability limitations. Our code is public.
△ Less
Submitted 14 October, 2022; v1 submitted 3 June, 2022;
originally announced June 2022.
-
Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets
Authors:
Miroslav Štrupl,
Francesco Faccio,
Dylan R. Ashley,
Jürgen Schmidhuber,
Rupesh Kumar Srivastava
Abstract:
Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching perfo…
▽ More
Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Reward-Weighted Regression Converges to a Global Optimum
Authors:
Miroslav Štrupl,
Francesco Faccio,
Dylan R. Ashley,
Rupesh Kumar Srivastava,
Jürgen Schmidhuber
Abstract:
Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic im…
▽ More
Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
△ Less
Submitted 23 February, 2022; v1 submitted 19 July, 2021;
originally announced July 2021.
-
Bayesian brains and the Rényi divergence
Authors:
Noor Sajid,
Francesco Faccio,
Lancelot Da Costa,
Thomas Parr,
Jürgen Schmidhuber,
Karl Friston
Abstract:
Under the Bayesian brain hypothesis, behavioural variations can be attributed to different priors over generative model parameters. This provides a formal explanation for why individuals exhibit inconsistent behavioural preferences when confronted with similar choices. For example, greedy preferences are a consequence of confident (or precise) beliefs over certain outcomes. Here, we offer an alter…
▽ More
Under the Bayesian brain hypothesis, behavioural variations can be attributed to different priors over generative model parameters. This provides a formal explanation for why individuals exhibit inconsistent behavioural preferences when confronted with similar choices. For example, greedy preferences are a consequence of confident (or precise) beliefs over certain outcomes. Here, we offer an alternative account of behavioural variability using Rényi divergences and their associated variational bounds. Rényi bounds are analogous to the variational free energy (or evidence lower bound) and can be derived under the same assumptions. Importantly, these bounds provide a formal way to establish behavioural differences through an $α$ parameter, given fixed priors. This rests on changes in $α$ that alter the bound (on a continuous scale), inducing different posterior estimates and consequent variations in behaviour. Thus, it looks as if individuals have different priors, and have reached different conclusions. More specifically, $α\to 0^{+}$ optimisation leads to mass-covering variational estimates and increased variability in choice behaviour. Furthermore, $α\to + \infty$ optimisation leads to mass-seeking variational posteriors and greedy preferences. We exemplify this formulation through simulations of the multi-armed bandit task. We note that these $α$ parameterisations may be especially relevant, i.e., shape preferences, when the true posterior is not in the same family of distributions as the assumed (simpler) approximate density, which may be the case in many real-world scenarios. The ensuing departure from vanilla variational inference provides a potentially useful explanation for differences in behavioural preferences of biological (or artificial) agents under the assumption that the brain performs variational Bayesian inference.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
Parameter-Based Value Functions
Authors:
Francesco Faccio,
Louis Kirsch,
Jürgen Schmidhuber
Abstract:
Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can ge…
▽ More
Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.
△ Less
Submitted 13 August, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Policy Optimization via Importance Sampling
Authors:
Alberto Maria Metelli,
Matteo Papini,
Francesco Faccio,
Marcello Restelli
Abstract:
Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estima…
▽ More
Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.
△ Less
Submitted 31 October, 2018; v1 submitted 17 September, 2018;
originally announced September 2018.
-
Test Beam Performance Measurements for the Phase I Upgrade of the CMS Pixel Detector
Authors:
M. Dragicevic,
M. Friedl,
J. Hrubec,
H. Steininger,
A. Gädda,
J. Härkönen,
T. Lampén,
P. Luukka,
T. Peltola,
E. Tuominen,
E. Tuovinen,
A. Winkler,
P. Eerola,
T. Tuuva,
G. Baulieu,
G. Boudoul,
L. Caponetto,
C. Combaret,
D. Contardo,
T. Dupasquier,
G. Gallbit,
N. Lumb,
L. Mirabito,
S. Perries,
M. Vander Donckt
, et al. (462 additional authors not shown)
Abstract:
A new pixel detector for the CMS experiment was built in order to cope with the instantaneous luminosities anticipated for the Phase~I Upgrade of the LHC. The new CMS pixel detector provides four-hit tracking with a reduced material budget as well as new cooling and powering schemes. A new front-end readout chip mitigates buffering and bandwidth limitations, and allows operation at low comparator…
▽ More
A new pixel detector for the CMS experiment was built in order to cope with the instantaneous luminosities anticipated for the Phase~I Upgrade of the LHC. The new CMS pixel detector provides four-hit tracking with a reduced material budget as well as new cooling and powering schemes. A new front-end readout chip mitigates buffering and bandwidth limitations, and allows operation at low comparator thresholds. In this paper, comprehensive test beam studies are presented, which have been conducted to verify the design and to quantify the performance of the new detector assemblies in terms of tracking efficiency and spatial resolution. Under optimal conditions, the tracking efficiency is $99.95\pm0.05\,\%$, while the intrinsic spatial resolutions are $4.80\pm0.25\,μ\mathrm{m}$ and $7.99\pm0.21\,μ\mathrm{m}$ along the $100\,μ\mathrm{m}$ and $150\,μ\mathrm{m}$ pixel pitch, respectively. The findings are compared to a detailed Monte Carlo simulation of the pixel detector and good agreement is found.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Trap** in irradiated p-on-n silicon sensors at fluences anticipated at the HL-LHC outer tracker
Authors:
W. Adam,
T. Bergauer,
M. Dragicevic,
M. Friedl,
R. Fruehwirth,
M. Hoch,
J. Hrubec,
M. Krammer,
W. Treberspurg,
W. Waltenberger,
S. Alderweireldt,
W. Beaumont,
X. Janssen,
S. Luyckx,
P. Van Mechelen,
N. Van Remortel,
A. Van Spilbeeck,
P. Barria,
C. Caillol,
B. Clerbaux,
G. De Lentdecker,
D. Dobur,
L. Favart,
A. Grebenyuk,
Th. Lenzi
, et al. (663 additional authors not shown)
Abstract:
The degradation of signal in silicon sensors is studied under conditions expected at the CERN High-Luminosity LHC. 200 $μ$m thick n-type silicon sensors are irradiated with protons of different energies to fluences of up to $3 \cdot 10^{15}$ neq/cm$^2$. Pulsed red laser light with a wavelength of 672 nm is used to generate electron-hole pairs in the sensors. The induced signals are used to determi…
▽ More
The degradation of signal in silicon sensors is studied under conditions expected at the CERN High-Luminosity LHC. 200 $μ$m thick n-type silicon sensors are irradiated with protons of different energies to fluences of up to $3 \cdot 10^{15}$ neq/cm$^2$. Pulsed red laser light with a wavelength of 672 nm is used to generate electron-hole pairs in the sensors. The induced signals are used to determine the charge collection efficiencies separately for electrons and holes drifting through the sensor. The effective trap** rates are extracted by comparing the results to simulation. The electric field is simulated using Synopsys device simulation assuming two effective defects. The generation and drift of charge carriers are simulated in an independent simulation based on PixelAV. The effective trap** rates are determined from the measured charge collection efficiencies and the simulated and measured time-resolved current pulses are compared. The effective trap** rates determined for both electrons and holes are about 50% smaller than those obtained using standard extrapolations of studies at low fluences and suggests an improved tracker performance over initial expectations.
△ Less
Submitted 7 May, 2015;
originally announced May 2015.