Dynamic planning in hierarchical active inference

Matteo Priorelli
Institute of Cognitive Sciences and Technologies
National Research Council of Italy
Padova, Italy
[email protected]
&Ivilin Peev Stoianov
Institute of Cognitive Sciences and Technologies
National Research Council of Italy
Padova, Italy
[email protected]
Abstract

By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behaviors could be explained in terms of an active inferential process – either as discrete decision-making or continuous motor control – inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan realistic actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, kee** in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.

1 Introduction

Hierarchies are found everywhere in the world. They are so pervasive that they do not just exist as causal relationships between physical properties of the environment, but are also inherent to how biological organisms act over it. Even the most complex kinematic structures of animals follow a rigid hierarchical strategy, whereby different limbs propagate from a body-centered reference frame. The hierarchical modularity of brain functional networks is widely recognized [1, 2], as well as the representation of the body schema in somatosensory and motor areas [3], and the organization of hierarchical motor sequences concerning parietal and premotor cortices [4]. In particular, the body schema is not a static entity but changes in concurrence to the development of the human body during childhood and adolescence [5]. Surprisingly, the nervous system is able to relate external objects to the self in a way that, although not reflecting the actual causal relationships between the body and the environment, is the most suitable for better operating in a specific context. Physiological studies have demonstrated that, with extensive tool use, parietal and motor areas of the monkey brain gradually adapt to make room for the tool, increasing the length of the perceived limb [6, 7]. This adaptation is highly plastic, assimilating objects in a very short time [8] and inducing altered somatosensory representations of the body morphology that persist even after tool use [9].

Why and how does this happen? One recent theory is that of predictive coding [10, 11, 12], which has been attracting increasing interest in recent years and proposes itself as a unifying paradigm of cortical function. Predictive coding posits that living beings make sense of the world by building an internal generative model that tries to imitate the hierarchical causal relationships of the external generative process. From a high-level hypothesis about the state of affairs of the world, a cascade of neural predictions takes place, eventually leading to a low-level guess about sensory evidence. Comparing the model’s guess with the sensorium triggers another cascade of prediction errors that travel back to the deepest cortical levels. The model iteratively refines its structure until all the prediction errors are minimized, that is, until it is finally able to predict what will happen next. This optimization differs from the more traditional view of deep learning, in that the message passing is local and what climbs up the hierarchy does not signal the detection of a feature, but how much the model is surprised about its prediction. Besides having stimulated cognitive and neural studies under several circumstances [13, 14, 15, 16], this theory has also influenced novel directions in machine learning: Predictive Coding Networks (PCNs) have been shown to generalize well to classification or regression tasks [17, 18], with key advantages compared to neural networks and still approximating the backpropagation algorithm [19, 20, 21].

While predictive coding can elucidate – through a rigorous computational framework [22] – illusions and visual phenomena such as binocular rivalry [23], it explains just the first (perceptual) half of the story. More specifically, it does not explain why interactions with the environment occur – a process that results, considering the above example, in the monkey brain actively distorting its body schema during tool use. Such complex tasks always involve decision-making, which the brain is known to realize via several methods [24]. Among them, one is particularly relevant here: planning for deliberation, also known as vicarious trials and errors, whereby an action is selected after several alternatives have been generated and evaluated [25]. One of the most intriguing characteristics of human planning is the capacity to imagine, or endogenously generate dynamic representations of future states, including potential trajectories and subgoals that bring to such states [26, 27]. The hippocampus is a key neural structure known to support trajectory generation, although planning is accomplished in concert with other areas implementing evaluation of options and response selection [25]. How does the human brain account for the dynamics of the self and the environment to afford purposeful planning?

On this trail, a second innovative perspective has been proposed, aspiring to unveil a unified first principle not just on cortical function, but on the behavior of all living organisms. This perspective, called active inference [28, 29, 30, 31], is grounded on the same theoretical basis of predictive coding but further assumes two key aspects of biological behavior. First, that a living being does not maintain static hypotheses about the state of affairs of the world but can also construct internal dynamics – either as instantaneous trajectories or future states – allowing it to anticipate the unfolding of events occurring at different timescales. Second, that these dynamic hypotheses can be fulfilled by movements. The latter assumption replaces models with agents, conveying a somewhat counterintuitive but insightful implication: while perception lets the agent’s hypothesis conform to the environment (as in predictive coding), action forces the environment to conform to the hypothesis, by letting the agent sample those observations that make the hypothesis true. If such hypotheses (usually called beliefs) correspond to desired states defined, e.g., by the phenotype, cycling between action and perception ultimately allows the agent to survive. This is the core of the so-called free energy principle, which states that in order to maintain homeostasis, all organisms must constantly and actively minimize the difference between their sensory states and their expectations based on a small set of life-compatible options. Giving a practical example, if I believe to find myself with a tool in hand, I will try with all my strength to observe visual images of the tool in my hand; in doing this, a combined reaching and gras** action happens. This view distances itself from the stimulus-response map** widely estabilished in neuroscience, and evidence indicates that it could be more biologically plausible than optimal control and Reinforcement Learning (RL) [32, 33, 34, 35].

In principle, active inference might be key for understanding how goal-directed behavior emerges in the human brain [36]. For instance, relevant objects used for manipulation may gradually become part of one’s identity through a closed loop between motor commands and sensory evidence, meaning that the boundary of the self from the environment increases whenever the agent manages to predict the consequences of its own movements [37]. Additionally, active inference might prove fundamental for making advances with current artificial agents, taking forward a promising research area known as planning as inference [38, 39, 40]. Active inference implementations can be divided into two frameworks, which have been used to simulate human and animal behaviors under the two complementary aspects of motor control [32, 41, 42, 43, 44, 45] and decision-making [46, 47, 48, 49, 50]. The first framework – generally compared to the low-level sensorimotor loops – is defined in continuous time [51, 52] and makes use of generalized filtering [53] to model instantaneous trajectories of the self and the environment; these trajectories are inferred by minimization of a quantity called variational free energy, which is the negative of what in machine learning is known as the evidence lower bound. Differently from optimal control, motor commands in active inference derive from proprioceptive predictions that are fulfilled by classical spinal reflex arcs [34]. This eliminates the need for cost functions – as the inverse model maps from proprioceptive (and not latent) states to actions – and replaces a control problem with an inference problem [33]. The second framework – attributed to the cerebral cortex, especially prefrontal aras [54], along with corticostriatal loops – is expressed in discrete state-space [55, 56] and exploits the structure of Partially Observable Markov Decision Processes (POMDPs) to plan abstract actions over expected future sensations. This (active) inference relies on the minimization of the expected free energy, i.e., the free energy that the agent expects to perceive in the future. The expected free energy can be unpacked into two terms resembling the two classical aspects of control theory, exploration and exploitation – which here naturally arise; these respectively correspond to an uncertainty-reducing term, and a goal-seeking term that, as before, pushes the agent to find a sequence of actions leading to its prior belief.

Three features of active inference are relevant to designing intelligent agents that can tackle real-life applications and, for the goal of this study, tasks requiring tool use. First, multiple units – composed of simple likelihood and state transition distributions – can be easily connected to adapt to complex hierarchical structures [57]. For instance, a hierarchical kinematic model can be designed in continuous time, wherein each unit encodes a certain Degree of Freedom (DoF) in intrinsic and extrinsic reference frames [44]. This allows one to realize advanced movements that involve simultaneous coordination of multiple limbs, e.g., moving with a glass in hand. This hierarchical structure can be generalized to perform homogeneous transformations between reference frames, e.g., perspective projections [58]. However, a continuous model alone lacks effective usability in the real world, since it can only deal with present sensory states and cannot perform any form of future planning.

The latter is usually possible through so-called mixed or hybrid models [48, 59], which combine the potentialities of a discrete model with the inference of continuous signals, allowing robust decision-making in uncertain and changing environments. While the theory of Bayesian model reduction [60, 61, 62, 63] provides efficient communication between the two models, this unified approach has not enjoyed many practical implementations for the time being [31, 48, 59, 64, 65, 66, 67]. An open issue regards how to deal with highly dynamic environments: standard hybrid models usually perform a comparison between static priors, limiting the agent to realize, e.g., multi-step reaching movements through fixed positions. One study addressed the problem of realistic robot navigation in active inference, but making use of alternative bio-inspired SLAM methods [68]. In [69], a hybrid model in which the agent’s hypotheses were generated at each time step from the system dynamics allowed to relate continuous trajectories with discrete plans.

A third appealing characteristic of the framework is that one can encode beliefs not only over its own bodily states, but also over external physical variables. This has been recently done in the context of active object reconstruction [70, 71, 72, 73] – where an agent encoded independent representations for multiple elements, and used action to more accurately infer its dynamics; for simulating oculomotor behavior [74] – where the dynamics of a target belief was biased by a hidden location; or for analyzing epistemic affordance [50], i.e., the changes in affordance of different objects in relation to the agent’s beliefs. In continuous time, such affordances can be expressed in intrinsic reference frames corresponding to potential agent’s configurations, defining specific ways to interact with the objects. Manipulating these additional beliefs depending on the agent’s intentions [43] permits effectively operating in dynamic contexts, e.g., tracking a target with the eyes [74], or gras** an object on the fly [69] and placing it at a goal position [75]. However, these applications do not exploit the efficiency of deep hierarchical models, and controlling multiple limbs other than the hand is not straightforward. Crucially, they would not appreciate the flexibility of animal brains in remap** their neuronal activity to account for usable tools.

Based on these premises, a question arises: how to perform dynamic planning with hierarchical structures of several objects? In other words, how to combine these three features into a single view? While many studies in continuous time can be currently found in the literature [52, 76, 77], a rigorous formalism of how to realize goal-directed behavior is still lacking, with the consequence of using different solutions for similar problems – especially in contexts that demand online replanning. On the other side, promising results have been achieved by combining the capabilities of discrete-time active inference with neural networks, in a way loosely resembling deep RL. Indeed, so-called deep active inference have critical advantages in learning and solving online tasks compared to traditional methods [78, 79, 80, 81, 82]. Still, neural networks are treated as black boxes during free energy minimization, without fully enjoying the potential benefits of hierarchical and temporal depths. One of the most attractive aspects of active inference is that it prescribes a unified perspective not just for fitting to complex high-dimensional data (as neural networks or PCNs do), but also for embodying environmental dynamics and acting over them to minimize uncertainty and conform to prior beliefs.

For these reasons, in this study we explore an alternative direction to the optimal control problem that does not call for additional frameworks, in a few words, a direction toward hybrid computations in hierarchical systems. We analyze many design choices that have been applied in the motor control domain, with an in-depth look at the three characteristics mentioned above. Asking ourselves how to model tool use, we start from a simple unit and construct richer modules that can be linked in a hierarchical fashion, exhibiting interesting high-level features. In Chapter 2, we consider a single-DoF agent and explore how to realize a sort of multi-step behavior in continuous time only. In Chapter 3, we analyze the implications of combining different units in a single network, using more complex kinematic configurations and distinguishing between intrinsic and extrinsic dynamics. In Chapter 4, we describe the advantages of using discrete decision-making in continuous environments, focusing on hybrid structures and drawing some parallelisms between the two worlds. Finally, in the Discussion we elaborate on the benefits of addressing discrete and continuous representations together, and give a few suggestions for future work on this subject.

2 Flexible intentions

In this chapter, we begin by explaining the inferential mechanisms of a basic unit in continuous time. We then discuss one by one the changes and features that we introduce, in order to achieve a multi-step behavior in simple tasks that do not require deep hierarchical modeling nor online replanning.

2.1 A simple agent

Refer to caption
Figure 1: Factor graph of a basic unit for static reaching. Variables and factors are indicated by circles and squares, respectively. Hidden states 𝒙𝒙\bm{x}bold_italic_x (e.g., the hand position) generate observations 𝒐𝒐\bm{o}bold_italic_o (e.g., an image of the hand) through the likelihood function 𝒈𝒈\bm{g}bold_italic_g, and their 1st derivatives 𝒙superscript𝒙\bm{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (e.g., the hand velocity) through a dynamics function 𝒇𝒇\bm{f}bold_italic_f. In contrast to optimal control, here action follows observation prediction errors arising from a simple attractor 𝝆𝝆\bm{\rho}bold_italic_ρ embedded in the model dynamics, or from a prior belief 𝜼𝜼\bm{\eta}bold_italic_η over the hand position.

The most elementary unit is represented in Figure 1. This is the simplest formulation of a continuous-time active inference agent, where we kept only the key nodes. This allows us to easily describe a velocity-controlled dynamic system with the following likelihood 𝒈𝒈\bm{g}bold_italic_g and dynamics 𝒇𝒇\bm{f}bold_italic_f:

𝒐=𝒈(𝒙)+𝒘o𝒙=𝒇(𝒙)+𝒘x𝒐𝒈𝒙subscript𝒘𝑜superscript𝒙𝒇𝒙subscript𝒘𝑥\displaystyle\begin{split}\bm{o}&=\bm{g}(\bm{x})+\bm{w}_{o}\\ \bm{x}^{\prime}&=\bm{f}(\bm{x})+\bm{w}_{x}\end{split}start_ROW start_CELL bold_italic_o end_CELL start_CELL = bold_italic_g ( bold_italic_x ) + bold_italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_f ( bold_italic_x ) + bold_italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW (1)

where 𝒙𝒙\bm{x}bold_italic_x and 𝒐𝒐\bm{o}bold_italic_o are respectively called hidden states and observations, and the letter w𝑤witalic_w indicates noise terms sampled from Gaussian distributions. For simplicity, we considered just two temporal orders – although all the features that we elucidate in the following hold for a system of generalized coordinates [53] – and we defined a likelihood function only for a single temporal order. We assume that the corresponding generative model is factorized as follows:

p(𝒙~,𝒐)=p(𝒐|𝒙)p(𝒙)p(𝒙|𝒙)𝑝~𝒙𝒐𝑝conditional𝒐𝒙𝑝𝒙𝑝conditionalsuperscript𝒙𝒙p(\tilde{\bm{x}},\bm{o})=p(\bm{o}|\bm{x})p(\bm{x})p(\bm{x}^{\prime}|\bm{x})italic_p ( over~ start_ARG bold_italic_x end_ARG , bold_italic_o ) = italic_p ( bold_italic_o | bold_italic_x ) italic_p ( bold_italic_x ) italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) (2)

where:

p(𝒐|𝒙)=𝒩(𝒈(𝒙),𝝅o\scaleto14pt)p(𝒙)=𝒩(𝜼,𝝅η\scaleto14pt)p(𝒙|𝒙)=𝒩(𝒇(𝒙),𝝅x\scaleto14pt)𝑝conditional𝒐𝒙𝒩𝒈𝒙subscriptsuperscript𝝅\scaleto14𝑝𝑡𝑜𝑝𝒙𝒩𝜼subscriptsuperscript𝝅\scaleto14𝑝𝑡𝜂𝑝conditionalsuperscript𝒙𝒙𝒩𝒇𝒙superscriptsubscript𝝅𝑥\scaleto14𝑝𝑡\displaystyle\begin{split}p(\bm{o}|\bm{x})&=\mathcal{N}(\bm{g}(\bm{x}),\bm{\pi% }^{\scaleto{-1}{4pt}}_{o})\\ p(\bm{x})&=\mathcal{N}(\bm{\eta},\bm{\pi}^{\scaleto{-1}{4pt}}_{\eta})\\ p(\bm{x}^{\prime}|\bm{x})&=\mathcal{N}(\bm{f}(\bm{x}),\bm{\pi}_{x}^{\scaleto{-% 1}{4pt}})\end{split}start_ROW start_CELL italic_p ( bold_italic_o | bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_g ( bold_italic_x ) , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_η , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_f ( bold_italic_x ) , bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW (3)

expressed in terms of precisions (or inverse variances) 𝝅𝝅\bm{\pi}bold_italic_π. Note that we introduced a prior 𝜼𝜼\bm{\eta}bold_italic_η over the hidden states, which is not generally used in continuous-time formulations, but it is the key element connecting different levels in discrete-time active inference [59] or PCNs [16] – as will be explained later. Also note that we used a generalized notation for instantaneous trajectories or paths, i.e., 𝒙~=[𝒙,𝒙]~𝒙𝒙superscript𝒙\tilde{\bm{x}}=[\bm{x},\bm{x}^{\prime}]over~ start_ARG bold_italic_x end_ARG = [ bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ], where 𝒙𝒙\bm{x}bold_italic_x will be indicated in the following as the 0th order, and 𝒙superscript𝒙\bm{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the 1st order. We highlighted in green and red respectively the input and output of the unit, namely the prior 𝜼𝜼\bm{\eta}bold_italic_η and the observations 𝒐𝒐\bm{o}bold_italic_o. For the moment, we do not specify their nature, either being intermediate representations coming from other levels, or the highest and lowest levels of the hierarchy, i.e., a fixed prior and a sensory observation.

Exact computation of the posterior p(𝒙~|𝒐)𝑝conditional~𝒙𝒐p(\tilde{\bm{x}}|\bm{o})italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o ) is unfeasible since the evidence requires marginalizing over every possible outcome, i.e., p(𝒐)=p(𝒙~,𝒐)𝑑𝒙~𝑝𝒐𝑝~𝒙𝒐differential-d~𝒙p(\bm{o})=\int{p(\tilde{\bm{x}},\bm{o})d\tilde{\bm{x}}}italic_p ( bold_italic_o ) = ∫ italic_p ( over~ start_ARG bold_italic_x end_ARG , bold_italic_o ) italic_d over~ start_ARG bold_italic_x end_ARG. For this reason, estimation of hidden states 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG is carried out through a variational approach [83], e.g., by minimizing the difference between properly chosen an approximate posterior q(𝒙~)𝑞~𝒙q(\tilde{\bm{x}})italic_q ( over~ start_ARG bold_italic_x end_ARG ) and the true posterior of the generative process. This difference is expressed in terms of a Kullback-Leibler (KL) divergence:

DKL[q(𝒙~)||p(𝒙~|𝒐)]=𝒙~q(𝒙~)lnq(𝒙~)p(𝒙~|𝒐)d𝒙~D_{KL}[q(\tilde{\bm{x}})||p(\tilde{\bm{x}}|\bm{o})]=\int_{\tilde{\bm{x}}}q(% \tilde{\bm{x}})\ln\frac{q(\tilde{\bm{x}})}{p(\tilde{\bm{x}}|\bm{o})}d\tilde{% \bm{x}}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( over~ start_ARG bold_italic_x end_ARG ) | | italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o ) ] = ∫ start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT italic_q ( over~ start_ARG bold_italic_x end_ARG ) roman_ln divide start_ARG italic_q ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o ) end_ARG italic_d over~ start_ARG bold_italic_x end_ARG (4)

The denominator p(𝒙|𝒚)𝑝conditional𝒙𝒚p(\bm{x}|\bm{y})italic_p ( bold_italic_x | bold_italic_y ) still depends on the marginal p(𝒚)𝑝𝒚p(\bm{y})italic_p ( bold_italic_y ), but the KL divergence can be rewritten in terms of the log evidence and a quantity known as the free energy \mathcal{F}caligraphic_F:

=𝔼q(𝒙~)[lnq(𝒙~)p(𝒙~,𝒐)]=𝔼q(𝒙~)[lnq(𝒙~)p(𝒙~|𝒐)]lnp(𝒐)subscript𝔼𝑞~𝒙𝑞~𝒙𝑝~𝒙𝒐subscript𝔼𝑞~𝒙𝑞~𝒙𝑝conditional~𝒙𝒐𝑝𝒐\mathcal{F}=\operatorname*{\mathbb{E}}_{q(\tilde{\bm{x}})}\left[\ln\frac{q(% \tilde{\bm{x}})}{p(\tilde{\bm{x}},\bm{o})}\right]=\operatorname*{\mathbb{E}}_{% q(\tilde{\bm{x}})}\left[\ln\frac{q(\tilde{\bm{x}})}{p(\tilde{\bm{x}}|\bm{o})}% \right]-\ln p(\bm{o})caligraphic_F = blackboard_E start_POSTSUBSCRIPT italic_q ( over~ start_ARG bold_italic_x end_ARG ) end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_q ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG , bold_italic_o ) end_ARG ] = blackboard_E start_POSTSUBSCRIPT italic_q ( over~ start_ARG bold_italic_x end_ARG ) end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_q ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o ) end_ARG ] - roman_ln italic_p ( bold_italic_o ) (5)

Since the KL divergence is always nonnegative, the free energy provides an upper bound on surprise, i.e., lnp(𝒐)𝑝𝒐\mathcal{F}\geq\ln p(\bm{o})caligraphic_F ≥ roman_ln italic_p ( bold_italic_o ). Hence, minimizing the KL divergence with respect to q(𝒙~)𝑞~𝒙q(\tilde{\bm{x}})italic_q ( over~ start_ARG bold_italic_x end_ARG ) is equivalent to minimizing \mathcal{F}caligraphic_F, and achieves the dual objective of kee** surprise low while estimating the true distribution. Assuming that the approximate posterior can be factorized into independent contributions, and further assuming that each contribution is Gaussian, the optimization process breaks down to the minimization of the prediction errors associated with the distributions of the generative model in Equation 2 – see [31] for more details:

𝜺o=𝒐𝒈(𝝁)𝜺η=𝝁𝜼𝜺x=𝝁𝒇(𝝁)subscript𝜺𝑜𝒐𝒈𝝁subscript𝜺𝜂𝝁𝜼subscript𝜺𝑥superscript𝝁𝒇𝝁\displaystyle\begin{split}\bm{\varepsilon}_{o}&=\bm{o}-\bm{g}(\bm{\mu})\\ \bm{\varepsilon}_{\eta}&=\bm{\mu}-\bm{\eta}\\ \bm{\varepsilon}_{x}&=\bm{\mu}^{\prime}-\bm{f}(\bm{\mu})\end{split}start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_o - bold_italic_g ( bold_italic_μ ) end_CELL end_ROW start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_μ - bold_italic_η end_CELL end_ROW start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_f ( bold_italic_μ ) end_CELL end_ROW (6)

Then, the inference of the means (also called beliefs) of the posterior over the hidden states, denoted by 𝝁~~𝝁\tilde{\bm{\mu}}over~ start_ARG bold_italic_μ end_ARG, is reduced to the following message passing:

𝝁~˙=[𝝁˙𝝁˙]=𝒟𝝁~μ~=[𝝁𝝅η𝜺η+μ𝒈T𝝅o𝜺o+μ𝒇T𝝅x𝜺x𝝅x𝜺x]˙~𝝁matrix˙𝝁superscript˙𝝁𝒟~𝝁subscript~𝜇matrixsuperscript𝝁subscript𝝅𝜂subscript𝜺𝜂subscript𝜇superscript𝒈𝑇subscript𝝅𝑜subscript𝜺𝑜subscript𝜇superscript𝒇𝑇subscript𝝅𝑥subscript𝜺𝑥missing-subexpressionsubscript𝝅𝑥subscript𝜺𝑥\dot{\tilde{\bm{\mu}}}=\begin{bmatrix}\dot{\bm{\mu}}\\ \dot{\bm{\mu}}^{\prime}\end{bmatrix}=\mathcal{D}\tilde{\bm{\mu}}-\partial_{% \tilde{\mu}}\mathcal{F}=\begin{bmatrix}\bm{\mu}^{\prime}-\bm{\pi}_{\eta}\bm{% \varepsilon}_{\eta}+\partial_{\mu}\bm{g}^{T}\bm{\pi}_{o}\bm{\varepsilon}_{o}+% \partial_{\mu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}\\ \\ -\bm{\pi}_{x}\bm{\varepsilon}_{x}\end{bmatrix}over˙ start_ARG over~ start_ARG bold_italic_μ end_ARG end_ARG = [ start_ARG start_ROW start_CELL over˙ start_ARG bold_italic_μ end_ARG end_CELL end_ROW start_ROW start_CELL over˙ start_ARG bold_italic_μ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = caligraphic_D over~ start_ARG bold_italic_μ end_ARG - ∂ start_POSTSUBSCRIPT over~ start_ARG italic_μ end_ARG end_POSTSUBSCRIPT caligraphic_F = [ start_ARG start_ROW start_CELL bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT + ∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + ∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL end_ROW start_ROW start_CELL - bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (7)

where 𝒟𝒟\mathcal{D}caligraphic_D is an operator that shifts every derivative by one, i.e., 𝒟𝝁~=[𝝁,𝟎]𝒟~𝝁superscript𝝁0\mathcal{D}\tilde{\bm{\mu}}=[\bm{\mu}^{\prime},\bm{0}]caligraphic_D over~ start_ARG bold_italic_μ end_ARG = [ bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_0 ]. This term arises because the generative model maintains a belief not over a static point, but over a dynamic trajectory, and only when the motion of the mean 𝝁~˙˙~𝝁\dot{\tilde{\bm{\mu}}}over˙ start_ARG over~ start_ARG bold_italic_μ end_ARG end_ARG equals the mean of the motion 𝒟𝝁~𝒟~𝝁\mathcal{D}\tilde{\bm{\mu}}caligraphic_D over~ start_ARG bold_italic_μ end_ARG, is the free energy minimized. In short, the inferential process does not involve matching a state (as in PCNs) but tracking a path [84]. Unpacking Equation 7, we note that the 0th order is subject to a forward error from the prior, a backward error from the likelihood, and a backward error from the dynamics function. On the other hand, the 1st order is only subject to the latter but in the form of a forward error. The belief is then updated via gradient descent, i.e., 𝝁~t+1=𝝁~t+Δt𝝁~˙subscript~𝝁𝑡1subscript~𝝁𝑡subscriptΔ𝑡˙~𝝁\tilde{\bm{\mu}}_{t+1}=\tilde{\bm{\mu}}_{t}+\Delta_{t}\dot{\tilde{\bm{\mu}}}over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over˙ start_ARG over~ start_ARG bold_italic_μ end_ARG end_ARG, where ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a time constant.

How can this agent perform a simple reaching movement? As highlighted in Figure 1, we can encode the hand position and velocity as generalized hidden states. We will talk later about the relation between proprioceptive and exteroceptive domains, as this deserves a careful discussion. For now, we consider a single DoF that has a univocal map** between the joint angle and the Cartesian position, so we can represent both of them by the same variables and factors (note however that we maintain a bold notation for generalization and consistency with the rest of the study). Indicating the target to reach by 𝝆𝝆\bm{\rho}bold_italic_ρ, we can define the following dynamics function:

𝒇(𝒙)=𝝆𝒙𝒇𝒙𝝆𝒙\bm{f}(\bm{x})=\bm{\rho}-\bm{x}bold_italic_f ( bold_italic_x ) = bold_italic_ρ - bold_italic_x (8)

expressing a simple attractor toward the target [37, 85, 86, 87, 88]. This dynamics does not exist in the actual generative process, and it is indeed this discrepancy that forces the environment to conform to the agent’s beliefs. Specifically, Equation 8 means that the agent thinks its hand will be pulled toward the target with a strength proportional to the precision 𝝅xsubscript𝝅𝑥\bm{\pi}_{x}bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. In fact, the attractor affects the belief update through the dynamics prediction error 𝜺xsubscript𝜺𝑥\bm{\varepsilon}_{x}bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, expressing a difference between the estimated velocity 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the one predicted by the agent through the dynamics function 𝒇𝒇\bm{f}bold_italic_f. Note that this error appears in both temporal orders: in brief, 𝜺xsubscript𝜺𝑥\bm{\varepsilon}_{x}bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT imposes a trajectory at the 1st order which in turn affects the 0th order directly through 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and indirectly through the gradient μ𝒇subscript𝜇𝒇\partial_{\mu}\bm{f}∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_f.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) In this task, the agent (a single DoF) has to reach a target angle represented by the red circle. Estimated and real arms are displayed in cyan and blue, respectively. Here, 𝝅η=0subscript𝝅𝜂0\bm{\pi}_{\eta}=0bold_italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT = 0, 𝝆=120𝝆120\bm{\rho}=120bold_italic_ρ = 120°, and 𝝁𝝁\bm{\mu}bold_italic_μ was initialized to 4040-40- 40°. The time step is indicated in the bottom left corner of each frame. Since the belief was initialized at a negative value, the likelihood initially pulls the arm toward the wrong direction before adapting to the dynamics attractor. (b) The top graph shows the evolution of the real angle 𝒙𝒙\bm{x}bold_italic_x, the belief over it 𝝁𝝁\bm{\mu}bold_italic_μ, and the target angle 𝝆𝝆\bm{\rho}bold_italic_ρ. The middle graph shows the evolution of the belief over the velocity 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the belief derivative 𝝁˙˙𝝁\dot{\bm{\mu}}over˙ start_ARG bold_italic_μ end_ARG. The bottom graph shows the evolution of all the components that comprise the belief update: the belief over the velocity 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the likelihood gradient μ𝒈T𝝅o𝜺osubscript𝜇superscript𝒈𝑇subscript𝝅𝑜subscript𝜺𝑜\partial_{\mu}\bm{g}^{T}\bm{\pi}_{o}\bm{\varepsilon}_{o}∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the dynamics gradient μ𝒇T𝝅x𝜺xsubscript𝜇superscript𝒇𝑇subscript𝝅𝑥subscript𝜺𝑥\partial_{\mu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and the weighted dynamics prediction error 𝝅x𝜺xsubscript𝝅𝑥subscript𝜺𝑥-\bm{\pi}_{x}\bm{\varepsilon}_{x}- bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. The latter has been plotted for better comparing its magnitude with the other components, although affecting the 1st temporal order.

The interactions between these quantities can be better understood from Figure 2, showing a reaching movement with the defined dynamics function and the trajectories of the agent’s generative model. Here, the belief is subject to two different forces: a likelihood gradient pushing it toward what it is currently perceiving (i.e., the real angle), and the other components that steer it toward the biased dynamics (i.e., the target angle 𝝆𝝆\bm{\rho}bold_italic_ρ). Note how in the third plot, of the three components that comprise the belief update, the backward error μ𝒇T𝝅x𝜺xsubscript𝜇superscript𝒇𝑇subscript𝝅𝑥subscript𝜺𝑥\partial_{\mu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT has the least impact in the overall direction of update. While the exact interactions arising from the dynamics prediction error have yet to be analyzed, in the following we assume that goal-directed behavior is achieved through the forward error at the 1st order 𝝅x𝜺xsubscript𝝅𝑥subscript𝜺𝑥-\bm{\pi}_{x}\bm{\varepsilon}_{x}- bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. An alternative would be to directly control the backward error without maintaining a belief over increasing temporal orders [85], which however requires to take a gradient into account and may be more challenging when defining appropriate attractors to reach a goal. Finally see how, in the middle plot, the agent tries at every instant to minimize the difference between 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝝁˙˙𝝁\dot{\bm{\mu}}over˙ start_ARG bold_italic_μ end_ARG, thus tracking the actual path of the hidden states.

But how does this agent actually move? As mentioned in the Introduction, action is the other side of the coin of the free energy principle, through which the agent samples those observations that conform to its prior beliefs. In fact, in addition to the perceptual inference typical of predictive coding, active inference assumes that organisms minimize free energy also by interacting with the environment; this minimization breaks down to an even simpler update that only depends on observation prediction errors 𝜺osubscript𝜺𝑜\bm{\varepsilon}_{o}bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Since these prediction errors are generated from the agent’s belief, this means that whenever the latter is biased toward some preferred state, movement naturally follows. There is thus a delicate balance between perception – in which prediction errors climb up the hierarchy to bring the belief closer to the observations – and action – in which prediction errors are suppressed at a low level so that the observations are brought closer to their predictions. However, there is an open issue regarding how active inference should be practically realized in continuous time. A few studies demonstrated that using exteroceptive information directly for computing motor commands could result in smoother movements and resolution of visuo-proprioceptive conflicts [28, 41, 43], and in fact some robotic implementations effectively used this approach [85, 86]. However, evidence seems to indicate that motor commands are generated by suppression of proprioceptive information only [34, 33], which is already in the intrinsic reference frame needed for movement and thus results in an easier inverse dynamics. For this reason, in the following we assume that – indicating by the subscript p𝑝pitalic_p the proprioceptive domain – movements are realized by minimizing the free energy with respect to proprioceptive prediction errors 𝜺psubscript𝜺𝑝\bm{\varepsilon}_{p}bold_italic_ε start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT:

𝒂˙=a=a𝒈pT𝝅p𝜺p˙𝒂subscript𝑎subscript𝑎superscriptsubscript𝒈𝑝𝑇subscript𝝅𝑝subscript𝜺𝑝\dot{\bm{a}}=-\partial_{a}\mathcal{F}=-\partial_{a}\bm{g}_{p}^{T}\bm{\pi}_{p}% \bm{\varepsilon}_{p}over˙ start_ARG bold_italic_a end_ARG = - ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT caligraphic_F = - ∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (9)

where a𝒈psubscript𝑎subscript𝒈𝑝\partial_{a}\bm{g}_{p}∂ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT performs an inverse dynamics from the proprioceptive predictions to the motor commands 𝒂𝒂\bm{a}bold_italic_a, likely to be implemented by classical spinal reflex arcs. As a last note, the actions can also depend on multiple orders – velocity, acceleration, and so on – allowing more efficient movement and control [89, 90, 91, 92], but since it is beyond our scope, we only minimize the 0th order. Nonetheless, 1st-order movements – e.g., maintaining a constant velocity – are still possible by specification of appropriate dynamics of the hidden states.

2.2 Tracking objects

Refer to caption
Figure 3: The target is now encoded in the hidden causes 𝒗𝒗\bm{v}bold_italic_v, generating a dynamic attractor for object tracking. In fact, both hidden states and hidden causes generate predictions through parallel likelihood functions 𝒈xsubscript𝒈𝑥\bm{g}_{x}bold_italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝒈vsubscript𝒈𝑣\bm{g}_{v}bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and both concur in estimating the 1st-order hidden states 𝒙superscript𝒙\bm{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

The simple agent defined in the previous section can only realize fixed trajectories embedded in the dynamics function, so how can we let it track moving objects? This is usually done by introducing a key concept in active inference, the hidden causes 𝒗𝒗\bm{v}bold_italic_v, which link hierarchical levels and specify how the dynamics function evolves. In the active inference literature of motor control, they are also used to encode the target to be reached [28, 67, 74, 93], as depicted in Figure 3. Considering the target as a causal variable for the hidden states and sensory observations makes sense from an active perspective whereby “it is an object I want to reach that generates my movements”. Now, the agent’s generative model becomes:

p(𝒙~,𝒗,𝒐)=p(𝒐|𝒙,𝒗)p(𝒙)p(𝒙|𝒙,𝒗)p(𝒗)𝑝~𝒙𝒗𝒐𝑝conditional𝒐𝒙𝒗𝑝𝒙𝑝conditionalsuperscript𝒙𝒙𝒗𝑝𝒗p(\tilde{\bm{x}},\bm{v},\bm{o})=p(\bm{o}|\bm{x},\bm{v})p(\bm{x})p(\bm{x}^{% \prime}|\bm{x},\bm{v})p(\bm{v})italic_p ( over~ start_ARG bold_italic_x end_ARG , bold_italic_v , bold_italic_o ) = italic_p ( bold_italic_o | bold_italic_x , bold_italic_v ) italic_p ( bold_italic_x ) italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x , bold_italic_v ) italic_p ( bold_italic_v ) (10)

where:

p(𝒙)=𝒩(𝜼x,𝝅ηx\scaleto14pt)p(𝒗)=𝒩(𝜼v,𝝅ηv\scaleto14pt)p(𝒙|𝒙,𝒗)=𝒩(𝒇(𝒙,𝒗),𝝅x\scaleto14pt)𝑝𝒙𝒩subscript𝜼𝑥subscriptsuperscript𝝅\scaleto14𝑝𝑡subscript𝜂𝑥𝑝𝒗𝒩subscript𝜼𝑣subscriptsuperscript𝝅\scaleto14𝑝𝑡subscript𝜂𝑣𝑝conditionalsuperscript𝒙𝒙𝒗𝒩𝒇𝒙𝒗superscriptsubscript𝝅𝑥\scaleto14𝑝𝑡\displaystyle\begin{split}p(\bm{x})&=\mathcal{N}(\bm{\eta}_{x},\bm{\pi}^{% \scaleto{-1}{4pt}}_{\eta_{x}})\\ p(\bm{v})&=\mathcal{N}(\bm{\eta}_{v},\bm{\pi}^{\scaleto{-1}{4pt}}_{\eta_{v}})% \\ p(\bm{x}^{\prime}|\bm{x},\bm{v})&=\mathcal{N}(\bm{f}(\bm{x},\bm{v}),\bm{\pi}_{% x}^{\scaleto{-1}{4pt}})\end{split}start_ROW start_CELL italic_p ( bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_v ) end_CELL start_CELL = caligraphic_N ( bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x , bold_italic_v ) end_CELL start_CELL = caligraphic_N ( bold_italic_f ( bold_italic_x , bold_italic_v ) , bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW (11)
Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) In this task, the agent has to track a target angle rotating at a constant velocity. Estimated and real targets are displayed in purple and red, respectively. Here, 𝝅η,x=0subscript𝝅𝜂𝑥0\bm{\pi}_{\eta,x}=0bold_italic_π start_POSTSUBSCRIPT italic_η , italic_x end_POSTSUBSCRIPT = 0, 𝝅η,v=0subscript𝝅𝜂𝑣0\bm{\pi}_{\eta,v}=0bold_italic_π start_POSTSUBSCRIPT italic_η , italic_v end_POSTSUBSCRIPT = 0, 𝒗𝒗\bm{v}bold_italic_v was initialized to 60606060°, and both 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝂𝝂\bm{\nu}bold_italic_ν were initialized to 00°. Here, the belief over the hidden causes pulls the belief over the hidden states with it while approaching the real target angle. (b) The top graph shows the evolution of the real angle 𝒙𝒙\bm{x}bold_italic_x, the belief over it 𝝁𝝁\bm{\mu}bold_italic_μ, the target angle 𝒗𝒗\bm{v}bold_italic_v, and the belief over it 𝝂𝝂\bm{\nu}bold_italic_ν. The middle graph shows the evolution of the belief over the velocity 𝝁superscript𝝁\bm{\mu}^{\prime}bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the belief derivative 𝝁˙˙𝝁\dot{\bm{\mu}}over˙ start_ARG bold_italic_μ end_ARG, as before. The bottom graph shows the evolution of all the components that comprise the hidden causes update: the likelihood gradient ν𝒈vT𝝅o,v𝜺o,vsubscript𝜈superscriptsubscript𝒈𝑣𝑇subscript𝝅𝑜𝑣subscript𝜺𝑜𝑣\partial_{\nu}\bm{g}_{v}^{T}\bm{\pi}_{o,v}\bm{\varepsilon}_{o,v}∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT, and the dynamics gradient ν𝒇T𝝅x𝜺xsubscript𝜈superscript𝒇𝑇subscript𝝅𝑥subscript𝜺𝑥\partial_{\nu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Note how, in the middle plot, the estimated 1st temporal order stabilizes to a non-zero value as the agent rotates with a constant angular velocity.

Note that there are two priors, one over the hidden states and another over the hidden causes, respectively denoted by 𝜼xsubscript𝜼𝑥\bm{\eta}_{x}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝜼vsubscript𝜼𝑣\bm{\eta}_{v}bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Note also that both dynamics and likelihood functions depend on the hidden causes, and that we assumed a further factorization for the likelihood:

p(𝒐|𝒙,𝒗)=p(𝒐x|𝒙)p(𝒐v|𝒗)p(𝒐x|𝒙)=𝒩(𝒈x(𝒙),𝝅o,x\scaleto14pt)p(𝒐v|𝒗)=𝒩(𝒈v(𝒗),𝝅o,v\scaleto14pt)𝑝conditional𝒐𝒙𝒗𝑝conditionalsubscript𝒐𝑥𝒙𝑝conditionalsubscript𝒐𝑣𝒗𝑝conditionalsubscript𝒐𝑥𝒙𝒩subscript𝒈𝑥𝒙subscriptsuperscript𝝅\scaleto14𝑝𝑡𝑜𝑥𝑝conditionalsubscript𝒐𝑣𝒗𝒩subscript𝒈𝑣𝒗subscriptsuperscript𝝅\scaleto14𝑝𝑡𝑜𝑣\displaystyle\begin{split}p(\bm{o}|\bm{x},\bm{v})&=p(\bm{o}_{x}|\bm{x})p(\bm{o% }_{v}|\bm{v})\\ p(\bm{o}_{x}|\bm{x})&=\mathcal{N}(\bm{g}_{x}(\bm{x}),\bm{\pi}^{\scaleto{-1}{4% pt}}_{o,x})\\ p(\bm{o}_{v}|\bm{v})&=\mathcal{N}(\bm{g}_{v}(\bm{v}),\bm{\pi}^{\scaleto{-1}{4% pt}}_{o,v})\end{split}start_ROW start_CELL italic_p ( bold_italic_o | bold_italic_x , bold_italic_v ) end_CELL start_CELL = italic_p ( bold_italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | bold_italic_x ) italic_p ( bold_italic_o start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | bold_italic_v ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_italic_x ) , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o , italic_x end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_o start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | bold_italic_v ) end_CELL start_CELL = caligraphic_N ( bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_italic_v ) , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT ) end_CELL end_ROW (12)

where 𝒐xsubscript𝒐𝑥\bm{o}_{x}bold_italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝒐vsubscript𝒐𝑣\bm{o}_{v}bold_italic_o start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denote the hand and target observations, respectively. It is this additional connection between hidden causes and observations that makes the agent able to operate in dynamic environments. In fact, we can define the following dynamics function:

𝒇(𝒙,𝒗)=𝒗𝒙𝒇𝒙𝒗𝒗𝒙\bm{f}(\bm{x},\bm{v})=\bm{v}-\bm{x}bold_italic_f ( bold_italic_x , bold_italic_v ) = bold_italic_v - bold_italic_x (13)

where we just replaced the static target 𝝆𝝆\bm{\rho}bold_italic_ρ with the hidden causes. Then, the posterior belief over the hidden causes 𝝂𝝂\bm{\nu}bold_italic_ν is updated according to:

𝝂˙=ν=𝝅ηv𝜺ηv+ν𝒈vT𝝅o,v𝜺o,v+ν𝒇T𝝅x𝜺x˙𝝂subscript𝜈subscript𝝅subscript𝜂𝑣subscript𝜺subscript𝜂𝑣subscript𝜈superscriptsubscript𝒈𝑣𝑇subscript𝝅𝑜𝑣subscript𝜺𝑜𝑣subscript𝜈superscript𝒇𝑇subscript𝝅𝑥subscript𝜺𝑥\dot{\bm{\nu}}=-\partial_{\nu}\mathcal{F}=-\bm{\pi}_{\eta_{v}}\bm{\varepsilon}% _{\eta_{v}}+\partial_{\nu}\bm{g}_{v}^{T}\bm{\pi}_{o,v}\bm{\varepsilon}_{o,v}+% \partial_{\nu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}over˙ start_ARG bold_italic_ν end_ARG = - ∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_F = - bold_italic_π start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT + ∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (14)

where we defined the following observation and prior prediction errors:

𝜺o,v=𝒐v𝒈v(𝝂)𝜺ηv=𝝂𝜼vsubscript𝜺𝑜𝑣subscript𝒐𝑣subscript𝒈𝑣𝝂subscript𝜺subscript𝜂𝑣𝝂subscript𝜼𝑣\displaystyle\begin{split}\bm{\varepsilon}_{o,v}&=\bm{o}_{v}-\bm{g}_{v}(\bm{% \nu})\\ \bm{\varepsilon}_{\eta_{v}}&=\bm{\nu}-\bm{\eta}_{v}\end{split}start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_o , italic_v end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_o start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_italic_ν ) end_CELL end_ROW start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_ν - bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_CELL end_ROW (15)

As evident, the hidden causes are subject to a prior prediction error, a backward dynamics error, and a backward likelihood error – similar to the update of the hidden states, with the only difference that this kind of inference is over a state and not a path. Via the backward likelihood error, the agent can correctly estimate the target configuration whenever it moves, as shown in the tracking simulation of Figure 4. Concerning the dynamics prediction error, it can now flow into two different pathways: specifically, the role of the gradients μ𝒇subscript𝜇𝒇\partial_{\mu}\bm{f}∂ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT bold_italic_f and ν𝒇subscript𝜈𝒇\partial_{\nu}\bm{f}∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_f are to respectively infer the positional state and the cause that may have generated a particular velocity; their actual role will be clear in Chapter 4.

2.3 Intention modulation and multi-step behavior

Although capable of operating in dynamic contexts, the last approach still portrays a simple scenario in which a specific target has no internal dynamics and has always the role of a cause for a hidden state. In other words, it does not permit modeling realistic tasks such as a pick-and-place operation, where an object can either be the cause of a reaching and gras** movement, or the consequence of another cause such as a goal position, resulting in a placing movement; critically, it does not allow to model a task wherein not only the dynamics of the self, but also the dynamics of the target must be learned (e.g., if a moving object should be grasped on the fly, the agent should infer its trajectory to anticipate where it will fall).

It follows that to operate in a complex environment, the agent must (i) maintain complete representations for each entity that it wants to interact with, and (ii) flexibly assign causes and consequences for the next movement depending on the current context – in a similar way to policies in discrete models, as will be explained later. Therefore, we first encode multiple environmental entities in the hidden states, i.e., 𝒙=[𝒙1,,𝒙N]𝒙subscript𝒙1subscript𝒙𝑁\bm{x}=[\bm{x}_{1},\dots,\bm{x}_{N}]bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], where N𝑁Nitalic_N is the number of entities [43]. Consequently, the factorized likelihood function generates specular observations for each element:

𝒐=[𝒐1,,𝒐N]=[𝒈1(𝒙1),,𝒈N(𝒙N)]𝒐subscript𝒐1subscript𝒐𝑁subscript𝒈1subscript𝒙1subscript𝒈𝑁subscript𝒙𝑁\bm{o}=[\bm{o}_{1},\dots,\bm{o}_{N}]=[\bm{g}_{1}(\bm{x}_{1}),\dots,\bm{g}_{N}(% \bm{x}_{N})]bold_italic_o = [ bold_italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_o start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] = [ bold_italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_italic_g start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] (16)

This structure is similar to the previous model, except that the target is now embedded in the hidden states along with the hand, and that there is no connection between hidden causes and observations. We could define a similar factorization for the hidden causes and dynamics function, i.e., 𝒙=[𝒇1(𝒙1,𝒗1),,𝒇N(𝒙N,𝒗N)]superscript𝒙subscript𝒇1subscript𝒙1subscript𝒗1subscript𝒇𝑁subscript𝒙𝑁subscript𝒗𝑁\bm{x}^{\prime}=[\bm{f}_{1}(\bm{x}_{1},\bm{v}_{1}),\dots,\bm{f}_{N}(\bm{x}_{N}% ,\bm{v}_{N})]bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ], such that each entity would have an independent dynamics biased by a specific cause (e.g., where the hand or the target will be in the future); however, this is of limited use in a pick-and-place operation that demands interaction between entities. We hence compute a potential hidden state with a single function, such as:

𝒊(𝒙)=𝑾𝒙+𝒃𝒊𝒙𝑾𝒙𝒃\bm{i}(\bm{x})=\bm{W}\bm{x}+\bm{b}bold_italic_i ( bold_italic_x ) = bold_italic_W bold_italic_x + bold_italic_b (17)

The weights 𝑾𝑾\bm{W}bold_italic_W perform a linear transformation of the hidden states that combines every entity, while the bias 𝒃𝒃\bm{b}bold_italic_b imposes a static configuration over them [44]. Equation 17 can be realized through simple neural connections, wherein the weights are encoded as synaptic strengths and the bias represents the threshold needed to fire a spike. An error is then computed between this potential state and the current one:

𝒆i=𝒊(𝒙)𝒙subscript𝒆𝑖𝒊𝒙𝒙\bm{e}_{i}=\bm{i}(\bm{x})-\bm{x}bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_i ( bold_italic_x ) - bold_italic_x (18)

This vector has the same role as the attractor of Equation 13, but now it points toward a function of the hidden states. Finally, we define the following dynamics function:

𝒇(𝒙,v)=v𝒆i𝒇𝒙𝑣𝑣subscript𝒆𝑖\bm{f}(\bm{x},v)=v\bm{e}_{i}bold_italic_f ( bold_italic_x , italic_v ) = italic_v bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (19)

multiplying the error by a single-value hidden cause v𝑣vitalic_v. Thus, the latter is not intended as an explicit trajectory prior over the hidden states (e.g., encoding where my hand will be in the future), whose role is now delegated to the bias 𝒃𝒃\bm{b}bold_italic_b; but as an attractor gain, whereby a high value implies a strong force toward the potential state. As a result, we have an additional modulation that combines with the dynamics precision 𝝅xsubscript𝝅𝑥\bm{\pi}_{x}bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT; their interactions will be explained in Chapter 4. Since 𝒊(𝒙)𝒊𝒙\bm{i}(\bm{x})bold_italic_i ( bold_italic_x ) is used to define a path for the current hidden states aiming to produce a desired configuration, we call it an intention. Similarly, we refer to 𝒆isubscript𝒆𝑖\bm{e}_{i}bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an intention prediction error; note however that this quantity is not strictly a prediction error, although it would be possible to design the model to call it as such.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: (a) Factor graph of the unit with environmental beliefs and dynamic behavior. Each variable is factorized into independent components that encode the agent’s bodily states and external physical variables, e.g., objects to interact with. A single dynamics function combines the environmental beliefs to compute a specific intention. The hidden causes now define an attractor gain expressing the strength of the intention. The weights 𝑾𝑾\bm{W}bold_italic_W of the intention can be used, e.g., to track moving objects, while the bias 𝒃𝒃\bm{b}bold_italic_b realizes a static configuration. See [43, 75] for more details. (b) Factor graph of the unit with multiple intentions. Every hidden cause vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, coupled with an independent intention of the dynamics function, is combined to produce a dynamic trajectory 𝜼xsuperscriptsubscript𝜼𝑥\bm{\eta}_{x}^{\prime}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, comprising all the environmental beliefs 𝒙nsubscript𝒙𝑛\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The transition between intentions can be achieved by a higher-level prior, e.g., a belief over tactile sensations.

In summary, as shown in Figure 5(a), the dynamics function is not composed of segregated pathways as for the likelihood, but affects all the environmental entities at once – e.g., it computes a trajectory for the hand depending on the target. The steps performed by the agent during a reaching movement are the following: (i) the 0th order imposes a dynamic trajectory to the 1st order and generates a sensory prediction; (ii) the 0th order infers the consequences of its predictions, hence it is now biased toward both the intention and the observation; (iii) a proprioceptive prediction is generated from this new biased position, eventually driving action. This approach can be seen as a generalization of [74] where, in the context of oculomotor behavior, the target and the center of gaze were encoded as hidden states, with their own dynamics and attracted by a hidden location. Although somewhat limited compared to non-linear dynamics functions (e.g., obstacle avoidance can be realized by dynamics functions built from repulsive potentials [44]), through the specific form defined above – and along with the hidden states factorization – there is a high flexibility in designing intentions for complex interactions. Further, interpreting the hidden causes as a gain still makes sense from an active inference perspective because what is represented at a higher level is the intention to move at the target, while the target location is inferred at a lower level.

Taken alone, considering a hidden cause as an attractor gain may not seem so helpful. However, as depicted in Figure 5(b), we can combine M𝑀Mitalic_M intentions in the following way:

𝜼x,m=𝒇m(𝒙,𝒗)=vm𝒆i,m𝜼x=𝒇(𝒙,𝒗)=mM𝜼x,msuperscriptsubscript𝜼𝑥𝑚subscript𝒇𝑚𝒙𝒗subscript𝑣𝑚subscript𝒆𝑖𝑚superscriptsubscript𝜼𝑥𝒇𝒙𝒗superscriptsubscript𝑚𝑀superscriptsubscript𝜼𝑥𝑚\displaystyle\begin{split}\bm{\eta}_{x,m}^{\prime}&=\bm{f}_{m}(\bm{x},\bm{v})=% v_{m}\bm{e}_{i,m}\\ \bm{\eta}_{x}^{\prime}&=\bm{f}(\bm{x},\bm{v})=\sum_{m}^{M}\bm{\eta}_{x,m}^{% \prime}\end{split}start_ROW start_CELL bold_italic_η start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_v ) = italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_f ( bold_italic_x , bold_italic_v ) = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (20)

In short, trajectories 𝜼x,msuperscriptsubscript𝜼𝑥𝑚\bm{\eta}_{x,m}^{\prime}bold_italic_η start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are separately computed from each intention 𝒊msubscript𝒊𝑚\bm{i}_{m}bold_italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and with their respective gains; then, the final trajectory 𝜼xsuperscriptsubscript𝜼𝑥\bm{\eta}_{x}^{\prime}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is found by combining all of them. The reason why we used the prior notation for the trajectory predictions will be clear in Chapter 4. Note for the moment that, as before, there is a different structure compared to the likelihood. While an observation is generated through a parallel pathway for every environmental belief, each function 𝒇m(𝒙,𝒗)subscript𝒇𝑚𝒙𝒗\bm{f}_{m}(\bm{x},\bm{v})bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_v ) combines all of them in a specific way. Since each intention prediction error is proportional to its hidden cause vvsubscript𝑣𝑣v_{v}italic_v start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, the latter lends itself to a parallelism with the policies of discrete models, as will be explained later: if vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is set to 1111 and all the others to 00, the hidden states will be subject only to intention m𝑚mitalic_m; conversely, if multiple hidden causes are active, the hidden states will be pulled toward a combination of the corresponding intentions. This means that the hidden causes act both as attractor gains – expressing the absolute strength by which the belief is steered toward the desired direction – and as intention modulators – defining the relative strength between each desired state. The resulting dynamics prediction error:

𝜺x=𝝁𝜼xsubscript𝜺𝑥superscript𝝁superscriptsubscript𝜼𝑥\bm{\varepsilon}_{x}=\bm{\mu}^{\prime}-\bm{\eta}_{x}^{\prime}bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (21)

will then realize an average trajectory that the agent predicts for the current situation. This approach is effective for two reasons. First, it allows defining a composite movement in terms of simpler subgoals, which can be tackled separately; this can be helpful, for instance, if one has to analyze the behavior of an agent when subject to two or more opposing priors [43]. But the main utility is that a fixed multi-step behavior can be achieved [75] which does not need to modify the dynamics function at each step but only to modulate the hidden causes, since the model already encodes all the intermediate goals that the agent will pass through. Transitions between continuous trajectories could then be realized by higher-level priors over the hidden causes, e.g., a belief over tactile sensations for multi-step reaching, such as in the simulation of Figure 6.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: (a) In this task, the agent has to first touch a moving target (red circle) and then reach the home position (grey square). Additionally, both elements have to be inferred through sensory observations before the agent starts moving. This phase of pure perceptual inference is realized by setting both hidden causes to 00. The second transition is done by a tactile belief. Note how even during the second movement, the agent continues to track the belief over the target angle. See [43, 75] for more details. (b) The first graph shows the evolution of the real arm angle 𝒙asubscript𝒙𝑎\bm{x}_{a}bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and its belief 𝝁asubscript𝝁𝑎\bm{\mu}_{a}bold_italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, the target angle 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its belief 𝝁tsubscript𝝁𝑡\bm{\mu}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the home angle 𝒙hsubscript𝒙\bm{x}_{h}bold_italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and its belief 𝝁hsubscript𝝁\bm{\mu}_{h}bold_italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The second graph shows the evolution of the hidden causes associated with the two intentions, νreachsubscript𝜈𝑟𝑒𝑎𝑐\nu_{reach}italic_ν start_POSTSUBSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUBSCRIPT and νreturnsubscript𝜈𝑟𝑒𝑡𝑢𝑟𝑛\nu_{return}italic_ν start_POSTSUBSCRIPT italic_r italic_e italic_t italic_u italic_r italic_n end_POSTSUBSCRIPT. The third graph shows the evolution of the intention prediction errors 𝒆i,reachsubscript𝒆𝑖𝑟𝑒𝑎𝑐\bm{e}_{i,reach}bold_italic_e start_POSTSUBSCRIPT italic_i , italic_r italic_e italic_a italic_c italic_h end_POSTSUBSCRIPT and 𝒆i,returnsubscript𝒆𝑖𝑟𝑒𝑡𝑢𝑟𝑛\bm{e}_{i,return}bold_italic_e start_POSTSUBSCRIPT italic_i , italic_r italic_e italic_t italic_u italic_r italic_n end_POSTSUBSCRIPT, and the dynamics functions 𝒇reachsubscript𝒇𝑟𝑒𝑎𝑐\bm{f}_{reach}bold_italic_f start_POSTSUBSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUBSCRIPT and 𝒇returnsubscript𝒇𝑟𝑒𝑡𝑢𝑟𝑛\bm{f}_{return}bold_italic_f start_POSTSUBSCRIPT italic_r italic_e italic_t italic_u italic_r italic_n end_POSTSUBSCRIPT. The last graph shows the evolution of the tactile observation 𝒐tsubscript𝒐𝑡\bm{o}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and its belief 𝝁tsubscript𝝁𝑡\bm{\mu}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3 Hierarchical models

So far, we have introduced several units with two kinds of inputs – a prior over the hidden states and a prior over the hidden causes – and one kind of output – the 0th-order observation. In this chapter, we focus on how to combine such units in a single network to achieve a more advanced and efficient control. For this, we will make use of the first input, leaving the discussion about the second to the next chapter.

3.1 Intrinsic and extrinsic causes

The last unit presented affords a multi-step behavior in continuous time that accounts, to some extent, for dynamic elements of the environment. However, in all the previous simulations we just considered a single-DoF arm, while in real-life applications we generally deal with much more complex kinematic structures such as the human body. In this case, there is no more a one-to-one map** between joint angles and Cartesian positions, so we need to distinguish between proprioceptive and exteroceptive (e.g., visual) observations. As in optimal control, continuous-time active inference considers three reference frames and two inversions: an extrinsic signal (e.g., encoding the Cartesian position of a target) is first transformed in an intrinsic signal (e.g., encoding the joint angles configuration corresponding to the hand at the target) through inverse kinematics, which is in turn converted to the actual motor control signals (e.g., joint torques) through inverse dynamics [94]. These two processes are also attributed to the human brain [95, 96, 97], but there is a substantial difference between optimal control and active inference regarding how they unfold in practice. As mentioned in the previous chapter, in active inference the motor commands are replaced by proprioceptive prediction errors that are suppressed through spinal reflex arcs [34]. As a consequence, inverse dynamics becomes easier because action is put aside and the agent has just to know the map** from proprioceptive states to motor commands – see Equation 9.

But what about inverse kinematics? Recall the perspective that we mentioned in the previous chapter, i.e., that “it is an object I want to reach that generates my movements”. Turning optimal control upside down, active inference posits that action is driven by the proprioceptive consequences (e.g., changes in limb lengths) of extrinsic causes (e.g., a target) [33]. Intuitively, one could model an extrinsic movement as in Figure 7(a), i.e., with the following dynamics and likelihood functions:

𝒇(𝒙,𝒗)=𝑱T(𝒗𝑻(𝒙))𝒈p(𝒙)=𝒙𝒈v(𝒙,𝒗)=[𝑻(𝒙)𝒗]𝒇𝒙𝒗superscript𝑱𝑇𝒗𝑻𝒙subscript𝒈𝑝𝒙𝒙subscript𝒈𝑣𝒙𝒗matrix𝑻𝒙𝒗\displaystyle\begin{split}\bm{f}(\bm{x},\bm{v})&=\bm{J}^{T}(\bm{v}-\bm{T}(\bm{% x}))\\ \bm{g}_{p}(\bm{x})&=\bm{x}\\ \bm{g}_{v}(\bm{x},\bm{v})&=\begin{bmatrix}\bm{T}(\bm{x})&\bm{v}\end{bmatrix}% \end{split}start_ROW start_CELL bold_italic_f ( bold_italic_x , bold_italic_v ) end_CELL start_CELL = bold_italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_v - bold_italic_T ( bold_italic_x ) ) end_CELL end_ROW start_ROW start_CELL bold_italic_g start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_italic_x ) end_CELL start_CELL = bold_italic_x end_CELL end_ROW start_ROW start_CELL bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_v ) end_CELL start_CELL = [ start_ARG start_ROW start_CELL bold_italic_T ( bold_italic_x ) end_CELL start_CELL bold_italic_v end_CELL end_ROW end_ARG ] end_CELL end_ROW (22)

where 𝑻𝑻\bm{T}bold_italic_T is the forward kinematics returning the hand position, and 𝑱𝑱\bm{J}bold_italic_J is the Jacobian matrix.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: (a) Factor graph of an active inference model commonly used for kinematics. Hidden causes of a single level represent a target to reach, while hidden states define the joint angles of the kinematic chain, generating proprioceptive predictions through 𝒈psubscript𝒈𝑝\bm{g}_{p}bold_italic_g start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and visual predictions through 𝒈vsubscript𝒈𝑣\bm{g}_{v}bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Note that forward and inverse kinematics are duplicated, also requiring the likelihood 𝒈vsubscript𝒈𝑣\bm{g}_{v}bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to be embedded into the dynamics function. (b) Factor graph of an alternative hierarchical model for kinematics. Two different units (left and right blocks) encode information about arm joint angles and hand position, respectively generating proprioceptive and visual predictions. The two levels are linked by the likelihood function 𝒈esubscript𝒈𝑒\bm{g}_{e}bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT performing forward kinematics. Inverse kinematics and goal-directed behavior arise naturally through inference; moreover, both levels can express their own dynamics, affording more advanced control.

In short, the hand – expressed in terms of joint angles of the whole arm – is embedded in the hidden states, while the target to reach – expressed as a Cartesian position – is encoded in the hidden causes. The proprioceptive states needed for movement are found by first using an inverse kinematic model directly as a forward model into the dynamics function, e.g., through a Jacobian transpose or a pseudoinverse [85, 86, 87, 37, 28, 98, 93]; and then, by generating a prediction through the proprioceptive likelihood 𝒈psubscript𝒈𝑝\bm{g}_{p}bold_italic_g start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (which here is a simple identity map**). The visual likelihood 𝒈vsubscript𝒈𝑣\bm{g}_{v}bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is finally used to update the target location and further refine the inference of the agent’s configuration. This approach implies that an extrinsic reference frame is inverted to generate an intrinsic state, which is in turn transformed again in the first domain to be compared with visual observations. As a result, forward and inverse kinematics are performed twice, once in the dynamics function and once in the forward and backward passes of perceptual inference, when propagating the visual prediction error 𝜺vsubscript𝜺𝑣\bm{\varepsilon}_{v}bold_italic_ε start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT:

𝒈vT𝜺v=[𝑱𝟏](𝒐v[𝑻(𝒙)𝒗])superscriptsubscript𝒈𝑣𝑇subscript𝜺𝑣matrix𝑱1subscript𝒐𝑣matrix𝑻𝒙𝒗\partial\bm{g}_{v}^{T}\bm{\varepsilon}_{v}=\begin{bmatrix}\bm{J}\\ \bm{1}\end{bmatrix}(\bm{o}_{v}-\begin{bmatrix}\bm{T}(\bm{x})&\bm{v}\end{% bmatrix})∂ bold_italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_italic_J end_CELL end_ROW start_ROW start_CELL bold_1 end_CELL end_ROW end_ARG ] ( bold_italic_o start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - [ start_ARG start_ROW start_CELL bold_italic_T ( bold_italic_x ) end_CELL start_CELL bold_italic_v end_CELL end_ROW end_ARG ] ) (23)

If the predictions are not temporarily stored, this requires increased computational demand and memory. Crucially, there is an additional issue regarding biological plausibility: using sensory-level attractors within the dynamics function means that a unit is aware of and can use intra-level part of the likelihood prediction – which is generally assumed to go all down to the sensorium – and its inverse map**, which are lower-level features. Finally, the model in Figure 7(a) does not let the agent express paths in extrinsic coordinates needed, e.g., for realizing linear or circular motions, or for imposing constraints in both intrinsic and extrinsic domains such as when walking with a glass in hand.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 8: (a) In this task, the agent (a 4-DoF arm) has to reach the red target while avoiding the green obstacle; this is possible by specifying two (attractive and repulsive) functions at the extrinsic level. (b) In this task, the agent has to reach the red target while maintaining the same hand orientation (such as when walking with a glass in hand); this is possible by combining intrinsic and extrinsic constraints. (c) In this task, the agent has to perform linear (top) and circular (bottom) motions, possible by defining attractors at the 1st temporal order of the extrinsic hidden states. See [44] for more details.

We can instead exploit Equation 23 and follow the natural flow of the generative process to avoid duplicated computations, as displayed in Figure 7(b). This model relies on two hierarchical levels, where an intrinsic unit (encoding the arm joint angles) is placed at the top and generates predictions through forward kinematics for an extrinsic unit (encoding the Cartesian position of the target) [44]:

𝒙e=𝒈(𝒙i)=𝑻(𝒙i)+𝒘isubscript𝒙𝑒𝒈subscript𝒙𝑖𝑻subscript𝒙𝑖subscript𝒘𝑖\bm{x}_{e}=\bm{g}(\bm{x}_{i})=\bm{T}(\bm{x}_{i})+\bm{w}_{i}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_T ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (24)

The goal-directed behavior of Equation 22 arises naturally via backpropagation of the extrinsic prediction error 𝜺esubscript𝜺𝑒\bm{\varepsilon}_{e}bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT:

𝒈eT𝜺e=𝑱T(𝒙e𝑻(𝒙i))superscriptsubscript𝒈𝑒𝑇subscript𝜺𝑒superscript𝑱𝑇subscript𝒙𝑒𝑻subscript𝒙𝑖\partial\bm{g}_{e}^{T}\bm{\varepsilon}_{e}=\bm{J}^{T}(\bm{x}_{e}-\bm{T}(\bm{x}% _{i}))∂ bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = bold_italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - bold_italic_T ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (25)

Having a complete unit that deals with extrinsic information – which is thus not embedded into the hidden causes of the intrinsic unit – allows the agent to specify their dynamics, leading to an efficient decomposition between intrinsic and extrinsic attractors, and between proprioceptive and visual observations – as exemplified in the simulations of Figure 8. Note the similarity of Equation 25 with Equations 22 and 23: if in the model of Figure 7(a) we had two different forward and inverse kinematics either for goal-directed behavior or for predicting current observations, in this case what is compared with the observations already contains a bias toward preferred states, without the need for sensory-level attractors within the dynamics function.

Although the generative model follows the forward flow of optimal control, the relationship between proprioceptive consequences and extrinsic causes peculiar to active inference still holds because the kinematic inversion regards a high-level process that manipulates abstract (intrinsic or extrinsic) representations, and both of them concur to generate low-level proprioceptive states. As noted by Adams and colleagues, “The key distinction is not about map** from desired states in an extrinsic (kinematic) frame to an intrinsic (dynamic) frame of reference, but the map** from desired states (in either frame) to motor commands[34]. Having said this, there is a significant difference between the two models represented in Figure 7, which can be compared to the two supervised learning modes of predictive coding [16]: a forward mode that fixes the latent states to the labels and the observations to the data can generate highly accurate images of digits, while the inverse classification task is more difficult as there is no univocal map** between labels and data; instead, a backward mode that fixes the latent states to the data and the observations to the labels achieves high performances on classification but falls short when generating images. Based on this, we can interpret the model of Figure 7(a) as a backward mode that would rapidly generate a proper kinematic configuration with the hand at the target, but that would hardly infer from proprioception the hand position needed to plan movements. Conversely, we can interpret the model of Figure 7(b) as a forward mode that would generate with high accuracy the hand position, but that would find it difficult to infer the kinematic configuration needed to actually realize movement.

3.2 A module for iterative transformations

The model in Figure 7(b) introduced a hierarchical dependency between two (intrinsic and extrinsic) levels, made possible by a connection between hidden states. Instead, the typical approach in continuous-time active inference is to let the hidden states and causes of a level exchange information with the hidden causes (and not the hidden states) of the subordinate level, as shown in Figure 9(a). While this allows one to impose a dynamic trajectory for the unit below, specifying fixed setpoints to the 0th-order hidden states is not as straightforward, since the dynamics prediction error generated from the hidden causes would have to travel back to the previous temporal orders. As clear from Figure 7(b), a connection between hidden states is of extreme importance when designing hierarchical models. In fact – as represented in Figure 9(b) – it is fundamental in defining the initial state of slower temporal scales in discrete models, e.g., in pictographic reading [48]. An analogy could be also made regarding the hierarchical connectivity of PCNs, as units are connected in a multiple-input and multiple-output system, defining static priors for the subordinate levels [16] – as shown in Figure 9(c).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 9: (a) Factor graph of the most common choice of connections between two continuous levels. Hidden states 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and hidden causes 𝒗(i)superscript𝒗𝑖\bm{v}^{(i)}bold_italic_v start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of level i𝑖iitalic_i generate – through the likelihood function 𝒈(i)superscript𝒈𝑖\bm{g}^{(i)}bold_italic_g start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT – the hidden causes 𝒗(i+1)superscript𝒗𝑖1\bm{v}^{(i+1)}bold_italic_v start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT of level i+1𝑖1i+1italic_i + 1. (b) Connections between two discrete levels. Hidden states 𝒔1(i)superscriptsubscript𝒔1𝑖\bm{s}_{1}^{(i)}bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of level i𝑖iitalic_i generate – through the prior matrix 𝑫(i)superscript𝑫𝑖\bm{D}^{(i)}bold_italic_D start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT – the hidden states of the present 𝒔1(i+1)superscriptsubscript𝒔1𝑖1\bm{s}_{1}^{(i+1)}bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT of level i+1𝑖1i+1italic_i + 1. We will cover discrete models in the next chapter. (c) Connections between two levels in PCNs. Hidden causes 𝒗(i)superscript𝒗𝑖\bm{v}^{(i)}bold_italic_v start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of level i𝑖iitalic_i receive an average of prediction errors from units at level i1𝑖1i-1italic_i - 1, and impose a prior over units at level i+1𝑖1i+1italic_i + 1. Typically, the likelihood function 𝒈(i)superscript𝒈𝑖\bm{g}^{(i)}bold_italic_g start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT performs a combination of independent units, passed to a nonlinear activation function. Standard PCNs only permit representing the causal structure of a system, without modeling internal dynamics. (d) Factor graph of a level with multiple inputs and outputs from independent units, with a similar connectivity of PCNs plus model dynamics. The observation 𝒐(i,j)superscript𝒐𝑖𝑗\bm{o}^{(i,j)}bold_italic_o start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT becomes the 0th-order hidden state 𝒙(i+1,j)superscript𝒙𝑖1𝑗\bm{x}^{(i+1,j)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i + 1 , italic_j ) end_POSTSUPERSCRIPT of the level below, while the prior 𝜼x(i,j)superscriptsubscript𝜼𝑥𝑖𝑗\bm{\eta}_{x}^{(i,j)}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT becomes the 0th-order hidden state 𝒙(i1,j)superscript𝒙𝑖1𝑗\bm{x}^{(i-1,j)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 , italic_j ) end_POSTSUPERSCRIPT of the level above. (e) A network of IE modules. A signal expressed in an extrinsic reference frame 𝒙e(i1)superscriptsubscript𝒙𝑒𝑖1\bm{x}_{e}^{(i-1)}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT, along with an intrinsic signal 𝒙i(i,j)superscriptsubscript𝒙𝑖𝑖𝑗\bm{x}_{i}^{(i,j)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT (e.g., angle for rotation or length for translation), is passed to a function 𝒈e(i,j)superscriptsubscript𝒈𝑒𝑖𝑗\bm{g}_{e}^{(i,j)}bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT, generating a new extrinsic signal 𝒙e(i,j)superscriptsubscript𝒙𝑒𝑖𝑗\bm{x}_{e}^{(i,j)}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT.

Following these two examples and the previous kinematic model, we use the observation of a level to directly bias the 0th-order hidden states of the level below. As a result, the observation prediction error 𝜺osubscript𝜺𝑜\bm{\varepsilon}_{o}bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and the prior prediction error 𝜺ηxsubscript𝜺subscript𝜂𝑥\bm{\varepsilon}_{\eta_{x}}bold_italic_ε start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT of Equation 6 is expressed by the same variable:

𝜺o(i)=𝝁(i+1)𝒈(i)(𝝁(i))superscriptsubscript𝜺𝑜𝑖superscript𝝁𝑖1superscript𝒈𝑖superscript𝝁𝑖\bm{\varepsilon}_{o}^{(i)}=\bm{\mu}^{(i+1)}-\bm{g}^{(i)}(\bm{\mu}^{(i)})bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = bold_italic_μ start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT - bold_italic_g start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) (26)

where the hierarchical level is indicated with a superscript and lower levels are denoted by increasing numbers. We can then design a multiple-input and multiple-output system wherein a level imposes and receives priors and observations to several units, as in Figure 9(d). The computation of the free energy in Equation 5 remains unchanged, and the update of the hidden states turns into the following:

𝝁~˙(i,j)=[𝝁(i,j)k𝝅o(i1,k)𝜺o(i1,k)+lμ(i,j)𝒈(i,l)T𝝅o(i,l)𝜺o(i,l)+μ(i,j)𝒇(i,j)T𝝅x(i,j)𝜺x(i,j)𝝅x(i,j)𝜺x(i,j)]superscript˙~𝝁𝑖𝑗matrixsuperscript𝝁𝑖𝑗subscript𝑘superscriptsubscript𝝅𝑜𝑖1𝑘superscriptsubscript𝜺𝑜𝑖1𝑘subscript𝑙subscriptsuperscript𝜇𝑖𝑗superscript𝒈𝑖𝑙𝑇superscriptsubscript𝝅𝑜𝑖𝑙superscriptsubscript𝜺𝑜𝑖𝑙subscriptsuperscript𝜇𝑖𝑗superscript𝒇𝑖𝑗𝑇subscriptsuperscript𝝅𝑖𝑗𝑥subscriptsuperscript𝜺𝑖𝑗𝑥missing-subexpressionsubscriptsuperscript𝝅𝑖𝑗𝑥subscriptsuperscript𝜺𝑖𝑗𝑥\dot{\tilde{\bm{\mu}}}^{(i,j)}=\begin{bmatrix}\bm{\mu}^{\prime{(i,j)}}-\sum_{k% }\bm{\pi}_{o}^{(i-1,k)}\bm{\varepsilon}_{o}^{(i-1,k)}+\sum_{l}\partial_{\mu^{(% i,j)}}\bm{g}^{(i,l)T}\bm{\pi}_{o}^{(i,l)}\bm{\varepsilon}_{o}^{(i,l)}+\partial% _{\mu^{(i,j)}}\bm{f}^{(i,j)T}\bm{\pi}^{(i,j)}_{x}\bm{\varepsilon}^{(i,j)}_{x}% \\ \\ -\bm{\pi}^{(i,j)}_{x}\bm{\varepsilon}^{(i,j)}_{x}\end{bmatrix}over˙ start_ARG over~ start_ARG bold_italic_μ end_ARG end_ARG start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL bold_italic_μ start_POSTSUPERSCRIPT ′ ( italic_i , italic_j ) end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 , italic_k ) end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 , italic_k ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∂ start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ( italic_i , italic_l ) italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_l ) end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_l ) end_POSTSUPERSCRIPT + ∂ start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ( italic_i , italic_j ) italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL end_ROW start_ROW start_CELL - bold_italic_π start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_ε start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (27)

where the superscript notation (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) indicates the i𝑖iitalic_ith hierarchical level and the j𝑗jitalic_jth element within the same level. As evident, this is a similar hierarchical connectivity of PCNs – wherein we highlighted an average of forward and backward prediction errors from independent units – but with the addition of model dynamics represented by the leftmost and rightmost terms of the equation.

What advantages do deep hierarchical models carry compared to a shallow agent? If we consider the structure of Figure 7(b), although affording a more advanced control with respect to the model in Figure 7(a), its uses are still limited to solving simple tasks, e.g., performing operations with the hand. While simultaneous coordination of multiple limbs is certainly possible, it would require complex dynamics functions, with complexity increasing with the number of joints and ramifications of the kinematic chain. Critically, a shallow agent would not be capable of capturing the hierarchical causal relationships inherent to the generative process, allowing one to predict and anticipate the local exchange of forces that would unfold whenever a biased belief over bodily states produces a movement. As mentioned in the Introduction, a deep model is also required if one has to flexibly use external tools for manipulation tasks. Besides the roto-translations occurring in forward kinematics, iterative transformations are also essential in computer vision – where an image can be subject to scaling, shearing, or projection – and, more in general, whenever changing basis of a coordinate vector.

For these reasons, we can generalize the last model and construct an Intrinsic-Extrinsic (or IE) module [44, 58, 99]. This module is composed of two units and its role is to perform iterative transformations between reference frames. In brief, a unit 𝒰e(i1)superscriptsubscript𝒰𝑒𝑖1\mathcal{U}_{e}^{(i-1)}caligraphic_U start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT encodes a signal in an extrinsic reference frame, while a second unit 𝒰i(i)superscriptsubscript𝒰𝑖𝑖\mathcal{U}_{i}^{(i)}caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT contains a generic intrinsic transformation. Applying the latter to the first signal returns a new extrinsic reference frame embedded in a unit 𝒰e(i)superscriptsubscript𝒰𝑒𝑖\mathcal{U}_{e}^{(i)}caligraphic_U start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. More formally, we can define a likelihood function 𝒈esubscript𝒈𝑒\bm{g}_{e}bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT such that:

𝒙e(i)=𝒈e(i)(𝒙i(i),𝒙e(i1))=𝑻(i)(𝒙i(i))𝒙e(i1)+𝒘e(i)superscriptsubscript𝒙𝑒𝑖superscriptsubscript𝒈𝑒𝑖superscriptsubscript𝒙𝑖𝑖superscriptsubscript𝒙𝑒𝑖1superscript𝑻𝑖superscriptsubscript𝒙𝑖𝑖superscriptsubscript𝒙𝑒𝑖1superscriptsubscript𝒘𝑒𝑖\bm{x}_{e}^{(i)}=\bm{g}_{e}^{(i)}(\bm{x}_{i}^{(i)},\bm{x}_{e}^{(i-1)})=\bm{T}^% {(i)}(\bm{x}_{i}^{(i)})\cdot\bm{x}_{e}^{(i-1)}+\bm{w}_{e}^{(i)}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) = bold_italic_T start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ⋅ bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT + bold_italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (28)

where 𝒘e(i)superscriptsubscript𝒘𝑒𝑖\bm{w}_{e}^{(i)}bold_italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is a noise term and 𝑻(i)superscript𝑻𝑖\bm{T}^{(i)}bold_italic_T start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is a linear transformation matrix. Backpropagating the extrinsic prediction error 𝜺e(i)superscriptsubscript𝜺𝑒𝑖\bm{\varepsilon}_{e}^{(i)}bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT leads to simple belief updates:

𝒈e(i)𝝁e(i1)T𝜺e(i)=𝑻(i)T𝜺e(i)𝒈e(i)𝝁i(i)T𝜺e(i)=𝑻(i)𝝁i(i)[𝜺e(i)𝝁e(i1)T]superscriptsuperscriptsubscript𝒈𝑒𝑖superscriptsubscript𝝁𝑒𝑖1𝑇superscriptsubscript𝜺𝑒𝑖superscript𝑻𝑖𝑇superscriptsubscript𝜺𝑒𝑖superscriptsuperscriptsubscript𝒈𝑒𝑖superscriptsubscript𝝁𝑖𝑖𝑇superscriptsubscript𝜺𝑒𝑖direct-productsuperscript𝑻𝑖superscriptsubscript𝝁𝑖𝑖delimited-[]superscriptsubscript𝜺𝑒𝑖superscriptsubscript𝝁𝑒𝑖1𝑇\displaystyle\begin{split}{\frac{\partial\bm{g}_{e}^{(i)}}{\partial\bm{\mu}_{e% }^{(i-1)}}}^{T}\bm{\varepsilon}_{e}^{(i)}&=\bm{T}^{(i)T}\cdot\bm{\varepsilon}_% {e}^{(i)}\\ {\frac{\partial\bm{g}_{e}^{(i)}}{\partial\bm{\mu}_{i}^{(i)}}}^{T}\bm{% \varepsilon}_{e}^{(i)}&=\frac{\partial\bm{T}^{(i)}}{\partial\bm{\mu}_{i}^{(i)}% }\odot[\bm{\varepsilon}_{e}^{(i)}\cdot\bm{\mu}_{e}^{(i-1)T}]\end{split}start_ROW start_CELL divide start_ARG ∂ bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_μ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_CELL start_CELL = bold_italic_T start_POSTSUPERSCRIPT ( italic_i ) italic_T end_POSTSUPERSCRIPT ⋅ bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_CELL start_CELL = divide start_ARG ∂ bold_italic_T start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG ⊙ [ bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) italic_T end_POSTSUPERSCRIPT ] end_CELL end_ROW (29)

where direct-product\odot is the element-wise product.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 10: (a) In this task, the agent (a 23-DoF human body) has to avoid a moving obstacle; this is possible by defining a repulsive attractor for each extrinsic level. (b) In this task, the agent (a 28-DoF kinematic tree) has to reach four target locations with the extremities of its branches. See [44] for more details. (c) In this task, the agent (the two blue eyes on the left of the top view) has to infer the depth of the red object while fixating on it. The inferred position (and its trajectory) is represented in orange, while the blue trajectory is the center of fixation of the eyes. The bottom two frames show the projection of the object in the eye planes. See [58] for more details.

These equations express the most likely intrinsic and extrinsic states that may have generated the new reference frame. As shown in Figure 9(e), modules are linked through the extrinsic units 𝒰e(i)superscriptsubscript𝒰𝑒𝑖\mathcal{U}_{e}^{(i)}caligraphic_U start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, while 𝒰i(i)superscriptsubscript𝒰𝑖𝑖\mathcal{U}_{i}^{(i)}caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT performs an internal operation and does not contribute to the hierarchical connectivity. Applying this architecture to kinematics, we can realize a hierarchical model with a multiple-output system, wherein the intrinsic hidden states 𝒙i(i,j)superscriptsubscript𝒙𝑖𝑖𝑗\bm{x}_{i}^{(i,j)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT of a level encode a pair of joint angle and limb length of a single DoF. Iteratively applying roto-translations to an origin (e.g., body-centered) reference frame 𝒙e(0)superscriptsubscript𝒙𝑒0\bm{x}_{e}^{(0)}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT – consisting of a Cartesian position and an absolute orientation – will determine the kinematic configuration of the agent in terms of extrinsic coordinates [44]. In addition to this, and differently from PCNs, we can now easily express how every single joint and limb would evolve. Or – which is the same from an active inference perspective – how the agent intends to move its joints and limbs, affording a highly advanced control as demonstrated by the simulations of Figures 10(a) and 10(b). Besides modeling limb dynamics, the IE module can be also applied to non-affine transformations, e.g., perspective projections. As displayed in Figure 10(c), this can be useful for estimating the depth of an object via parallel predictions (e.g., from the eyes or multiple cameras) [58] – a process that active inference casts in terms of target fixation [67]. The modularity of this architecture allows the agent to define dynamic attractors in the 2D projected planes, in the 3D reference frames of the eyes, or as simple vergence-accommodation angles.

3.3 The self, the objects, and the others

Describing Figure 7(b), we passed over a critical mechanism introduced at the beginning: the characterization of objects for dynamic goal-directed behavior. Recall that the hidden states encode in parallel not only the self but also environmental entities; however, the agent’s model now describes the generative process hierarchically:

𝒈e(i)(𝒙i(i),𝒙e(i1))=[𝑻1(i)(𝒙i,1(i))𝒙e,1(i1)𝑻N(i)(𝒙i,N(i))𝒙e,N(i1)]superscriptsubscript𝒈𝑒𝑖superscriptsubscript𝒙𝑖𝑖superscriptsubscript𝒙𝑒𝑖1matrixsuperscriptsubscript𝑻1𝑖superscriptsubscript𝒙𝑖1𝑖superscriptsubscript𝒙𝑒1𝑖1superscriptsubscript𝑻𝑁𝑖superscriptsubscript𝒙𝑖𝑁𝑖superscriptsubscript𝒙𝑒𝑁𝑖1\bm{g}_{e}^{(i)}(\bm{x}_{i}^{(i)},\bm{x}_{e}^{(i-1)})=\begin{bmatrix}\bm{T}_{1% }^{(i)}(\bm{x}_{i,1}^{(i)})\cdot\bm{x}_{e,1}^{(i-1)}&\dots&\bm{T}_{N}^{(i)}(% \bm{x}_{i,N}^{(i)})\cdot\bm{x}_{e,N}^{(i-1)}\end{bmatrix}bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) = [ start_ARG start_ROW start_CELL bold_italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ⋅ bold_italic_x start_POSTSUBSCRIPT italic_e , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL bold_italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ⋅ bold_italic_x start_POSTSUBSCRIPT italic_e , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (30)

For the self, this has a simple explanation, i.e., it just generates, one after the other, the positions of every segment of the kinematic chain depending on its joint angles. Concerning an object, attaching its visual observation to a second extrinsic hidden state of a specific level would lead the latter to encode its Cartesian position. How then should all the previous levels be interpreted? If the generative model maintains the same hierarchy for both the self and the object, backpropagating the extrinsic prediction errors of the second component will eventually infer a potential agent’s configuration in relation to the object. For instance, if the object is linked to the last (i.e., hand) level, this would represent the hand at the object location, while all the previous levels would represent appropriate intermediate positions and angles generating that final location. In other words, the additional factorizations of hidden states and likelihoods do not encode simple target angles or positions as before, but a whole configuration of the self that the agent thinks to be suitable for interacting with an object. Since each level can express a particular dynamics through its hidden causes, the inference of this potential configuration is steered to match both the object’s affordances and the agent’s intentions (e.g., gras** a cup by the handle or with the whole hand). Such inferred beliefs would be subject only to exteroceptive information coming from the objects, while proprioceptive states would be used only to update the agent’s belief over its current configuration.

Refer to caption
Figure 11: Interactions between an agent (a 2-DoF light-blue arm), another agent (a 3-DoF purple arm), and an object (a red ball). Generative model of the first agent, composed of three parallel pathways representing the kinematic structures of both agents. For clarity, lateral connections among the model components of each level are not shown. Both the self and the other (or object) in relation to the self depend on the same body-centered reference frame. The first component is subject to both proprioception (yellow dotted lines) and exteroception (red dotted lines), while the other two components are only inferred via exteroception. In this case, the interaction with the second agent just depends on the observation of its last level, leading, e.g., to a hand-shaking action.

Besides modeling object dynamics, this strategy is also useful in multi-agent contexts. One could maintain a hierarchical generative model regarding the kinematic chain of another agent, which would be inferred by exteroceptive observations about all its positions and joint angles, starting from a different body-centered reference frame. As shown in Figure 11, the goal-directed method used for external objects reflects in this case as well: the agent could represent, by a parallel hierarchical pathway, a second agent in relation to itself, expressing a particular kind of interaction (e.g., the hand of the second agent in terms of its own, resulting in a shaking action). These two cases could be interpreted, from a biological perspective, as simulating the functioning of mirror neurons, firing whenever a subject executes a voluntary goal-directed action or when that action is performed by other subjects [100]. Building an internal model with the kinematic chain of the others – both per se and in relation to the self – could be critical to predict (thus, to understand) their intentions. In this view, neural activity results because the agent makes constant predictions over their kinematic structures depending on its hypotheses and the current context [101, 98].

The relationships between the self, the objects, and other agents under active inference may be better understood from the simulation of Figure 12, showing two agents with incompatible goals that depend on each other. Here, both agents are able to infer parallel representations of different kinematic chains, using an effective decomposition of potential and real configurations. Note how one’s current belief is always in between the future state to realize and the actual configuration; this speaks of one of the fundamental aspects of active inference, i.e., that our beliefs never really reflect the state of the affairs of the world, but are always biased toward preferred states – eventually driving action. In general, bodily states, objects, or other agents can be all manipulated in reference frames appropriate for a specific context; this is in line with the hypothesis that cortical columns use object-centered reference frames to encode external elements and more abstract entities [102]. This approach has also some analogies with Active Predictive Coding [103] and Recursive Neural Programs [104], which addressed the part-whole hierarchy learning problem in computer vision by recursively applying reference frame transformations to parts of a scene.

Refer to caption
(a)
Refer to caption
(b)
Figure 12: (a) In this task, two agents with different kinematic chains (respectively of 5 and 3 DoF) have two incompatible goals: the first agent (in red) has to reach the elbow of the second agent (in blue), while the second agent has to reach the hand of the first agent. Note that after an initial approaching phase from both, the second agent gradually retracts trying to touch the hand of the first agent. (b) Beliefs of both agents, compared with their actual configurations (displayed in light red for the first agent, and in light blue for the second). The top and bottom graphs respectively show the beliefs of the first and second agents. Specifically, the belief over the actual configuration of the self (in orange or cyan), the belief over the other in relation to the self (in dark red or dark blue), and the belief over the other (in green or purple). Note the belief succession during goal-directed behavior: the potential configuration pointing toward one’s intention (either the elbow or the hand), followed by the actual belief, and then by the actual arm. Also note a slight delay about the inference of the configuration of the other.

4 The hybrid unit

All the hierarchical models we presented fall short on simulating real-life applications that involve planning actions ahead. In this chapter, we turn to the problem of how to integrate discrete decision-making into continuous motor control. In doing this, we will revisit the basic unit of the first chapter, finally using the second input – the prior over the hidden causes.

4.1 Dynamic inference by model reduction

Consider the unit in Figure 5(b): recall that some sort of multi-step behavior was achieved, which however depended on higher-level priors about different modalities. In most cases, we need to switch intentions based on lower-level information, affording a more dynamic and less uncertain behavior. Taking a pick-and-place operation as an example, an IE module would be more confident about the success of the first reaching movement if it could rely not just on a tactile belief but also on its intrinsic and extrinsic hidden states. In other words, hidden causes 𝒗𝒗\bm{v}bold_italic_v should manage to effectively use both its prior 𝜼vsubscript𝜼𝑣\bm{\eta}_{v}bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and the dynamics prediction error 𝜺xsubscript𝜺𝑥\bm{\varepsilon}_{x}bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. The latter assumes two different roles depending on which pathway it flows into: the gradient with respect to the hidden states infers the position that is most likely to have generated the current trajectory; conversely, the gradient with respect to the hidden causes infers the most likely combination of gains vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, signaling the current status of the trajectory and resulting in a dynamic modulation of intentions. However, this pathway is somewhat problematic because the hidden causes are generated by Gaussian distributions and do not encode proper probabilities. Thus, the gradient ν𝒇xsubscript𝜈subscript𝒇𝑥\partial_{\nu}\bm{f}_{x}∂ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT bold_italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT infers just one over many possible combinations of gains, and it makes sense as “inferring the most likely intention to have generated the current trajectory” only in simple contexts and if appropriate assumptions are made. Thus, to implement a correct intention selection, we assume that the hidden causes are generated from a categorical distribution:

p(𝒗)=Cat(𝑯v)𝑝𝒗𝐶𝑎𝑡subscript𝑯𝑣p(\bm{v})=Cat(\bm{H}_{v})italic_p ( bold_italic_v ) = italic_C italic_a italic_t ( bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) (31)

where 𝑯vsubscript𝑯𝑣\bm{H}_{v}bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is an intention preference like 𝜼vsubscript𝜼𝑣\bm{\eta}_{v}bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. In this way, each discrete element of 𝒗𝒗\bm{v}bold_italic_v represents the probability that a specific continuous trajectory will be realized.

However, we have now the problem of how to convert discrete hidden causes to continuous hidden states, and vice versa. This can be done via Bayesian model reduction, a technique used to constrain the complexity of full posterior models into simpler and more restrictive (formally called reduced) distributions [61, 62]. Reduced means that the likelihood of some data is equal to that of the full model and the only difference rests upon the specification of the priors – hence, the posterior of a reduced model m𝑚mitalic_m can be expressed in terms of the posterior of the full model:

p(𝒙~|𝒐,m)=p(𝒙~|𝒐)p(𝒙~|m)p(𝒐)p(𝒙~)p(𝒐|m)𝑝conditional~𝒙𝒐𝑚𝑝conditional~𝒙𝒐𝑝conditional~𝒙𝑚𝑝𝒐𝑝~𝒙𝑝conditional𝒐𝑚p(\tilde{\bm{x}}|\bm{o},m)=p(\tilde{\bm{x}}|\bm{o})\frac{p(\tilde{\bm{x}}|m)p(% \bm{o})}{p(\tilde{\bm{x}})p(\bm{o}|m)}italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o , italic_m ) = italic_p ( over~ start_ARG bold_italic_x end_ARG | bold_italic_o ) divide start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_m ) italic_p ( bold_italic_o ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG ) italic_p ( bold_italic_o | italic_m ) end_ARG (32)

In our case, model reduction means to explain the infinite values a continuous signal may assume by a discrete set of hypotheses. In active inference, models that combine discrete and continuous signals are called hybrid or mixed [48, 59], and a simplified version is shown in Figure 13(a). We can cast this procedure into the usual message passing, where top-down and bottom-up messages between the two domains respectively perform a Bayesian Model Average (BMA) of reduced priors and a Bayesian Model Comparison (BMC) of reduced sensory evidence. In conventional hybrid models, discrete hidden states generate priors for the continuous hidden causes by weighting the probability of each discrete state with a specific reduced prior, which thus represents one among many alternatives that the agent thinks to be the cause of what it is perceiving [48]. Conversely, the hidden causes posterior is compared with such reduced priors to find which one among them could be the best explanation, taking into account their discrete probabilities before observing sensory evidence.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 13: (a) A conventional hybrid architecture, composed of a discrete model at the top and a continuous model at the bottom. For simplicity, we assume that the continuous prior 𝜼vsubscript𝜼𝑣\bm{\eta}_{v}bold_italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is directly conditioned on discrete hidden states 𝒔𝒔\bm{s}bold_italic_s. We will cover discrete models in the following section. Here, top-down and bottom-up messages are computed by Bayesian model reduction of some static priors, without the possibility for dynamic planning. (b) Factor graph of the unit with hybrid control. The hidden causes are now generated from a categorical distribution, such that instead of inferring a combination of continuous intention gains, the model correctly infers the most likely intention associated with the current dynamic trajectory. This is done by computing the free energy 𝑬msubscript𝑬𝑚\bm{E}_{m}bold_italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT corresponding to each intention. Performing a Bayesian model reduction between discrete hidden causes and continuous hidden states lets the agent update its reduced priors at each time step. (c) A simplified graph displaying only the exchange of top-down (red) and bottom-up (blue and green) messages of the hybrid control.

Averaging and comparing continuous alternatives that are fixed and determined a-priori results in the agent’s inability to correctly operate in a changing environment. For instance, if the agent thinks to find an object in one of two locations, it will always reach either one or the other initial guesses, even if the object has been moved to a third location. How then to use the newly available evidence to update our reduced assumptions? By considering the hidden causes as generated from a categorical distribution – as in Equation 31 – we can compare the posterior over the hidden states with the output of the dynamics functions 𝒇msubscript𝒇𝑚\bm{f}_{m}bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which thus act as the agent’s reduced priors [69]. More formally, we define M𝑀Mitalic_M reduced prior probability distributions and a full prior model:

p(𝒙|𝒙,m)=𝒩(𝒇m(𝒙),𝝅x,m\scaleto14pt)p(𝒙|𝒙)=𝒩(𝜼x,𝝅x\scaleto14pt)𝑝conditionalsuperscript𝒙𝒙𝑚𝒩subscript𝒇𝑚𝒙subscriptsuperscript𝝅\scaleto14𝑝𝑡𝑥𝑚𝑝conditionalsuperscript𝒙𝒙𝒩superscriptsubscript𝜼𝑥superscriptsubscript𝝅𝑥\scaleto14𝑝𝑡\displaystyle\begin{split}p(\bm{x}^{\prime}|\bm{x},m)&=\mathcal{N}(\bm{f}_{m}(% \bm{x}),\bm{\pi}^{\scaleto{-1}{4pt}}_{x,m})\\ p(\bm{x}^{\prime}|\bm{x})&=\mathcal{N}(\bm{\eta}_{x}^{\prime},\bm{\pi}_{x}^{% \scaleto{-1}{4pt}})\end{split}start_ROW start_CELL italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x , italic_m ) end_CELL start_CELL = caligraphic_N ( bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) , bold_italic_π start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) end_CELL start_CELL = caligraphic_N ( bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW (33)

where 𝜼xsuperscriptsubscript𝜼𝑥\bm{\eta}_{x}^{\prime}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the full prior. Note here that the reduced priors have the same form of Equation 20 but are not directly conditioned on the hidden causes:

𝒇m(𝒙)=𝒆i,msubscript𝒇𝑚𝒙subscript𝒆𝑖𝑚\bm{f}_{m}(\bm{x})=\bm{e}_{i,m}bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_e start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT (34)

Next, we define the corresponding posterior models:

q(𝒙|m)=𝒩(𝝁m,𝒑x,m\scaleto14pt)q(𝒙)=𝒩(𝝁,𝒑x\scaleto14pt)𝑞conditionalsuperscript𝒙𝑚𝒩superscriptsubscript𝝁𝑚superscriptsubscript𝒑𝑥𝑚\scaleto14𝑝𝑡𝑞superscript𝒙𝒩superscript𝝁subscriptsuperscript𝒑\scaleto14𝑝𝑡𝑥\displaystyle\begin{split}q(\bm{x}^{\prime}|m)&=\mathcal{N}(\bm{\mu}_{m}^{% \prime},\bm{p}_{x,m}^{\scaleto{-1}{4pt}})\\ q(\bm{x}^{\prime})&=\mathcal{N}(\bm{\mu}^{\prime},\bm{p}^{\scaleto{-1}{4pt}}_{% x})\end{split}start_ROW start_CELL italic_q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_m ) end_CELL start_CELL = caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = caligraphic_N ( bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUPERSCRIPT - 14 italic_p italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) end_CELL end_ROW (35)

Now, we can find the full prior and its prediction error by averaging the continuous trajectories with their respective discrete probabilities:

𝜼x=mvm𝒇m(𝒙)=mvm𝒆i,m𝜺x=𝝁𝜼xsuperscriptsubscript𝜼𝑥subscript𝑚subscript𝑣𝑚subscript𝒇𝑚𝒙subscript𝑚subscript𝑣𝑚subscript𝒆𝑖𝑚subscript𝜺𝑥superscript𝝁superscriptsubscript𝜼𝑥\displaystyle\begin{split}\bm{\eta}_{x}^{\prime}&=\sum_{m}v_{m}\bm{f}_{m}(\bm{% x})=\sum_{m}v_{m}\bm{e}_{i,m}\\ \bm{\varepsilon}_{x}&=\bm{\mu}^{\prime}-\bm{\eta}_{x}^{\prime}\end{split}start_ROW start_CELL bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_ε start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (36)

which have the same form of Equations 20 and 21. In fact, the hidden states still perceive a single dynamics prediction error containing the total contribution of every intention. As concerns the bottom-up messages 𝒍𝒍\bm{l}bold_italic_l, we first write the free energy of each reduced model in terms of the full model. As before, maximizing each reduced free energy makes it approximate the log evidence:

(m)=lnp(𝒙~|m)p(𝒙~)q(𝒙~)𝑑𝒙~lnp(𝒐|m)𝑚𝑝conditional~𝒙𝑚𝑝~𝒙𝑞~𝒙differential-d~𝒙𝑝conditional𝒐𝑚\mathcal{F}(m)=\mathcal{F}-\ln\int\frac{p(\tilde{\bm{x}}|m)}{p(\tilde{\bm{x}})% }q(\tilde{\bm{x}})d\tilde{\bm{x}}\approx\ln p(\bm{o}|m)caligraphic_F ( italic_m ) = caligraphic_F - roman_ln ∫ divide start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_m ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG ) end_ARG italic_q ( over~ start_ARG bold_italic_x end_ARG ) italic_d over~ start_ARG bold_italic_x end_ARG ≈ roman_ln italic_p ( bold_italic_o | italic_m ) (37)

As a result, the free energy related to each dynamics function 𝒇msubscript𝒇𝑚\bm{f}_{m}bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT depends on the approximate posterior q(𝒙~)𝑞~𝒙q(\tilde{\bm{x}})italic_q ( over~ start_ARG bold_italic_x end_ARG ) of the full model, avoiding the computation of the reduced posteriors. Under a Gaussian approximation, the m𝑚mitalic_mth reduced free energy breaks down to a simple formula and the bottom-up messages 𝒍𝒍\bm{l}bold_italic_l are found by accumulating the log evidence associated with every intention for a certain amount of continuous time T𝑇Titalic_T:

lm=0Tm𝑑tm=12(𝝁mT𝒑x,m𝝁m𝒇m(𝒙)T𝝅x,m𝒇m(𝒙)𝝁T𝒑x𝝁+𝜼xT𝝅x𝜼x)subscript𝑙𝑚superscriptsubscript0𝑇subscript𝑚differential-d𝑡subscript𝑚12superscriptsubscript𝝁𝑚𝑇subscript𝒑𝑥𝑚superscriptsubscript𝝁𝑚subscript𝒇𝑚superscript𝒙𝑇subscript𝝅𝑥𝑚subscript𝒇𝑚𝒙superscript𝝁𝑇subscript𝒑𝑥superscript𝝁superscriptsubscript𝜼𝑥𝑇subscript𝝅𝑥superscriptsubscript𝜼𝑥\displaystyle\begin{split}l_{m}&=\int_{0}^{T}\mathcal{L}_{m}dt\\ \mathcal{L}_{m}&=\frac{1}{2}(\bm{\mu}_{m}^{\prime T}\bm{p}_{x,m}\bm{\mu}_{m}^{% \prime}-\bm{f}_{m}(\bm{x})^{T}\bm{\pi}_{x,m}\bm{f}_{m}(\bm{x})-\bm{\mu}^{% \prime T}\bm{p}_{x}\bm{\mu}^{\prime}+\bm{\eta}_{x}^{\prime T}\bm{\pi}_{x}\bm{% \eta}_{x}^{\prime})\end{split}start_ROW start_CELL italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL start_CELL = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_d italic_t end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) - bold_italic_μ start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW (38)

Then, a BMC turns into computing the softmax of a vector comprising the free energy Emsubscript𝐸𝑚E_{m}italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of every reduced model. This quantity compares the prior surprise ln𝑯vsubscript𝑯𝑣-\ln\bm{H}_{v}- roman_ln bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with the accumulated log evidence:

𝒗=σ(𝑬)=σ(ln𝑯v+𝒍)𝒗𝜎𝑬𝜎subscript𝑯𝑣𝒍\bm{v}=\sigma(-\bm{E})=\sigma(\ln\bm{H}_{v}+\bm{l})bold_italic_v = italic_σ ( - bold_italic_E ) = italic_σ ( roman_ln bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + bold_italic_l ) (39)

See [61, 62] for a full derivation of BMC under the Laplace assumption, and [69] for more details about the presented approach. Equation 39 is the discrete analogous of Equation 14, but now the bottom-up message encodes a proper discrete distribution and can be used to infer the most likely intention associated with the current dynamic trajectory.

The factor graph of this model, which we call a hybrid unit, is displayed in Figure 13(b), and its inferential process at each continuous instant is better understood if we analyze separately the three different pathways shown in Figure 13(c): (i) during the forward pass, the unit receives a discrete intention prior 𝑯vsubscript𝑯𝑣\bm{H}_{v}bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, performs a BMA with dynamically generated trajectories 𝒇m(𝒙)subscript𝒇𝑚𝒙\bm{f}_{m}(\bm{x})bold_italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_x ) that manipulate the inferred beliefs of every environmental entity 𝒙nsubscript𝒙𝑛\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and imposes a prior 𝜼xsuperscriptsubscript𝜼𝑥\bm{\eta}_{x}^{\prime}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over the 1st order; (ii) through the first backward pass, the unit accumulates the most likely intention related to the current trajectory by comparing it to the ones generated by the dynamics functions; (iii) in the second backward pass, the unit propagates the dynamics prediction error back to the 0th order to infer the most likely continuous state associated with the trajectory, eventually generating biased observations. After a period T𝑇Titalic_T, the unit finally computes the difference between the discrete prior and the accumulated evidence, generates a new combination of intentions, and the process starts over.

Refer to caption
Figure 14: In this task, the agent has to infer which one among two objects is following. The arm and the two objects move along a circular path, but each with a different velocity. Sequence of time frames (left), and dynamics of hidden causes (right). The two hidden causes are associated with different reaching intentions. As the hand gradually moves away from the first target and approaches the second target, the dynamic accumulation of evidence leads to an increase in the second hidden cause. See [69] for more details.

This kind of dynamic inference has several utilities, e.g., it can be used to infer which one among multiple objects an agent is following – as exemplified in Figure 14 – by generating trajectories for different objects and comparing them with the one it is perceiving [69]. As a last note, the dynamics precisions of Equations 33 and 38 have here an interesting interpretation specular to the observation precisions 𝝅osubscript𝝅𝑜\bm{\pi}_{o}bold_italic_π start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Active inference and predictive coding assume that whenever an agent perceives high noise about a sensory modality, the precision of that generative model will decrease because it cannot be trusted for understanding the state of affairs of the world [11, 12]. In addition, the dualism between action and perception inherent to the free energy principle tells us that the optimization of precisions – which are thought to be encoded as synaptic gains – could play a crucial role in attention mechanisms that selectively sample sensory data [105, 106]. Based on this assumption, we could interpret a low precision 𝝅x,msubscript𝝅𝑥𝑚\bm{\pi}_{x,m}bold_italic_π start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT as the decreased agent’s confidence over that intention for minimizing the prediction errors in the current context; however, a low intention precision could also mean that the agent does not intend to rely on it for the realization of a desired goal. In short, there is a dual interpretation of intention precisions about explaining a situation (e.g., the result of a gras** action to understand an object far away from the hand) or solving a task (e.g., gras** an object when it is out of reach). This perspective unveils an additional mechanism besides the fast inference of hidden causes that we mentioned before: a slow learning of reduced precisions that lets the agent score – and, crucially, focus on – those intentions that would be appropriate for a specific scenario [69].

4.2 A discrete interface for dynamic planning

Numerous studies have demonstrated that the brains of athletes are marked by a higher activation of posterior and subcortical regions that involves little or no conscious thinking, producing fluid transitions between different motions; in contrast, the brain of a novice requires a higher demand of prefrontal computations that results in lower performances [107, 108, 109]. From an active inference perspective, we can compare the proficiency of athletes with the continuous model of Figure 5(b), corresponding to the subcortical sensorimotor loops. This model encodes a transition mechanism that is not very flexible, but which precisely for this reason allows it to react much more rapidly to environmental stimuli, e.g., when gras** objects moving at high speed [75]. In general, this strategy can be very effective when the environment has limited uncertainty and the task to be solved comprises a rigid sequence of actions, which the agent has already correctly learned. However, suppose that the agent is introduced to a novel task, or to a highly complex task that requires careful thinking about the imminent future. In this case, it should be capable of replanning the correct sequence of actions if something goes unexpected, and a high-level belief always producing an a-priori-determined behavior for the hidden causes would fall short in completing the task.

Refer to caption
Figure 15: Interface between a discrete model (at the top) and several hybrid units (at the bottom). For clarity, the hidden states factorization of each unit is not displayed. The discrete hidden states 𝒔𝒔\bm{s}bold_italic_s at time τ+1𝜏1\tau+1italic_τ + 1 are computed – from the current hidden states 𝒔τsubscript𝒔𝜏\bm{s}_{\tau}bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT – by choosing some policy 𝝅πsubscript𝝅𝜋\bm{\pi}_{\pi}bold_italic_π start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT related to a specific transition distribution encoded in 𝑩𝑩\bm{B}bold_italic_B. The best policy at any moment is the sequence of actions that is most likely to minimize the free energy 𝒢𝒢\mathcal{G}caligraphic_G that the agent expects to perceive in the future. As a result, the agent will try to sample those observations that conform to its preference 𝑪𝑪\bm{C}bold_italic_C. The discrete hidden causes 𝒗(i)superscript𝒗𝑖\bm{v}^{(i)}bold_italic_v start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT at time τ𝜏\tauitalic_τ are directly generated from discrete hidden states 𝒔τsubscript𝒔𝜏\bm{s}_{\tau}bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT through likelihood matrices 𝑨(i)superscript𝑨𝑖\bm{A}^{(i)}bold_italic_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, thus affording dynamic planning, synchronized behavior, and inference with multiple evidences.

Having replaced the continuous hidden causes of Figure 5(b) with discrete hidden causes in Figure 13(b), we can now endow the agent with planning capabilities through a discrete model composed of the following distributions – as shown in Figure 15:

p(𝒔1:T,𝒗1:T,𝝅)=p(𝒔1)p(𝝅)τp(𝒗τ|𝒔τ)p(𝒔τ|𝒔τ1,𝝅)𝑝subscript𝒔:1𝑇subscript𝒗:1𝑇𝝅𝑝subscript𝒔1𝑝𝝅subscriptproduct𝜏𝑝conditionalsubscript𝒗𝜏subscript𝒔𝜏𝑝conditionalsubscript𝒔𝜏subscript𝒔𝜏1𝝅p(\bm{s}_{1:T},\bm{v}_{1:T},\bm{\pi})=p(\bm{s}_{1})p(\bm{\pi})\prod_{\tau}p(% \bm{v}_{\tau}|\bm{s}_{\tau})p(\bm{s}_{\tau}|\bm{s}_{\tau-1},\bm{\pi})italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π ) = italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( bold_italic_π ) ∏ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_p ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_p ( bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_π ) (40)

where:

p(𝒔1)=Cat(𝑫)p(𝝅)=Cat(𝑬)𝑝subscript𝒔1absent𝐶𝑎𝑡𝑫𝑝𝝅absent𝐶𝑎𝑡𝑬\displaystyle\begin{aligned} p(\bm{s}_{1})&=Cat(\bm{D})\\ p(\bm{\pi})&=Cat(\bm{E})\end{aligned}start_ROW start_CELL italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_D ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_π ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_E ) end_CELL end_ROW p(𝒗τ|𝒔τ)=Cat(𝑨)p(𝒔τ|𝒔τ1,𝝅)=Cat(𝑩π,τ)𝑝conditionalsubscript𝒗𝜏subscript𝒔𝜏absent𝐶𝑎𝑡𝑨𝑝conditionalsubscript𝒔𝜏subscript𝒔𝜏1𝝅absent𝐶𝑎𝑡subscript𝑩𝜋𝜏\displaystyle\begin{aligned} p(\bm{v}_{\tau}|\bm{s}_{\tau})&=Cat(\bm{A})\\ p(\bm{s}_{\tau}|\bm{s}_{\tau-1},\bm{\pi})&=Cat(\bm{B}_{\pi,\tau})\end{aligned}start_ROW start_CELL italic_p ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_A ) end_CELL end_ROW start_ROW start_CELL italic_p ( bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_π ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_B start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT ) end_CELL end_ROW (41)

Here, 𝑨𝑨\bm{A}bold_italic_A, 𝑩𝑩\bm{B}bold_italic_B, 𝑫𝑫\bm{D}bold_italic_D are the likelihood matrix, transition matrix, and prior, 𝝅𝝅\bm{\pi}bold_italic_π are the policies – which are not state-action map**s as in RL but sequences of actions – with prior 𝑬𝑬\bm{E}bold_italic_E, and 𝒔τsubscript𝒔𝜏\bm{s}_{\tau}bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT are the discrete hidden states at time τ𝜏\tauitalic_τ. These quantities have strict analogies with their continuous counterparts of Equation 3, i.e., the likelihood function 𝒈𝒈\bm{g}bold_italic_g, the dynamics function 𝒇𝒇\bm{f}bold_italic_f, the prior 𝜼xsubscript𝜼𝑥\bm{\eta}_{x}bold_italic_η start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and the hidden causes 𝒗𝒗\bm{v}bold_italic_v, with the difference that the hidden states do not encode instantaneous paths expressed in generalized coordinates, but sequences of future states defined by discrete variables. Here, we wish to infer the posterior distribution:

p(𝒔1:T,𝝅|𝒗1:T)=p(𝒗1:T|𝒔1:T,𝝅)p(𝒔1:T,𝝅)p(𝒗1:T)𝑝subscript𝒔:1𝑇conditional𝝅subscript𝒗:1𝑇𝑝conditionalsubscript𝒗:1𝑇subscript𝒔:1𝑇𝝅𝑝subscript𝒔:1𝑇𝝅𝑝subscript𝒗:1𝑇p(\bm{s}_{1:T},\bm{\pi}|\bm{v}_{1:T})=\frac{p(\bm{v}_{1:T}|\bm{s}_{1:T},\bm{% \pi})p(\bm{s}_{1:T},\bm{\pi})}{p(\bm{v}_{1:T})}italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π | bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) = divide start_ARG italic_p ( bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π ) italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π ) end_ARG start_ARG italic_p ( bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_ARG (42)

As before, this requires computing the intractable model evidence p(𝒗1:T)𝑝subscript𝒗:1𝑇p(\bm{v}_{1:T})italic_p ( bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), so we resort to a variational approach: expressing the approximate posterior by its sufficient statistics 𝒔π,τsubscript𝒔𝜋𝜏\bm{s}_{\pi,\tau}bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT and conditioning upon a specific policy:

p(𝒔1:T|𝒗1:T,𝝅)q(𝒔1:T,𝝅)=q(𝝅)τTq(𝒔τ|𝝅)q(𝝅)=Cat(𝝅)q(𝒔τ|𝝅)=Cat(𝒔π,τ)𝑝conditionalsubscript𝒔:1𝑇subscript𝒗:1𝑇𝝅𝑞subscript𝒔:1𝑇𝝅𝑞𝝅superscriptsubscriptproduct𝜏𝑇𝑞conditionalsubscript𝒔𝜏𝝅𝑞𝝅𝐶𝑎𝑡𝝅𝑞conditionalsubscript𝒔𝜏𝝅𝐶𝑎𝑡subscript𝒔𝜋𝜏\displaystyle\begin{split}p(\bm{s}_{1:T}|\bm{v}_{1:T},\bm{\pi})&\approx q(\bm{% s}_{1:T},\bm{\pi})=q(\bm{\pi})\prod_{\tau}^{T}q(\bm{s}_{\tau}|\bm{\pi})\\ q(\bm{\pi})&=Cat(\bm{\pi})\\ q(\bm{s}_{\tau}|\bm{\pi})&=Cat(\bm{s}_{\pi,\tau})\end{split}start_ROW start_CELL italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_v start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π ) end_CELL start_CELL ≈ italic_q ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_italic_π ) = italic_q ( bold_italic_π ) ∏ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_π ) end_CELL end_ROW start_ROW start_CELL italic_q ( bold_italic_π ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_π ) end_CELL end_ROW start_ROW start_CELL italic_q ( bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_π ) end_CELL start_CELL = italic_C italic_a italic_t ( bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT ) end_CELL end_ROW (43)

we infer the most likely discrete hidden states at time τ𝜏\tauitalic_τ by computing the gradient of the related free energy πsubscript𝜋\mathcal{F}_{\pi}caligraphic_F start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT of that policy:

𝒔π,τ=σ(ln𝑩π,τ1𝒔π,τ1+𝑩π,τ+1T𝒔π,τ+1+iln𝑨(i)T𝒗τ(i))subscript𝒔𝜋𝜏𝜎subscript𝑩𝜋𝜏1subscript𝒔𝜋𝜏1superscriptsubscript𝑩𝜋𝜏1𝑇subscript𝒔𝜋𝜏1subscript𝑖superscript𝑨superscript𝑖𝑇subscriptsuperscript𝒗𝑖𝜏\bm{s}_{\pi,\tau}=\sigma(\ln\bm{B}_{\pi,\tau-1}\bm{s}_{\pi,\tau-1}+\bm{B}_{\pi% ,\tau+1}^{T}\bm{s}_{\pi,\tau+1}+\sum_{i}\ln\bm{A}^{(i)^{T}}\bm{v}^{(i)}_{\tau})bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT = italic_σ ( roman_ln bold_italic_B start_POSTSUBSCRIPT italic_π , italic_τ - 1 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ - 1 end_POSTSUBSCRIPT + bold_italic_B start_POSTSUBSCRIPT italic_π , italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ + 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_ln bold_italic_A start_POSTSUPERSCRIPT ( italic_i ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_italic_v start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) (44)

where we applied a softmax function to ensure that it is a proper probability distribution. Here, 𝑨(i)superscript𝑨𝑖\bm{A}^{(i)}bold_italic_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the likelihood matrix that sends predictions for the i𝑖iitalic_ith hybrid unit. In fact, if we connect several units to the discrete model, each of them has an independent interface whereby the discrete model computes different signals and waits for the next step τ+1𝜏1\tau+1italic_τ + 1, when it can infer its hidden states based on multiple accumulated evidences. Recall that in the combined structure of Figure 15, the role of the hybrid unit was to predict a dynamic trajectory from a discrete intention prior, and to infer the most likely intention in a continuous period T𝑇Titalic_T. But now the intention prior is generated from a high-level policy that decides which action to take next in the current situation:

𝒗τ=σ(ln𝑨𝒔τ+𝒍τ)subscript𝒗𝜏𝜎𝑨subscript𝒔𝜏subscript𝒍𝜏\bm{v}_{\tau}=\sigma(\ln\bm{A}\bm{s}_{\tau}+\bm{l}_{\tau})bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_σ ( roman_ln bold_italic_A bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + bold_italic_l start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) (45)

where 𝒍τsubscript𝒍𝜏\bm{l}_{\tau}bold_italic_l start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is the bottom-up message at time τ𝜏\tauitalic_τ. The inference of the policies additionally considers unobserved outcomes as random variables, finding the most likely sequence of actions that will lead to some preferred outcomes. More formally, the policy posterior q(𝝅)𝑞𝝅q(\bm{\pi})italic_q ( bold_italic_π ) is found by comparing the policy prior with the expected free energy 𝒢𝒢\mathcal{G}caligraphic_G, which is defined as the free energy that the agent expects to perceive in the future:

𝝅=σ(ln𝑬𝒢)𝝅𝜎𝑬𝒢\bm{\pi}=\sigma(\ln\bm{E}-\mathcal{G})bold_italic_π = italic_σ ( roman_ln bold_italic_E - caligraphic_G ) (46)

Assuming that an agent has some preference 𝑪𝑪\bm{C}bold_italic_C about future outcomes, the expected free energy 𝒢πsubscript𝒢𝜋\mathcal{G}_{\pi}caligraphic_G start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT under policy π𝜋\piitalic_π will consist of a pragmatic or goal-seeking term toward that preference, and an epistemic or uncertainty-reducing term (see [55] for more details):

𝒢πτDKL[q(𝒗τ|𝝅)||p(𝒗τ|𝑪)]+𝔼q(𝒔τ|𝒔τ1,𝝅)[H[p(𝒗τ|𝒔τ)]]=τ𝒗π,τ(ln𝒗π,τ𝑪τ)+𝒔π,τ𝑯A\displaystyle\begin{split}\mathcal{G}_{\pi}&\approx\sum_{\tau}D_{KL}[q(\bm{v}_% {\tau}|\bm{\pi})||p(\bm{v}_{\tau}|\bm{C})]+\operatorname*{\mathbb{E}}_{q(\bm{s% }_{\tau}|\bm{s}_{\tau-1},\bm{\pi})}[H[p(\bm{v}_{\tau}|\bm{s}_{\tau})]]\\ &=\sum_{\tau}\bm{v}_{\pi,\tau}(\ln\bm{v}_{\pi,\tau}-\bm{C}_{\tau})+\bm{s}_{\pi% ,\tau}\bm{H}_{A}\end{split}start_ROW start_CELL caligraphic_G start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_CELL start_CELL ≈ ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_π ) | | italic_p ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_C ) ] + blackboard_E start_POSTSUBSCRIPT italic_q ( bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_π ) end_POSTSUBSCRIPT [ italic_H [ italic_p ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT ( roman_ln bold_italic_v start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT - bold_italic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_CELL end_ROW (47)

where:

𝒗π,τ=𝑨𝒔π,τsubscript𝒗𝜋𝜏𝑨subscript𝒔𝜋𝜏\displaystyle\bm{v}_{\pi,\tau}=\bm{A}\bm{s}_{\pi,\tau}bold_italic_v start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT = bold_italic_A bold_italic_s start_POSTSUBSCRIPT italic_π , italic_τ end_POSTSUBSCRIPT 𝑪τ=lnp(𝒗τ|𝑪)subscript𝑪𝜏𝑝conditionalsubscript𝒗𝜏𝑪\displaystyle\bm{C}_{\tau}=\ln p(\bm{v}_{\tau}|\bm{C})bold_italic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = roman_ln italic_p ( bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | bold_italic_C ) 𝑯A=diag(𝑨Tln𝑨)subscript𝑯𝐴𝑑𝑖𝑎𝑔superscript𝑨𝑇𝑨\displaystyle\bm{H}_{A}=-diag(\bm{A}^{T}\ln\bm{A})bold_italic_H start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = - italic_d italic_i italic_a italic_g ( bold_italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ln bold_italic_A ) (48)

Note in the above equations that the likelihood matrix 𝑨𝑨\bm{A}bold_italic_A expresses a conditional probability over the discrete hidden causes 𝒗τsubscript𝒗𝜏\bm{v}_{\tau}bold_italic_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. As in conventional hybrid models, the discrete hidden states are linked to the hidden causes, but now the latter directly act as discrete observations generated by the likelihood matrix, which thus replaces the prior 𝑯vsubscript𝑯𝑣\bm{H}_{v}bold_italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT in Equation 31. In sum, computing the posterior probability over policies 𝝅𝝅\bm{\pi}bold_italic_π turns into finding the best action that makes the agent conform to the dual objective defined by 𝒢𝒢\mathcal{G}caligraphic_G. Here, the discrete actions are not intended as actual motor commands similar to Equation 9, but as abstract actions over high-level representations. In fact, the hierarchical nature of discrete models in active inference makes it possible to perform decision-making with a separation of temporal scales, wherein a specific level can generate and infer the states and the paths of the level below [110, 111, 112]. Further evaluating the consequences of an action for a longer time horizon affords more advanced planning called sophisticated inference [113]. Computing actions with the free energy of the future is different from the motor control of continuous models, which only minimize the free energy of present states.

Refer to caption
Figure 16: In this task, the agent (a 4-DoF arm with an additional 4-DoF hand composed of two fingers) has to pick a moving object, and place it at a home position. The agent has a shallow structure with a single IE module computing the hand position from the arm joint angles. Note that the object belief is rapidly inferred, and as soon as the picking action is complete, the belief is gradually pulled toward the home position, resulting in a second reaching movement. The top right panel shows the hand-object distance over time, while the bottom right panel displays the dynamics of the discrete action probabilities used to infer the next discrete state. See [75] for more details.

In addition to the previous agents, it is now possible to synchronize the behavior of different continuous signals based on the same high-level policy. For instance, one can realize a pick-and-place operation with a moving object – as represented in Figure 16 – producing smooth transitions between reaching and gras** actions, respectively performed in extrinsic and intrinsic domains [75]. Note that an intermediate phase between the two actions naturally arises, corresponding to a composite approaching movement. In principle, the learning of intention precisions 𝝅x,msubscript𝝅𝑥𝑚\bm{\pi}_{x,m}bold_italic_π start_POSTSUBSCRIPT italic_x , italic_m end_POSTSUBSCRIPT (not to be confused with the policy notation) might shed light on how motor skill learning occurs, via message passing between continuous intentions and discrete policies. Moreover, through this kind of dynamic planning the agent can infer and realize instantaneous trajectories even within the same discrete period τ𝜏\tauitalic_τ, useful, e.g., for gras** the moving object without waiting for the successive replanning step. Finally note that to correctly maintain a goal state, we now need to introduce a hidden cause loosely corresponding to the stay action commonly used in discrete tasks [56]. This hidden cause can be linked to an identity intention, i.e., 𝒊stay(𝒙)=𝒙subscript𝒊𝑠𝑡𝑎𝑦𝒙𝒙\bm{i}_{stay}(\bm{x})=\bm{x}bold_italic_i start_POSTSUBSCRIPT italic_s italic_t italic_a italic_y end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_x, which can be interpreted as the agent’s desire to maintain the current state of affairs of the world [69]. Again, the dualism between action and perception also relates this hidden cause to the initial stationary state of the task, and translates into a specular desire for a phase of pure perceptual inference – as the one shown in the simulation of Figure 6.

4.3 Deep hybrid models

Figure 17 portrays a deep hybrid model designed for solving a flexible tool-use task [99]. It combines the expressivity of a (deep) hierarchical formulation, the advantages of inferring and imposing dynamic multi-step intentions inherent to a hybrid unit, and the possibility of encoding external objects and other agents. As in Figure 15, the IE modules communicate with a discrete model at the top, but now they are combined in a hierarchical fashion recapitulating the agent’s kinematic chain. As a consequence, two different goal-directed strategies arise. Considering a simple reaching movement, an attractor imposed at the hand level would generate a cascade of extrinsic prediction errors flowing back to the previous levels and finding a suitable kinematic configuration with the hand over the target. This corresponds to a horizontal hierarchical depth occurring along the hybrid units, and can be compared to the process of motor babbling typical of infants [114], whereby random attractors are generated at different hierarchical levels to identificate the correct body structure. In addition to this naive strategy, since a discrete model can now generate intentions for every IE module (in both intrinsic and extrinsic domains), a more advanced behavior can be achieved once inverse kinematics is correctly performed, which imposes a specific path to the whole kinematic chain. This corresponds to a vertical hierarchical depth with two (discrete and continuous) temporal scales, steering the lower-level inferential process in a direction that, e.g., avoids singularities or gets out from local minima generated by repulsive attractors.

Refer to caption
Figure 17: Graphical representation of a deep hybrid model for tool use, composed of a discrete model at the top and several IE modules. Every module is factorized into three elements, which are respectively linked to the observations of the agent’s arm (in blue), a tool (in green), and a ball (in red). Note that the last level only encodes the tool’s extremity and the ball.

There is thus a delicate balance between forward and backward extrinsic likelihood, and the top-down modulation of the discrete model:

𝝁˙e(i)𝝅e(i1)𝜺e(i1)+𝒈eT𝝅e(i)𝜺e(i)+𝜼x,e(i)T𝝅x,e(i)𝜺x,e(i)proportional-tosuperscriptsubscript˙𝝁𝑒𝑖superscriptsubscript𝝅𝑒𝑖1superscriptsubscript𝜺𝑒𝑖1superscriptsubscript𝒈𝑒𝑇superscriptsubscript𝝅𝑒𝑖superscriptsubscript𝜺𝑒𝑖subscriptsuperscript𝜼𝑖𝑇𝑥𝑒superscriptsubscript𝝅𝑥𝑒𝑖superscriptsubscript𝜺𝑥𝑒𝑖\dot{\bm{\mu}}_{e}^{(i)}\propto-\bm{\pi}_{e}^{(i-1)}\bm{\varepsilon}_{e}^{(i-1% )}+\partial\bm{g}_{e}^{T}\bm{\pi}_{e}^{(i)}\bm{\varepsilon}_{e}^{(i)}+\partial% \bm{\eta}^{\prime(i)T}_{x,e}\bm{\pi}_{x,e}^{(i)}\bm{\varepsilon}_{x,e}^{(i)}over˙ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∝ - bold_italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT + ∂ bold_italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + ∂ bold_italic_η start_POSTSUPERSCRIPT ′ ( italic_i ) italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_e end_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_x , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_italic_ε start_POSTSUBSCRIPT italic_x , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (49)

where 𝜼x,e(i)Tsubscriptsuperscript𝜼𝑖𝑇𝑥𝑒\partial\bm{\eta}^{\prime(i)T}_{x,e}∂ bold_italic_η start_POSTSUPERSCRIPT ′ ( italic_i ) italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x , italic_e end_POSTSUBSCRIPT is the gradient of the trajectory prior of Equation 36. From the discrete model’s perspective, the discrete hidden states produce a specific combination of hidden causes for each hybrid unit; this combination generates a composite trajectory in the continuous domain weighting independent intentions, taking into account dynamic elements for the whole discrete step. After this period, evidence is accumulated for every hybrid unit, eventually inferring the most likely discrete state that may have generated the trajectories of the self and the environment.

A non-trivial issue exists in tasks requiring tool use, e.g., reaching a ball with the extremity of a stick. Much as other agents may have different kinematic structures than the self, a tool may have its own hierarchy (e.g., even a simple stick is represented by two Cartesian positions and an angle) that must somehow be integrated into the agent’s generative model. Specifically, reaching an object with the extremity of a tool means defining a potential kinematic chain augmented by a new virtual level, letting the agent think of the tool as an extension of its arm. This is possible by linking the two visual observations of the tool to the hand and virtual levels in a second pathway of the hidden states, as shown in Figure 17. Since the intrinsic units of the IE modules also encode information about limb lengths, the agent can infer through visual observations not only its kinematic structure, but also the actual length of the tool [115]. While this second pathway is still marked by a clear distinction between the tool and the arm since the hand level receives observations from both elements, a third pathway is constructed such that the observation of the ball is only linked to the virtual level. As a result, this new potential configuration views the arm and the tool as being part of the same kinematic chain. The interactions between these three pathways (shown in Figure 18) may shed light on how the remap** of the motor cortex gradually occurs with extensive tool use [6, 7], modifying the boundaries between the self and the environment.

Refer to caption
Figure 18: In this task, the agent (a 4-DoF arm) has to grasp a green tool and track a moving red ball with the tool’s extremity. The real arm configuration is represented in blue, while the light green and light red arms correspond to the potential agent’s configurations in relation to the tool and the ball, respectively. Note that the last two levels of the tool’s belief gradually match the real tool, while the ball’s belief makes no distinction between the arm and the tool, and is only defined by the tool’s extremity. The (deep) hierarchical factorization allows the agent to infer a potential configuration for the ball even during the first reaching movement. See [99] for more details.

5 Discussion

Despite the many advances that have been made in this relatively new and promising research area, with increasing popularity among different scientific domains, a current drawback is that studies about low-level motor control and high-level behavior have been somewhat distinct so far, making use of two highly specular but separated frameworks. As a result, there is no consensus on how to achieve dynamic planning (i.e., how to perform decision-making in constantly changing environments), and state-of-the-art solutions to tackle complex tasks generally couple active inference with traditional machine learning methods. From a theoretical perspective, a few works prescribed an efficient and elegant way for combining the capabilities of discrete and continuous representations into a single generative model [48, 59]; however, this hybrid approach has not reached as much maturity, with the consequence that there are far fewer studies on the subject in the literature, none applied to dynamic contexts.

For this reason, we tried here to give a comprehensive view of this yet unexplored direction, comparing several design choices regarding goal-directed behavior, with the intent of bringing motor control and behavioral studies closer. As a practical goal, we decided to model tool use [99], a task that inevitably calls for both discrete and continuous frameworks, and that requires taking two additional aspects into account, i.e., object affordances and hierarchical causal relationships. In a simple scenario, considering a target to reach as the cause of some hidden states is a reasonable assumption and makes the agent able to operate in dynamic contexts. But assuming that multiple objects are present, how does the agent decide which one will be the cause of a particular action? And what if the target moves along a non-trivial path? If the hidden states are factorized into independent distributions encoding multiple entities, the hidden causes may be seen from a different angle, i.e. they would manipulate the hidden states through flexible intentions [43, 44, 75]. Each of these entities would have its own dynamics, allowing the agent to predict, e.g., the trajectory of a moving ball. Then, this unit was scaled up to construct complex (deep) hierarchical structures, e.g., for simulating human body kinematics [44], and to perform more general transformations of reference frames, e.g., perspective projections [58]. A hierarchical factorization of the hidden states now assumes a broader perspective that can also account for multi-agent interaction – an aspect that has been analyzed in the discrete framework as well [116]. Finally, we designed a hybrid unit with discrete hidden causes and continuous hidden states, affording dynamic inference via Bayesian model reduction [69] that, when coupled with a higher-level discrete model, made it possible to simulate multi-step tasks involving online planning of actions. This showed further parallelisms between the inference of intentions in the continuous domain, and policies in discrete models.

Still, the real question is how we can reach such performances without embedding our prior knowledge into the agent’s generative model. Although not relevant for many continuous-time implementations that focused on different aspects of motor control, a common criticism is that the structure of generative models is a-priori defined and fixed, with intricate and hardcoded dynamics functions that raise some concerns about biological plausibility. In contrast, one appealing characteristic of PCNs is that they simulate brain processing with extremely simple functions typical of the connectivity of neural networks (e.g., linear combinations of weights and biases passed to a non-linear activation function). This allows the network to easily adapt to high-dimensional data, with a few critical advantages compared to deep learning arising from a top-down modulation [16]. While much of this research involves static representations, some studies began to address how predictive coding could be used to learn temporal sequences [117, 118], or to solve RL tasks [119, 103, 120]. Here, we demonstrated how generative models in active inference could be realized by simple likelihood and dynamics functions, showing some analogies with the inferential process of PCNs. Based on these findings, a promising research direction would be to imitate their multiple-input and multiple-output architectures (as in Figure 9(d)), so that an agent could not only learn – in a biologically plausible way – its kinematic configuration and the system dynamics, but also act over them to conform to prior beliefs.

Learning policies in continuous environments is not an easy challenge, but addressing it with strategies different from traditional methods might be key for advancing with current intelligent agents, realizing the full theoretical potential at the basis of active inference and the free energy principle. On this matter, the state-of-the-art is to approximate the likelihood and transition distributions by deep neural networks [78, 121, 122, 123]. While several benefits arise compared to deep RL, this still relegates the deep structure within the neural network, generally making use of a single-level active inference agent. One study used a more biologically plausible PCN as a generative model [120], but relied on a similar approach. As extensively analyzed in [51], neural networks can be seen as static generative models with infinitely precise priors at the last level and no hidden states. This architecture can be used to perform sparse coding or Principal Components Analysis (PCA); however, it fails to account for dynamic variables, as in deconvolution problems or filtering in state-space models. Temporal depth – either discrete or continuous – is thus key to inferring the most accurate representation of the environment, and indeed it seems that cortical columns are able to express model dynamics (e.g., the prefrontal cortex is constantly involved in predicting future states, and motion-sensitive neurons have been recorded in the early visual cortex as well [124]). While it is true that temporal sequences can be easily handled by deep architectures such as recurrent neural networks or transformers [125], their passive generative mechanism could still reflect to the behavior of the active inference agent. In contrast to such a passive AI, being grounded on sensorimotor experiences and actively modifying the environment could be fundamental to the emergence of genuine understanding [126]. Taken together, these facts suggest that acting upon generalized coordinates of motion or discrete future states also for intermediate levels could bring several advantages in solving RL tasks. For instance, representing an agent in a hierarchical fashion afforded highly advanced control over its whole body structure that would not have been possible by a single level generating only the hand position [44, 115].

How then to learn dynamic planning in deep hierarchical models? In [127], it has been stressed the importance of being discrete when considering structure learning. Indeed, hierarchical discrete models afford much more expressivity compared to their continuous counterparts, above all, deriving from the simplicity of computing the expected free energy that allows an agent to plan actions over the imminent future. Nonetheless, as Friston and colleagues note, whether using continuous or discrete representations depends on the model evidence. Specifically, the former may have better performances when the evidence has contiguity properties, e.g., when dealing with time series or with Euclidean space. Indeed, the task exemplified in Figure 18 is effective because the Bayesian model reduction performs a dynamic evidence accumulation over the extrinsic space in which the agent operates. Hence, coupling the hierarchical depth of the hybrid units in Figure 17 with a hierarchical discrete architecture (and not just a single discrete level) could bring efficient structure learning also in constantly changing environments. An alternative approach would be to combine in a hierarchical fashion units composed of a joint discrete-continuous model – as in Figure 15 – which would allow to perform dynamic planning within each single unit. While this solution may not be supported yet by empirical evidence from biological agents, it could be an encouraging direction to explore from a machine learning perspective, contrasting the hypothesis of central discrete decision-making with a distributed network of local decisions.

A third interesting topic regards motor intentionality. Although multi-step tasks are typically tackled at the discrete level, we showed here that, under appropriate assumptions, a non-trivial behavior could be achieved and analyzed also at the continuous level. The flexible intentions that we defined could be compared to an advanced stage of motor skill learning, consisting of autonomous and smooth movements that do not necessitate conscious decision-making [75]. Still, the model structure was predefined in this case as well. How do such intentions emerge during repeated exposure to the same task? How does the agent score which intentions will be appropriate for a specific context? As mentioned in the last chapter, the optimization of intention precisions is likely to involve the free energy of reduced models (see Equation 38). This process may shed light on how discrete actions arise from low-level continuous intentions and, conversely, how the latter are generated from a composite discrete action. Last, a few studies proposed additional connections between policies unfolding at different timescales, either directly [111, 112] or through discrete hidden states [110]. Such approaches could be adopted in hybrid and continuous contexts as well, so that flexible intentions could be propagated via local message passing between hidden causes along the whole hierarchy.

6 Acknowledgments

This research received funding from the European Union’s Horizon H2020-EIC-FETPROACT-2019 Programme for Research and Innovation under Grant Agreement 951910 to I.P.S. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • [1] David Meunier. Hierarchical modularity in human brain functional networks. Frontiers in Neuroinformatics, 3, 2009.
  • [2] Claus C. Hilgetag and Alexandros Goulas. ‘hierarchy’ in the organization of brain networks. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1796):20190319, February 2020.
  • [3] Nicholas P. Holmes and Charles Spence. The body schema and multisensory representation(s) of peripersonal space. Cognitive Processing, 5(2):94–105, June 2004.
  • [4] Atsushi Yokoi and Jörn Diedrichsen. Neural organization of hierarchical motor sequence representations in the human neocortex. Neuron, 103(6):1178–1190.e7, September 2019.
  • [5] Christine Assaiante, F. Barlaam, F. Cignetti, and M. Vaugoyeau. Body schema building during childhood and adolescence: A neurosensory approach. Neurophysiologie Clinique = Clinical Neurophysiology, 44(1):3–12, January 2014.
  • [6] Atsushi lriki, Michio Tanaka, and Yoshiaki Iwamura. Coding of modified body schema during tool use by macaque postcentral neurones. NeuroReport, 7(14):2325–2330, October 1996.
  • [7] Shigeru Obayashi, Tetsuya Suhara, Koichi Kawabe, Takashi Okauchi, Jun Maeda, Yoshihide Akine, Hirotaka Onoe, and Atsushi Iriki. Functional brain map** of monkey tool use. NeuroImage, 14(4):853–861, 2001.
  • [8] Thomas A. Carlson, George Alvarez, Daw-an Wu, and Frans A.J. Verstraten. Rapid assimilation of external objects into the body schema. Psychological Science, 21(7):1000–1005, May 2010.
  • [9] Lucilla Cardinali, Francesca Frassinetti, Claudio Brozzoli, Christian Urquizar, Alice C. Roy, and Alessandro Farnè. Tool-use induces morphological updating of the body schema. Current Biology, 19(13):478, 2009.
  • [10] Rajesh P.N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87, 1999.
  • [11] Jakob Hohwy. The Predictive Mind. Oxford University Press UK, 2013.
  • [12] Andy Clark. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 01 2016.
  • [13] Jakob Hohwy. New directions in predictive processing. Mind and Language, 35(2):209–223, 2020.
  • [14] Andy Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204, 2013.
  • [15] Stewart Shipp. Neural elements for predictive coding. Frontiers in Psychology, 7, November 2016.
  • [16] Beren Millidge, Anil Seth, and Christopher L Buckley. Predictive coding: a theoretical and experimental review, 2022.
  • [17] Alexander Ororbia and Daniel Kifer. The neural coding framework for learning generative models. Nature Communications, 13(1), 2022.
  • [18] Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, and Alexander Ororbia. Brain-inspired computational intelligence via predictive coding, 2023.
  • [19] James C R Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Comput, 29(5):1229–1262, March 2017.
  • [20] James C.R. Whittington and Rafal Bogacz. Theories of Error Back-Propagation in the Brain. Trends in Cognitive Sciences, 23(3):235–250, 2019.
  • [21] Beren Millidge, Alexander Tschantz, and Christopher L. Buckley. Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs. Neural Computation, 34(6):1329–1368, 2022.
  • [22] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1211–1221, 2009.
  • [23] Jakob Hohwy, Andreas Roepstorff, and Karl Friston. Predictive coding explains binocular rivalry: An epistemological review. Cognition, 108(3):687–701, September 2008.
  • [24] Giovanni Pezzulo, Francesco Donnarumma, Domenico Maisto, and Ivilin Stoianov. Planning at decision time and in the background during spatial navigation. Current Opinion in Behavioral Sciences, 29:69–76, 2019.
  • [25] A David Redish. Vicarious trial and error. Nature Reviews Neuroscience, 17:147–159, 2016.
  • [26] I Stoianov, C Pennartz, C Lansink, and G Pezzulo. Model-based spatial navigation in the hippocampus-ventral striatum circuit: a computational analysis. Plos Computational Biology, 14(9):1–28, 2018.
  • [27] Ivilin Stoianov, Domenico Maisto, and Giovanni Pezzulo. The hippocampal formation as a hierarchical generative model supporting generative replay and continual learning. Progress in Neurobiology, 217:1–20, 2022.
  • [28] Karl J. Friston, Jean Daunizeau, James Kilner, and Stefan J. Kiebel. Action and behavior: A free-energy formulation. Biological Cybernetics, 102(3):227–260, 2010.
  • [29] Karl Friston. The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2):127–138, 2010.
  • [30] Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology, 81:55–79, 2017.
  • [31] Thomas Parr, Giovanni Pezzulo, and Karl J Friston. Active inference: the free energy principle in mind, brain, and behavior. Cambridge, MA: MIT Press, 2021.
  • [32] Karl J. Friston, Jean Daunizeau, and Stefan J. Kiebel. Reinforcement learning or active inference? PLoS ONE, 4(7), 2009.
  • [33] Karl Friston. What is optimal about motor control? Neuron, 72(3):488–498, 2011.
  • [34] Rick A. Adams, Stewart Shipp, and Karl J. Friston. Predictions not commands: Active inference in the motor system. Brain Structure and Function, 218(3):611–643, 2013.
  • [35] Harriet Brown, Karl Friston, and Sven Bestmann. Active inference, attention, and motor preparation. Frontiers in Psychology, 2(SEP):1–10, 2011.
  • [36] Giovanni Pezzulo, Leo D’Amato, Francesco Mannella, Matteo Priorelli, Toon Van de Maele, Ivilin Peev Stoianov, and Karl Friston. Neural representation in active inference: using generative models to interact with – and understand – the lived world. Annals of the New York Academy of Sciences, in press 2024.
  • [37] Pablo Lanillos, Jordi Pages, and Gordon Cheng. Robot self/other distinction: active inference meets neural networks learning in a mirror. (Ecai), 2020.
  • [38] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ACM International Conference Proceeding Series, 148:945–952, 2006.
  • [39] Marc Toussaint. Probabilistic inference as a model of planned behavior. Künstliche Intelligenz, 3/09:23–29, 2009.
  • [40] Matthew Botvinick and Marc Toussaint. Planning as inference. Trends in Cognitive Sciences, 16(10):485–488, 2012.
  • [41] A. Maselli, P. Lanillos, and G. Pezzulo. Active inference unifies intentional and conflict-resolution imperatives of motor control. PLOS Comput. Biol, 18(6), 2022.
  • [42] Francesco Mannella, Federico Maggiore, Manuel Baltieri, and Giovanni Pezzulo. Active inference through whiskers. Neural Networks, 144:428–437, 2021.
  • [43] Matteo Priorelli and Ivilin Peev Stoianov. Flexible Intentions: An Active Inference Theory. Frontiers in Computational Neuroscience, 17:1 – 41, 2023.
  • [44] Matteo Priorelli, Giovanni Pezzulo, and Ivilin Peev Stoianov. Deep kinematic inference affords efficient and scalable control of bodily movements. Proceedings of the National Academy of Sciences of the United States of America, 120, 2023.
  • [45] Ajith Anil Meera, Filip Novicky, Thomas Parr, Karl Friston, Pablo Lanillos, and Noor Sajid. Reclaiming saliency: Rhythmic precision-modulated action and perception. Frontiers in Neurorobotics, 16:1–23, 2022.
  • [46] Raphael Kaplan and Karl J. Friston. Planning and navigation as active inference. Biological Cybernetics, 112(4):323–343, 2018.
  • [47] Rick A. Adams, Klaas Enno Stephan, Harriet R. Brown, Christopher D. Frith, and Karl J. Friston. The computational anatomy of psychosis. Frontiers in Psychiatry, 4, 2013.
  • [48] Karl J. Friston, Thomas Parr, and Bert de Vries. The graphical brain: Belief propagation and active inference. 1(4):381–414, 2017.
  • [49] Riccardo Proietti, Giovanni Pezzulo, and Alessia Tessari. An active inference model of hierarchical action understanding, learning and imitation. Physics of Life Reviews, 46:92–118, September 2023.
  • [50] Francesco Donnarumma, Marcello Costantini, Ettore Ambrosini, Karl Friston, and Giovanni Pezzulo. Action perception as hypothesis testing. Cortex, 89:45–60, April 2017.
  • [51] Karl Friston. Hierarchical models in the brain. PLoS Computational Biology, 4(11), 2008.
  • [52] Matteo Priorelli, Federico Maggiore, Antonella Maselli, Francesco Donnarumma, Domenico Maisto, Francesco Mannella, Ivilin Peev Stoianov, and Giovanni Pezzulo. Modeling motor control in continuous-time Active Inference: a survey. IEEE Transactions on Cognitive and Developmental Systems, pages 1–15, 2023.
  • [53] Karl Friston, Klaas Stephan, Baojuan Li, and Jean Daunizeau. Generalised filtering. Mathematical Problems in Engineering, 2010:Article ID 621670, 34 p.–Article ID 621670, 34 p., 2010.
  • [54] Thomas Parr, Rajeev Vijay Rikhye, Michael M. Halassa, and Karl J. Friston. Prefrontal Computation as Active Inference. Cerebral Cortex, 30(2):682–695, 2020.
  • [55] Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselic, Victorita Neacsu, and Karl Friston. Active inference on discrete state-spaces: A synthesis. Journal of Mathematical Psychology, 99, 2020.
  • [56] Ryan Smith, Karl J. Friston, and Christopher J. Whyte. A step-by-step tutorial on active inference and its application to empirical data. Journal of Mathematical Psychology, 107:102632, 2022.
  • [57] Giovanni Pezzulo, Francesco Rigoli, and Karl J. Friston. Hierarchical Active Inference: A Theory of Motivated Control. Trends in Cognitive Sciences, 22(4):294–306, 2018.
  • [58] M. Priorelli, G. Pezzulo, and I.P. Stoianov. Active vision in binocular depth estimation: A top-down perspective. Biomimetics, 8(5), 2023.
  • [59] Karl J. Friston, Richard Rosch, Thomas Parr, Cathy Price, and Howard Bowman. Deep temporal models and active inference. Neuroscience and Biobehavioral Reviews, 77(November 2016):388–402, 2017.
  • [60] K. J. Friston, L. Harrison, and Will Penny. Dynamic causal modelling. NeuroImage, 19(4):1273–1302, 2003.
  • [61] Karl Friston and Will Penny. Post hoc Bayesian model selection. NeuroImage, 56(4):2089–2099, 2011.
  • [62] Karl Friston, Thomas Parr, and Peter Zeidman. Bayesian model reduction. pages 1–32, 2018.
  • [63] M.J. Rosa, K. Friston, and W. Penny. Post-hoc selection of dynamic causal models. Journal of Neuroscience Methods, 208(1):66–78, June 2012.
  • [64] Thomas Parr and Karl J. Friston. The Discrete and Continuous Brain: From Decisions to Movement—And Back Again Thomas. Neural Computation, 30:2319–2347, 2018.
  • [65] T. Parr and K. J. Friston. The computational pharmacology of oculomotion. Psychopharmacology (Berl.), 236(8):2473–2484, August 2019.
  • [66] A. Tschantz, L. Barca, D. Maisto, C. L. Buckley, A. K. Seth, and G. Pezzulo. Simulating homeostatic, allostatic and goal-directed forms of interoceptive control using active inference. Biological Psychology, 169:108266, 2022.
  • [67] Thomas Parr and Karl J. Friston. Active inference and the anatomy of oculomotion. Neuropsychologia, 111(January):334–343, 2018.
  • [68] Ozan Çatal, Tim Verbelen, Toon Van de Maele, Bart Dhoedt, and Adam Safron. Robot navigation as hierarchical active inference. Neural Networks, 142:192–204, 2021.
  • [69] M. Priorelli and I.P. Stoianov. Dynamic inference by model reduction. bioRxiv, 2023.
  • [70] Stefano Ferraro, Toon Van de Maele, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. Disentangling shape and pose for object-centric deep active inference models, 2022.
  • [71] Ruben S. van Bergen and Pablo L. Lanillos. Object-based active inference, 2022.
  • [72] Toon Van de Maele, Tim Verbelen, Ozan undefinedatal, and Bart Dhoedt. Embodied object representation learning and recognition. Frontiers in Neurorobotics, 16, April 2022.
  • [73] Toon Van de Maele, Tim Verbelen, Pietro Mazzaglia, Stefano Ferraro, and Bart Dhoedt. Object-centric scene representations using active inference, 2023.
  • [74] Rick A. Adams, Eduardo Aponte, Louise Marshall, and Karl J. Friston. Active inference and oculomotor pursuit: The dynamic causal modelling of eye movements. Journal of Neuroscience Methods, 242:1–14, 2015.
  • [75] Matteo Priorelli and Ivilin Peev Stoianov. Slow but flexible or fast but rigid? discrete and continuous processes compared. bioRxiv, 2023.
  • [76] Pablo Lanillos, Cristian Meo, Corrado Pezzato, Ajith Anil Meera, Mohamed Baioumy, Wataru Ohata, Alexander Tschantz, Beren Millidge, Martijn Wisse, Christopher L. Buckley, and Jun Tani. Active inference in robotics and artificial agents: Survey and challenges. CoRR, abs/2112.01871, 2021.
  • [77] Tadahiro Taniguchi, Shingo Murata, Masahiro Suzuki, Dimitri Ognibene, Pablo Lanillos, Emre Ugur, Lorenzo Jamone, Tomoaki Nakamura, Alejandra Ciria, Bruno Lara, and Giovanni Pezzulo. World models and predictive coding for cognitive and developmental robotics: frontiers and challenges. Advanced Robotics, 37(13):780–806, June 2023.
  • [78] Kai Ueltzhöffer. Deep Active Inference. pages 1–40, 2017.
  • [79] Beren Millidge. Deep active inference as variational policy gradients. Journal of Mathematical Psychology, 96, 2020.
  • [80] Zafeirios Fountas, Noor Sajid, Pedro A.M. Mediano, and Karl Friston. Deep active inference agents using Monte-Carlo methods. Advances in Neural Information Processing Systems, 2020-Decem(NeurIPS), 2020.
  • [81] Théophile Champion, Marek Grześ, Lisa Bonheme, and Howard Bowman. Deconstructing deep active inference. 2023.
  • [82] Aleksey Zelenov and Vladimir Krylov. Deep active inference in control tasks. In 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), pages 1–3, 2021.
  • [83] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, April 2017.
  • [84] Maxwell J. D. Ramstead, Dalton A. R. Sakthivadivel, Conor Heins, Magnus Koudahl, Beren Millidge, Lancelot Da Costa, Brennan Klein, and Karl J. Friston. On bayesian mechanics: a physics of and by beliefs. Interface Focus, 13(3), April 2023.
  • [85] Cansu Sancaktar, Marcel A. J. van Gerven, and Pablo Lanillos. End-to-End Pixel-Based Deep Active Inference for Body Perception and Action. In 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 1–8, 2020.
  • [86] Guillermo Oliver, Pablo Lanillos, and Gordon Cheng. An empirical study of active inference on a humanoid robot. IEEE Transactions on Cognitive and Developmental Systems, 8920(c):1–10, 2021.
  • [87] Cristian Meo and Pablo Lanillos. Multimodal VAE active inference controller. CoRR, abs/2103.04412, 2021.
  • [88] Thomas Rood, Marcel van Gerven, and Pablo Lanillos. A deep active inference model of the rubber-hand illusion. 2020.
  • [89] Mohamed Baioumy, Paul Duckworth, Bruno Lacerda, and Nick Hawes. Active inference for integrated state-estimation, control, and learning. arXiv, 2020.
  • [90] Cristian Meo, Giovanni Franzese, Corrado Pezzato, Max Spahn, and Pablo Lanillos. Adaptation through prediction: Multisensory active inference torque control. IEEE Transactions on Cognitive and Developmental Systems, 15(1):32–41, 2023.
  • [91] Fred Bos, Ajith Anil Meera, Dennis Benders, and Martijn Wisse. Free Energy Principle for State and Input Estimation of a Quadcopter Flying in Wind. Proceedings - IEEE International Conference on Robotics and Automation, pages 5389–5395, 2022.
  • [92] Ajith Anil Meera and Martijn Wisse. Dynamic expectation maximization algorithm for estimation of linear systems with colored noise. Entropy, 23(10), 2021.
  • [93] Léo Pio-Lopez, Ange Nizard, Karl Friston, and Giovanni Pezzulo. Active inference and robot control: A case study. Journal of the Royal Society Interface, 13(122), 2016.
  • [94] Emanuel Todorov. Optimality principles in sensorimotor control. Nature Neuroscience, 7:907–915, 2004.
  • [95] Mareike Floegel, Johannes Kasper, Pascal Perrier, and Christian A. Kell. How the conception of control influences our understanding of actions. Nature Reviews Neuroscience, 24(May):313–329, 2023.
  • [96] Giuseppe Vallar, Elie Lobel, Gaspare Galati, Alain Berthoz, Luigi Pizzamiglio, and Denis Le Bihan. A fronto-parietal system for computing the egocentric spatial frame of reference in humans. Experimental Brain Research, 124(3):281–286, January 1999.
  • [97] James R. Hinman, G. William Chapman, and Michael E. Hasselmo. Neuronal representation of environmental boundaries in egocentric coordinates. Nature Communications, 10(1), June 2019.
  • [98] Karl J Friston, Jérémie Mattout, and James Kilner. Action understanding and active inference. Biological cybernetics, 104(1-2):137–60, feb 2011.
  • [99] Matteo Priorelli and Ivilin Peev Stoianov. Deep hybrid models: infer and plan in the real world. arXiv, 2024.
  • [100] Giacomo Rizzolatti and Laila Craighero. The mirror-neuron system. Annu Rev Neurosci, 27:169–192, 2004.
  • [101] James M. Kilner, Karl J. Friston, and Chris D. Frith. Predictive coding: an account of the mirror neuron system. Cognitive Processing, 8(3):159–166, April 2007.
  • [102] Jeff Hawkins, Subutai Ahmad, and Yuwei Cui. A theory of how columns in the neocortex enable learning the structure of the world. Frontiers in Neural Circuits, 11, 2017.
  • [103] Rajesh P. N. Rao, Dimitrios C. Gklezakos, and Vishwas Sathish. Active predictive coding: A unified neural framework for learning hierarchical world models for perception and planning, 2022.
  • [104] Ares Fisher and Rajesh P N Rao. Recursive neural programs: A differentiable framework for learning compositional part-whole hierarchies and image grammars. PNAS Nexus, 2(11), October 2023.
  • [105] Harriet Feldman and Karl J. Friston. Attention, uncertainty, and free-energy. Frontiers in Human Neuroscience, 4, 2010.
  • [106] Thomas Parr, David A. Benrimoh, Peter Vincent, and Karl J. Friston. Precision and false perceptual inference. Frontiers in Integrative Neuroscience, 12, September 2018.
  • [107] F. Fattapposta, G. Amabile, M. V. Cordischi, D. Di Venanzio, A. Foti, F. Pierelli, C. D’Alessio, F. Pigozzi, A. Parisi, and C. Morrocutti. Long-term practice effects on a new skilled motor learning: An electrophysiological study. Electroencephalography and Clinical Neurophysiology, 99(6):495–507, 1996.
  • [108] Francesco Di Russo, Sabrina Pitzalis, Teresa Aprile, and Donatella Spinelli. Effect of practice on brain activity: An investigation in top-level rifle shooters. Medicine and Science in Sports and Exercise, 37(9):1586–1593, 2005.
  • [109] Ann M. Graybiel. Habits, rituals, and the evaluative brain. Annual Review of Neuroscience, 31:359–387, 2008.
  • [110] Karl J. Friston, Thomas Parr, Conor Heins, Axel Constant, Daniel Friedman, Takuya Isomura, Chris Fields, Tim Verbelen, Maxwell Ramstead, John Clip**er, and Christopher D. Frith. Federated inference and belief sharing. Neuroscience & Biobehavioral Reviews, 156:105500, 2024.
  • [111] Toon Van de Maele, Bart Dhoedt, Tim Verbelen, and Giovanni Pezzulo. Integrating cognitive map learning and active inference for planning in ambiguous environments. In Active Inference, pages 204–217, Cham, 2024. Springer Nature Switzerland.
  • [112] Daria de Tinguy, Toon Van de Maele, Tim Verbelen, and Bart Dhoedt. Spatial and temporal hierarchy for autonomous navigation using active inference in minigrid environment. Entropy, 26(1):83, January 2024.
  • [113] Karl Friston, Lancelot Da Costa, Danijar Hafner, Casper Hesp, and Thomas Parr. Sophisticated inference. Neural Computation, 33(3):713–763, 2021.
  • [114] Daniele Caligiore, Tomassino Ferrauto, Domenico Parisi, Neri Accornero, Marco Capozza, and Gianluca Baldassarre. Using motor babbling and hebb rules for modeling the development of reaching with obstacles and gras**. 2008.
  • [115] Matteo Priorelli and Ivilin Peev Stoianov. Efficient motor learning through action-perception cycles in deep kinematic inference. In Active Inference, pages 59–70. Springer Nature Switzerland, 2024.
  • [116] Domenico Maisto, Francesco Donnarumma, and Giovanni Pezzulo. Interactive inference: A multi-agent model of cooperative joint actions. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 54(2):704–715, 2024.
  • [117] Linxing Preston Jiang and Rajesh P. N. Rao. Dynamic predictive coding: A model of hierarchical sequence learning and prediction in the neocortex. bioRxiv, 2023.
  • [118] Beren Millidge, Mahyar Osanlouy, and Rafal Bogacz. Predictive Coding Networks for Temporal Prediction. pages 1–59, 2023.
  • [119] Alexander Ororbia and Ankur Mali. Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems. 2022.
  • [120] Beren Millidge. Combining Active Inference and Hierarchical Predictive Coding: a Tutorial Introduction and Case Study. PsyArXiv, 2019.
  • [121] Ozan Çatal, Johannes Nauta, Tim Verbelen, Pieter Simoens, and B. Dhoedt. Bayesian policy selection using active inference. ArXiv, abs/1904.08149, 2019.
  • [122] Stefano Ferraro, Toon Van de Maele, Tim Verbelen, and Bart Dhoedt. Symmetry and complexity in object-centric deep active inference models. Interface Focus, 13(3), April 2023.
  • [123] Kai Yuan, Karl Friston, Zhibin Li, and Noor Sajid. Hierarchical generative modelling for autonomous robots. Research Square, 2023.
  • [124] Stephen Grossberg and Praveen K. Pilly. Temporal dynamics of decision-making during motion perception in the visual cortex. Vision Research, 48(12):1345–1373, June 2008.
  • [125] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [126] Giovanni Pezzulo, Thomas Parr, Paul Cisek, Andy Clark, and Karl Friston. Generating meaning: active inference and the scope and limits of passive ai. Trends in Cognitive Sciences, 28(2):97–112, February 2024.
  • [127] Karl J. Friston, Lancelot Da Costa, Alexander Tschantz, Alex Kiefer, Tommaso Salvatori, Victorita Neacsu, Magnus Koudahl, Conor Heins, Noor Sajid, Dimitrije Markovic, Thomas Parr, Tim Verbelen, and Christopher L Buckley. Supervised structure learning. 2023.