Dynamic planning in hierarchical active inference

Matteo Priorelli
Institute of Cognitive Sciences and Technologies
National Research Council of Italy
Padova, Italy
[email protected]
&Ivilin Peev Stoianov
Institute of Cognitive Sciences and Technologies
National Research Council of Italy
Padova, Italy
[email protected]

Abstract

By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behaviors could be explained in terms of an active inferential process – either as discrete decision-making or continuous motor control – inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan realistic actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, kee** in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.

1 Introduction

Hierarchies are found everywhere in the world. They are so pervasive that they do not just exist as causal relationships between physical properties of the environment, but are also inherent to how biological organisms act over it. Even the most complex kinematic structures of animals follow a rigid hierarchical strategy, whereby different limbs propagate from a body-centered reference frame. The hierarchical modularity of brain functional networks is widely recognized [1, 2], as well as the representation of the body schema in somatosensory and motor areas [3], and the organization of hierarchical motor sequences concerning parietal and premotor cortices [4]. In particular, the body schema is not a static entity but changes in concurrence to the development of the human body during childhood and adolescence [5]. Surprisingly, the nervous system is able to relate external objects to the self in a way that, although not reflecting the actual causal relationships between the body and the environment, is the most suitable for better operating in a specific context. Physiological studies have demonstrated that, with extensive tool use, parietal and motor areas of the monkey brain gradually adapt to make room for the tool, increasing the length of the perceived limb [6, 7]. This adaptation is highly plastic, assimilating objects in a very short time [8] and inducing altered somatosensory representations of the body morphology that persist even after tool use [9].

Why and how does this happen? One recent theory is that of predictive coding [10, 11, 12], which has been attracting increasing interest in recent years and proposes itself as a unifying paradigm of cortical function. Predictive coding posits that living beings make sense of the world by building an internal generative model that tries to imitate the hierarchical causal relationships of the external generative process. From a high-level hypothesis about the state of affairs of the world, a cascade of neural predictions takes place, eventually leading to a low-level guess about sensory evidence. Comparing the model’s guess with the sensorium triggers another cascade of prediction errors that travel back to the deepest cortical levels. The model iteratively refines its structure until all the prediction errors are minimized, that is, until it is finally able to predict what will happen next. This optimization differs from the more traditional view of deep learning, in that the message passing is local and what climbs up the hierarchy does not signal the detection of a feature, but how much the model is surprised about its prediction. Besides having stimulated cognitive and neural studies under several circumstances [13, 14, 15, 16], this theory has also influenced novel directions in machine learning: Predictive Coding Networks (PCNs) have been shown to generalize well to classification or regression tasks [17, 18], with key advantages compared to neural networks and still approximating the backpropagation algorithm [19, 20, 21].

While predictive coding can elucidate – through a rigorous computational framework [22] – illusions and visual phenomena such as binocular rivalry [23], it explains just the first (perceptual) half of the story. More specifically, it does not explain why interactions with the environment occur – a process that results, considering the above example, in the monkey brain actively distorting its body schema during tool use. Such complex tasks always involve decision-making, which the brain is known to realize via several methods [24]. Among them, one is particularly relevant here: planning for deliberation, also known as vicarious trials and errors, whereby an action is selected after several alternatives have been generated and evaluated [25]. One of the most intriguing characteristics of human planning is the capacity to imagine, or endogenously generate dynamic representations of future states, including potential trajectories and subgoals that bring to such states [26, 27]. The hippocampus is a key neural structure known to support trajectory generation, although planning is accomplished in concert with other areas implementing evaluation of options and response selection [25]. How does the human brain account for the dynamics of the self and the environment to afford purposeful planning?

On this trail, a second innovative perspective has been proposed, aspiring to unveil a unified first principle not just on cortical function, but on the behavior of all living organisms. This perspective, called active inference [28, 29, 30, 31], is grounded on the same theoretical basis of predictive coding but further assumes two key aspects of biological behavior. First, that a living being does not maintain static hypotheses about the state of affairs of the world but can also construct internal dynamics – either as instantaneous trajectories or future states – allowing it to anticipate the unfolding of events occurring at different timescales. Second, that these dynamic hypotheses can be fulfilled by movements. The latter assumption replaces models with agents, conveying a somewhat counterintuitive but insightful implication: while perception lets the agent’s hypothesis conform to the environment (as in predictive coding), action forces the environment to conform to the hypothesis, by letting the agent sample those observations that make the hypothesis true. If such hypotheses (usually called beliefs) correspond to desired states defined, e.g., by the phenotype, cycling between action and perception ultimately allows the agent to survive. This is the core of the so-called free energy principle, which states that in order to maintain homeostasis, all organisms must constantly and actively minimize the difference between their sensory states and their expectations based on a small set of life-compatible options. Giving a practical example, if I believe to find myself with a tool in hand, I will try with all my strength to observe visual images of the tool in my hand; in doing this, a combined reaching and gras** action happens. This view distances itself from the stimulus-response map** widely estabilished in neuroscience, and evidence indicates that it could be more biologically plausible than optimal control and Reinforcement Learning (RL) [32, 33, 34, 35].

In principle, active inference might be key for understanding how goal-directed behavior emerges in the human brain [36]. For instance, relevant objects used for manipulation may gradually become part of one’s identity through a closed loop between motor commands and sensory evidence, meaning that the boundary of the self from the environment increases whenever the agent manages to predict the consequences of its own movements [37]. Additionally, active inference might prove fundamental for making advances with current artificial agents, taking forward a promising research area known as planning as inference [38, 39, 40]. Active inference implementations can be divided into two frameworks, which have been used to simulate human and animal behaviors under the two complementary aspects of motor control [32, 41, 42, 43, 44, 45] and decision-making [46, 47, 48, 49, 50]. The first framework – generally compared to the low-level sensorimotor loops – is defined in continuous time [51, 52] and makes use of generalized filtering [53] to model instantaneous trajectories of the self and the environment; these trajectories are inferred by minimization of a quantity called variational free energy, which is the negative of what in machine learning is known as the evidence lower bound. Differently from optimal control, motor commands in active inference derive from proprioceptive predictions that are fulfilled by classical spinal reflex arcs [34]. This eliminates the need for cost functions – as the inverse model maps from proprioceptive (and not latent) states to actions – and replaces a control problem with an inference problem [33]. The second framework – attributed to the cerebral cortex, especially prefrontal aras [54], along with corticostriatal loops – is expressed in discrete state-space [55, 56] and exploits the structure of Partially Observable Markov Decision Processes (POMDPs) to plan abstract actions over expected future sensations. This (active) inference relies on the minimization of the expected free energy, i.e., the free energy that the agent expects to perceive in the future. The expected free energy can be unpacked into two terms resembling the two classical aspects of control theory, exploration and exploitation – which here naturally arise; these respectively correspond to an uncertainty-reducing term, and a goal-seeking term that, as before, pushes the agent to find a sequence of actions leading to its prior belief.

Three features of active inference are relevant to designing intelligent agents that can tackle real-life applications and, for the goal of this study, tasks requiring tool use. First, multiple units – composed of simple likelihood and state transition distributions – can be easily connected to adapt to complex hierarchical structures [57]. For instance, a hierarchical kinematic model can be designed in continuous time, wherein each unit encodes a certain Degree of Freedom (DoF) in intrinsic and extrinsic reference frames [44]. This allows one to realize advanced movements that involve simultaneous coordination of multiple limbs, e.g., moving with a glass in hand. This hierarchical structure can be generalized to perform homogeneous transformations between reference frames, e.g., perspective projections [58]. However, a continuous model alone lacks effective usability in the real world, since it can only deal with present sensory states and cannot perform any form of future planning.

The latter is usually possible through so-called mixed or hybrid models [48, 59], which combine the potentialities of a discrete model with the inference of continuous signals, allowing robust decision-making in uncertain and changing environments. While the theory of Bayesian model reduction [60, 61, 62, 63] provides efficient communication between the two models, this unified approach has not enjoyed many practical implementations for the time being [31, 48, 59, 64, 65, 66, 67]. An open issue regards how to deal with highly dynamic environments: standard hybrid models usually perform a comparison between static priors, limiting the agent to realize, e.g., multi-step reaching movements through fixed positions. One study addressed the problem of realistic robot navigation in active inference, but making use of alternative bio-inspired SLAM methods [68]. In [69], a hybrid model in which the agent’s hypotheses were generated at each time step from the system dynamics allowed to relate continuous trajectories with discrete plans.

A third appealing characteristic of the framework is that one can encode beliefs not only over its own bodily states, but also over external physical variables. This has been recently done in the context of active object reconstruction [70, 71, 72, 73] – where an agent encoded independent representations for multiple elements, and used action to more accurately infer its dynamics; for simulating oculomotor behavior [74] – where the dynamics of a target belief was biased by a hidden location; or for analyzing epistemic affordance [50], i.e., the changes in affordance of different objects in relation to the agent’s beliefs. In continuous time, such affordances can be expressed in intrinsic reference frames corresponding to potential agent’s configurations, defining specific ways to interact with the objects. Manipulating these additional beliefs depending on the agent’s intentions [43] permits effectively operating in dynamic contexts, e.g., tracking a target with the eyes [74], or gras** an object on the fly [69] and placing it at a goal position [75]. However, these applications do not exploit the efficiency of deep hierarchical models, and controlling multiple limbs other than the hand is not straightforward. Crucially, they would not appreciate the flexibility of animal brains in remap** their neuronal activity to account for usable tools.

Based on these premises, a question arises: how to perform dynamic planning with hierarchical structures of several objects? In other words, how to combine these three features into a single view? While many studies in continuous time can be currently found in the literature [52, 76, 77], a rigorous formalism of how to realize goal-directed behavior is still lacking, with the consequence of using different solutions for similar problems – especially in contexts that demand online replanning. On the other side, promising results have been achieved by combining the capabilities of discrete-time active inference with neural networks, in a way loosely resembling deep RL. Indeed, so-called deep active inference have critical advantages in learning and solving online tasks compared to traditional methods [78, 79, 80, 81, 82]. Still, neural networks are treated as black boxes during free energy minimization, without fully enjoying the potential benefits of hierarchical and temporal depths. One of the most attractive aspects of active inference is that it prescribes a unified perspective not just for fitting to complex high-dimensional data (as neural networks or PCNs do), but also for embodying environmental dynamics and acting over them to minimize uncertainty and conform to prior beliefs.

For these reasons, in this study we explore an alternative direction to the optimal control problem that does not call for additional frameworks, in a few words, a direction toward hybrid computations in hierarchical systems. We analyze many design choices that have been applied in the motor control domain, with an in-depth look at the three characteristics mentioned above. Asking ourselves how to model tool use, we start from a simple unit and construct richer modules that can be linked in a hierarchical fashion, exhibiting interesting high-level features. In Chapter 2, we consider a single-DoF agent and explore how to realize a sort of multi-step behavior in continuous time only. In Chapter 3, we analyze the implications of combining different units in a single network, using more complex kinematic configurations and distinguishing between intrinsic and extrinsic dynamics. In Chapter 4, we describe the advantages of using discrete decision-making in continuous environments, focusing on hybrid structures and drawing some parallelisms between the two worlds. Finally, in the Discussion we elaborate on the benefits of addressing discrete and continuous representations together, and give a few suggestions for future work on this subject.

2 Flexible intentions

In this chapter, we begin by explaining the inferential mechanisms of a basic unit in continuous time. We then discuss one by one the changes and features that we introduce, in order to achieve a multi-step behavior in simple tasks that do not require deep hierarchical modeling nor online replanning.

2.1 A simple agent

Refer to caption — Figure 1: Factor graph of a basic unit for static reaching. Variables and factors are indicated by circles and squares, respectively. Hidden states $\bm{x}$ (e.g., the hand position) generate observations $\bm{o}$ (e.g., an image of the hand) through the likelihood function $\bm{g}$ , and their 1st derivatives $\bm{x}^{\prime}$ (e.g., the hand velocity) through a dynamics function $\bm{f}$ . In contrast to optimal control, here action follows observation prediction errors arising from a simple attractor $\bm{\rho}$ embedded in the model dynamics, or from a prior belief $\bm{\eta}$ over the hand position.

The most elementary unit is represented in Figure 1. This is the simplest formulation of a continuous-time active inference agent, where we kept only the key nodes. This allows us to easily describe a velocity-controlled dynamic system with the following likelihood $\bm{g}$ and dynamics $\bm{f}$ :

\displaystyle\begin{split}\bm{o}&=\bm{g}(\bm{x})+\bm{w}_{o}\\ \bm{x}^{\prime}&=\bm{f}(\bm{x})+\bm{w}_{x}\end{split}

(1)

where $\bm{x}$ and $\bm{o}$ are respectively called hidden states and observations, and the letter $w$ indicates noise terms sampled from Gaussian distributions. For simplicity, we considered just two temporal orders – although all the features that we elucidate in the following hold for a system of generalized coordinates [53] – and we defined a likelihood function only for a single temporal order. We assume that the corresponding generative model is factorized as follows:

p(\tilde{\bm{x}},\bm{o})=p(\bm{o}|\bm{x})p(\bm{x})p(\bm{x}^{\prime}|\bm{x})

(2)

where:

\displaystyle\begin{split}p(\bm{o}|\bm{x})&=\mathcal{N}(\bm{g}(\bm{x}),\bm{\pi% }^{\scaleto{-1}{4pt}}_{o})\\ p(\bm{x})&=\mathcal{N}(\bm{\eta},\bm{\pi}^{\scaleto{-1}{4pt}}_{\eta})\\ p(\bm{x}^{\prime}|\bm{x})&=\mathcal{N}(\bm{f}(\bm{x}),\bm{\pi}_{x}^{\scaleto{-% 1}{4pt}})\end{split}

(3)

expressed in terms of precisions (or inverse variances) $\bm{\pi}$ . Note that we introduced a prior $\bm{\eta}$ over the hidden states, which is not generally used in continuous-time formulations, but it is the key element connecting different levels in discrete-time active inference [59] or PCNs [16] – as will be explained later. Also note that we used a generalized notation for instantaneous trajectories or paths, i.e., $\tilde{\bm{x}}=[\bm{x},\bm{x}^{\prime}]$ , where $\bm{x}$ will be indicated in the following as the 0th order, and $\bm{x}^{\prime}$ as the 1st order. We highlighted in green and red respectively the input and output of the unit, namely the prior $\bm{\eta}$ and the observations $\bm{o}$ . For the moment, we do not specify their nature, either being intermediate representations coming from other levels, or the highest and lowest levels of the hierarchy, i.e., a fixed prior and a sensory observation.

Exact computation of the posterior $p(\tilde{\bm{x}}|\bm{o})$ is unfeasible since the evidence requires marginalizing over every possible outcome, i.e., $p(\bm{o})=\int{p(\tilde{\bm{x}},\bm{o})d\tilde{\bm{x}}}$ . For this reason, estimation of hidden states $\tilde{\bm{x}}$ is carried out through a variational approach [83], e.g., by minimizing the difference between properly chosen an approximate posterior $q(\tilde{\bm{x}})$ and the true posterior of the generative process. This difference is expressed in terms of a Kullback-Leibler (KL) divergence:

D_{KL}[q(\tilde{\bm{x}})||p(\tilde{\bm{x}}|\bm{o})]=\int_{\tilde{\bm{x}}}q(% \tilde{\bm{x}})\ln\frac{q(\tilde{\bm{x}})}{p(\tilde{\bm{x}}|\bm{o})}d\tilde{% \bm{x}}

(4)

The denominator $p(\bm{x}|\bm{y})$ still depends on the marginal $p(\bm{y})$ , but the KL divergence can be rewritten in terms of the log evidence and a quantity known as the free energy $\mathcal{F}$ :

\mathcal{F}=\operatorname*{\mathbb{E}}_{q(\tilde{\bm{x}})}\left[\ln\frac{q(% \tilde{\bm{x}})}{p(\tilde{\bm{x}},\bm{o})}\right]=\operatorname*{\mathbb{E}}_{% q(\tilde{\bm{x}})}\left[\ln\frac{q(\tilde{\bm{x}})}{p(\tilde{\bm{x}}|\bm{o})}% \right]-\ln p(\bm{o})

(5)

Since the KL divergence is always nonnegative, the free energy provides an upper bound on surprise, i.e., $\mathcal{F}\geq\ln p(\bm{o})$ . Hence, minimizing the KL divergence with respect to $q(\tilde{\bm{x}})$ is equivalent to minimizing $\mathcal{F}$ , and achieves the dual objective of kee** surprise low while estimating the true distribution. Assuming that the approximate posterior can be factorized into independent contributions, and further assuming that each contribution is Gaussian, the optimization process breaks down to the minimization of the prediction errors associated with the distributions of the generative model in Equation 2 – see [31] for more details:

\displaystyle\begin{split}\bm{\varepsilon}_{o}&=\bm{o}-\bm{g}(\bm{\mu})\\ \bm{\varepsilon}_{\eta}&=\bm{\mu}-\bm{\eta}\\ \bm{\varepsilon}_{x}&=\bm{\mu}^{\prime}-\bm{f}(\bm{\mu})\end{split}

(6)

Then, the inference of the means (also called beliefs) of the posterior over the hidden states, denoted by $\tilde{\bm{\mu}}$ , is reduced to the following message passing:

\dot{\tilde{\bm{\mu}}}=\begin{bmatrix}\dot{\bm{\mu}}\\ \dot{\bm{\mu}}^{\prime}\end{bmatrix}=\mathcal{D}\tilde{\bm{\mu}}-\partial_{% \tilde{\mu}}\mathcal{F}=\begin{bmatrix}\bm{\mu}^{\prime}-\bm{\pi}_{\eta}\bm{% \varepsilon}_{\eta}+\partial_{\mu}\bm{g}^{T}\bm{\pi}_{o}\bm{\varepsilon}_{o}+% \partial_{\mu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}\\ \\ -\bm{\pi}_{x}\bm{\varepsilon}_{x}\end{bmatrix}

(7)

where $\mathcal{D}$ is an operator that shifts every derivative by one, i.e., $\mathcal{D}\tilde{\bm{\mu}}=[\bm{\mu}^{\prime},\bm{0}]$ . This term arises because the generative model maintains a belief not over a static point, but over a dynamic trajectory, and only when the motion of the mean $\dot{\tilde{\bm{\mu}}}$ equals the mean of the motion $\mathcal{D}\tilde{\bm{\mu}}$ , is the free energy minimized. In short, the inferential process does not involve matching a state (as in PCNs) but tracking a path [84]. Unpacking Equation 7, we note that the 0th order is subject to a forward error from the prior, a backward error from the likelihood, and a backward error from the dynamics function. On the other hand, the 1st order is only subject to the latter but in the form of a forward error. The belief is then updated via gradient descent, i.e., $\tilde{\bm{\mu}}_{t+1}=\tilde{\bm{\mu}}_{t}+\Delta_{t}\dot{\tilde{\bm{\mu}}}$ , where $\Delta_{t}$ is a time constant.

How can this agent perform a simple reaching movement? As highlighted in Figure 1, we can encode the hand position and velocity as generalized hidden states. We will talk later about the relation between proprioceptive and exteroceptive domains, as this deserves a careful discussion. For now, we consider a single DoF that has a univocal map** between the joint angle and the Cartesian position, so we can represent both of them by the same variables and factors (note however that we maintain a bold notation for generalization and consistency with the rest of the study). Indicating the target to reach by $\bm{\rho}$ , we can define the following dynamics function:

\bm{f}(\bm{x})=\bm{\rho}-\bm{x}

(8)

expressing a simple attractor toward the target [37, 85, 86, 87, 88]. This dynamics does not exist in the actual generative process, and it is indeed this discrepancy that forces the environment to conform to the agent’s beliefs. Specifically, Equation 8 means that the agent thinks its hand will be pulled toward the target with a strength proportional to the precision $\bm{\pi}_{x}$ . In fact, the attractor affects the belief update through the dynamics prediction error $\bm{\varepsilon}_{x}$ , expressing a difference between the estimated velocity $\bm{\mu}^{\prime}$ and the one predicted by the agent through the dynamics function $\bm{f}$ . Note that this error appears in both temporal orders: in brief, $\bm{\varepsilon}_{x}$ imposes a trajectory at the 1st order which in turn affects the 0th order directly through $\bm{\mu}^{\prime}$ , and indirectly through the gradient $\partial_{\mu}\bm{f}$ .

The interactions between these quantities can be better understood from Figure 2, showing a reaching movement with the defined dynamics function and the trajectories of the agent’s generative model. Here, the belief is subject to two different forces: a likelihood gradient pushing it toward what it is currently perceiving (i.e., the real angle), and the other components that steer it toward the biased dynamics (i.e., the target angle $\bm{\rho}$ ). Note how in the third plot, of the three components that comprise the belief update, the backward error $\partial_{\mu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}$ has the least impact in the overall direction of update. While the exact interactions arising from the dynamics prediction error have yet to be analyzed, in the following we assume that goal-directed behavior is achieved through the forward error at the 1st order $-\bm{\pi}_{x}\bm{\varepsilon}_{x}$ . An alternative would be to directly control the backward error without maintaining a belief over increasing temporal orders [85], which however requires to take a gradient into account and may be more challenging when defining appropriate attractors to reach a goal. Finally see how, in the middle plot, the agent tries at every instant to minimize the difference between $\bm{\mu}^{\prime}$ and $\dot{\bm{\mu}}$ , thus tracking the actual path of the hidden states.

But how does this agent actually move? As mentioned in the Introduction, action is the other side of the coin of the free energy principle, through which the agent samples those observations that conform to its prior beliefs. In fact, in addition to the perceptual inference typical of predictive coding, active inference assumes that organisms minimize free energy also by interacting with the environment; this minimization breaks down to an even simpler update that only depends on observation prediction errors $\bm{\varepsilon}_{o}$ . Since these prediction errors are generated from the agent’s belief, this means that whenever the latter is biased toward some preferred state, movement naturally follows. There is thus a delicate balance between perception – in which prediction errors climb up the hierarchy to bring the belief closer to the observations – and action – in which prediction errors are suppressed at a low level so that the observations are brought closer to their predictions. However, there is an open issue regarding how active inference should be practically realized in continuous time. A few studies demonstrated that using exteroceptive information directly for computing motor commands could result in smoother movements and resolution of visuo-proprioceptive conflicts [28, 41, 43], and in fact some robotic implementations effectively used this approach [85, 86]. However, evidence seems to indicate that motor commands are generated by suppression of proprioceptive information only [34, 33], which is already in the intrinsic reference frame needed for movement and thus results in an easier inverse dynamics. For this reason, in the following we assume that – indicating by the subscript $p$ the proprioceptive domain – movements are realized by minimizing the free energy with respect to proprioceptive prediction errors $\bm{\varepsilon}_{p}$ :

\dot{\bm{a}}=-\partial_{a}\mathcal{F}=-\partial_{a}\bm{g}_{p}^{T}\bm{\pi}_{p}% \bm{\varepsilon}_{p}

(9)

where $\partial_{a}\bm{g}_{p}$ performs an inverse dynamics from the proprioceptive predictions to the motor commands $\bm{a}$ , likely to be implemented by classical spinal reflex arcs. As a last note, the actions can also depend on multiple orders – velocity, acceleration, and so on – allowing more efficient movement and control [89, 90, 91, 92], but since it is beyond our scope, we only minimize the 0th order. Nonetheless, 1st-order movements – e.g., maintaining a constant velocity – are still possible by specification of appropriate dynamics of the hidden states.

2.2 Tracking objects

The simple agent defined in the previous section can only realize fixed trajectories embedded in the dynamics function, so how can we let it track moving objects? This is usually done by introducing a key concept in active inference, the hidden causes $\bm{v}$ , which link hierarchical levels and specify how the dynamics function evolves. In the active inference literature of motor control, they are also used to encode the target to be reached [28, 67, 74, 93], as depicted in Figure 3. Considering the target as a causal variable for the hidden states and sensory observations makes sense from an active perspective whereby “it is an object I want to reach that generates my movements”. Now, the agent’s generative model becomes:

p(\tilde{\bm{x}},\bm{v},\bm{o})=p(\bm{o}|\bm{x},\bm{v})p(\bm{x})p(\bm{x}^{% \prime}|\bm{x},\bm{v})p(\bm{v})

(10)

where:

\displaystyle\begin{split}p(\bm{x})&=\mathcal{N}(\bm{\eta}_{x},\bm{\pi}^{% \scaleto{-1}{4pt}}_{\eta_{x}})\\ p(\bm{v})&=\mathcal{N}(\bm{\eta}_{v},\bm{\pi}^{\scaleto{-1}{4pt}}_{\eta_{v}})% \\ p(\bm{x}^{\prime}|\bm{x},\bm{v})&=\mathcal{N}(\bm{f}(\bm{x},\bm{v}),\bm{\pi}_{% x}^{\scaleto{-1}{4pt}})\end{split}

(11)

Note that there are two priors, one over the hidden states and another over the hidden causes, respectively denoted by $\bm{\eta}_{x}$ and $\bm{\eta}_{v}$ . Note also that both dynamics and likelihood functions depend on the hidden causes, and that we assumed a further factorization for the likelihood:

\displaystyle\begin{split}p(\bm{o}|\bm{x},\bm{v})&=p(\bm{o}_{x}|\bm{x})p(\bm{o% }_{v}|\bm{v})\\ p(\bm{o}_{x}|\bm{x})&=\mathcal{N}(\bm{g}_{x}(\bm{x}),\bm{\pi}^{\scaleto{-1}{4% pt}}_{o,x})\\ p(\bm{o}_{v}|\bm{v})&=\mathcal{N}(\bm{g}_{v}(\bm{v}),\bm{\pi}^{\scaleto{-1}{4% pt}}_{o,v})\end{split}

(12)

where $\bm{o}_{x}$ and $\bm{o}_{v}$ denote the hand and target observations, respectively. It is this additional connection between hidden causes and observations that makes the agent able to operate in dynamic environments. In fact, we can define the following dynamics function:

\bm{f}(\bm{x},\bm{v})=\bm{v}-\bm{x}

(13)

where we just replaced the static target $\bm{\rho}$ with the hidden causes. Then, the posterior belief over the hidden causes $\bm{\nu}$ is updated according to:

\dot{\bm{\nu}}=-\partial_{\nu}\mathcal{F}=-\bm{\pi}_{\eta_{v}}\bm{\varepsilon}% _{\eta_{v}}+\partial_{\nu}\bm{g}_{v}^{T}\bm{\pi}_{o,v}\bm{\varepsilon}_{o,v}+% \partial_{\nu}\bm{f}^{T}\bm{\pi}_{x}\bm{\varepsilon}_{x}

(14)

where we defined the following observation and prior prediction errors:

\displaystyle\begin{split}\bm{\varepsilon}_{o,v}&=\bm{o}_{v}-\bm{g}_{v}(\bm{% \nu})\\ \bm{\varepsilon}_{\eta_{v}}&=\bm{\nu}-\bm{\eta}_{v}\end{split}

(15)

As evident, the hidden causes are subject to a prior prediction error, a backward dynamics error, and a backward likelihood error – similar to the update of the hidden states, with the only difference that this kind of inference is over a state and not a path. Via the backward likelihood error, the agent can correctly estimate the target configuration whenever it moves, as shown in the tracking simulation of Figure 4. Concerning the dynamics prediction error, it can now flow into two different pathways: specifically, the role of the gradients $\partial_{\mu}\bm{f}$ and $\partial_{\nu}\bm{f}$ are to respectively infer the positional state and the cause that may have generated a particular velocity; their actual role will be clear in Chapter 4.

2.3 Intention modulation and multi-step behavior

Although capable of operating in dynamic contexts, the last approach still portrays a simple scenario in which a specific target has no internal dynamics and has always the role of a cause for a hidden state. In other words, it does not permit modeling realistic tasks such as a pick-and-place operation, where an object can either be the cause of a reaching and gras** movement, or the consequence of another cause such as a goal position, resulting in a placing movement; critically, it does not allow to model a task wherein not only the dynamics of the self, but also the dynamics of the target must be learned (e.g., if a moving object should be grasped on the fly, the agent should infer its trajectory to anticipate where it will fall).

It follows that to operate in a complex environment, the agent must (i) maintain complete representations for each entity that it wants to interact with, and (ii) flexibly assign causes and consequences for the next movement depending on the current context – in a similar way to policies in discrete models, as will be explained later. Therefore, we first encode multiple environmental entities in the hidden states, i.e., $\bm{x}=[\bm{x}_{1},\dots,\bm{x}_{N}]$ , where $N$ is the number of entities [43]. Consequently, the factorized likelihood function generates specular observations for each element:

\bm{o}=[\bm{o}_{1},\dots,\bm{o}_{N}]=[\bm{g}_{1}(\bm{x}_{1}),\dots,\bm{g}_{N}(% \bm{x}_{N})]

(16)

This structure is similar to the previous model, except that the target is now embedded in the hidden states along with the hand, and that there is no connection between hidden causes and observations. We could define a similar factorization for the hidden causes and dynamics function, i.e., $\bm{x}^{\prime}=[\bm{f}_{1}(\bm{x}_{1},\bm{v}_{1}),\dots,\bm{f}_{N}(\bm{x}_{N}% ,\bm{v}_{N})]$ , such that each entity would have an independent dynamics biased by a specific cause (e.g., where the hand or the target will be in the future); however, this is of limited use in a pick-and-place operation that demands interaction between entities. We hence compute a potential hidden state with a single function, such as:

\bm{i}(\bm{x})=\bm{W}\bm{x}+\bm{b}

(17)

The weights $\bm{W}$ perform a linear transformation of the hidden states that combines every entity, while the bias $\bm{b}$ imposes a static configuration over them [44]. Equation 17 can be realized through simple neural connections, wherein the weights are encoded as synaptic strengths and the bias represents the threshold needed to fire a spike. An error is then computed between this potential state and the current one:

\bm{e}_{i}=\bm{i}(\bm{x})-\bm{x}

(18)

This vector has the same role as the attractor of Equation 13, but now it points toward a function of the hidden states. Finally, we define the following dynamics function:

\bm{f}(\bm{x},v)=v\bm{e}_{i}

(19)

multiplying the error by a single-value hidden cause $v$ . Thus, the latter is not intended as an explicit trajectory prior over the hidden states (e.g., encoding where my hand will be in the future), whose role is now delegated to the bias $\bm{b}$ ; but as an attractor gain, whereby a high value implies a strong force toward the potential state. As a result, we have an additional modulation that combines with the dynamics precision $\bm{\pi}_{x}$ ; their interactions will be explained in Chapter 4. Since $\bm{i}(\bm{x})$ is used to define a path for the current hidden states aiming to produce a desired configuration, we call it an intention. Similarly, we refer to $\bm{e}_{i}$ as an intention prediction error; note however that this quantity is not strictly a prediction error, although it would be possible to design the model to call it as such.

In summary, as shown in Figure 5(a), the dynamics function is not composed of segregated pathways as for the likelihood, but affects all the environmental entities at once – e.g., it computes a trajectory for the hand depending on the target. The steps performed by the agent during a reaching movement are the following: (i) the 0th order imposes a dynamic trajectory to the 1st order and generates a sensory prediction; (ii) the 0th order infers the consequences of its predictions, hence it is now biased toward both the intention and the observation; (iii) a proprioceptive prediction is generated from this new biased position, eventually driving action. This approach can be seen as a generalization of [74] where, in the context of oculomotor behavior, the target and the center of gaze were encoded as hidden states, with their own dynamics and attracted by a hidden location. Although somewhat limited compared to non-linear dynamics functions (e.g., obstacle avoidance can be realized by dynamics functions built from repulsive potentials [44]), through the specific form defined above – and along with the hidden states factorization – there is a high flexibility in designing intentions for complex interactions. Further, interpreting the hidden causes as a gain still makes sense from an active inference perspective because what is represented at a higher level is the intention to move at the target, while the target location is inferred at a lower level.

Taken alone, considering a hidden cause as an attractor gain may not seem so helpful. However, as depicted in Figure 5(b), we can combine $M$ intentions in the following way:

\displaystyle\begin{split}\bm{\eta}_{x,m}^{\prime}&=\bm{f}_{m}(\bm{x},\bm{v})=% v_{m}\bm{e}_{i,m}\\ \bm{\eta}_{x}^{\prime}&=\bm{f}(\bm{x},\bm{v})=\sum_{m}^{M}\bm{\eta}_{x,m}^{% \prime}\end{split}

(20)

In short, trajectories $\bm{\eta}_{x,m}^{\prime}$ are separately computed from each intention $\bm{i}_{m}$ and with their respective gains; then, the final trajectory $\bm{\eta}_{x}^{\prime}$ is found by combining all of them. The reason why we used the prior notation for the trajectory predictions will be clear in Chapter 4. Note for the moment that, as before, there is a different structure compared to the likelihood. While an observation is generated through a parallel pathway for every environmental belief, each function $\bm{f}_{m}(\bm{x},\bm{v})$ combines all of them in a specific way. Since each intention prediction error is proportional to its hidden cause $v_{v}$ , the latter lends itself to a parallelism with the policies of discrete models, as will be explained later: if $v_{m}$ is set to $1$ and all the others to $0$ , the hidden states will be subject only to intention $m$ ; conversely, if multiple hidden causes are active, the hidden states will be pulled toward a combination of the corresponding intentions. This means that the hidden causes act both as attractor gains – expressing the absolute strength by which the belief is steered toward the desired direction – and as intention modulators – defining the relative strength between each desired state. The resulting dynamics prediction error:

\bm{\varepsilon}_{x}=\bm{\mu}^{\prime}-\bm{\eta}_{x}^{\prime}

(21)

will then realize an average trajectory that the agent predicts for the current situation. This approach is effective for two reasons. First, it allows defining a composite movement in terms of simpler subgoals, which can be tackled separately; this can be helpful, for instance, if one has to analyze the behavior of an agent when subject to two or more opposing priors [43]. But the main utility is that a fixed multi-step behavior can be achieved [75] which does not need to modify the dynamics function at each step but only to modulate the hidden causes, since the model already encodes all the intermediate goals that the agent will pass through. Transitions between continuous trajectories could then be realized by higher-level priors over the hidden causes, e.g., a belief over tactile sensations for multi-step reaching, such as in the simulation of Figure 6.

3 Hierarchical models

So far, we have introduced several units with two kinds of inputs – a prior over the hidden states and a prior over the hidden causes – and one kind of output – the 0th-order observation. In this chapter, we focus on how to combine such units in a single network to achieve a more advanced and efficient control. For this, we will make use of the first input, leaving the discussion about the second to the next chapter.

3.1 Intrinsic and extrinsic causes

The last unit presented affords a multi-step behavior in continuous time that accounts, to some extent, for dynamic elements of the environment. However, in all the previous simulations we just considered a single-DoF arm, while in real-life applications we generally deal with much more complex kinematic structures such as the human body. In this case, there is no more a one-to-one map** between joint angles and Cartesian positions, so we need to distinguish between proprioceptive and exteroceptive (e.g., visual) observations. As in optimal control, continuous-time active inference considers three reference frames and two inversions: an extrinsic signal (e.g., encoding the Cartesian position of a target) is first transformed in an intrinsic signal (e.g., encoding the joint angles configuration corresponding to the hand at the target) through inverse kinematics, which is in turn converted to the actual motor control signals (e.g., joint torques) through inverse dynamics [94]. These two processes are also attributed to the human brain [95, 96, 97], but there is a substantial difference between optimal control and active inference regarding how they unfold in practice. As mentioned in the previous chapter, in active inference the motor commands are replaced by proprioceptive prediction errors that are suppressed through spinal reflex arcs [34]. As a consequence, inverse dynamics becomes easier because action is put aside and the agent has just to know the map** from proprioceptive states to motor commands – see Equation 9.

But what about inverse kinematics? Recall the perspective that we mentioned in the previous chapter, i.e., that “it is an object I want to reach that generates my movements”. Turning optimal control upside down, active inference posits that action is driven by the proprioceptive consequences (e.g., changes in limb lengths) of extrinsic causes (e.g., a target) [33]. Intuitively, one could model an extrinsic movement as in Figure 7(a), i.e., with the following dynamics and likelihood functions:

\displaystyle\begin{split}\bm{f}(\bm{x},\bm{v})&=\bm{J}^{T}(\bm{v}-\bm{T}(\bm{% x}))\\ \bm{g}_{p}(\bm{x})&=\bm{x}\\ \bm{g}_{v}(\bm{x},\bm{v})&=\begin{bmatrix}\bm{T}(\bm{x})&\bm{v}\end{bmatrix}% \end{split}

(22)

where $\bm{T}$ is the forward kinematics returning the hand position, and $\bm{J}$ is the Jacobian matrix.

In short, the hand – expressed in terms of joint angles of the whole arm – is embedded in the hidden states, while the target to reach – expressed as a Cartesian position – is encoded in the hidden causes. The proprioceptive states needed for movement are found by first using an inverse kinematic model directly as a forward model into the dynamics function, e.g., through a Jacobian transpose or a pseudoinverse [85, 86, 87, 37, 28, 98, 93]; and then, by generating a prediction through the proprioceptive likelihood $\bm{g}_{p}$ (which here is a simple identity map**). The visual likelihood $\bm{g}_{v}$ is finally used to update the target location and further refine the inference of the agent’s configuration. This approach implies that an extrinsic reference frame is inverted to generate an intrinsic state, which is in turn transformed again in the first domain to be compared with visual observations. As a result, forward and inverse kinematics are performed twice, once in the dynamics function and once in the forward and backward passes of perceptual inference, when propagating the visual prediction error $\bm{\varepsilon}_{v}$ :

\partial\bm{g}_{v}^{T}\bm{\varepsilon}_{v}=\begin{bmatrix}\bm{J}\\ \bm{1}\end{bmatrix}(\bm{o}_{v}-\begin{bmatrix}\bm{T}(\bm{x})&\bm{v}\end{% bmatrix})

(23)

If the predictions are not temporarily stored, this requires increased computational demand and memory. Crucially, there is an additional issue regarding biological plausibility: using sensory-level attractors within the dynamics function means that a unit is aware of and can use intra-level part of the likelihood prediction – which is generally assumed to go all down to the sensorium – and its inverse map**, which are lower-level features. Finally, the model in Figure 7(a) does not let the agent express paths in extrinsic coordinates needed, e.g., for realizing linear or circular motions, or for imposing constraints in both intrinsic and extrinsic domains such as when walking with a glass in hand.

We can instead exploit Equation 23 and follow the natural flow of the generative process to avoid duplicated computations, as displayed in Figure 7(b). This model relies on two hierarchical levels, where an intrinsic unit (encoding the arm joint angles) is placed at the top and generates predictions through forward kinematics for an extrinsic unit (encoding the Cartesian position of the target) [44]:

\bm{x}_{e}=\bm{g}(\bm{x}_{i})=\bm{T}(\bm{x}_{i})+\bm{w}_{i}

(24)

The goal-directed behavior of Equation 22 arises naturally via backpropagation of the extrinsic prediction error $\bm{\varepsilon}_{e}$ :

\partial\bm{g}_{e}^{T}\bm{\varepsilon}_{e}=\bm{J}^{T}(\bm{x}_{e}-\bm{T}(\bm{x}% _{i}))

(25)

Having a complete unit that deals with extrinsic information – which is thus not embedded into the hidden causes of the intrinsic unit – allows the agent to specify their dynamics, leading to an efficient decomposition between intrinsic and extrinsic attractors, and between proprioceptive and visual observations – as exemplified in the simulations of Figure 8. Note the similarity of Equation 25 with Equations 22 and 23: if in the model of Figure 7(a) we had two different forward and inverse kinematics either for goal-directed behavior or for predicting current observations, in this case what is compared with the observations already contains a bias toward preferred states, without the need for sensory-level attractors within the dynamics function.

Although the generative model follows the forward flow of optimal control, the relationship between proprioceptive consequences and extrinsic causes peculiar to active inference still holds because the kinematic inversion regards a high-level process that manipulates abstract (intrinsic or extrinsic) representations, and both of them concur to generate low-level proprioceptive states. As noted by Adams and colleagues, “The key distinction is not about map** from desired states in an extrinsic (kinematic) frame to an intrinsic (dynamic) frame of reference, but the map** from desired states (in either frame) to motor commands” [34]. Having said this, there is a significant difference between the two models represented in Figure 7, which can be compared to the two supervised learning modes of predictive coding [16]: a forward mode that fixes the latent states to the labels and the observations to the data can generate highly accurate images of digits, while the inverse classification task is more difficult as there is no univocal map** between labels and data; instead, a backward mode that fixes the latent states to the data and the observations to the labels achieves high performances on classification but falls short when generating images. Based on this, we can interpret the model of Figure 7(a) as a backward mode that would rapidly generate a proper kinematic configuration with the hand at the target, but that would hardly infer from proprioception the hand position needed to plan movements. Conversely, we can interpret the model of Figure 7(b) as a forward mode that would generate with high accuracy the hand position, but that would find it difficult to infer the kinematic configuration needed to actually realize movement.

3.2 A module for iterative transformations

The model in Figure 7(b) introduced a hierarchical dependency between two (intrinsic and extrinsic) levels, made possible by a connection between hidden states. Instead, the typical approach in continuous-time active inference is to let the hidden states and causes of a level exchange information with the hidden causes (and not the hidden states) of the subordinate level, as shown in Figure 9(a). While this allows one to impose a dynamic trajectory for the unit below, specifying fixed setpoints to the 0th-order hidden states is not as straightforward, since the dynamics prediction error generated from the hidden causes would have to travel back to the previous temporal orders. As clear from Figure 7(b), a connection between hidden states is of extreme importance when designing hierarchical models. In fact – as represented in Figure 9(b) – it is fundamental in defining the initial state of slower temporal scales in discrete models, e.g., in pictographic reading [48]. An analogy could be also made regarding the hierarchical connectivity of PCNs, as units are connected in a multiple-input and multiple-output system, defining static priors for the subordinate levels [16] – as shown in Figure 9(c).

Following these two examples and the previous kinematic model, we use the observation of a level to directly bias the 0th-order hidden states of the level below. As a result, the observation prediction error $\bm{\varepsilon}_{o}$ and the prior prediction error $\bm{\varepsilon}_{\eta_{x}}$ of Equation 6 is expressed by the same variable:

\bm{\varepsilon}_{o}^{(i)}=\bm{\mu}^{(i+1)}-\bm{g}^{(i)}(\bm{\mu}^{(i)})

(26)

where the hierarchical level is indicated with a superscript and lower levels are denoted by increasing numbers. We can then design a multiple-input and multiple-output system wherein a level imposes and receives priors and observations to several units, as in Figure 9(d). The computation of the free energy in Equation 5 remains unchanged, and the update of the hidden states turns into the following:

\dot{\tilde{\bm{\mu}}}^{(i,j)}=\begin{bmatrix}\bm{\mu}^{\prime{(i,j)}}-\sum_{k% }\bm{\pi}_{o}^{(i-1,k)}\bm{\varepsilon}_{o}^{(i-1,k)}+\sum_{l}\partial_{\mu^{(% i,j)}}\bm{g}^{(i,l)T}\bm{\pi}_{o}^{(i,l)}\bm{\varepsilon}_{o}^{(i,l)}+\partial% _{\mu^{(i,j)}}\bm{f}^{(i,j)T}\bm{\pi}^{(i,j)}_{x}\bm{\varepsilon}^{(i,j)}_{x}% \\ \\ -\bm{\pi}^{(i,j)}_{x}\bm{\varepsilon}^{(i,j)}_{x}\end{bmatrix}

(27)

where the superscript notation $(i,j)$ indicates the $i$ th hierarchical level and the $j$ th element within the same level. As evident, this is a similar hierarchical connectivity of PCNs – wherein we highlighted an average of forward and backward prediction errors from independent units – but with the addition of model dynamics represented by the leftmost and rightmost terms of the equation.

What advantages do deep hierarchical models carry compared to a shallow agent? If we consider the structure of Figure 7(b), although affording a more advanced control with respect to the model in Figure 7(a), its uses are still limited to solving simple tasks, e.g., performing operations with the hand. While simultaneous coordination of multiple limbs is certainly possible, it would require complex dynamics functions, with complexity increasing with the number of joints and ramifications of the kinematic chain. Critically, a shallow agent would not be capable of capturing the hierarchical causal relationships inherent to the generative process, allowing one to predict and anticipate the local exchange of forces that would unfold whenever a biased belief over bodily states produces a movement. As mentioned in the Introduction, a deep model is also required if one has to flexibly use external tools for manipulation tasks. Besides the roto-translations occurring in forward kinematics, iterative transformations are also essential in computer vision – where an image can be subject to scaling, shearing, or projection – and, more in general, whenever changing basis of a coordinate vector.

For these reasons, we can generalize the last model and construct an Intrinsic-Extrinsic (or IE) module [44, 58, 99]. This module is composed of two units and its role is to perform iterative transformations between reference frames. In brief, a unit $\mathcal{U}_{e}^{(i-1)}$ encodes a signal in an extrinsic reference frame, while a second unit $\mathcal{U}_{i}^{(i)}$ contains a generic intrinsic transformation. Applying the latter to the first signal returns a new extrinsic reference frame embedded in a unit $\mathcal{U}_{e}^{(i)}$ . More formally, we can define a likelihood function $\bm{g}_{e}$ such that:

\bm{x}_{e}^{(i)}=\bm{g}_{e}^{(i)}(\bm{x}_{i}^{(i)},\bm{x}_{e}^{(i-1)})=\bm{T}^% {(i)}(\bm{x}_{i}^{(i)})\cdot\bm{x}_{e}^{(i-1)}+\bm{w}_{e}^{(i)}

(28)

where $\bm{w}_{e}^{(i)}$ is a noise term and $\bm{T}^{(i)}$ is a linear transformation matrix. Backpropagating the extrinsic prediction error $\bm{\varepsilon}_{e}^{(i)}$ leads to simple belief updates:

\displaystyle\begin{split}{\frac{\partial\bm{g}_{e}^{(i)}}{\partial\bm{\mu}_{e% }^{(i-1)}}}^{T}\bm{\varepsilon}_{e}^{(i)}&=\bm{T}^{(i)T}\cdot\bm{\varepsilon}_% {e}^{(i)}\\ {\frac{\partial\bm{g}_{e}^{(i)}}{\partial\bm{\mu}_{i}^{(i)}}}^{T}\bm{% \varepsilon}_{e}^{(i)}&=\frac{\partial\bm{T}^{(i)}}{\partial\bm{\mu}_{i}^{(i)}% }\odot[\bm{\varepsilon}_{e}^{(i)}\cdot\bm{\mu}_{e}^{(i-1)T}]\end{split}

(29)

where $\odot$ is the element-wise product.

These equations express the most likely intrinsic and extrinsic states that may have generated the new reference frame. As shown in Figure 9(e), modules are linked through the extrinsic units $\mathcal{U}_{e}^{(i)}$ , while $\mathcal{U}_{i}^{(i)}$ performs an internal operation and does not contribute to the hierarchical connectivity. Applying this architecture to kinematics, we can realize a hierarchical model with a multiple-output system, wherein the intrinsic hidden states $\bm{x}_{i}^{(i,j)}$ of a level encode a pair of joint angle and limb length of a single DoF. Iteratively applying roto-translations to an origin (e.g., body-centered) reference frame $\bm{x}_{e}^{(0)}$ – consisting of a Cartesian position and an absolute orientation – will determine the kinematic configuration of the agent in terms of extrinsic coordinates [44]. In addition to this, and differently from PCNs, we can now easily express how every single joint and limb would evolve. Or – which is the same from an active inference perspective – how the agent intends to move its joints and limbs, affording a highly advanced control as demonstrated by the simulations of Figures 10(a) and 10(b). Besides modeling limb dynamics, the IE module can be also applied to non-affine transformations, e.g., perspective projections. As displayed in Figure 10(c), this can be useful for estimating the depth of an object via parallel predictions (e.g., from the eyes or multiple cameras) [58] – a process that active inference casts in terms of target fixation [67]. The modularity of this architecture allows the agent to define dynamic attractors in the 2D projected planes, in the 3D reference frames of the eyes, or as simple vergence-accommodation angles.

3.3 The self, the objects, and the others

Describing Figure 7(b), we passed over a critical mechanism introduced at the beginning: the characterization of objects for dynamic goal-directed behavior. Recall that the hidden states encode in parallel not only the self but also environmental entities; however, the agent’s model now describes the generative process hierarchically:

\bm{g}_{e}^{(i)}(\bm{x}_{i}^{(i)},\bm{x}_{e}^{(i-1)})=\begin{bmatrix}\bm{T}_{1% }^{(i)}(\bm{x}_{i,1}^{(i)})\cdot\bm{x}_{e,1}^{(i-1)}&\dots&\bm{T}_{N}^{(i)}(% \bm{x}_{i,N}^{(i)})\cdot\bm{x}_{e,N}^{(i-1)}\end{bmatrix}

(30)

For the self, this has a simple explanation, i.e., it just generates, one after the other, the positions of every segment of the kinematic chain depending on its joint angles. Concerning an object, attaching its visual observation to a second extrinsic hidden state of a specific level would lead the latter to encode its Cartesian position. How then should all the previous levels be interpreted? If the generative model maintains the same hierarchy for both the self and the object, backpropagating the extrinsic prediction errors of the second component will eventually infer a potential agent’s configuration in relation to the object. For instance, if the object is linked to the last (i.e., hand) level, this would represent the hand at the object location, while all the previous levels would represent appropriate intermediate positions and angles generating that final location. In other words, the additional factorizations of hidden states and likelihoods do not encode simple target angles or positions as before, but a whole configuration of the self that the agent thinks to be suitable for interacting with an object. Since each level can express a particular dynamics through its hidden causes, the inference of this potential configuration is steered to match both the object’s affordances and the agent’s intentions (e.g., gras** a cup by the handle or with the whole hand). Such inferred beliefs would be subject only to exteroceptive information coming from the objects, while proprioceptive states would be used only to update the agent’s belief over its current configuration.

Besides modeling object dynamics, this strategy is also useful in multi-agent contexts. One could maintain a hierarchical generative model regarding the kinematic chain of another agent, which would be inferred by exteroceptive observations about all its positions and joint angles, starting from a different body-centered reference frame. As shown in Figure 11, the goal-directed method used for external objects reflects in this case as well: the agent could represent, by a parallel hierarchical pathway, a second agent in relation to itself, expressing a particular kind of interaction (e.g., the hand of the second agent in terms of its own, resulting in a shaking action). These two cases could be interpreted, from a biological perspective, as simulating the functioning of mirror neurons, firing whenever a subject executes a voluntary goal-directed action or when that action is performed by other subjects [100]. Building an internal model with the kinematic chain of the others – both per se and in relation to the self – could be critical to predict (thus, to understand) their intentions. In this view, neural activity results because the agent makes constant predictions over their kinematic structures depending on its hypotheses and the current context [101, 98].

The relationships between the self, the objects, and other agents under active inference may be better understood from the simulation of Figure 12, showing two agents with incompatible goals that depend on each other. Here, both agents are able to infer parallel representations of different kinematic chains, using an effective decomposition of potential and real configurations. Note how one’s current belief is always in between the future state to realize and the actual configuration; this speaks of one of the fundamental aspects of active inference, i.e., that our beliefs never really reflect the state of the affairs of the world, but are always biased toward preferred states – eventually driving action. In general, bodily states, objects, or other agents can be all manipulated in reference frames appropriate for a specific context; this is in line with the hypothesis that cortical columns use object-centered reference frames to encode external elements and more abstract entities [102]. This approach has also some analogies with Active Predictive Coding [103] and Recursive Neural Programs [104], which addressed the part-whole hierarchy learning problem in computer vision by recursively applying reference frame transformations to parts of a scene.

4 The hybrid unit

All the hierarchical models we presented fall short on simulating real-life applications that involve planning actions ahead. In this chapter, we turn to the problem of how to integrate discrete decision-making into continuous motor control. In doing this, we will revisit the basic unit of the first chapter, finally using the second input – the prior over the hidden causes.

4.1 Dynamic inference by model reduction

Consider the unit in Figure 5(b): recall that some sort of multi-step behavior was achieved, which however depended on higher-level priors about different modalities. In most cases, we need to switch intentions based on lower-level information, affording a more dynamic and less uncertain behavior. Taking a pick-and-place operation as an example, an IE module would be more confident about the success of the first reaching movement if it could rely not just on a tactile belief but also on its intrinsic and extrinsic hidden states. In other words, hidden causes $\bm{v}$ should manage to effectively use both its prior $\bm{\eta}_{v}$ and the dynamics prediction error $\bm{\varepsilon}_{x}$ . The latter assumes two different roles depending on which pathway it flows into: the gradient with respect to the hidden states infers the position that is most likely to have generated the current trajectory; conversely, the gradient with respect to the hidden causes infers the most likely combination of gains $v_{m}$ , signaling the current status of the trajectory and resulting in a dynamic modulation of intentions. However, this pathway is somewhat problematic because the hidden causes are generated by Gaussian distributions and do not encode proper probabilities. Thus, the gradient $\partial_{\nu}\bm{f}_{x}$ infers just one over many possible combinations of gains, and it makes sense as “inferring the most likely intention to have generated the current trajectory” only in simple contexts and if appropriate assumptions are made. Thus, to implement a correct intention selection, we assume that the hidden causes are generated from a categorical distribution:

p(\bm{v})=Cat(\bm{H}_{v})

(31)

where $\bm{H}_{v}$ is an intention preference like $\bm{\eta}_{v}$ . In this way, each discrete element of $\bm{v}$ represents the probability that a specific continuous trajectory will be realized.

However, we have now the problem of how to convert discrete hidden causes to continuous hidden states, and vice versa. This can be done via Bayesian model reduction, a technique used to constrain the complexity of full posterior models into simpler and more restrictive (formally called reduced) distributions [61, 62]. Reduced means that the likelihood of some data is equal to that of the full model and the only difference rests upon the specification of the priors – hence, the posterior of a reduced model $m$ can be expressed in terms of the posterior of the full model:

p(\tilde{\bm{x}}|\bm{o},m)=p(\tilde{\bm{x}}|\bm{o})\frac{p(\tilde{\bm{x}}|m)p(% \bm{o})}{p(\tilde{\bm{x}})p(\bm{o}|m)}

(32)

In our case, model reduction means to explain the infinite values a continuous signal may assume by a discrete set of hypotheses. In active inference, models that combine discrete and continuous signals are called hybrid or mixed [48, 59], and a simplified version is shown in Figure 13(a). We can cast this procedure into the usual message passing, where top-down and bottom-up messages between the two domains respectively perform a Bayesian Model Average (BMA) of reduced priors and a Bayesian Model Comparison (BMC) of reduced sensory evidence. In conventional hybrid models, discrete hidden states generate priors for the continuous hidden causes by weighting the probability of each discrete state with a specific reduced prior, which thus represents one among many alternatives that the agent thinks to be the cause of what it is perceiving [48]. Conversely, the hidden causes posterior is compared with such reduced priors to find which one among them could be the best explanation, taking into account their discrete probabilities before observing sensory evidence.

Averaging and comparing continuous alternatives that are fixed and determined a-priori results in the agent’s inability to correctly operate in a changing environment. For instance, if the agent thinks to find an object in one of two locations, it will always reach either one or the other initial guesses, even if the object has been moved to a third location. How then to use the newly available evidence to update our reduced assumptions? By considering the hidden causes as generated from a categorical distribution – as in Equation 31 – we can compare the posterior over the hidden states with the output of the dynamics functions $\bm{f}_{m}$ , which thus act as the agent’s reduced priors [69]. More formally, we define $M$ reduced prior probability distributions and a full prior model:

\displaystyle\begin{split}p(\bm{x}^{\prime}|\bm{x},m)&=\mathcal{N}(\bm{f}_{m}(% \bm{x}),\bm{\pi}^{\scaleto{-1}{4pt}}_{x,m})\\ p(\bm{x}^{\prime}|\bm{x})&=\mathcal{N}(\bm{\eta}_{x}^{\prime},\bm{\pi}_{x}^{% \scaleto{-1}{4pt}})\end{split}

(33)

where $\bm{\eta}_{x}^{\prime}$ is the full prior. Note here that the reduced priors have the same form of Equation 20 but are not directly conditioned on the hidden causes:

\bm{f}_{m}(\bm{x})=\bm{e}_{i,m}

(34)

Next, we define the corresponding posterior models:

\displaystyle\begin{split}q(\bm{x}^{\prime}|m)&=\mathcal{N}(\bm{\mu}_{m}^{% \prime},\bm{p}_{x,m}^{\scaleto{-1}{4pt}})\\ q(\bm{x}^{\prime})&=\mathcal{N}(\bm{\mu}^{\prime},\bm{p}^{\scaleto{-1}{4pt}}_{% x})\end{split}

(35)

Now, we can find the full prior and its prediction error by averaging the continuous trajectories with their respective discrete probabilities:

\displaystyle\begin{split}\bm{\eta}_{x}^{\prime}&=\sum_{m}v_{m}\bm{f}_{m}(\bm{% x})=\sum_{m}v_{m}\bm{e}_{i,m}\\ \bm{\varepsilon}_{x}&=\bm{\mu}^{\prime}-\bm{\eta}_{x}^{\prime}\end{split}

(36)

which have the same form of Equations 20 and 21. In fact, the hidden states still perceive a single dynamics prediction error containing the total contribution of every intention. As concerns the bottom-up messages $\bm{l}$ , we first write the free energy of each reduced model in terms of the full model. As before, maximizing each reduced free energy makes it approximate the log evidence:

\mathcal{F}(m)=\mathcal{F}-\ln\int\frac{p(\tilde{\bm{x}}|m)}{p(\tilde{\bm{x}})% }q(\tilde{\bm{x}})d\tilde{\bm{x}}\approx\ln p(\bm{o}|m)

(37)

As a result, the free energy related to each dynamics function $\bm{f}_{m}$ depends on the approximate posterior $q(\tilde{\bm{x}})$ of the full model, avoiding the computation of the reduced posteriors. Under a Gaussian approximation, the $m$ th reduced free energy breaks down to a simple formula and the bottom-up messages $\bm{l}$ are found by accumulating the log evidence associated with every intention for a certain amount of continuous time $T$ :

\displaystyle\begin{split}l_{m}&=\int_{0}^{T}\mathcal{L}_{m}dt\\ \mathcal{L}_{m}&=\frac{1}{2}(\bm{\mu}_{m}^{\prime T}\bm{p}_{x,m}\bm{\mu}_{m}^{% \prime}-\bm{f}_{m}(\bm{x})^{T}\bm{\pi}_{x,m}\bm{f}_{m}(\bm{x})-\bm{\mu}^{% \prime T}\bm{p}_{x}\bm{\mu}^{\prime}+\bm{\eta}_{x}^{\prime T}\bm{\pi}_{x}\bm{% \eta}_{x}^{\prime})\end{split}

(38)

Then, a BMC turns into computing the softmax of a vector comprising the free energy $E_{m}$ of every reduced model. This quantity compares the prior surprise $-\ln\bm{H}_{v}$ with the accumulated log evidence:

\bm{v}=\sigma(-\bm{E})=\sigma(\ln\bm{H}_{v}+\bm{l})

(39)

See [61, 62] for a full derivation of BMC under the Laplace assumption, and [69] for more details about the presented approach. Equation 39 is the discrete analogous of Equation 14, but now the bottom-up message encodes a proper discrete distribution and can be used to infer the most likely intention associated with the current dynamic trajectory.

The factor graph of this model, which we call a hybrid unit, is displayed in Figure 13(b), and its inferential process at each continuous instant is better understood if we analyze separately the three different pathways shown in Figure 13(c): (i) during the forward pass, the unit receives a discrete intention prior $\bm{H}_{v}$ , performs a BMA with dynamically generated trajectories $\bm{f}_{m}(\bm{x})$ that manipulate the inferred beliefs of every environmental entity $\bm{x}_{n}$ , and imposes a prior $\bm{\eta}_{x}^{\prime}$ over the 1st order; (ii) through the first backward pass, the unit accumulates the most likely intention related to the current trajectory by comparing it to the ones generated by the dynamics functions; (iii) in the second backward pass, the unit propagates the dynamics prediction error back to the 0th order to infer the most likely continuous state associated with the trajectory, eventually generating biased observations. After a period $T$ , the unit finally computes the difference between the discrete prior and the accumulated evidence, generates a new combination of intentions, and the process starts over.

This kind of dynamic inference has several utilities, e.g., it can be used to infer which one among multiple objects an agent is following – as exemplified in Figure 14 – by generating trajectories for different objects and comparing them with the one it is perceiving [69]. As a last note, the dynamics precisions of Equations 33 and 38 have here an interesting interpretation specular to the observation precisions $\bm{\pi}_{o}$ . Active inference and predictive coding assume that whenever an agent perceives high noise about a sensory modality, the precision of that generative model will decrease because it cannot be trusted for understanding the state of affairs of the world [11, 12]. In addition, the dualism between action and perception inherent to the free energy principle tells us that the optimization of precisions – which are thought to be encoded as synaptic gains – could play a crucial role in attention mechanisms that selectively sample sensory data [105, 106]. Based on this assumption, we could interpret a low precision $\bm{\pi}_{x,m}$ as the decreased agent’s confidence over that intention for minimizing the prediction errors in the current context; however, a low intention precision could also mean that the agent does not intend to rely on it for the realization of a desired goal. In short, there is a dual interpretation of intention precisions about explaining a situation (e.g., the result of a gras** action to understand an object far away from the hand) or solving a task (e.g., gras** an object when it is out of reach). This perspective unveils an additional mechanism besides the fast inference of hidden causes that we mentioned before: a slow learning of reduced precisions that lets the agent score – and, crucially, focus on – those intentions that would be appropriate for a specific scenario [69].

4.2 A discrete interface for dynamic planning

Numerous studies have demonstrated that the brains of athletes are marked by a higher activation of posterior and subcortical regions that involves little or no conscious thinking, producing fluid transitions between different motions; in contrast, the brain of a novice requires a higher demand of prefrontal computations that results in lower performances [107, 108, 109]. From an active inference perspective, we can compare the proficiency of athletes with the continuous model of Figure 5(b), corresponding to the subcortical sensorimotor loops. This model encodes a transition mechanism that is not very flexible, but which precisely for this reason allows it to react much more rapidly to environmental stimuli, e.g., when gras** objects moving at high speed [75]. In general, this strategy can be very effective when the environment has limited uncertainty and the task to be solved comprises a rigid sequence of actions, which the agent has already correctly learned. However, suppose that the agent is introduced to a novel task, or to a highly complex task that requires careful thinking about the imminent future. In this case, it should be capable of replanning the correct sequence of actions if something goes unexpected, and a high-level belief always producing an a-priori-determined behavior for the hidden causes would fall short in completing the task.

Having replaced the continuous hidden causes of Figure 5(b) with discrete hidden causes in Figure 13(b), we can now endow the agent with planning capabilities through a discrete model composed of the following distributions – as shown in Figure 15:

p(\bm{s}_{1:T},\bm{v}_{1:T},\bm{\pi})=p(\bm{s}_{1})p(\bm{\pi})\prod_{\tau}p(% \bm{v}_{\tau}|\bm{s}_{\tau})p(\bm{s}_{\tau}|\bm{s}_{\tau-1},\bm{\pi})

(40)

where:

\displaystyle\begin{aligned} p(\bm{s}_{1})&=Cat(\bm{D})\\ p(\bm{\pi})&=Cat(\bm{E})\end{aligned}

\displaystyle\begin{aligned} p(\bm{v}_{\tau}|\bm{s}_{\tau})&=Cat(\bm{A})\\ p(\bm{s}_{\tau}|\bm{s}_{\tau-1},\bm{\pi})&=Cat(\bm{B}_{\pi,\tau})\end{aligned}

(41)

Here, $\bm{A}$ , $\bm{B}$ , $\bm{D}$ are the likelihood matrix, transition matrix, and prior, $\bm{\pi}$ are the policies – which are not state-action map**s as in RL but sequences of actions – with prior $\bm{E}$ , and $\bm{s}_{\tau}$ are the discrete hidden states at time $\tau$ . These quantities have strict analogies with their continuous counterparts of Equation 3, i.e., the likelihood function $\bm{g}$ , the dynamics function $\bm{f}$ , the prior $\bm{\eta}_{x}$ , and the hidden causes $\bm{v}$ , with the difference that the hidden states do not encode instantaneous paths expressed in generalized coordinates, but sequences of future states defined by discrete variables. Here, we wish to infer the posterior distribution:

p(\bm{s}_{1:T},\bm{\pi}|\bm{v}_{1:T})=\frac{p(\bm{v}_{1:T}|\bm{s}_{1:T},\bm{% \pi})p(\bm{s}_{1:T},\bm{\pi})}{p(\bm{v}_{1:T})}

(42)

As before, this requires computing the intractable model evidence $p(\bm{v}_{1:T})$ , so we resort to a variational approach: expressing the approximate posterior by its sufficient statistics $\bm{s}_{\pi,\tau}$ and conditioning upon a specific policy:

\displaystyle\begin{split}p(\bm{s}_{1:T}|\bm{v}_{1:T},\bm{\pi})&\approx q(\bm{% s}_{1:T},\bm{\pi})=q(\bm{\pi})\prod_{\tau}^{T}q(\bm{s}_{\tau}|\bm{\pi})\\ q(\bm{\pi})&=Cat(\bm{\pi})\\ q(\bm{s}_{\tau}|\bm{\pi})&=Cat(\bm{s}_{\pi,\tau})\end{split}

(43)

we infer the most likely discrete hidden states at time $\tau$ by computing the gradient of the related free energy $\mathcal{F}_{\pi}$ of that policy:

\bm{s}_{\pi,\tau}=\sigma(\ln\bm{B}_{\pi,\tau-1}\bm{s}_{\pi,\tau-1}+\bm{B}_{\pi% ,\tau+1}^{T}\bm{s}_{\pi,\tau+1}+\sum_{i}\ln\bm{A}^{(i)^{T}}\bm{v}^{(i)}_{\tau})

(44)

where we applied a softmax function to ensure that it is a proper probability distribution. Here, $\bm{A}^{(i)}$ is the likelihood matrix that sends predictions for the $i$ th hybrid unit. In fact, if we connect several units to the discrete model, each of them has an independent interface whereby the discrete model computes different signals and waits for the next step $\tau+1$ , when it can infer its hidden states based on multiple accumulated evidences. Recall that in the combined structure of Figure 15, the role of the hybrid unit was to predict a dynamic trajectory from a discrete intention prior, and to infer the most likely intention in a continuous period $T$ . But now the intention prior is generated from a high-level policy that decides which action to take next in the current situation:

\bm{v}_{\tau}=\sigma(\ln\bm{A}\bm{s}_{\tau}+\bm{l}_{\tau})

(45)

where $\bm{l}_{\tau}$ is the bottom-up message at time $\tau$ . The inference of the policies additionally considers unobserved outcomes as random variables, finding the most likely sequence of actions that will lead to some preferred outcomes. More formally, the policy posterior $q(\bm{\pi})$ is found by comparing the policy prior with the expected free energy $\mathcal{G}$ , which is defined as the free energy that the agent expects to perceive in the future:

\bm{\pi}=\sigma(\ln\bm{E}-\mathcal{G})

(46)

Assuming that an agent has some preference $\bm{C}$ about future outcomes, the expected free energy $\mathcal{G}_{\pi}$ under policy $\pi$ will consist of a pragmatic or goal-seeking term toward that preference, and an epistemic or uncertainty-reducing term (see [55] for more details):

\displaystyle\begin{split}\mathcal{G}_{\pi}&\approx\sum_{\tau}D_{KL}[q(\bm{v}_% {\tau}|\bm{\pi})||p(\bm{v}_{\tau}|\bm{C})]+\operatorname*{\mathbb{E}}_{q(\bm{s% }_{\tau}|\bm{s}_{\tau-1},\bm{\pi})}[H[p(\bm{v}_{\tau}|\bm{s}_{\tau})]]\\ &=\sum_{\tau}\bm{v}_{\pi,\tau}(\ln\bm{v}_{\pi,\tau}-\bm{C}_{\tau})+\bm{s}_{\pi% ,\tau}\bm{H}_{A}\end{split}

(47)

where:

\displaystyle\bm{v}_{\pi,\tau}=\bm{A}\bm{s}_{\pi,\tau}

\displaystyle\bm{C}_{\tau}=\ln p(\bm{v}_{\tau}|\bm{C})

\displaystyle\bm{H}_{A}=-diag(\bm{A}^{T}\ln\bm{A})

(48)

Note in the above equations that the likelihood matrix $\bm{A}$ expresses a conditional probability over the discrete hidden causes $\bm{v}_{\tau}$ . As in conventional hybrid models, the discrete hidden states are linked to the hidden causes, but now the latter directly act as discrete observations generated by the likelihood matrix, which thus replaces the prior $\bm{H}_{v}$ in Equation 31. In sum, computing the posterior probability over policies $\bm{\pi}$ turns into finding the best action that makes the agent conform to the dual objective defined by $\mathcal{G}$ . Here, the discrete actions are not intended as actual motor commands similar to Equation 9, but as abstract actions over high-level representations. In fact, the hierarchical nature of discrete models in active inference makes it possible to perform decision-making with a separation of temporal scales, wherein a specific level can generate and infer the states and the paths of the level below [110, 111, 112]. Further evaluating the consequences of an action for a longer time horizon affords more advanced planning called sophisticated inference [113]. Computing actions with the free energy of the future is different from the motor control of continuous models, which only minimize the free energy of present states.

In addition to the previous agents, it is now possible to synchronize the behavior of different continuous signals based on the same high-level policy. For instance, one can realize a pick-and-place operation with a moving object – as represented in Figure 16 – producing smooth transitions between reaching and gras** actions, respectively performed in extrinsic and intrinsic domains [75]. Note that an intermediate phase between the two actions naturally arises, corresponding to a composite approaching movement. In principle, the learning of intention precisions $\bm{\pi}_{x,m}$ (not to be confused with the policy notation) might shed light on how motor skill learning occurs, via message passing between continuous intentions and discrete policies. Moreover, through this kind of dynamic planning the agent can infer and realize instantaneous trajectories even within the same discrete period $\tau$ , useful, e.g., for gras** the moving object without waiting for the successive replanning step. Finally note that to correctly maintain a goal state, we now need to introduce a hidden cause loosely corresponding to the stay action commonly used in discrete tasks [56]. This hidden cause can be linked to an identity intention, i.e., $\bm{i}_{stay}(\bm{x})=\bm{x}$ , which can be interpreted as the agent’s desire to maintain the current state of affairs of the world [69]. Again, the dualism between action and perception also relates this hidden cause to the initial stationary state of the task, and translates into a specular desire for a phase of pure perceptual inference – as the one shown in the simulation of Figure 6.

4.3 Deep hybrid models

Figure 17 portrays a deep hybrid model designed for solving a flexible tool-use task [99]. It combines the expressivity of a (deep) hierarchical formulation, the advantages of inferring and imposing dynamic multi-step intentions inherent to a hybrid unit, and the possibility of encoding external objects and other agents. As in Figure 15, the IE modules communicate with a discrete model at the top, but now they are combined in a hierarchical fashion recapitulating the agent’s kinematic chain. As a consequence, two different goal-directed strategies arise. Considering a simple reaching movement, an attractor imposed at the hand level would generate a cascade of extrinsic prediction errors flowing back to the previous levels and finding a suitable kinematic configuration with the hand over the target. This corresponds to a horizontal hierarchical depth occurring along the hybrid units, and can be compared to the process of motor babbling typical of infants [114], whereby random attractors are generated at different hierarchical levels to identificate the correct body structure. In addition to this naive strategy, since a discrete model can now generate intentions for every IE module (in both intrinsic and extrinsic domains), a more advanced behavior can be achieved once inverse kinematics is correctly performed, which imposes a specific path to the whole kinematic chain. This corresponds to a vertical hierarchical depth with two (discrete and continuous) temporal scales, steering the lower-level inferential process in a direction that, e.g., avoids singularities or gets out from local minima generated by repulsive attractors.

There is thus a delicate balance between forward and backward extrinsic likelihood, and the top-down modulation of the discrete model:

\dot{\bm{\mu}}_{e}^{(i)}\propto-\bm{\pi}_{e}^{(i-1)}\bm{\varepsilon}_{e}^{(i-1% )}+\partial\bm{g}_{e}^{T}\bm{\pi}_{e}^{(i)}\bm{\varepsilon}_{e}^{(i)}+\partial% \bm{\eta}^{\prime(i)T}_{x,e}\bm{\pi}_{x,e}^{(i)}\bm{\varepsilon}_{x,e}^{(i)}

(49)

where $\partial\bm{\eta}^{\prime(i)T}_{x,e}$ is the gradient of the trajectory prior of Equation 36. From the discrete model’s perspective, the discrete hidden states produce a specific combination of hidden causes for each hybrid unit; this combination generates a composite trajectory in the continuous domain weighting independent intentions, taking into account dynamic elements for the whole discrete step. After this period, evidence is accumulated for every hybrid unit, eventually inferring the most likely discrete state that may have generated the trajectories of the self and the environment.

A non-trivial issue exists in tasks requiring tool use, e.g., reaching a ball with the extremity of a stick. Much as other agents may have different kinematic structures than the self, a tool may have its own hierarchy (e.g., even a simple stick is represented by two Cartesian positions and an angle) that must somehow be integrated into the agent’s generative model. Specifically, reaching an object with the extremity of a tool means defining a potential kinematic chain augmented by a new virtual level, letting the agent think of the tool as an extension of its arm. This is possible by linking the two visual observations of the tool to the hand and virtual levels in a second pathway of the hidden states, as shown in Figure 17. Since the intrinsic units of the IE modules also encode information about limb lengths, the agent can infer through visual observations not only its kinematic structure, but also the actual length of the tool [115]. While this second pathway is still marked by a clear distinction between the tool and the arm since the hand level receives observations from both elements, a third pathway is constructed such that the observation of the ball is only linked to the virtual level. As a result, this new potential configuration views the arm and the tool as being part of the same kinematic chain. The interactions between these three pathways (shown in Figure 18) may shed light on how the remap** of the motor cortex gradually occurs with extensive tool use [6, 7], modifying the boundaries between the self and the environment.

5 Discussion

Despite the many advances that have been made in this relatively new and promising research area, with increasing popularity among different scientific domains, a current drawback is that studies about low-level motor control and high-level behavior have been somewhat distinct so far, making use of two highly specular but separated frameworks. As a result, there is no consensus on how to achieve dynamic planning (i.e., how to perform decision-making in constantly changing environments), and state-of-the-art solutions to tackle complex tasks generally couple active inference with traditional machine learning methods. From a theoretical perspective, a few works prescribed an efficient and elegant way for combining the capabilities of discrete and continuous representations into a single generative model [48, 59]; however, this hybrid approach has not reached as much maturity, with the consequence that there are far fewer studies on the subject in the literature, none applied to dynamic contexts.

For this reason, we tried here to give a comprehensive view of this yet unexplored direction, comparing several design choices regarding goal-directed behavior, with the intent of bringing motor control and behavioral studies closer. As a practical goal, we decided to model tool use [99], a task that inevitably calls for both discrete and continuous frameworks, and that requires taking two additional aspects into account, i.e., object affordances and hierarchical causal relationships. In a simple scenario, considering a target to reach as the cause of some hidden states is a reasonable assumption and makes the agent able to operate in dynamic contexts. But assuming that multiple objects are present, how does the agent decide which one will be the cause of a particular action? And what if the target moves along a non-trivial path? If the hidden states are factorized into independent distributions encoding multiple entities, the hidden causes may be seen from a different angle, i.e. they would manipulate the hidden states through flexible intentions [43, 44, 75]. Each of these entities would have its own dynamics, allowing the agent to predict, e.g., the trajectory of a moving ball. Then, this unit was scaled up to construct complex (deep) hierarchical structures, e.g., for simulating human body kinematics [44], and to perform more general transformations of reference frames, e.g., perspective projections [58]. A hierarchical factorization of the hidden states now assumes a broader perspective that can also account for multi-agent interaction – an aspect that has been analyzed in the discrete framework as well [116]. Finally, we designed a hybrid unit with discrete hidden causes and continuous hidden states, affording dynamic inference via Bayesian model reduction [69] that, when coupled with a higher-level discrete model, made it possible to simulate multi-step tasks involving online planning of actions. This showed further parallelisms between the inference of intentions in the continuous domain, and policies in discrete models.

Still, the real question is how we can reach such performances without embedding our prior knowledge into the agent’s generative model. Although not relevant for many continuous-time implementations that focused on different aspects of motor control, a common criticism is that the structure of generative models is a-priori defined and fixed, with intricate and hardcoded dynamics functions that raise some concerns about biological plausibility. In contrast, one appealing characteristic of PCNs is that they simulate brain processing with extremely simple functions typical of the connectivity of neural networks (e.g., linear combinations of weights and biases passed to a non-linear activation function). This allows the network to easily adapt to high-dimensional data, with a few critical advantages compared to deep learning arising from a top-down modulation [16]. While much of this research involves static representations, some studies began to address how predictive coding could be used to learn temporal sequences [117, 118], or to solve RL tasks [119, 103, 120]. Here, we demonstrated how generative models in active inference could be realized by simple likelihood and dynamics functions, showing some analogies with the inferential process of PCNs. Based on these findings, a promising research direction would be to imitate their multiple-input and multiple-output architectures (as in Figure 9(d)), so that an agent could not only learn – in a biologically plausible way – its kinematic configuration and the system dynamics, but also act over them to conform to prior beliefs.

Learning policies in continuous environments is not an easy challenge, but addressing it with strategies different from traditional methods might be key for advancing with current intelligent agents, realizing the full theoretical potential at the basis of active inference and the free energy principle. On this matter, the state-of-the-art is to approximate the likelihood and transition distributions by deep neural networks [78, 121, 122, 123]. While several benefits arise compared to deep RL, this still relegates the deep structure within the neural network, generally making use of a single-level active inference agent. One study used a more biologically plausible PCN as a generative model [120], but relied on a similar approach. As extensively analyzed in [51], neural networks can be seen as static generative models with infinitely precise priors at the last level and no hidden states. This architecture can be used to perform sparse coding or Principal Components Analysis (PCA); however, it fails to account for dynamic variables, as in deconvolution problems or filtering in state-space models. Temporal depth – either discrete or continuous – is thus key to inferring the most accurate representation of the environment, and indeed it seems that cortical columns are able to express model dynamics (e.g., the prefrontal cortex is constantly involved in predicting future states, and motion-sensitive neurons have been recorded in the early visual cortex as well [124]). While it is true that temporal sequences can be easily handled by deep architectures such as recurrent neural networks or transformers [125], their passive generative mechanism could still reflect to the behavior of the active inference agent. In contrast to such a passive AI, being grounded on sensorimotor experiences and actively modifying the environment could be fundamental to the emergence of genuine understanding [126]. Taken together, these facts suggest that acting upon generalized coordinates of motion or discrete future states also for intermediate levels could bring several advantages in solving RL tasks. For instance, representing an agent in a hierarchical fashion afforded highly advanced control over its whole body structure that would not have been possible by a single level generating only the hand position [44, 115].

How then to learn dynamic planning in deep hierarchical models? In [127], it has been stressed the importance of being discrete when considering structure learning. Indeed, hierarchical discrete models afford much more expressivity compared to their continuous counterparts, above all, deriving from the simplicity of computing the expected free energy that allows an agent to plan actions over the imminent future. Nonetheless, as Friston and colleagues note, whether using continuous or discrete representations depends on the model evidence. Specifically, the former may have better performances when the evidence has contiguity properties, e.g., when dealing with time series or with Euclidean space. Indeed, the task exemplified in Figure 18 is effective because the Bayesian model reduction performs a dynamic evidence accumulation over the extrinsic space in which the agent operates. Hence, coupling the hierarchical depth of the hybrid units in Figure 17 with a hierarchical discrete architecture (and not just a single discrete level) could bring efficient structure learning also in constantly changing environments. An alternative approach would be to combine in a hierarchical fashion units composed of a joint discrete-continuous model – as in Figure 15 – which would allow to perform dynamic planning within each single unit. While this solution may not be supported yet by empirical evidence from biological agents, it could be an encouraging direction to explore from a machine learning perspective, contrasting the hypothesis of central discrete decision-making with a distributed network of local decisions.

A third interesting topic regards motor intentionality. Although multi-step tasks are typically tackled at the discrete level, we showed here that, under appropriate assumptions, a non-trivial behavior could be achieved and analyzed also at the continuous level. The flexible intentions that we defined could be compared to an advanced stage of motor skill learning, consisting of autonomous and smooth movements that do not necessitate conscious decision-making [75]. Still, the model structure was predefined in this case as well. How do such intentions emerge during repeated exposure to the same task? How does the agent score which intentions will be appropriate for a specific context? As mentioned in the last chapter, the optimization of intention precisions is likely to involve the free energy of reduced models (see Equation 38). This process may shed light on how discrete actions arise from low-level continuous intentions and, conversely, how the latter are generated from a composite discrete action. Last, a few studies proposed additional connections between policies unfolding at different timescales, either directly [111, 112] or through discrete hidden states [110]. Such approaches could be adopted in hybrid and continuous contexts as well, so that flexible intentions could be propagated via local message passing between hidden causes along the whole hierarchy.

6 Acknowledgments

This research received funding from the European Union’s Horizon H2020-EIC-FETPROACT-2019 Programme for Research and Innovation under Grant Agreement 951910 to I.P.S. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

[1] David Meunier. Hierarchical modularity in human brain functional networks. Frontiers in Neuroinformatics, 3, 2009.
[2] Claus C. Hilgetag and Alexandros Goulas. ‘hierarchy’ in the organization of brain networks. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1796):20190319, February 2020.
[3] Nicholas P. Holmes and Charles Spence. The body schema and multisensory representation(s) of peripersonal space. Cognitive Processing, 5(2):94–105, June 2004.
[4] Atsushi Yokoi and Jörn Diedrichsen. Neural organization of hierarchical motor sequence representations in the human neocortex. Neuron, 103(6):1178–1190.e7, September 2019.
[5] Christine Assaiante, F. Barlaam, F. Cignetti, and M. Vaugoyeau. Body schema building during childhood and adolescence: A neurosensory approach. Neurophysiologie Clinique = Clinical Neurophysiology, 44(1):3–12, January 2014.
[6] Atsushi lriki, Michio Tanaka, and Yoshiaki Iwamura. Coding of modified body schema during tool use by macaque postcentral neurones. NeuroReport, 7(14):2325–2330, October 1996.
[7] Shigeru Obayashi, Tetsuya Suhara, Koichi Kawabe, Takashi Okauchi, Jun Maeda, Yoshihide Akine, Hirotaka Onoe, and Atsushi Iriki. Functional brain map** of monkey tool use. NeuroImage, 14(4):853–861, 2001.
[8] Thomas A. Carlson, George Alvarez, Daw-an Wu, and Frans A.J. Verstraten. Rapid assimilation of external objects into the body schema. Psychological Science, 21(7):1000–1005, May 2010.
[9] Lucilla Cardinali, Francesca Frassinetti, Claudio Brozzoli, Christian Urquizar, Alice C. Roy, and Alessandro Farnè. Tool-use induces morphological updating of the body schema. Current Biology, 19(13):478, 2009.
[10] Rajesh P.N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87, 1999.
[11] Jakob Hohwy. The Predictive Mind. Oxford University Press UK, 2013.
[12] Andy Clark. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 01 2016.
[13] Jakob Hohwy. New directions in predictive processing. Mind and Language, 35(2):209–223, 2020.
[14] Andy Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204, 2013.
[15] Stewart Shipp. Neural elements for predictive coding. Frontiers in Psychology, 7, November 2016.
[16] Beren Millidge, Anil Seth, and Christopher L Buckley. Predictive coding: a theoretical and experimental review, 2022.
[17] Alexander Ororbia and Daniel Kifer. The neural coding framework for learning generative models. Nature Communications, 13(1), 2022.
[18] Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, and Alexander Ororbia. Brain-inspired computational intelligence via predictive coding, 2023.
[19] James C R Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Comput, 29(5):1229–1262, March 2017.
[20] James C.R. Whittington and Rafal Bogacz. Theories of Error Back-Propagation in the Brain. Trends in Cognitive Sciences, 23(3):235–250, 2019.
[21] Beren Millidge, Alexander Tschantz, and Christopher L. Buckley. Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs. Neural Computation, 34(6):1329–1368, 2022.
[22] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1211–1221, 2009.
[23] Jakob Hohwy, Andreas Roepstorff, and Karl Friston. Predictive coding explains binocular rivalry: An epistemological review. Cognition, 108(3):687–701, September 2008.
[24] Giovanni Pezzulo, Francesco Donnarumma, Domenico Maisto, and Ivilin Stoianov. Planning at decision time and in the background during spatial navigation. Current Opinion in Behavioral Sciences, 29:69–76, 2019.
[25] A David Redish. Vicarious trial and error. Nature Reviews Neuroscience, 17:147–159, 2016.
[26] I Stoianov, C Pennartz, C Lansink, and G Pezzulo. Model-based spatial navigation in the hippocampus-ventral striatum circuit: a computational analysis. Plos Computational Biology, 14(9):1–28, 2018.
[27] Ivilin Stoianov, Domenico Maisto, and Giovanni Pezzulo. The hippocampal formation as a hierarchical generative model supporting generative replay and continual learning. Progress in Neurobiology, 217:1–20, 2022.
[28] Karl J. Friston, Jean Daunizeau, James Kilner, and Stefan J. Kiebel. Action and behavior: A free-energy formulation. Biological Cybernetics, 102(3):227–260, 2010.
[29] Karl Friston. The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2):127–138, 2010.
[30] Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology, 81:55–79, 2017.
[31] Thomas Parr, Giovanni Pezzulo, and Karl J Friston. Active inference: the free energy principle in mind, brain, and behavior. Cambridge, MA: MIT Press, 2021.
[32] Karl J. Friston, Jean Daunizeau, and Stefan J. Kiebel. Reinforcement learning or active inference? PLoS ONE, 4(7), 2009.
[33] Karl Friston. What is optimal about motor control? Neuron, 72(3):488–498, 2011.
[34] Rick A. Adams, Stewart Shipp, and Karl J. Friston. Predictions not commands: Active inference in the motor system. Brain Structure and Function, 218(3):611–643, 2013.
[35] Harriet Brown, Karl Friston, and Sven Bestmann. Active inference, attention, and motor preparation. Frontiers in Psychology, 2(SEP):1–10, 2011.
[36] Giovanni Pezzulo, Leo D’Amato, Francesco Mannella, Matteo Priorelli, Toon Van de Maele, Ivilin Peev Stoianov, and Karl Friston. Neural representation in active inference: using generative models to interact with – and understand – the lived world. Annals of the New York Academy of Sciences, in press 2024.
[37] Pablo Lanillos, Jordi Pages, and Gordon Cheng. Robot self/other distinction: active inference meets neural networks learning in a mirror. (Ecai), 2020.
[38] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ACM International Conference Proceeding Series, 148:945–952, 2006.
[39] Marc Toussaint. Probabilistic inference as a model of planned behavior. Künstliche Intelligenz, 3/09:23–29, 2009.
[40] Matthew Botvinick and Marc Toussaint. Planning as inference. Trends in Cognitive Sciences, 16(10):485–488, 2012.
[41] A. Maselli, P. Lanillos, and G. Pezzulo. Active inference unifies intentional and conflict-resolution imperatives of motor control. PLOS Comput. Biol, 18(6), 2022.
[42] Francesco Mannella, Federico Maggiore, Manuel Baltieri, and Giovanni Pezzulo. Active inference through whiskers. Neural Networks, 144:428–437, 2021.
[43] Matteo Priorelli and Ivilin Peev Stoianov. Flexible Intentions: An Active Inference Theory. Frontiers in Computational Neuroscience, 17:1 – 41, 2023.
[44] Matteo Priorelli, Giovanni Pezzulo, and Ivilin Peev Stoianov. Deep kinematic inference affords efficient and scalable control of bodily movements. Proceedings of the National Academy of Sciences of the United States of America, 120, 2023.
[45] Ajith Anil Meera, Filip Novicky, Thomas Parr, Karl Friston, Pablo Lanillos, and Noor Sajid. Reclaiming saliency: Rhythmic precision-modulated action and perception. Frontiers in Neurorobotics, 16:1–23, 2022.
[46] Raphael Kaplan and Karl J. Friston. Planning and navigation as active inference. Biological Cybernetics, 112(4):323–343, 2018.
[47] Rick A. Adams, Klaas Enno Stephan, Harriet R. Brown, Christopher D. Frith, and Karl J. Friston. The computational anatomy of psychosis. Frontiers in Psychiatry, 4, 2013.
[48] Karl J. Friston, Thomas Parr, and Bert de Vries. The graphical brain: Belief propagation and active inference. 1(4):381–414, 2017.
[49] Riccardo Proietti, Giovanni Pezzulo, and Alessia Tessari. An active inference model of hierarchical action understanding, learning and imitation. Physics of Life Reviews, 46:92–118, September 2023.
[50] Francesco Donnarumma, Marcello Costantini, Ettore Ambrosini, Karl Friston, and Giovanni Pezzulo. Action perception as hypothesis testing. Cortex, 89:45–60, April 2017.
[51] Karl Friston. Hierarchical models in the brain. PLoS Computational Biology, 4(11), 2008.
[52] Matteo Priorelli, Federico Maggiore, Antonella Maselli, Francesco Donnarumma, Domenico Maisto, Francesco Mannella, Ivilin Peev Stoianov, and Giovanni Pezzulo. Modeling motor control in continuous-time Active Inference: a survey. IEEE Transactions on Cognitive and Developmental Systems, pages 1–15, 2023.
[53] Karl Friston, Klaas Stephan, Baojuan Li, and Jean Daunizeau. Generalised filtering. Mathematical Problems in Engineering, 2010:Article ID 621670, 34 p.–Article ID 621670, 34 p., 2010.
[54] Thomas Parr, Rajeev Vijay Rikhye, Michael M. Halassa, and Karl J. Friston. Prefrontal Computation as Active Inference. Cerebral Cortex, 30(2):682–695, 2020.
[55] Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselic, Victorita Neacsu, and Karl Friston. Active inference on discrete state-spaces: A synthesis. Journal of Mathematical Psychology, 99, 2020.
[56] Ryan Smith, Karl J. Friston, and Christopher J. Whyte. A step-by-step tutorial on active inference and its application to empirical data. Journal of Mathematical Psychology, 107:102632, 2022.
[57] Giovanni Pezzulo, Francesco Rigoli, and Karl J. Friston. Hierarchical Active Inference: A Theory of Motivated Control. Trends in Cognitive Sciences, 22(4):294–306, 2018.
[58] M. Priorelli, G. Pezzulo, and I.P. Stoianov. Active vision in binocular depth estimation: A top-down perspective. Biomimetics, 8(5), 2023.
[59] Karl J. Friston, Richard Rosch, Thomas Parr, Cathy Price, and Howard Bowman. Deep temporal models and active inference. Neuroscience and Biobehavioral Reviews, 77(November 2016):388–402, 2017.
[60] K. J. Friston, L. Harrison, and Will Penny. Dynamic causal modelling. NeuroImage, 19(4):1273–1302, 2003.
[61] Karl Friston and Will Penny. Post hoc Bayesian model selection. NeuroImage, 56(4):2089–2099, 2011.
[62] Karl Friston, Thomas Parr, and Peter Zeidman. Bayesian model reduction. pages 1–32, 2018.
[63] M.J. Rosa, K. Friston, and W. Penny. Post-hoc selection of dynamic causal models. Journal of Neuroscience Methods, 208(1):66–78, June 2012.
[64] Thomas Parr and Karl J. Friston. The Discrete and Continuous Brain: From Decisions to Movement—And Back Again Thomas. Neural Computation, 30:2319–2347, 2018.
[65] T. Parr and K. J. Friston. The computational pharmacology of oculomotion. Psychopharmacology (Berl.), 236(8):2473–2484, August 2019.
[66] A. Tschantz, L. Barca, D. Maisto, C. L. Buckley, A. K. Seth, and G. Pezzulo. Simulating homeostatic, allostatic and goal-directed forms of interoceptive control using active inference. Biological Psychology, 169:108266, 2022.
[67] Thomas Parr and Karl J. Friston. Active inference and the anatomy of oculomotion. Neuropsychologia, 111(January):334–343, 2018.
[68] Ozan Çatal, Tim Verbelen, Toon Van de Maele, Bart Dhoedt, and Adam Safron. Robot navigation as hierarchical active inference. Neural Networks, 142:192–204, 2021.
[69] M. Priorelli and I.P. Stoianov. Dynamic inference by model reduction. bioRxiv, 2023.
[70] Stefano Ferraro, Toon Van de Maele, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. Disentangling shape and pose for object-centric deep active inference models, 2022.
[71] Ruben S. van Bergen and Pablo L. Lanillos. Object-based active inference, 2022.
[72] Toon Van de Maele, Tim Verbelen, Ozan undefinedatal, and Bart Dhoedt. Embodied object representation learning and recognition. Frontiers in Neurorobotics, 16, April 2022.
[73] Toon Van de Maele, Tim Verbelen, Pietro Mazzaglia, Stefano Ferraro, and Bart Dhoedt. Object-centric scene representations using active inference, 2023.
[74] Rick A. Adams, Eduardo Aponte, Louise Marshall, and Karl J. Friston. Active inference and oculomotor pursuit: The dynamic causal modelling of eye movements. Journal of Neuroscience Methods, 242:1–14, 2015.
[75] Matteo Priorelli and Ivilin Peev Stoianov. Slow but flexible or fast but rigid? discrete and continuous processes compared. bioRxiv, 2023.
[76] Pablo Lanillos, Cristian Meo, Corrado Pezzato, Ajith Anil Meera, Mohamed Baioumy, Wataru Ohata, Alexander Tschantz, Beren Millidge, Martijn Wisse, Christopher L. Buckley, and Jun Tani. Active inference in robotics and artificial agents: Survey and challenges. CoRR, abs/2112.01871, 2021.
[77] Tadahiro Taniguchi, Shingo Murata, Masahiro Suzuki, Dimitri Ognibene, Pablo Lanillos, Emre Ugur, Lorenzo Jamone, Tomoaki Nakamura, Alejandra Ciria, Bruno Lara, and Giovanni Pezzulo. World models and predictive coding for cognitive and developmental robotics: frontiers and challenges. Advanced Robotics, 37(13):780–806, June 2023.
[78] Kai Ueltzhöffer. Deep Active Inference. pages 1–40, 2017.
[79] Beren Millidge. Deep active inference as variational policy gradients. Journal of Mathematical Psychology, 96, 2020.
[80] Zafeirios Fountas, Noor Sajid, Pedro A.M. Mediano, and Karl Friston. Deep active inference agents using Monte-Carlo methods. Advances in Neural Information Processing Systems, 2020-Decem(NeurIPS), 2020.
[81] Théophile Champion, Marek Grześ, Lisa Bonheme, and Howard Bowman. Deconstructing deep active inference. 2023.
[82] Aleksey Zelenov and Vladimir Krylov. Deep active inference in control tasks. In 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), pages 1–3, 2021.
[83] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, April 2017.
[84] Maxwell J. D. Ramstead, Dalton A. R. Sakthivadivel, Conor Heins, Magnus Koudahl, Beren Millidge, Lancelot Da Costa, Brennan Klein, and Karl J. Friston. On bayesian mechanics: a physics of and by beliefs. Interface Focus, 13(3), April 2023.
[85] Cansu Sancaktar, Marcel A. J. van Gerven, and Pablo Lanillos. End-to-End Pixel-Based Deep Active Inference for Body Perception and Action. In 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 1–8, 2020.
[86] Guillermo Oliver, Pablo Lanillos, and Gordon Cheng. An empirical study of active inference on a humanoid robot. IEEE Transactions on Cognitive and Developmental Systems, 8920(c):1–10, 2021.
[87] Cristian Meo and Pablo Lanillos. Multimodal VAE active inference controller. CoRR, abs/2103.04412, 2021.
[88] Thomas Rood, Marcel van Gerven, and Pablo Lanillos. A deep active inference model of the rubber-hand illusion. 2020.
[89] Mohamed Baioumy, Paul Duckworth, Bruno Lacerda, and Nick Hawes. Active inference for integrated state-estimation, control, and learning. arXiv, 2020.
[90] Cristian Meo, Giovanni Franzese, Corrado Pezzato, Max Spahn, and Pablo Lanillos. Adaptation through prediction: Multisensory active inference torque control. IEEE Transactions on Cognitive and Developmental Systems, 15(1):32–41, 2023.
[91] Fred Bos, Ajith Anil Meera, Dennis Benders, and Martijn Wisse. Free Energy Principle for State and Input Estimation of a Quadcopter Flying in Wind. Proceedings - IEEE International Conference on Robotics and Automation, pages 5389–5395, 2022.
[92] Ajith Anil Meera and Martijn Wisse. Dynamic expectation maximization algorithm for estimation of linear systems with colored noise. Entropy, 23(10), 2021.
[93] Léo Pio-Lopez, Ange Nizard, Karl Friston, and Giovanni Pezzulo. Active inference and robot control: A case study. Journal of the Royal Society Interface, 13(122), 2016.
[94] Emanuel Todorov. Optimality principles in sensorimotor control. Nature Neuroscience, 7:907–915, 2004.
[95] Mareike Floegel, Johannes Kasper, Pascal Perrier, and Christian A. Kell. How the conception of control influences our understanding of actions. Nature Reviews Neuroscience, 24(May):313–329, 2023.
[96] Giuseppe Vallar, Elie Lobel, Gaspare Galati, Alain Berthoz, Luigi Pizzamiglio, and Denis Le Bihan. A fronto-parietal system for computing the egocentric spatial frame of reference in humans. Experimental Brain Research, 124(3):281–286, January 1999.
[97] James R. Hinman, G. William Chapman, and Michael E. Hasselmo. Neuronal representation of environmental boundaries in egocentric coordinates. Nature Communications, 10(1), June 2019.
[98] Karl J Friston, Jérémie Mattout, and James Kilner. Action understanding and active inference. Biological cybernetics, 104(1-2):137–60, feb 2011.
[99] Matteo Priorelli and Ivilin Peev Stoianov. Deep hybrid models: infer and plan in the real world. arXiv, 2024.
[100] Giacomo Rizzolatti and Laila Craighero. The mirror-neuron system. Annu Rev Neurosci, 27:169–192, 2004.
[101] James M. Kilner, Karl J. Friston, and Chris D. Frith. Predictive coding: an account of the mirror neuron system. Cognitive Processing, 8(3):159–166, April 2007.
[102] Jeff Hawkins, Subutai Ahmad, and Yuwei Cui. A theory of how columns in the neocortex enable learning the structure of the world. Frontiers in Neural Circuits, 11, 2017.
[103] Rajesh P. N. Rao, Dimitrios C. Gklezakos, and Vishwas Sathish. Active predictive coding: A unified neural framework for learning hierarchical world models for perception and planning, 2022.
[104] Ares Fisher and Rajesh P N Rao. Recursive neural programs: A differentiable framework for learning compositional part-whole hierarchies and image grammars. PNAS Nexus, 2(11), October 2023.
[105] Harriet Feldman and Karl J. Friston. Attention, uncertainty, and free-energy. Frontiers in Human Neuroscience, 4, 2010.
[106] Thomas Parr, David A. Benrimoh, Peter Vincent, and Karl J. Friston. Precision and false perceptual inference. Frontiers in Integrative Neuroscience, 12, September 2018.
[107] F. Fattapposta, G. Amabile, M. V. Cordischi, D. Di Venanzio, A. Foti, F. Pierelli, C. D’Alessio, F. Pigozzi, A. Parisi, and C. Morrocutti. Long-term practice effects on a new skilled motor learning: An electrophysiological study. Electroencephalography and Clinical Neurophysiology, 99(6):495–507, 1996.
[108] Francesco Di Russo, Sabrina Pitzalis, Teresa Aprile, and Donatella Spinelli. Effect of practice on brain activity: An investigation in top-level rifle shooters. Medicine and Science in Sports and Exercise, 37(9):1586–1593, 2005.
[109] Ann M. Graybiel. Habits, rituals, and the evaluative brain. Annual Review of Neuroscience, 31:359–387, 2008.
[110] Karl J. Friston, Thomas Parr, Conor Heins, Axel Constant, Daniel Friedman, Takuya Isomura, Chris Fields, Tim Verbelen, Maxwell Ramstead, John Clip**er, and Christopher D. Frith. Federated inference and belief sharing. Neuroscience & Biobehavioral Reviews, 156:105500, 2024.
[111] Toon Van de Maele, Bart Dhoedt, Tim Verbelen, and Giovanni Pezzulo. Integrating cognitive map learning and active inference for planning in ambiguous environments. In Active Inference, pages 204–217, Cham, 2024. Springer Nature Switzerland.
[112] Daria de Tinguy, Toon Van de Maele, Tim Verbelen, and Bart Dhoedt. Spatial and temporal hierarchy for autonomous navigation using active inference in minigrid environment. Entropy, 26(1):83, January 2024.
[113] Karl Friston, Lancelot Da Costa, Danijar Hafner, Casper Hesp, and Thomas Parr. Sophisticated inference. Neural Computation, 33(3):713–763, 2021.
[114] Daniele Caligiore, Tomassino Ferrauto, Domenico Parisi, Neri Accornero, Marco Capozza, and Gianluca Baldassarre. Using motor babbling and hebb rules for modeling the development of reaching with obstacles and gras**. 2008.
[115] Matteo Priorelli and Ivilin Peev Stoianov. Efficient motor learning through action-perception cycles in deep kinematic inference. In Active Inference, pages 59–70. Springer Nature Switzerland, 2024.
[116] Domenico Maisto, Francesco Donnarumma, and Giovanni Pezzulo. Interactive inference: A multi-agent model of cooperative joint actions. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 54(2):704–715, 2024.
[117] Linxing Preston Jiang and Rajesh P. N. Rao. Dynamic predictive coding: A model of hierarchical sequence learning and prediction in the neocortex. bioRxiv, 2023.
[118] Beren Millidge, Mahyar Osanlouy, and Rafal Bogacz. Predictive Coding Networks for Temporal Prediction. pages 1–59, 2023.
[119] Alexander Ororbia and Ankur Mali. Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems. 2022.
[120] Beren Millidge. Combining Active Inference and Hierarchical Predictive Coding: a Tutorial Introduction and Case Study. PsyArXiv, 2019.
[121] Ozan Çatal, Johannes Nauta, Tim Verbelen, Pieter Simoens, and B. Dhoedt. Bayesian policy selection using active inference. ArXiv, abs/1904.08149, 2019.
[122] Stefano Ferraro, Toon Van de Maele, Tim Verbelen, and Bart Dhoedt. Symmetry and complexity in object-centric deep active inference models. Interface Focus, 13(3), April 2023.
[123] Kai Yuan, Karl Friston, Zhibin Li, and Noor Sajid. Hierarchical generative modelling for autonomous robots. Research Square, 2023.
[124] Stephen Grossberg and Praveen K. Pilly. Temporal dynamics of decision-making during motion perception in the visual cortex. Vision Research, 48(12):1345–1373, June 2008.
[125] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[126] Giovanni Pezzulo, Thomas Parr, Paul Cisek, Andy Clark, and Karl Friston. Generating meaning: active inference and the scope and limits of passive ai. Trends in Cognitive Sciences, 28(2):97–112, February 2024.
[127] Karl J. Friston, Lancelot Da Costa, Alexander Tschantz, Alex Kiefer, Tommaso Salvatori, Victorita Neacsu, Magnus Koudahl, Conor Heins, Noor Sajid, Dimitrije Markovic, Thomas Parr, Tim Verbelen, and Christopher L Buckley. Supervised structure learning. 2023.