\sidecaptionvpos

figurec

Recent Trends in Insect and Robot Navigation through the Lens of Reinforcement Learning

Stephan Lochner ID Institute of Biology I, University of Freiburg, Germany Correspondence: [email protected] Daniel Honerkamp ID Department of Computer Science, University of Freiburg, Germany Abhinav Valada ID Department of Computer Science, University of Freiburg, Germany Andrew D. Straw ID Institute of Biology I, University of Freiburg, Germany Bernstein Center Freiburg, University of Freiburg, Germany
Abstract

Bees are among the master navigators of the insect world. Despite impressive advances in robot navigation research, the performance of these insects is still unrivaled by any artificial system in terms of training efficiency and generalization capabilities, particularly considering the limited computational capacity. On the other hand, computational principles underlying these extraordinary feats are still only partially understood. The theoretical framework of reinforcement learning (RL) provides an ideal focal point to bring the two fields together for mutual benefit. While RL has long been at the core of robot navigation research, current computational theories of insect navigation are not commonly formulated within this framework, but largely as an associative learning process implemented in the insect brain, especially in a region called the mushroom body. Here we argue that this neural substrate can be understood as implementing a certain class of relatively simple RL algorithms, capable of integrating distinct components of a navigation task, reminiscent of hierarchical RL models used in robot navigation. The efficiency of insect navigation is likely rooted in an efficient and robust internal representation of space, linking retinotopic (egocentric) visual input with the geometry of the environment. We discuss how current models of insect and robot navigation are exploring representations beyond classical, complete map-like representations, with spatial information being embedded in the respective latent representations to varying degrees.

1 Introduction

The purpose of this paper is to offer a perspective that links our current understanding of spatial navigation by insect navigation researchers together with that of robotics researchers. We do this largely with the help of the theoretical framework of reinforcement learning (RL), which is a central theme in modern robotics research but has so far had relatively little impact in the field of insect navigation or more broadly in insect learning. We begin with an introduction to relevant topics, suggest how the anatomy and physiology of the insect brain may implement learning through the lens of reinforcement learning, and finally offer a perspective about how recent models of insect navigation can be viewed as hierarchical RL model.

Reliably navigating the world in order to acquire essential resources while avoiding potentially catastrophic threats is an existential skill for many animals. In the insect world, central-place foragers like many bee and ant species stake the survival of the entire colony on individuals’ ability to return to the nest after extensive foraging trips. Their remarkable navigational capabilities allow them to do so after only a few learning flights or walks under vastly varying environmental conditions. This is so far unrivaled by any artificial autonomous system. Recent years have seen substantial advances in understanding the underlying mechanisms of insect navigation (INav), tentatively converging on what was coined the ‘insect navigation base model’ (INBM) in a comprehensive review by Webb (2019).

This model and its components–rooted in a rich history of behavioral experiments, modeling, and the neuroanatomy of the insect brain– possess substantial explanatory power and offer a mechanistic, bottom-up picture of navigation. Nevertheless, it is only implicitly related to the high-level objective of efficiently exploiting the resources provided by the environment. On the other hand, robot navigation (RNav) research is driven by the practical goal of enabling robots to perform specific spatial tasks, making reinforcement learning a dominant theoretical framework: An RL agent is trained to optimize its interaction with the environment by accumulating positive rewards while avoiding punishment (negative rewards), which it achieves by learning a specific policy: what is the optimal action to take, given the agent’s current state? In contrast to other training paradigms, RL requires no additional external supervision. If the task involves a spatial component, this implies learning a navigational strategy which is optimal for achieving the high-level task.

Successful and reliable navigation depends on a robust and efficient choice of the agent’s internal representation of its environment (map**) and its own relative pose therein (localization), which will be derived from sensory input but otherwise arbitrarily complex. This spatial representation of sensory input then serves as the basis to determine a sequence of suitable actions to accomplish a certain objective (planning). Until recently, the dominant approach in RNav decoupled the question of finding a suitable spatial representation from the planning phase: a fixed, feed-forward architecture (usually some variety of ‘simultaneous localization and map**’, SLAM, see Fuentes-Pacheco et al. 2015 for a review) is used to infer an explicit spatial representation form sensory input based on which a policy is learned dynamically using e.g. RL methods. More current research, however, is shifting towards end-to-end learning approaches, where differentiable neural network architectures are trained to learn policies directly from the sensory input. In order to do so efficiently, these networks usually form latent spatial representations within hidden layers of the network as an intermediate step. These latent representations are not pre-determined but learned in order to most efficiently solve the navigation task within the constraints of a specific network architecture111Although other training paradigms with varying degrees of supervision are being used as well, most of the models discussed can be trained in an end-to-end RL fashion, and we will interpret their latent representations within a common RL framework for conceptual clarity. .

Spatial representations can be further characterized by their ‘geometric content’, i.e. how much of the geometric structure of the environment is encoded in the spatial representation: as a biological example, hippocampal place-cell activity (O’Keefe, 1979), which can be interpreted as a high-level representation of multi-modal sensory input, exhibits characteristics of a grid-like spatial map representation. On the other hand, similarity gradients on a retinotopic (pixel-by-pixel) level, as proposed for visual navigation in insects (Zeil et al., 2003), carry no geometric information at all. This mirrors the long-standing debate among insect navigation researchers whether insects use cognitive maps for navigation (Dhein, 2023). Following a common negative characterization of an animal without a cognitive map: "At any one time, the animal knows where to go rather than where it is […]" (Hoinville and Wehner, 2018), we can restate the question in the language of RL as follows:

What is the geometric content (‘where the animal is’) of the - latent or explicit - spatial representation (‘what the animal knows’) of the RL agent?

Free of anatomical and physiological constraints, recent RNav research has produced a plethora of end-to-end learned navigation models with different architectures and policy optimization routines. The resulting pool of ‘experimentally validated’ latent spatial representations can serve as theoretical guidance when thinking about the way space is represented in the insect brain for successful navigation - both in terms of behavioral modeling and in the experimental search for neural correlates of such representations. Conversely, evidence about certain components of spatial representations in insects, like the existence of spatial vectors encoded in the brain, may guide the design of network architectures for artificial agents. To this end, we will analyze the what geometric information is represented (Sec. 2) and how it is represented in recently successful robot navigation models (Sec. 3) and in the ‘insect navigation base model’ (Sec. 4).

Going beyond the conceptual considerations outlined above, the question naturally arises whether a link between insect navigation and RL can be established on a more fundamental level. After a brief formal introduction to RL (Sec. 5), we will investigate how the neuroanatomical components involved in the insect navigation base model, the mushroom bodies (MB) and central complex (CX), could support computations similar to certain simple RL algorithms like SARSA or Q-learning (Sec. 6).

2 Representations of Space from a RL Perspective

In robot navigation, the problem of navigation has traditionally been partitioned into the subtasks of localization, map** and planning. Map** and localization operations take (potentially multimodal) sensory input to infer a map of the environment and the agent’s pose. It has long been acknowledged that the localization problem is most easily solved by reference to locations of salient landmarks in the world - i.e. a map - and conversely, constructing a coherent map requires accurate estimates of the agent’s pose. This led to the breakthrough of a suite of techniques collectively known as ‘simultaneous localization and map**’ (SLAM) (Mur-Artal and Tardós, 2017; Engel et al., 2014a; Endres et al., 2012; Fuentes-Pacheco et al., 2015). Most SLAM techniques combine landmark/feature recognition with odometry to maintain a joint (often probabilistic) representation of the environment and the agent’s pose therein, which we will refer to as the spatial representation ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ of the sensory input v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V. We denote sensory input with v𝑣vitalic_v, since the paper will focus on visual navigation, for simplicity. Multimodal input spaces are of course possible and highly relevant for a realistic understanding of insect navigation222Note the definition of ‘visual input’ may differ between INav and RNav and go beyond pure RGB pixel values. For example, insects can visually infer compass cues from celestial polarization patterns, while depth information obtained from RGB-D cameras is often used RNav. Based on ϕitalic-ϕ\phiitalic_ϕ, the planning stage then determines a sequence of actions a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A in order to achieve the objective of the navigation task. This in turn can be based on planning-based or learned methods In the following, we will analyze the spatial representations ΦΦ\Phiroman_Φ found in current robot and insect navigation models. Since these representations differ in many aspects, we first define two dimensions along which our analysis is structured: ‘What is represented?’ and ‘How is it represented?’.

Refer to caption
Figure 1: (A) representations of space with varying degrees of geometric accuracy (geometric content) (B-H) explicit and latent spatial representations in robot and insect navigation. (B-D) visual SLAM methods use a combination of odometry and landmark/feature tracking and matching to construct explicit topological maps (B), vector/landmark (C), or grid-based (D) maps. (E) the insect navigation base model uses two distinct mechanisms which result in two spatial representations: an explicit vector map, built from visual odometry alone without the need for landmark recognition and map**, and latent directional cues learned from a view memory. It is not clear if and how these two are linked to form a unified representation of space. (F-H) more recent approaches in robot navigation use latent representations. (F) topological latent representations (e.g. RECON Shah et al., 2023) (G) grid-based latent representations can be built from an explicit view-memory based architecture using RNNs (e.g. MapNet Henriques and Vedaldi, 2018) or a learnable map** module as in CMP (Gupta et al., 2019). (H) Unstructured memory based approaches like SMT (Fang et al., 2019) learn a completely abstract latent spatial representation.

2.1 What is Represented? The Geometric Content of the Spatial Representation ϕitalic-ϕ\phiitalic_ϕ

Confining our discussion to 2D, the simplest, ‘geometrically perfect’ representation of the environment could be imagined as an infinitely extended and infinitesimally spaced grid, filled with binary ‘occupancy’ values. While it may prove useful to enrich the map with layers of meaning (object categories, valuations, etc.) by adding semantic channels, the geometric information is captured fully by this single layer333Depending on the choice of origin and orientation of the grid axes, this map could be either allocentric or egocentric. Any sensory (visual) input is by definition egocentric. . Any practical representation of space, however, must be an abstraction of this ideal to varying degrees, trading off geometric accuracy for improved coding efficiency and storage capacity:

Grid-based Maps.

The most straightforward simplification is a grid with finite extent and resolution. Grid-based occupancy maps have a long history in SLAM approaches to robot navigation (e.g. Gutmann et al., 2008; Mur-Artal and Tardós, 2017; Engel et al., 2014a; Endres et al., 2012) where a probabilistic occupancy grid map is predicted from a time series of observations, as a joint estimate for both the agent’s position (localization) and layout of the environment (map**). More recent methods use hierarchical multi-scale approaches (Zhu et al., 2021), neural radiance fields (Rosinol et al., 2023) or Gaussian splatting (Matsuki et al., 2023) for highly accurate reconstructions. Numerous studies have demonstrated the existence of neural activity consistent with this kind of spatial representation in the so-called place-cells of the mammalian hippocampus O’Keefe (1979). Although their activity likely does not perfectly and exclusively encode position, and other cell types like head-direction cells, edge cells and entorhinal grid-cells with different spatial receptive fields exist, the collective activity of populations of these different cell types could be interpreted as a rich latent representation of sensory input encoding a high degree of spatial information. However, it is currently unknown whether insects also possess neuronal populations with similar place-cell like activity. We discuss potential candidate cell types in the insect brain in Sec. 5.

While more classical approaches focus on binary occupancy or probabilistic occupancy encodings as inputs to motion planners, learning based methods have also encompassed higher-dimensional contexts such as semantics (Wani et al., 2020; Schmalstieg et al., 2022; Younes et al., 2023), potential functions (Ramakrishnan et al., 2022) as additional channels in these maps.

Scene-Graphs.

Beyond dense maps, scene graphs have arisen as sparse environment representations that disassemble large scenes into objects, regions, etc., and represent them as as nodes (Hughes et al., 2022; Gu et al., 2023; Werby et al., 2024). The resulting representation provides a hierarchical and object-centric abstraction that has proven useful in particular in higher-level reasoning and planning (Rana et al., 2023; Honerkamp et al., 2024). In contrast to pure geometric representations, edges mainly focus on semantic or relational attributes, resorting back to grid-based maps for more detailed distance calculations.

Vector Maps.

As a next level of abstraction useful for sparsely populated maps, one could store only the grid indices of occupied cells, instead of an occupancy value for every cell. Increasing the accuracy by replacing grid indices with actual (Cartesian) coordinates with respect to some common origin, we arrive at a vector map, in which geometric relations in the world are represented by relative vectors between salient locations (vector nodes). Banino et al. (2018) propose a biologically inspired, vector-based navigation model, where vectors to navigational goals are represented as a ‘grid-code’ resembling grid- and head-direction cells in the mammalian enthorinal cortex. As we discuss below, the insect navigation base model assumes a vector-based representation of the global geometry of the environment.

Topological Maps.

If the vector information between connected nodes becomes inaccurate, the vector map gradually loses geometric information and transforms into a topological map, to the extreme case where nodes are connected only by binary ‘reachability’ or ‘traversability’ values. A less extreme case would be a ‘weighted graph’ representation, where edge weights could represent the Euclidean (or temporal) distance between nodes, preserving some geometric information, but not enough to uniquely reconstruct the map. Besides the obvious advantage of memory efficiency, a topological representation may be preferred over geometric maps (as argued for by Warren et al. 2017 in humans) for a different reason: It is more robust to inaccurate or corrupted measurements and therefore a more reliable representation of the coarse structure of the environment, which can then be combined with other mechanisms for local goal finding. Many outdoor navigation approaches in RNav construct topological maps of the environment (e.g. Shah et al., 2023; Shah and Levine, 2022; Engel et al., 2014b). The gradual transition between the map types described above is illustrated in Fig. 1 A.

All of the above representations establish a relation between multiple salient locations in the world, including the agent’s own position, and therefore represent knowledge about where the agent is.

Directional Cues Relative to Salient Location(s).

On the other hand, one could imagine a spatial representation of sensory input that encodes a relation between the agent and salient locations, without knowledge about how these relate to each other. For example, the insect navigation base model proposes the use of view memories, which are not attached to any specific location, as discussed in more detail in Sec. 4. One can interpret visual similarity as a proxy for the distance to the stored view and the similarity gradient as a directional cue (Zeil et al., 2003) towards the location of the snapshot. More recent models based on visual familiarity (Baddeley et al., 2012; Ardin et al., 2016) allow visual homing based on stored view memories regardless of the temporal sequence or locations of the stored views. Wystrach (2023) proposes a visual steering model that categorizes current views into left/right facing with regard to a specific location. These models demonstrate that spatial representations that tell the agent where to go, rather than where it is, are sufficient to support surprisingly complex navigation behavior.

2.2 How is it Represented? Explicit and Latent Representations of Space: ϕexpsubscriptitalic-ϕ𝑒𝑥𝑝\phi_{exp}italic_ϕ start_POSTSUBSCRIPT italic_e italic_x italic_p end_POSTSUBSCRIPT and ϕlatsubscriptitalic-ϕ𝑙𝑎𝑡\phi_{lat}italic_ϕ start_POSTSUBSCRIPT italic_l italic_a italic_t end_POSTSUBSCRIPT

We introduce some formal definitions to pose the navigation task as a reinforcement learning problem. Note that while we illustrate the following considerations in the context of RL, they equally apply to other learning paradigms used in the robot navigation literature. RL is usually formalized as a Markov Decision Process (MDP)444Technically, most classical RL navigation problems are formulated as partially observable MDPs (POMDP), which operate on probabilistic belief states about the unobservable true state. , which is specified as a 4-tuple (𝒮,𝒜,P,R)𝒮𝒜𝑃𝑅(\mathcal{S},\mathcal{A},P,R)( caligraphic_S , caligraphic_A , italic_P , italic_R ): State space 𝒮𝒮\mathcal{S}caligraphic_S and action space 𝒜𝒜\mathcal{A}caligraphic_A characterize the agent, while the environment555‘Environment’ in this context may also include the sensory processing apparatus is specified by the (probabilistic) transition function

P(s|s,a),s,s𝒮,a𝒜formulae-sequence𝑃conditionalsuperscript𝑠𝑠𝑎𝑠superscript𝑠𝒮𝑎𝒜\displaystyle P(s^{\prime}|s,a),\;\;\;s,s^{\prime}\in\mathcal{S},a\in\mathcal{A}italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_a ∈ caligraphic_A (1)

between states s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, given action a𝑎aitalic_a, and a reward function

R(s,a),s𝒮,a𝒜.formulae-sequence𝑅𝑠𝑎𝑠𝒮𝑎𝒜\displaystyle R(s,a),\;\;\;s\in\mathcal{S},a\in\mathcal{A}.italic_R ( italic_s , italic_a ) , italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A . (2)

At each timestep, the agent moves across the state space by choosing an action, which determines the next state according to Eq. (1), and receives rewards according to Eq. (2). The agent’s objective is to learn a (probabilistic) policy

π(a|s),s𝒮,a𝒜formulae-sequence𝜋conditional𝑎𝑠𝑠𝒮𝑎𝒜\displaystyle\pi(a|s),\;\;\;s\in\mathcal{S},a\in\mathcal{A}italic_π ( italic_a | italic_s ) , italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A (3)

over actions given the agent’s current state, such that under repeated applications of π𝜋\piitalic_π, starting from any state s𝑠sitalic_s, it maximizes the expected discounted cumulative reward, which we will discuss in more detail in Sec. 5.1. Note that our notation is meant to implicitly include policies over temporal sequences of states, e.g. eligibility traces (Sutton and Barto, 2018).

In the context of a navigation task, as outlined in Sec. 2.1, there are now two possible choices for the state space 𝒮𝒮\mathcal{S}caligraphic_S of the agent: the conventional approach was to use modular SLAM methods to construct a spatial representation from the sensory input, and then use this explicit representation as the state space of the agent: 𝒮Φexp𝒮subscriptΦ𝑒𝑥𝑝\mathcal{S}\equiv\Phi_{exp}caligraphic_S ≡ roman_Φ start_POSTSUBSCRIPT italic_e italic_x italic_p end_POSTSUBSCRIPT. The agent effectively only solves the planning sub-problem by either planning or learning a policy over a space of spatial representations whose geometric content is pre-determined by the specific SLAM implementation (e.g. Fig. 2 B-D)

Refer to caption
Figure 2: Navigation as a Markov Decision Process (MDP). (A) The three sub-problems of navigation: Map** and localization operations take sensory input from an input space 𝒱𝒱\mathcal{V}caligraphic_V to infer a map of the environment and the agent’s pose, in a space ΦΦ\Phiroman_Φ of joint spatial representations. These tasks are usually solved together using simultaneous localization and map** (SLAM). Based on the spatial representation, the agent plans a sequence of actions from action space 𝒜𝒜\mathcal{A}caligraphic_A to achieve the objective of the navigation task. (B) In modular robot navigation, only the planning stage is represented as an MDP: existing SLAM methods are used to construct an explicit spatial representation ΦexpsubscriptΦ𝑒𝑥𝑝\Phi_{exp}roman_Φ start_POSTSUBSCRIPT italic_e italic_x italic_p end_POSTSUBSCRIPT, which serves as the state space 𝒮𝒮\mathcal{S}caligraphic_S of the MDP. The learned policy π(𝒜|Φexp)𝜋conditional𝒜subscriptΦ𝑒𝑥𝑝\pi(\mathcal{A}|\Phi_{exp})italic_π ( caligraphic_A | roman_Φ start_POSTSUBSCRIPT italic_e italic_x italic_p end_POSTSUBSCRIPT ) then ’plans’ the action. (C) End-to-end RL navigation directly uses the input space 𝒱𝒱\mathcal{V}caligraphic_V as the MDP state space 𝒮𝒮\mathcal{S}caligraphic_S. Localization, map**, and planning are jointly solved by learning policy π(𝒜|𝒱)𝜋conditional𝒜𝒱\pi(\mathcal{A}|\mathcal{V})italic_π ( caligraphic_A | caligraphic_V ). The spatial representation ΦlatsubscriptΦ𝑙𝑎𝑡\Phi_{lat}roman_Φ start_POSTSUBSCRIPT italic_l italic_a italic_t end_POSTSUBSCRIPT is now latent in the hidden layers of the learnable (deep) policy network.

The other possibility is to model the navigation task as an end-to-end RL (or generally end-to-end learning) problem. This more recent approach takes the raw sensory input as the RL state space: 𝒮𝒱𝒮𝒱\mathcal{S}\equiv\mathcal{V}caligraphic_S ≡ caligraphic_V. The policy learned in this case represents a joint implementation of all the sub-problems of the navigation task. In order to achieve this, a sufficiently expressive network architecture needs to be chosen for learning π𝜋\piitalic_π. For example, Deep Reinforcement Learning (DRL) leverages deep neural networks as trainable function approximators (see Zhu and Zhang 2021; Zeng et al. 2020 for reviews of DRL for navigation tasks). Crucially, successful end-to-end learning of a navigation task implies the existence of an implicit, or latent spatial representation Φlat𝒮not-equivalent-tosubscriptΦ𝑙𝑎𝑡𝒮\Phi_{lat}\nequiv\mathcal{S}roman_Φ start_POSTSUBSCRIPT italic_l italic_a italic_t end_POSTSUBSCRIPT ≢ caligraphic_S within the network architecture for π𝜋\piitalic_π (Fig. 2 E-H). Notably, this representation is learned dynamically, ideally converging towards a representation that most efficiently encodes spatial features, not (fully) determined prior to training, relevant for the navigation task. However, the space of possible representations is constrained by the network architecture, which allows imposing certain structural characteristics.

From a biological perspective, the distinction between latent and explicit spatial representation is largely a question of plasticity, recurrent connections, and time scales: In order for a latent - i.e. learned - representation to emerge, sufficient synaptic plasticity is required in the neuronal populations that encode it. Modulation of the synaptic connections based on a training error would require some recurrent connectivity to convey that signal (see Sec. 5). Spatial representations in biological agents likely contain both explicit and latent components. For example, visual processing in the insect optic lobes shows relatively little experience-dependent plasticity, but the mushroom body, which receives high-level sensory input from the visual and other sensory systems, is known as the locus of much insect learning. From a more conceptual point of view, one could argue that the entire neuronal circuit is plastic on an evolutionary timescale. The full spatial representation can then be interpreted as a latent representation shaped by ecological constraints over many generations (see Sec. 4.6). While modeling explicit neuronal implementations of learning mechanisms would be nonsensical in this case, the approach may well serve as a normative model for a unified spatial representation in navigating insects.

3 Spatial Representations in Robot Navigation

Having discussed what is represented, here we discuss how space is represented in robots.

Explicit Spatial Representations: Variations of SLAM.

For completeness, we will very briefly discuss traditional robot navigation models that construct explicit spatial representations using some variation of SLAM. Fuentes-Pacheco et al. (2015) provide a concise review of popular visual SLAM approaches, which operate on different modalities (monocular, stereo, multi-camera or RGB-D vision) and use different probabilistic (Extended Kalman Filter (EKF), Maximum Likelihood (ML) or Expectation Maximization (EM)) or purely geometric (‘Structure from Motion’) approaches to maintain a joint representation of the agent’s pose within a map of the environment. These map representations come in any of the flavors discussed above. Egocentric occupancy grid maps (Gutmann et al., 2008; Xiao et al., 2022; Schmalstieg et al., 2022) are common for dense indoor environments where obstacle avoidance is paramount.

Vector maps (e.g. Klein and Murray, 2007), only encode the relative locations of salient features (landmarks) which are tracked in the (retinotopic) camera view across time. Typically, these methods (Mur-Artal and Tardós, 2017) use a loop of (visual) odometry based on these landmarks666The landmarks encoded in the map and the ones used for odometry need not be identical. and landmark prediction based on self-motion estimates to maintain a joint probabilistic estimate of robot and landmark positions. Long-baseline feature matching allows for drift correction by loop-closure - the recognition of previously visited locations. In contrast to the vector memory in the INBM discussed below, this concept of landmark based maps goes beyond our earlier conceptual definition of vector maps: not only are the physical landmark locations stored in the vector map, they are also linked to their retinotopic locations in the camera frame.

Topological SLAM methods (Konolige et al., 2009; Engel et al., 2014b; Greve et al., 2023; Vödisch et al., 2022) are particularly useful for map** larger areas: the world is represented as a graph in which nodes are key-frames (‘sensor snapshots’) representing the camera pose. Nodes are connected by edges which represent the relationship between poses (pose-pose constraints) obtained from odometry or loop-closure. Global optimization (e.g. Pose Graph Optimization (PGO)) ensures convergence of the topological map. Nevertheless, Fuentes-Pacheco et al. (2015) state that due to ‘the lack of metric information, […] it is impossible to use the map for the purpose of guiding a robot’, a limitation which has been overcome by using latent topological representations as in Shah et al. (2023), discussed below.

Latent Representations.

Recently, the attention of RNav research has shifted towards end-to-end learning approaches. While these offer the possibility of abstract spatial representations, some implementations choose architectures that constrain the spatial representation to known templates. In the following, we will map out (see Fig. 1 F-H) the space of possible latent representations along a (non-exhaustive) selection of instructive examples:

The Cognitive Mapper and Planner (CMP) model (Gupta et al., 2019) uses fully differentiable encoder-decoder architecture to create a grid map of the environment. Instead of occupancy (or pre-defined semantic) values, however, ‘The model learns to store inside the map whatever information is most useful for generating successful plans’, making the map a latent representation777This is in contrast to (e.g. neural SLAM Chaplot et al., 2020), which is explicitly trained to produce an occupancy map against a ground truth.. Earlier, the same authors (Gupta et al., 2017) suggested a latent representation that combines grid-based with vector (landmark) based maps by synthesizing a global allocentric grid-map from multiple local egocentric grid maps at salient locations. Learning a map from egocentric observations can be viewed as storing encoded egocentric views in a map-like memory. Explicit memory-based models like MapNet (Henriques and Vedaldi, 2018) use a Long Short-Term Memory (LTSM) type Recurrent Neural Network (RNN) with convolutional layers to encode and continually update a grid-map-like state vector by egocentric observations.

The RECON (Rapid Exploration Controllers for Outcome-driven Navigation) model by Shah et al. (2023) uses a network architecture whose latent representations capture the topology of a large-scale environment. The map is represented as a graph with egocentric views (‘goal images’) at specific locations as nodes, which are determined by a goal-directed exploration algorithm888akin to the ‘key-frames’ in pose graphs (Engel et al., 2014b). The model employs a variation of the information bottleneck architecture (Alemi et al., 2019): an encoder-decoder pair, conditioned on the current egocentric view - learns to compress the goal image into a latent representation (conditional encoder), which is predictive of both the (temporal) distance to the goal, and the best action to reach it (conditional decoder). The encoder and decoder are trained together in a self-supervised fashion to learn the optimal (most predictive) latent representation, with the actual time to reach the goal as ground truth. Crucially, the resulting conditional latent representation now encodes the relative distance to the nodes, and thus the topology of the environment. In contrast to topological SLAM models, goal-directed actions are learned alongside the topology, enabling successful robot navigation.

Memory-based approaches like MapNet are based on the insight that all spatial representation is inherently contained in the history of previous observations. The Scene Memory Transformer (SMT) architecture (Fang et al., 2019) exploits this fact to learn an abstract representation free of inductive biases about the memory structure (like a grid of fixed dimensions in MapNet). Instead of updating an RNN state vector with each observation, an efficient embedding of every observation is stored in an unstructured scene memory. This serves as the state space for an attention-based policy network based on the Transformer architecture (Vaswani et al., 2017), which enables the model to transform the embedding of each memory item according to a specific context. In a nutshell, the transformer blocks are used to ‘[…] first encode the memory by transforming each memory element in the context of all other elements. This step has the potential to capture the spatio-temporal dependencies in the environment. ’ Thus, the encoded scene memory contains a completely abstract latent spatial representation without any preimposed structure. A second attention block is then used to decode the current observation in the context of the transformed (encoded) scene memory into a distribution over actions. The lack of prior assumptions about spatial representation makes this model very versatile and allows applications in a variety of navigation domains. Wani et al. (2020) compare models using map-based and map-less spatial representations on a multi-object navigation task.

4 Spatial Representations of the Insect Navigation Base Model

In this section, we discuss how space is represented in insects. We describe constituent components of the proposed insect navigation base model INBM (Webb, 2019) and analyze inherent spatial representations in the light of the previous discussion for artificial agents. Current INav research has identified three main mechanisms as the minimal set of assumptions that may be sufficient to explain observed navigation behavior.

4.1 Path Integration

Central place foraging insects are able to maintain a reasonably accurate estimate of their position with respect to a central nest location as a vector-like representation, known as the path integration (PI) home vector. Stone et al. (2017) propose an anatomically constrained model for path integration in the central complex (CX) region of the bee brain: a self-stabilizing representation of the current heading direction is maintained in the ring-attractor architecture of the protocerebral bridge (PB): Neuronal activity of EPG neurons in eight subpopulations of the PB encodes heading direction relative to the sky compass, projected onto eight axes shifted by 360/8=453608superscript45360/8=45^{\circ}360 / 8 = 45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, leading to a periodic, sinusoidal activity pattern999This can be interpreted as a redundant - and thus more robust - generalization of simple Cartesian encoding (along two axes).. Another population of CX neurons (CPU4) accumulates a speed signal derived from optic flow, modulated by the current heading direction signal from the PB neurons. As a result, the PI home vector is again (redundantly) encoded by its projection along eight axes. This representation essentially amounts to a (discrete) phasor representation of the home vector, with the amplitude and phase of the periodic signal representing its length and angle, respectively. The home vector can then be used to drive the animal back towards the nest. In the context of this work, we want to stress two important aspects of PI in flying navigators: First, it must rely to a large extent on vision alone, since proprioceptive modalities used by walking insects like desert ants are highly unreliable due to wind drift and other atmospheric parameters. We can therefore interpret the PI home vector as a purely visual representation of space101010The role of vestibular-like inertial sensory information in PI is largely unknown. . Secondly, for the same reason, basing PI on heading direction is an oversimplification since heading and traveling direction will often differ. Lyu et al. (2022) proposed a circuit model of the fly (Drosophila melanogaster) CX, demonstrating how a representation of the allocentric traveling direction can be computed from heading direction and egocentric directional optic flow cues by a phasor-based neural implementation of vector addition. However, this important implementation detail does not invalidate the general PI mechanism outlined above.

4.2 Vector Memory

Efficient navigation entails more than returning to the nest: Foragers need to be able to reliably revisit known food sources. The INBM posits that whenever insects visit a salient location, they store the current state of the home vector in a vector memory. Le Moël et al. (2019) suggest a mechanism where an individual vector memory is stored in the synaptic weights of a memory neuron, which forms tangential inhibitory synapses onto all directional compartments of the CPU4 population. Activation of this neuron when the home vector is zero would leave a negative imprint of the memorized PI vector (i.e. the vector from the nest to the remembered location) in CPU4 activity, driving the animal to recover a CPU4 activity corresponding to a zero vector (which is now the case at the remembered location, where vector memory and home vector are equal). If the initial home vector is not zero, this mechanism for vector addition effectively computes the direct shortcut to the remembered location, an ability frequently cited as strong evidence for the existence of a cognitive map111111Another implementation of a vector memory may be sustained neuronal ‘phasor type’ activity of (unknown) cell types in the CBU, and performing vector arithmetic as in Lyu et al. (2022) . The neuronal mechanisms for storing, retrieving, and choosing between multiple vector memories remain speculative. For the former, the authors suggest dopaminergic synaptic modulation directly at the CPU4 dendrites, providing direct reinforcement from extrinsic rewards (food).

This combination of accurate path integration and vector memories constitutes a vector map, i.e. a geometrically accurate (within limits of PI accuracy) representation of the world, which the insect can access for navigation, as long as the PI vector is not corrupted or manipulated. Unlike landmark-based maps in vSLAM, the vector locations are not associated with any visual landmarks or features.

4.3 View Memory

A large body of INav research has been concerned with the ability to return the nest when an accurate PI vector is not accessible to the animal, making it reliant on visual homing and route-following mechanisms. Originating from the snapshot model (Cartwright and Collett, 1983) which matched the retinal positions of landmarks between the current view and stored snapshots, more recent models suggest retinotopic representations of a low-resolution panoramic view, with only elementary processing like edge filters, and without the need for explicit landmark recognition. Zeil et al. (2003) showed that similarity gradients based on pixel-by-pixel intensity differences are sufficient for successful visual homing. Combining multiple view memories along frequently traveled routes allows for complex routes following toward the nest. Webb (2019) emphasizes that no information about the location or temporal sequence of the stored views is necessary: Baddeley et al. (2012) proposed a computational familiarity model, which encodes the entire view memory in an InfoMax (Lee et al., 1999; Lulham et al., 2011) neural network architecture. From a scan of the environment, the agent can then infer the most familiar viewing direction over all stored memories. If the view memories are acquired during inbound routes (i.e. linked to a homing motivational state), this will guide the agent towards the nest. Ardin et al. (2016) proposed a biological implementation of a familiarity model based on the insect mushroom body (MB), a learning-associated region of the insect brain discussed in more detail in Sec. 5.

4.4 Discussion: Spatial Representation in the Insect Navigation Base Model

As presented thus far, the base model entails two independent spatial representations: A vector map, which is not linked to specific egocentric views, and directional cues based on view memories, which are not linked to any geometric information from the vector memory. According to our previous classification, the vector map is an explicit spatial representation: It is evolutionarily pre-determined by the path integration circuitry, just like classical SLAM architecture by the pre-defined inference algorithm. The construction of the vector representation differs from SLAM methods in that exact localization is assumed, based on PI, and the map is constructed based on that ground truth, obviating the need to maintain correspondences between retinotopic and geometric locations of features and landmarks. On the other hand, a crude latent spatial representation in terms of directional cues is implicit in the view memory.

For example, the visual features used by the InfoMax architecture for familiarity discrimination are latent in the learned network weights. Fig. 1 E illustrates how the INBM aligns with our classification of spatial representations used for robot navigation. As mentioned, the base model explicitly does not link view memories to vector memories.

4.5 Beyond the base model.

How these two distinct representations are linked is an active research question, covering two major aspects: First, how do insects balance conflicting information from the two systems? Sun et al. (2020) proposed a unified model inspired by joint MB/CX neuroanatomy, combining PI, visual homing, and visual route following. The model balances off-route (PI and visual homing) with on-route (visual route following) steering outputs based on visual novelty and uncertainty of the PI signal. Conceptually more interesting is a second aspect: is the view memory truly independent of the geometry of the vector memory? This is closely related to a question not thoroughly studied in the work cited above: When and where does an animal form a view memory? Most models just assume that views are stored regularly along a homeward-bound route. Ardin et al. (2016) suggest that ‘the home reinforcement signal could […] be generated by decreases in home vector length’. Note that this already associates the stored views with a specific node in the vector map. Wystrach (2023) recently proposed a neuroanatomically constrained model for visual homing which obviates the need for storing individual view memories: during learning, views are continuously associated with facing left or right with respect to the nest, using the difference between PI vector and current heading direction for reinforcement. We will discuss this model in detail in Sec. 5. The spatial representation of egocentric views is now decidedly conditioned on a specific vector. One could easily imagine an extension of this model by vector memories, enabling learnable visual guidance along arbitrary vectors encoded in the vector map, essentially using each vector as a motivational state. Note that such a mechanism would be different from ‘reloading’ a PI state from the view memory, although the expected behavior is similar: Insects would be able to recover previously known ‘shortcuts’ based on visual guidance alone. The joint spatial representation would be a topological latent representation similar to Shah et al. (2023), see the discussion in Sec. 7.

4.6 An Evolutionary Perspective: Insect Inspired RL as a Normative Model

Conceptually, such a unified spatial representation could itself be viewed as a single, latent embedding of visual input, learned over evolutionary time to best adapt to ecological constraints, i.e. reap the largest long-term reward from the environment. The dichotomy between static, explicit and plastic, latent components of the representation would then be relaxed to a continuum of plasticity for different model components, realized via differential learning rates. We propose to design an end-to-end RL-learnable navigation model constrained by the insect navigation base model, in the sense that the resulting spatial representation is compatible with its basic assumptions. This will be instructive for both the field of insect and robot navigation: For the former, it can serve as a normative model for a possible unified spatial representation that goes beyond the base model, providing theoretical guidance for how vector and view-based representations may interact to support efficient navigation.

On the level of sensory processing, having the network learn representations that match, e.g. PI based vector maps, may yield valuable insights into which visual features are useful in intermediate processing steps to reliably support such computations in a variety of visual conditions and environments. These model predictions would yield testable hypotheses for further neuroanatomical, physiological, and behavioral experiments. (See the discussion in Sec. 7.) Given the superior performance of insect navigators in terms of training efficiency, robustness, and generalization capability, robot navigation may profit from this biologically inspired and constrained spatial representation. Pretraining such a network extensively under varying conditions and then freezing the slow components may yield a highly robust, adaptive spatial representation for applications similar to natural insect task spaces, i.e. visual outdoor navigation for ground or aerial autonomous agents. Implementing network architectures that support phasor representations may be a useful avenue for robotic navigation research.

5 Reinforcement Learning with an Insect Brain

Given the success of reinforcement learning as a framework for robot navigation, it seems reasonable to ask if and how navigation could be implemented based on actual RL-type computations in the insect brain. Furthermore, extensive literature involving dopamine, learning, and reward prediction errors exists in the mammalian neuroscience community but despite these topics being relevant in insect learning and navigation, the discussion of potential connections is limited. To explore this line of thought, we will first continue our formal treatment of RL (Sec. 5.1). In Sec. 5.2 we will discuss how current computational models of the MB - the prominent learning associated region of the insect brain - could be augmented to support simple RL algorithms. This will allow us to discuss the recent MB/CX based visual homing model by Wystrach (2023) in the context of RL and extrapolate it to roughly outline a neuroanatomically inspired end-to-end RL model for insect navigation in Sec. 6

5.1 RL Formalism

Starting from the definitions from Sec. 2.2, RL methods find the optimal policy Eq. (3) which may maximize the expectation of the temporally discounted sum of instantaneous rewards Eq. (2) over time:

Vπ(s)superscript𝑉𝜋𝑠\displaystyle V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) =𝔼π[k=0TγkRt+k+1|St=s]\displaystyle=\mathbb{E}_{\pi}\left[\sum_{k=0}^{T}\gamma^{k}R_{t+k+1}\,\middle% |\,S_{t}=s\right]= blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t + italic_k + 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] (4)
=𝔼π[Rt+1+γVπ(St+1)|St=s]\displaystyle=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma V^{\pi}(S_{t+1})\,\middle|% \,S_{t}=s\right]= blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] (5)
=aπ(a|s)[R(s,a)+γsp(s|s,a)Vπ(s))]\displaystyle=\sum_{a}\pi(a|s)\left[R(s,a)+\gamma\sum_{s^{\prime}}p(s^{\prime}% |s,a)V^{\pi}(s^{\prime}))\right]= ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) [ italic_R ( italic_s , italic_a ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] (6)

for all initial states s𝑠sitalic_s. This function is therefore called the value function of s𝑠sitalic_s under policy π𝜋\piitalic_π, with a temporal discount factor γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ]. Eq. (5) and Eq. (6), versions of the Bellman equation, illustrate the recursive nature of the value function: it can be decomposed into the average immediate reward from the current state s𝑠sitalic_s under policy π𝜋\piitalic_π, plus the discounted value of the subsequent state, averaged over all possible successor states ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Maximization of V𝑉Vitalic_V with respect to π𝜋\piitalic_π can now be understood intuitively: For a single time step, the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the recursion Eq. (6) would be simply choosing the action which maximizes the term in square brackets. Iterating through the recursion then leads to the Bellman optimality equation for the optimal state value function

V(s)superscript𝑉𝑠\displaystyle V^{*}(s)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) =maxa[R(s,a)+γsp(s|s,a)V(s))]\displaystyle=\max_{a}\left[R(s,a)+\gamma\sum_{s^{\prime}}p(s^{\prime}|s,a)V^{% *}(s^{\prime}))\right]= roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_R ( italic_s , italic_a ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] (7)
=maxa𝔼[Rt+1+γV(St+1)|St=s,At=a]\displaystyle=\max_{a}\mathbb{E}\left[R_{t+1}+\gamma V^{*}(S_{t+1})\,\middle|% \,S_{t}=s,A_{t}=a\right]= roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] (8)

Another way to interpret Eq. (6) would be as a policy average over a state-action value function Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ):

Vπ(s)aπ(a|s)Qπ(s,a).superscript𝑉𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠superscript𝑄𝜋𝑠𝑎\displaystyle V^{\pi}(s)\equiv\sum_{a}\pi(a|s)\,Q^{\pi}(s,a).italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ≡ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) . (9)

By the same logic, recursive (optimality) relations can be derived for Q𝑄Qitalic_Q:

Qπ(s,a)superscript𝑄𝜋𝑠𝑎\displaystyle Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) =𝔼π[Rt+1+γQπ(St+1,At+1)|St=s,At=a]\displaystyle=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma Q^{\pi}(S_{t+1},A_{t+1})\,% \middle|\,S_{t}=s,A_{t}=a\right]= blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] (10)
Q(s,a)superscript𝑄𝑠𝑎\displaystyle Q^{*}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) =𝔼[Rt+1+γmaxaQ(St+1,a)|St=s,At=a]\displaystyle=\mathbb{E}\left[R_{t+1}+\gamma\max_{a^{\prime}}Q^{*}(S_{t+1},a^{% \prime})\,\middle|\,S_{t}=s,A_{t}=a\right]= blackboard_E [ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] (11)

Value-based, Policy-based, and Actor-Critic Methods.

One way to find the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is by trying to solve it directly. This is most commonly done using policy gradient methods which parameterize the policy π(a|s;θ)𝜋conditional𝑎𝑠𝜃\pi(a|s;\theta)italic_π ( italic_a | italic_s ; italic_θ ) and then perform gradient ascent on a suitable performance metric, like the average reward per timestep: Δθθ𝔼[Rt]proportional-toΔ𝜃subscript𝜃𝔼delimited-[]subscript𝑅𝑡\Delta\theta\propto\nabla_{\theta}\mathbb{E}[R_{t}]roman_Δ italic_θ ∝ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. We will not dwell on pure policy-based methods further, for more detail see Sutton et al. (1999); Williams (1992). Alternatively, one can estimate the value function V(s)𝑉𝑠V(s)italic_V ( italic_s ) or Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) and infer an optimal policy indirectly. For example, assuming an optimal Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is found, the optimal policy is simply

π=argmaxaQ(s,a)superscript𝜋subscriptargmax𝑎superscript𝑄𝑠𝑎\displaystyle\pi^{*}=\text{argmax}_{a}Q^{*}(s,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) (12)

These value-based approaches have the strongest connection to insect neuroscience and will therefore feature prominently in the rest of this paper.
Last, so-called actor-critic methods constitute a hybrid approach, involving both a policy (actor) and value (critic) estimation. The policy gradient is then computed to maximize the advantage of an action derived from the policy over the estimated baseline value. Actor-critic methods play an important part in robot navigation.

Temporal Difference (TD) Methods: SARSA and Q-learning

TD methods approach the problem of value estimation by deriving single-timestep update rules from the recursive relations Eq. (10), Eq. (11) for Q𝑄Qitalic_Q or Eq. (5), Eq. (8) for V𝑉Vitalic_V: for each timestep, the squared difference between the LHS and RHS is treated as a prediction error to be minimized. For Eq. (10), this leads to the update rule

Qπ(St,At)superscript𝑄𝜋subscript𝑆𝑡subscript𝐴𝑡\displaystyle Q^{\pi}(S_{t},A_{t})italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Qπ(St,At)+αΔt,absentsuperscript𝑄𝜋subscript𝑆𝑡subscript𝐴𝑡𝛼subscriptΔ𝑡\displaystyle\leftarrow Q^{\pi}(S_{t},A_{t})+\alpha\cdot\Delta_{t},← italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α ⋅ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (13)
Δt=subscriptΔ𝑡absent\displaystyle\Delta_{t}=roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [Rt+1+γQπ(St+1,At+1)Qπ(St,At)]delimited-[]subscript𝑅𝑡1𝛾superscript𝑄𝜋subscript𝑆𝑡1subscript𝐴𝑡1superscript𝑄𝜋subscript𝑆𝑡subscript𝐴𝑡\displaystyle\left[R_{t+1}+\gamma Q^{\pi}(S_{t+1},A_{t+1})-Q^{\pi}(S_{t},A_{t}% )\right][ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]

with learning rate α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. This update rule is known as SARSA due to the tuple (St,At,Rt+1,St+1,At+1)subscript𝑆𝑡subscript𝐴𝑡subscript𝑅𝑡1subscript𝑆𝑡1subscript𝐴𝑡1(S_{t},A_{t},R_{t+1},S_{t+1},A_{t+1})( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) required to compute the update. In order to accommodate exploration, the agent follows a non-deterministic policy based on the current estimate of Q𝑄Qitalic_Q, making SARSA an on-policy algorithm. The most common choices are ϵitalic-ϵ\epsilonitalic_ϵ-greedy (choose the argmaxa(Q)subscriptargmax𝑎𝑄\text{argmax}_{a}(Q)argmax start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_Q ) with probability 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ, random action with probability ϵitalic-ϵ\epsilonitalic_ϵ) and softmax policies (choose an action from a Boltzmann distribution based on Q𝑄Qitalic_Q, with inverse temperature β𝛽\betaitalic_β). These policies converge to the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, if the stochasticity is reduced systematically towards a deterministic (greedy) policy (ϵ0italic-ϵ0\epsilon\rightarrow 0italic_ϵ → 0 or β𝛽\beta\rightarrow\inftyitalic_β → ∞) during learning (see Sutton and Barto, 2018).

Finally, a TD error without reference to a specific policy can be derived to estimate Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT directly from Eq. (11):

Δt=[Rt+1+γmaxaQ(St+1,a)Q(St,At)]subscriptΔ𝑡delimited-[]subscript𝑅𝑡1𝛾subscript𝑎superscript𝑄subscript𝑆𝑡1𝑎superscript𝑄subscript𝑆𝑡subscript𝐴𝑡\displaystyle\Delta_{t}=\left[R_{t+1}+\gamma\max_{a}Q^{*}(S_{t+1},a)-Q^{*}(S_{% t},A_{t})\right]roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (14)

This is known as Q-learning. Note that in contrast to Eq. (13), this off-policy update is now independent of the consecutive action At+1subscript𝐴𝑡1A_{t+1}italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT prescribed by the policy. However, the policy still determines which action-value pair receives the update. This effectively decouples the learned policy from the policy employed during learning, which can be chosen to be very exploratory in order to improve convergence. In particular, this also allows for the agent to perform off-line updates, i.e. updates which are not based on the current state transition, but e.g. sampled from a replay buffer of previous experiences. This property was key to the success of the Deep Q-Network (DQN) by Mnih et al. (2015), an early milestone of Deep Reinforcement Learning, which has since found numerous applications in robot navigation.

As we will show, the neural substrate of the insect mushroom body has the potential to support Q𝑄Qitalic_Q-based TD computations like SARSA or Q-learning to solve navigational tasks. Conversely, (Deep) Reinforcement Learning as a key toolset for robot navigation provides a useful framework to think about the neural computations underlying insect navigation.

5.2 The Mushroom Bodies as a Neural Substrate for RL

The canonical MB learning model for classical conditioning.

The mushroom bodies are bilateral neuropils in the insect brain, with homologous structures largely conserved across different species, whose crucial role in learning and memory has long been established. Extensively studied in the context of olfactory learning in Drosophila melanogaster (reviewed in Cognigni et al., 2018), inputs from other sensory modalities, in particular vision (e.g. Ehmer and Gronenberg, 2002; Strube-Bloss and Rössler, 2018; Vogt et al., 2014) likely support general behavioral learning tasks, including visual-spatial navigation. The main intrinsic anatomical components of the MB are Kenyon cells (KC), whose dendrites form the calyx, while the axons constitute the lobes of the MB. They receive sensory inputs via projection neurons (PN), which are thought to form non-plastic, sparse and random (Caron et al., 2013) synapses onto the KCs at the calyx. KC activity is transmitted to mushroom body output neurons (MBONs) at the MB lobes. Dopaminergic neurons (DANs), which also target the MB lobes, induce (usually depressive) modulation of the KC-MBON synapses and thus enable an adaptive response to the sensory stimulus. MBON activity is integrated downstream by (pre)motor neurons (MN) to produce an action. Conventionally, the reinforcement signal mediated by the DANs is assumed to encode a direct extrinsic reward. Fig. 3 A shows the ‘canonical’ MB circuitry (without the dashed MBON\rightarrowDAN synapses).

Prediction Targets and Reinforcement Signals: RL vs. Associative Learning.

At first glance, all the ingredients for reinforcement learning seem to be there: KC activity defines the state space 𝒮𝒮\mathcal{S}caligraphic_S, based on which MBON activity encodes some value prediction over an action space 𝒜𝒜\mathcal{A}caligraphic_A. DAN activity encodes a reward function R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ). However, there is a crucial difference in the kind of value prediction which is computed: So far, MB-based learning has been studied in the context of trial-by-trial classical conditioning, a form of associative learning (AL) where the agent is presented with isolated stimulus-reward pairs. This is not a full MDP, since neither future states nor rewards are contingent upon the current state and action, i.e. the transition function P(s|s,a)𝑃conditionalsuperscript𝑠𝑠𝑎P(s^{\prime}|s,a)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) is not specified. The prediction target of RL, the long-term cumulative reward, is therefore ill-defined, since it is determined by the experimenter’s choice of stimulus/reward pairs, and not by (only) the agent’s action. Instead, the agent is trained to predict the immediate external reward following an action. This highlights a fundamental difference in the interpretation of rewards in AL vs. reinforcement learning: In the former, it serves as an immediate feedback signal used to evaluate individual actions. The agent does not learn to maximize future rewards, but merely to react to a stimulus according to the associated reward. In the latter, rewards define a long-term objective which the agent learns to achieve by a series of optimal actions.

Furthermore, the models differ in how the value prediction is learned: in the canonical MB model, the DAN reinforcement signal directly encodes the absolute value of the external reward R𝑅Ritalic_R (direct reinforcement), while for TD-RL methods, ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (13) and Eq. (14) encode a prediction error of Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) (which is a proxy for the prediction target V(s)𝑉𝑠V(s)italic_V ( italic_s )). Crucially, an MDP cannot be learned using direct reinforcement, since there is no directly provided ground truth for the prediction target. A neural mechanism for computing prediction errors is therefore a prerequisite to reconcile RL with MB based computations.

Refer to caption
Figure 3: (A,B) anatomical (A) and computational (B) components of MB model for AL (adapted from Bennett et al. 2021): KCs receive sensory input vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the PNs at the MB calyx, encoding a decorrelated and sparse representation of the sensory environment stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. KCs synapse onto distinct sets of approach/avoid MBONs (M±subscript𝑀plus-or-minusM_{\pm}italic_M start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT) driving opposing responses, which are integrated in downstream (pre)motor neurons (MN) to produce an effective action, according to a ‘policy’ π𝜋\piitalic_π. MBON activity can also be interpreted as a prediction R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG of the external reward R𝑅Ritalic_R, stored in the KC\rightarrowMBON synaptic weights. These are depressively modulated by a reinforcement signals Δ±subscriptΔplus-or-minus\Delta_{\pm}roman_Δ start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT from distinct sets of aversive/appetitive (Dsubscript𝐷minus-or-plusD_{\mp}italic_D start_POSTSUBSCRIPT ∓ end_POSTSUBSCRIPT) DANs of opposite valence in the MB lobes. In the canonical model, DANs encode direct (external) positive or negative reinforcement Δ=R±Δsubscript𝑅plus-or-minus\Delta=R_{\pm}roman_Δ = italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT. Including recurrent MBON \rightarrow DAN connections (dashed) enables DANs to compute an RPE signal to be used for reinforcement: Δ=RR~Δ𝑅~𝑅\Delta=R-\widetilde{R}roman_Δ = italic_R - over~ start_ARG italic_R end_ARG. For more detail see Sec. 5.2. The opposing valences of rewards and actions can have other interpretations, see Fig. 4. (C) The insect MB as a neural substrate for RL. When the state-action loop is closed, stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are valid states of an MDP. The MB circuit can support TD-RL with the additional assumption that recurrent MBON\rightarrowDAN connections carry an activity trace. This allows DANs to ‘compute’ a genuine TD error. MBON activity can then encode the agent’s current estimate of the Q𝑄Qitalic_Q-function. Reinterpreting the 2d action space as canonical translation, this model can learn rudimentary navigation.

Prediction Errors in the MB.

Recent computational studies proposed rate-based Bennett et al. (2021) and spiking Jürgensen et al. (2024) models of the MB which employ recurrent MBON-DAN connections to compute a reward prediction error (RPE). They show that the resulting behavior of the agents in a classic conditioning paradigm aligns with experimental evidence. The postulated recurrent connections are supported by anatomical evidence (see references 17-20 in Bennett et al., 2021) and are indicated by the dashed arrows in Fig. 3A. It illustrates the simplest iteration of the models investigated by Bennett et al. (2021): Agent behavior is the net result of approach and avoid opponent processes within the MB. The two antagonistic behaviors are encoded by the activity of two distinct sets of MBONs (M±subscript𝑀plus-or-minusM_{\pm}italic_M start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT), whose outputs are integrated by downstream descending neurons. KC\rightarrowMBON synapses of these distinct sets are targeted by two sets of valence specific, i.e. appetitive and aversive, DANs (D±subscript𝐷plus-or-minusD_{\pm}italic_D start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT), now encoding prediction errors ΔR±Δsubscript𝑅plus-or-minus\Delta R_{\pm}roman_Δ italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT for positive and negative external reward R±subscript𝑅plus-or-minusR_{\pm}italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT, respectively121212 Somewhat confusingly, rewards of opposite valence are often termed ‘reward’ and ‘punishment’ in the AL literature. . The major innovation of this model lies in the interpretation of M±subscript𝑀plus-or-minusM_{\pm}italic_M start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT activity as predictions R~±subscript~𝑅plus-or-minus\widetilde{R}_{\pm}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT of the corresponding reward R±subscript𝑅plus-or-minusR_{\pm}italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT. ΔR±Δsubscript𝑅plus-or-minus\Delta R_{\pm}roman_Δ italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT are then computed indirectly by recurrent excitatory MBON \rightarrow DAN connections of opposite valence, i.e. the prediction of negative reward is added to the direct positive reward, and vice versa: ΔR±=R±+R~Δsubscript𝑅plus-or-minussubscript𝑅plus-or-minussubscript~𝑅minus-or-plus\Delta R_{\pm}=R_{\pm}+\widetilde{R}_{\mp}roman_Δ italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT ± end_POSTSUBSCRIPT + over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT ∓ end_POSTSUBSCRIPT. The difference between D+subscript𝐷D_{+}italic_D start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and Dsubscript𝐷D_{-}italic_D start_POSTSUBSCRIPT - end_POSTSUBSCRIPT activity then encodes the full RPE: ΔR=ΔR+ΔR=(R+R~+)(RR~)Δ𝑅Δsubscript𝑅Δsubscript𝑅subscript𝑅subscript~𝑅subscript𝑅subscript~𝑅\Delta R=\Delta R_{+}-\Delta R_{-}=(R_{+}-\widetilde{R}_{+})-(R_{-}-\widetilde% {R}_{-})roman_Δ italic_R = roman_Δ italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - roman_Δ italic_R start_POSTSUBSCRIPT - end_POSTSUBSCRIPT = ( italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) - ( italic_R start_POSTSUBSCRIPT - end_POSTSUBSCRIPT - over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ). Since the synaptic modulation by DANs is depressive, reinforcement is achieved by inhibiting MBONs of the opposite valence, i.e. appetitive DANs inhibit aversive MBONs and vice versa. In contrast to the temporal difference errors from SARSA Eq. (13) and Q-learning Eq. (14), here the RPE reflects the prediction error of the single-timestep (or ‘timeless’) total external reward of the Rescorla-Wager type (Rescorla, 1972). We propose that only a single additional assumption can turn this MB circuit into a neural implementation of a TD RL agent.

Can the MB Support Temporal Difference Learning?

Once we change the experimental paradigm to an interaction task that can be modeled as a full MDP, optimizing cumulative rewards becomes a meaningful objective. The agent’s learning objective is now no longer the immediate reward R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) following an action, but the long-term value of a state-action pair Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ). The recursive TD update rules Eq. (13), Eq. (14) are prediction errors of the current estimate of Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ), but the prediction target is R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) plus the agent’s estimate for Q(s,a)𝑄superscript𝑠superscript𝑎Q(s^{\prime},a^{\prime})italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the next timestep. This temporal link is the core principle allowing TD methods to compute estimates of cumulative rewards over time. As before, we can interpret MBON activity as a prediction, this time for Q𝑄Qitalic_Q (instead of r𝑟ritalic_r). In order for DANs to compute a TD error, recurrent MBON\rightarrowDAN connections would have to mediate a kind of activity trace (instead of the instantaneous MBON activity) which captures these temporal dynamics131313Technically, Eq. (13) looks forward in time, while an activity trace only gives access to previous states. This could be reconciled by simply shifting the time index such that Q(St1,Att)Q_{(}S_{t-1},A_{t-t})italic_Q start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t - italic_t end_POSTSUBSCRIPT ) is updated at time t𝑡titalic_t. In the neural model, this could be captured by the temporal dynamics of synaptic modulation. . Different neural implementations of this connectivity could support either SARSA or Q-learning type computations. In any case, however, they would need to account for the sign shift between Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) and Q(s,a)𝑄superscript𝑠superscript𝑎Q(s^{\prime},a^{\prime})italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in Eq. (13) and Eq. (14). In this model, the agent’s current estimate of Q𝑄Qitalic_Q is fully captured by the KC\rightarrow MBON synaptic weights, learned by synaptic modulation through the DANs. The explicit policy is determined by the downstream integration of MBON output which ultimately generates the action. Fig. 3 illustrates how a full TD model can be obtained by augmenting the canonical classical conditioning model of the MB, and how anatomical and algorithmic components map onto each other. If rewards are given only externally at sparse locations, successful learning in such an RL agent will require training over many episodes.

Refer to caption
Figure 4: (A) The MB/CX model for vision-based homing (adapted from Wystrach 2023) can be viewed as two competing MB AL agents (Fig. 3 A,B) in the left (L𝐿Litalic_L) and right (R)𝑅(R)( italic_R ) hemisphere, receiving internal rewards provided by vector computations in the CX (Panel B). Instead of positive/negative external rewards, the set of DANs now encodes whether the nest is located to the left or to the right (Dl/rsubscript𝐷𝑙𝑟D_{l/r}italic_D start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT) as a direct reinforcement signal. No RPE computation is assumed in this model. This ‘ground truth’ is computed in the CX based on the current compass heading κ𝜅\kappaitalic_κ and a PI home vector (whose compass angle we denote as ΠΠ\Piroman_Π, such that Π<κΠ𝜅\Pi<\kapparoman_Π < italic_κ means that the nest is located to the left). In each hemisphere, competition between ‘steer left’ and ‘steer right’ MBONs (Ml/rsubscript𝑀𝑙𝑟M_{l/r}italic_M start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT) is integrated ipsilaterally by neurons in the Superior Intermediate Protocerebrum (SIP) to generate a steer left/right command in the left/right hemibrain, respectively. Opposing valence of the steering commands between hemispheres is achieved by inverting excitation/inhibition of the SIP inputs. Note that visual steering is learned ‘off-policy’, i.e. the agent does not use the policy it learns during learning but is instead driven by an ‘off-policy’ metacontroller. (In Wystrach 2023 the agent was simply made to retrace experimental trajectories of learning walks.)

6 Reinforcement Learning for Insect Navigation

6.1 Visual Homing with Vector-based Internal Rewards

Wystrach (2023) proposed an MB/CX based visual homing model that uses an internal reward signal to alleviate this problem. Fig. 4A illustrates how the model consists of two antagonistic copies (one in each hemisphere) of the ‘canonical’ MB circuit in Fig. 3 which receive internal reward signals computed in the CX: comparing the agent’s current compass heading κ𝜅\kappaitalic_κ and PI home vector (with compass angle Pi𝑃𝑖Piitalic_P italic_i) by a mechanism similar to the one described by Lyu et al. (2022), a set of two reward signals rl/rsubscript𝑟𝑙𝑟r_{l/r}italic_r start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT is computed, encoding whether the nest is located to the left/right of the agents current heading direction. These provide input for two sets of dopaminergic neurons Dl/rsubscript𝐷𝑙𝑟D_{l/r}italic_D start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT, assuming the role of external rewards in the canonical model (Fig. 3 A). However, they don’t encode a rewarding experience coming from the environment, like the agent reaching a food source, but relate to an internal state of the agent, the home vector. Since the latter is continually updated, the rewards are no longer sparse which makes learning considerably more efficient. DANs convey copies of the respective reward signal to MBs in both hemispheres, giving rise to double opponent processes: In each hemisphere, rr/lsubscript𝑟𝑟𝑙r_{r/l}italic_r start_POSTSUBSCRIPT italic_r / italic_l end_POSTSUBSCRIPT serve as direct reinforcement to associate representations of the current view (encoded in KC activity) with populations of MBONs, Ml/rsubscript𝑀𝑙𝑟M_{l/r}italic_M start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT, corresponding to ‘steer left/steer right’ responses respectively. In either hemisphere, Ml/rsubscript𝑀𝑙𝑟M_{l/r}italic_M start_POSTSUBSCRIPT italic_l / italic_r end_POSTSUBSCRIPT activity is integrated by downstream neurons, but with inverted signs, leading to competing ‘steer left’ and ‘steer right’ premotor commands from the left/right hemisphere, respectively141414The implementation of the actual steering mechanism is again located in the CX, but we will not discuss this further here. For more detail refer to the original paper Wystrach (2023). This double opponent architecture increases performance robustness as there are now two sets of MBONs that independently encode the appropriate behavioral response.

Evidently, the model implements an associative learning algorithm: It does not involve computation of a TD error as proposed above, or even an RPE for the internal reward. While it may be interesting to extend the model to use RPEs for reinforcement, a computation of TD errors would serve to achieve an erroneous objective: The formulation of the underlying MDP would imply that the agent’s goal is to maximize the cumulative internal reward. Since the internal reward is higher (e.g. for steering right) the further off-target the agent is heading, maximizing it over time would lead to the opposite of the desired behavior. It would be interesting to investigate if a model with inverted reward valences could be extended to an RL model for visual homing. In the following, however, we will explore a different line of thought, sketching an MB/CX inspired RL model that integrates all three components of the insect navigation base model from Sec. 4 - path integration, vector memories and view memories - and links them to the behavioral objective of optimizing external rewards. Finally, note that - in a liberal interpretation of RL terminology - we can classify the visual homing model as an off-policy learning algorithm: During the learning phase, the agent’s actions are assumed to follow an exploration strategy in agreement with observations (using data from jayatilakaChoreographyLearningWalks2018; wystrachVisualScanningBehaviours2014) of learning flights/walks performed by insects after emerging from their nest for the first time (see collettInstinctLearningLearning2023, for a recent review).

Refer to caption
Figure 5: The visual homing model of Fig. 4 can conceptually be turned into a hierarchical MB/CX based RL full navigation model which roughly fits to the insect navigation base model. This can be done by replacing the off-policy controller in Fig. 4B with an MB RL agent as an ‘RL meta-controller’. As one additional component, it relies on a store of vector memories m𝑚mitalic_m, ‘snapshots’ of the PI home vector ΠΠ\Piroman_Π, learned via direct associative reinforcement with an external reward at salient locations (see Sec. 4 and Sec. 6.2 for possible anatomical implementations) which now form the (discrete) action space of the TD-RL agent. It learns a policy π(m|s)𝜋conditional𝑚𝑠\pi(m|s)italic_π ( italic_m | italic_s ) from the same external reward as the vector memories on a state space of processed visual input stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The internal reward now encodes whether the agent is facing left/right with respect to the relative vector to the currently active vector memory selected by the meta-controller. This vector computation based on current ΠΠ\Piroman_Π, current heading κ𝜅\kappaitalic_κ, and active m𝑚mitalic_m is assumed to be performed in the CX. The rest of the visual homing circuit is identical to Fig. 4. Note that stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ΠΠ\Piroman_Π could conceptually be viewed as components of a joint representation of the visual input, with different components of the algorithm acting only on subspaces of this representation.

6.2 Towards an MB/CX based RL Model for (Insect) Navigation

The visual homing model discussed above will serve as a representative of models for view memory. It obviates the need to decide when to store a view memory by learning continuous associations, as described above. It makes a very explicit connection between view memories and the home vector, while other models only implicitly associate view memories with a motivational state (Webb, 2019). However, taking into account the proposed mechanism for storing and recalling vector memories (Le Moël et al., 2019), one could broadly interpret the presently loaded vector memory as a motivational state (feeder vector loaded = ‘foraging motivation’, no vector loaded = ‘homing motivation’). Loading the vector memory of a specific location effectively replaces the home vector in the visual homing model with a relative vector towards that location, theoretically allowing the agent to associatively learn visual homing relative to every location in the vector memory, using internal rewards. Vector memories on the other hand, are learned by one-shot associative learning from external reward, e.g. the presence of food. As suggested by Le Moël et al. (2019), this could be achieved by direct dopaminergic modulation of synapses between ‘vector-memory neurons’ and the CPU4 integrator neurons in the CX. Alternatively, specific populations of neurons, each of which conveys a specific vector in a phasor-like representation, might be activated. If these hypothetical ‘vector-memory neurons’ are, or receive excitatory input from, a subpopulation of MBONs, the pool of vector memories could in turn serve as the action space for an MB-based implementation of a TD-RL algorithm discussed above (Fig. 4B). This RL meta-controller learns a policy over sub-goals from the current view and external reward, which may serve two purposes: Steering the agent towards the selected sub-goal using PI, and providing sub-goal-directed internal reward for the low-level AL steering/homing controller. Interestingly, however, due to the off-policy architecture of the AL controller, view association with respect to any (latent) vector memory could be learned while homing towards another (active) one (using either views or PI), assuming a dedicated set of visual homing MBONs associated with each vector memory. This would be consistent with our interpretation of vector memories as motivational states, which are often modeled by distinct MBON populations in the MB literature (see Webb and Wystrach (2016)). This would theoretically enable the agent to continually learn view associations with respect to all vectors stored in memory while using a specific one, or PI for steering. If we further assume an anatomical link between ‘vector memory MBONs’ and their corresponding set of ‘visual homing MBONs’, the RL-meta controller could select previously visited vector locations as sub-goals for visual homing. We argue that a unified MB/CX RL model along these lines would give rise to a remarkably versatile spatial representation for navigation.

Fig. 5 illustrates how the proposed model is built from the components discussed above.

Table 1: Rosetta Stone of Insect Navigation and Reinforcement Learning
Part
Name Description Size (μ𝜇\muitalic_μm)
Dendrite Input terminal similar-to\sim100
Axon Output terminal similar-to\sim10
Soma Cell body up to 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT

7 Discussion

In this paper, we presented reinforcement learning as a common framework to compare representations of space (Sec. 2) in current models for robots (Sec. 3) and insect navigation (Sec. 4). We argue that the insect navigation based model combines an explicit vector map with a latent spatial representation based on view memories. This combination of explicit and latent spatial representations, evidently very successful for insects, may provide inspiration for future approaches for robot navigation. We further proposed reinforcement learning as a novel framework for interpreting and expanding existing models for insect navigation. In Sec. 5, we analyzed how existing models of the insect MB circuit could implement temporal-difference (TD) RL algorithms, and argued that activity traces of recurrent MBON\rightarrowDAN connections would support TD-like reinforcement at the MBON\rightarrowKC synapse. In Sec. 6, we sketched how this could be used to build a full MB/CX inspired TD-RL navigation model, which has the potential to link the disjoint spatial representations of the insect navigation-based model. To conclude, we will discuss aspects of such a hypothetical model with respect to spatial representations and RL-based robot navigation models more generally.

Task Hierarchies, Intrinsic Rewards, and Exploration

The model we outline here is, in a sense, a hierarchical RL (HRL) model: A metacontroller selects a sub-goal for a low-level controller, to provide the latter with dense internal rewards instead of operating on the sparse external reward landscape. For similar reasons, hierarchical RL architectures also play an important role in robot navigation (Schmalstieg et al., 2023; Nachum et al., 2018; Haarnoja et al., 2018) to avoid ‘dimensional disaster’. There are, however, crucial differences: In RL, the meta-controller usually provides intrinsic rewards for the low-level agent, designed to facilitate the exploration of complex environments with sparse external rewards. Baldassarre (2011) define a situation as ‘intrinsically motivating […] if its interest depends primarily on the collation or comparison of information from different stimuli and independently of their semantics, […] understood in an information theoretic perspective, in which what is considered is the intrinsic mathematical structure of the values of stimuli, independently of their meaning’, as opposed to extrinsic motivation. Intrinsic rewards can be viewed as a formalization of curiosity, motivating the agent to explore unfamiliar terrain, and various approaches to model it exist in the RL literature (Pathak et al., 2017; Savinov et al., 2019). In our model, the internal reward - facing left/right of a location related to external rewards - is inherently extrinsic and does not motivate the agent to explore, and the low-level controller is not an RL agent. However, there are biologically plausible ways to include intrinsic reward in the model. For example, Sun et al. (2020) proposes a mechanism for switching between on-route and off-route navigation strategies based on visual novelty. Such a signal could theoretically serve as an intrinsic reward in our model.

A Unified Spatial Representation.

The proposed CX/MB inspired RL navigation model would fuse the disjoint spatial representations of the insect navigation base model - an explicit vector map and latent directional cues (Fig. 1) - into a unified latent graph-like representation which could be used in the absence of accurate PI information (while preserving the vector map for pure PI-based navigation). The agent learns to select a high-level subgoal - a vector memory - based on the current view, from the high-level objective of external reward optimization using an MB/CX RL circuit. This association defines a latent topological relation between the location of the current view and nodes represented by the vector memories: the learned policy will likely reflect distance since learning to visit distant goals would be less rewarding in the long run. The low-level MB/CX AL homing circuit learns rough directional cues to steer the agent toward the selected goal. Although both directional and distance information are therefore available to the agent, both are indirect and inaccurate, making the representation topological rather than geometrical.

In light of the cognitive map debate, we can characterize an agent with such a spatial representation as ‘knowing where to go, on different spatial scales’: It can infer from the current view both where it wants to go, i.e. which vector memory to load, and how to get there. This would enable behavior commonly associated with a cognitive map. For example, the agent could visually infer the direction of a novel shortcut from the present location A to a previously visited location B stored in the vector memory, since view memories in the vicinity of A have also been associated with left/right steering commands with respect to B. By a change in the policy over vector memories, for example, because a previously closer food source C has been depleted, the agent could be incentivized to choose to steer along a novel route (’shortcut’) towards B.

Predictions, Memory Traces, and Internal World Models.

The previous example shows that our model could support surprisingly flexible navigation, adapting behavior locally depending on distant changes in the environment. In RL, this kind of flexibility is traditionally associated with model-based algorithms: Unlike the model-free algorithms discussed above, they learn a model of the environment - i.e. the transition Eq. (1) and reward functions Eq. (2) - directly and then infer the optimal policy from them via Eq. (5). Explicit knowledge about the environment allows them to plan out actions virtually in order to optimize the policy. (See Hafner et al. 2022 for a current application using Deep Hierarchical Planning.) This is particularly useful to adapt flexibly to changes in the environment, like changes in reward magnitudes (e.g. an empty feeder) or new navigational obstacles. Model-based algorithms don’t need to slowly and incrementally update their state evaluation by repeated exposure to the change in the environment, but can simply update their model of the environment and virtually plan a new route based on that update. A neural implementation of such a predictive model of the environment presupposes a state representation in the form of recurrent neural activity, like the (p)replay of spatial sequences observed in the mammalian hippocampus (dragoiPreplayFuturePlace2011). The likely insect analog, spontaneous KC activity in the MB, would be challenging to access experimentally and has not yet been observed.

On the other hand, a suitably predictive (latent) representation can produce similar flexible behavior. E.g. Russek et al. (2017) show that the successor representation (Dayan, 2000) can link model-free TD methods to model-based behavior. Abstract latent representations like in the SMT model (Fang et al. 2019, see Sec. 3) can also adapt flexibly to environmental changes without an explicit model: Adding an observation of a change in the environment to the scene memory would globally change the embedding of all other scene memories, and the latent representation could quickly adapt to reflect the new environment. Another memory mechanism linking model-free TD-RL algorithms to seemingly predictive behavior is eligibility traces Sutton (1988). Instead of updating only values of the current (single timestep) state-action pair with the current reward, a trace of previous experiences is updated as well. A current negative reward, e.g. related to an obstacle, would for example affect the agent’s evaluation of an earlier state, causing it to adapt behavior early. However, while the behavior looks predictive, is in fact reactive: In order to learn to avoid a new obstacle would still have to experience the novel situation a couple of times. Wystrach et al. (2020) used a very similar concept of memory trace learning151515This is essentially the AL analog to eligibility traces., mediated by KC activity traces in the MB, to show how desert ants learn to avoid new obstacles.

Our MB/CX RL model has the potential to enable fast adaptation to changes: As discussed in the previous paragraph, a small change in global high-level policy due to environmental changes (empty feeder) can lead to sudden change in the local, low-level behavior (steering towards a different goal): The question would now be, how fast can the agent adapt the high-level policy to the environmental change? Since local steering is handled by the low-level controller, and the agent learns a policy over vector memories on a relatively large spatial scale, the agent doesn’t need to constantly reassess its policy, but can do so in larger intervals. This would effectively increase the temporal scale of the MDP, and therefore reduce the number of intermediate states between rewarding experiences. In this ‘sped-up’ MDP with much denser rewards, policy adaptation due to a local change in the environment is propagated faster to distant states, through fewer intermediate states. This effect could be further enhanced by a memory trace mechanism. Taken together, this would enable the agent to quickly adapt local behavior based on distant changes in the environment, by updating its hierarchical, topological spatial representation of the world, without the need for a predictive internal world model.

RL as a Normative Framework: Place-Cell like Activity in KCs?

So far, we have treated the visual system as a static component of the model. While this assumption is largely consistent with current knowledge about plasticity in the insect visual system, we can again expand the temporal horizon to evolutionary timescales and view the anatomical components of the insect MB, CX, and visual system in the light of end-to-end RL, as outlined in the discussion of Sec. 4. Without going into anatomical details of the insect visual system here, a sensor-motor signal transduction pathway: retina \rightarrow lamina \rightarrow medulla \rightarrow lobula / lobula plate \rightarrow visual PNs \rightarrow KCs \rightarrow MBONs \rightarrow (pre)motor neurons, could be modeled as a deep neural network implementing a DRL architecture, whose latent spatial representation would then correspond to KC activity (the input layer to the actual policy network, represented by the MB). The learned representations could be indicative of actual spatial representations used by navigating insects, and provide guidance for experimental work. For example, it would be interesting to see if such a model learns a grid-like spatial representation which would correspond to place-cell like KC activity. This hypothesis is compatible with known sparse firing patterns of KCs, but otherwise speculative given current neurophysiological data. It is also unclear how such a representation would fit into the existing model for insect navigation. Since CX based PI is firmly established as a key component for insect navigation, it seems imperative to eventually include the CX in such an end-to-end RL model. This would enable representations that embed PI-like components in a more complex latent space. Since current insect navigation models do not include CX\rightarrowKC connections, it is not straightforward how the CX would be integrated into a MB based end-to-end RL model. It will be challenging to strike a balance between expressive power of the network architecture - essential for gaining new insights about possible representations - and necessary (anatomical) constraints to match empirically known components of these representations, like the PI home vector.

Acknowledgements

We thank Joschka Boedecker, Paulina Friemann, and Christian Leibold for helpful discussions. We gratefully acknowledge financial support from the BrainWorlds Initiative at the University of Freiburg and the VolkswagenFoundation Momentum Program (AZ 98692 to ADS).

References

  • Webb (2019) Barbara Webb. The internal maps of insects. Journal of Experimental Biology, 222(Suppl_1):jeb188094, February 2019. ISSN 0022-0949. doi: 10.1242/jeb.188094.
  • Fuentes-Pacheco et al. (2015) Jorge Fuentes-Pacheco, José Ruiz-Ascencio, and Juan Manuel Rendón-Mancha. Visual simultaneous localization and map**: A survey. Artificial Intelligence Review, 43(1):55–81, January 2015. ISSN 1573-7462. doi: 10.1007/s10462-012-9365-8.
  • O’Keefe (1979) John O’Keefe. A review of the hippocampal place cells. Progress in Neurobiology, 13(4):419–439, January 1979. ISSN 0301-0082. doi: 10.1016/0301-0082(79)90005-4.
  • Zeil et al. (2003) Jochen Zeil, Martin I. Hofmann, and Javaan S. Chahl. Catchment areas of panoramic snapshots in outdoor scenes. Journal of the Optical Society of America A, 20(3):450, March 2003. ISSN 1084-7529, 1520-8532. doi: 10.1364/JOSAA.20.000450.
  • Dhein (2023) Kelle Dhein. The cognitive map debate in insects: A historical perspective on what is at stake. Studies in History and Philosophy of Science, 98:62–79, April 2023. ISSN 0039-3681. doi: 10.1016/j.shpsa.2022.12.008.
  • Hoinville and Wehner (2018) Thierry Hoinville and Rüdiger Wehner. Optimal multiguidance integration in insect navigation. Proceedings of the National Academy of Sciences, 115(11):2824–2829, March 2018. doi: 10.1073/pnas.1721668115.
  • Mur-Artal and Tardós (2017) Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5):1255–1262, 2017.
  • Engel et al. (2014a) Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. In European conference on computer vision, pages 834–849. Springer, 2014a.
  • Endres et al. (2012) Felix Endres, Jürgen Hess, Nikolas Engelhard, Jürgen Sturm, Daniel Cremers, and Wolfram Burgard. An evaluation of the rgb-d slam system. In 2012 IEEE international conference on robotics and automation, pages 1691–1696. IEEE, 2012.
  • Shah et al. (2023) Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid Exploration for Open-World Navigation with Latent Goal Models, October 2023.
  • Henriques and Vedaldi (2018) João F. Henriques and Andrea Vedaldi. MapNet: An Allocentric Spatial Memory for Map** Environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8476–8484, 2018.
  • Gupta et al. (2019) Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive Map** and Planning for Visual Navigation, February 2019.
  • Fang et al. (2019) Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–547, June 2019. doi: 10.1109/CVPR.2019.00063.
  • Gutmann et al. (2008) Jens-Steffen Gutmann, Masaki Fukuchi, and Masahiro Fujita. 3D Perception and Environment Map Generation for Humanoid Robot Navigation. The International Journal of Robotics Research, 27(10):1117–1134, October 2008. ISSN 0278-3649, 1741-3176. doi: 10.1177/0278364908096316.
  • Zhu et al. (2021) Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12776–12786, 2021.
  • Rosinol et al. (2023) Antoni Rosinol, John J Leonard, and Luca Carlone. Nerf-slam: Real-time dense monocular slam with neural radiance fields. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3437–3444. IEEE, 2023.
  • Matsuki et al. (2023) Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. arXiv preprint arXiv:2312.06741, 2023.
  • Wani et al. (2020) Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang, and Manolis Savva. MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation, December 2020.
  • Schmalstieg et al. (2022) Fabian Schmalstieg, Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Learning long-horizon robot exploration strategies for multi-object search in continuous action spaces. In The International Symposium of Robotics Research, pages 52–66. Springer, 2022.
  • Younes et al. (2023) Abdelrahman Younes, Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Catch me if you hear me: Audio-visual navigation in complex unmapped environments with moving sounds. IEEE Robotics and Automation Letters, 8(2):928–935, 2023.
  • Ramakrishnan et al. (2022) Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022.
  • Hughes et al. (2022) Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. arXiv preprint arXiv:2201.13360, 2022.
  • Gu et al. (2023) Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
  • Werby et al. (2024) Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
  • Rana et al. (2023) Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
  • Honerkamp et al. (2024) Daniel Honerkamp, Martin Buchner, Fabien Despinoy, Tim Welschehold, and Abhinav Valada. Language-grounded dynamic scene graphs for interactive object search with mobile manipulation. arXiv preprint arXiv:2403.08605, 2024.
  • Banino et al. (2018) Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Gaffney, Helen King, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, and Dharshan Kumaran. Vector-based navigation using grid-like representations in artificial agents. Nature, 557(7705):429–433, May 2018. ISSN 1476-4687. doi: 10.1038/s41586-018-0102-6.
  • Warren et al. (2017) William H. Warren, Daniel B. Rothman, Benjamin H. Schnapp, and Jonathan D. Ericson. Wormholes in virtual space: From cognitive maps to cognitive graphs. Cognition, 166:152–163, September 2017. ISSN 00100277. doi: 10.1016/j.cognition.2017.05.020.
  • Shah and Levine (2022) Dhruv Shah and Sergey Levine. ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. In Robotics: Science and Systems XVIII, June 2022. doi: 10.15607/RSS.2022.XVIII.019.
  • Engel et al. (2014b) Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 834–849, Cham, 2014b. Springer International Publishing. ISBN 978-3-319-10605-2. doi: 10.1007/978-3-319-10605-2_54.
  • Baddeley et al. (2012) Bart Baddeley, Paul Graham, Philip Husbands, and Andrew Philippides. A Model of Ant Route Navigation Driven by Scene Familiarity. PLoS Computational Biology, 8(1):e1002336, January 2012. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1002336.
  • Ardin et al. (2016) Paul Ardin, Fei Peng, Michael Mangan, Konstantinos Lagogiannis, and Barbara Webb. Using an Insect Mushroom Body Circuit to Encode Route Memory in Complex Natural Environments. PLOS Computational Biology, 12(2):e1004683, February 2016. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1004683.
  • Wystrach (2023) Antoine Wystrach. Neurons from pre-motor areas to the Mushroom bodies can orchestrate latent visual learning in navigating insects. Preprint, Neuroscience, March 2023.
  • Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachusetts, second edition edition, 2018. ISBN 978-0-262-03924-6.
  • Zhu and Zhang (2021) Kai Zhu and Tao Zhang. Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Science and Technology, 26(5):674–691, October 2021. ISSN 1007-0214. doi: 10.26599/TST.2021.9010012.
  • Zeng et al. (2020) Fanyu Zeng, Chen Wang, and Shuzhi Sam Ge. A Survey on Visual Navigation for Artificial Agents With Deep Reinforcement Learning. IEEE Access, 8:135426–135442, 2020. ISSN 2169-3536. doi: 10.1109/ACCESS.2020.3011438.
  • Xiao et al. (2022) Xuesu Xiao, Zifan Xu, Zizhao Wang, Yunlong Song, Garrett Warnell, Peter Stone, Tingnan Zhang, Shravan Ravi, Gary Wang, Haresh Karnan, et al. Autonomous ground navigation in highly constrained spaces: Lessons learned from the benchmark autonomous robot navigation challenge at icra 2022 [competitions]. IEEE Robotics & Automation Magazine, 29(4):148–156, 2022.
  • Klein and Murray (2007) Georg Klein and David Murray. Parallel Tracking and Map** for Small AR Workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 1–10, Nara, Japan, November 2007. IEEE. ISBN 978-1-4244-1749-0. doi: 10.1109/ISMAR.2007.4538852.
  • Mur-Artal and Tardós (2017) Raúl Mur-Artal and Juan D. Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Transactions on Robotics, 33(5):1255–1262, October 2017. ISSN 1941-0468. doi: 10.1109/TRO.2017.2705103.
  • Konolige et al. (2009) K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calonder, V. Lepetit, and P. Fua. View-based maps. In Robotics: Science and Systems V, volume 05, June 2009. ISBN 978-0-262-51463-7.
  • Greve et al. (2023) Elias Greve, Martin Büchner, Niclas Vödisch, Wolfram Burgard, and Abhinav Valada. Collaborative dynamic 3d scene graphs for automated driving. arXiv preprint arXiv:2309.06635, 2023.
  • Vödisch et al. (2022) Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, and Abhinav Valada. Continual slam: Beyond lifelong simultaneous localization and map** through continual learning. In The International Symposium of Robotics Research, pages 19–35. Springer, 2022.
  • Chaplot et al. (2020) Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. LEARNING TO EXPLORE USING ACTIVE NEURAL SLAM. arXiv preprint arXiv:2004.05155, 2020.
  • Gupta et al. (2017) Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying Map and Landmark Based Representations for Visual Navigation, December 2017.
  • Alemi et al. (2019) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep Variational Information Bottleneck, October 2019.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Stone et al. (2017) Thomas Stone, Barbara Webb, Andrea Adden, Nicolai Ben Weddig, Anna Honkanen, Rachel Templin, William Wcislo, Luca Scimeca, Eric Warrant, and Stanley Heinze. An Anatomically Constrained Model for Path Integration in the Bee Brain. Current Biology, 27(20):3069–3085.e11, October 2017. ISSN 09609822. doi: 10.1016/j.cub.2017.08.052.
  • Lyu et al. (2022) Cheng Lyu, L. F. Abbott, and Gaby Maimon. Building an allocentric travelling direction signal via vector computation. Nature, 601(7891):92–97, January 2022. ISSN 1476-4687. doi: 10.1038/s41586-021-04067-0.
  • Le Moël et al. (2019) Florent Le Moël, Thomas Stone, Mathieu Lihoreau, Antoine Wystrach, and Barbara Webb. The Central Complex as a Potential Substrate for Vector Based Navigation. Frontiers in Psychology, 10, 2019. ISSN 1664-1078.
  • Cartwright and Collett (1983) B. A. Cartwright and T. S. Collett. Landmark learning in bees. Journal of comparative physiology, 151(4):521–543, December 1983. ISSN 1432-1351. doi: 10.1007/BF00605469.
  • Lee et al. (1999) T. W. Lee, M. Girolami, and T. J. Sejnowski. Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11(2):417–441, February 1999. ISSN 0899-7667. doi: 10.1162/089976699300016719.
  • Lulham et al. (2011) Andrew Lulham, Rafal Bogacz, Simon Vogt, and Malcolm W. Brown. An Infomax algorithm can perform both familiarity discrimination and feature extraction in a single network. Neural Computation, 23(4):909–926, April 2011. ISSN 1530-888X. doi: 10.1162/NECO_a_00097.
  • Sun et al. (2020) Xuelong Sun, Shigang Yue, and Michael Mangan. A decentralised neural model explaining optimal integration of navigational strategies in insects. eLife, 9:e54026, June 2020. ISSN 2050-084X. doi: 10.7554/eLife.54026.
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
  • Williams (1992) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992696.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN 1476-4687. doi: 10.1038/nature14236.
  • Cognigni et al. (2018) Paola Cognigni, Johannes Felsenberg, and Scott Waddell. Do the right thing: neural network mechanisms of memory formation, expression and update in drosophila. Curr. Opin. Neurobiol., 49:51–58, April 2018.
  • Ehmer and Gronenberg (2002) Birgit Ehmer and Wulfila Gronenberg. Segregation of visual input to the mushroom bodies in the honeybee (apis mellifera). J. Comp. Neurol., 451(4):362–373, September 2002.
  • Strube-Bloss and Rössler (2018) Martin F Strube-Bloss and Wolfgang Rössler. Multimodal integration and stimulus categorization in putative mushroom body output neurons of the honeybee. R. Soc. Open Sci., 5(2):171785, February 2018.
  • Vogt et al. (2014) Katrin Vogt, Christopher Schnaitmann, Kristina V Dylla, Stephan Knapek, Yoshinori Aso, Gerald M Rubin, and Hiromu Tanimoto. Shared mushroom body circuits underlie visual and olfactory memories in drosophila. Elife, 3:e02395, August 2014.
  • Caron et al. (2013) Sophie J C Caron, Vanessa Ruta, L F Abbott, and Richard Axel. Random convergence of olfactory inputs in the drosophila mushroom body. Nature, 497(7447):113–117, May 2013.
  • Bennett et al. (2021) James E. M. Bennett, Andrew Philippides, and Thomas Nowotny. Learning with reinforcement prediction errors in a model of the Drosophila mushroom body. Nature Communications, 12(1):2569, May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-22592-4.
  • Jürgensen et al. (2024) Anna-Maria Jürgensen, Panagiotis Sakagiannis, Michael Schleyer, Bertram Gerber, and Martin Paul Nawrot. Prediction error drives associative learning and conditioned behavior in a spiking model of Drosophila larva. iScience, 27(1):108640, January 2024. ISSN 25890042. doi: 10.1016/j.isci.2023.108640.
  • Rescorla (1972) Robert A Rescorla. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning, Current research and theory, 2:64–69, 1972.
  • Webb and Wystrach (2016) Barbara Webb and Antoine Wystrach. Neural mechanisms of insect navigation. Current Opinion in Insect Science, 15:27–39, June 2016. ISSN 22145745. doi: 10.1016/j.cois.2016.02.011.
  • Schmalstieg et al. (2023) Fabian Schmalstieg, Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Learning Hierarchical Interactive Multi-Object Search for Mobile Manipulation. IEEE Robotics and Automation Letters, 8(12):8549–8556, December 2023. ISSN 2377-3766, 2377-3774. doi: 10.1109/LRA.2023.3329619.
  • Nachum et al. (2018) Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-Efficient Hierarchical Reinforcement Learning, October 2018.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent Space Policies for Hierarchical Reinforcement Learning, September 2018.
  • Baldassarre (2011) Gianluca Baldassarre. What are intrinsic motivations? a biological perspective. In 2011 IEEE international conference on development and learning (ICDL), volume 2, pages 1–8. IEEE, 2011.
  • Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven Exploration by Self-supervised Prediction, May 2017.
  • Savinov et al. (2019) Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability, August 2019.
  • Hafner et al. (2022) Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep Hierarchical Planning from Pixels, June 2022.
  • Russek et al. (2017) Evan M. Russek, Ida Momennejad, Matthew M. Botvinick, Samuel J. Gershman, and Nathaniel D. Daw. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLOS Computational Biology, 13(9):e1005768, September 2017. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1005768.
  • Dayan (2000) Peter Dayan. Improving Generalisation for Temporal Difference Learning: The Successor Representation. Neural computation, August 2000.
  • Sutton (1988) Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, August 1988. ISSN 1573-0565. doi: 10.1007/BF00115009.
  • Wystrach et al. (2020) Antoine Wystrach, Cornelia Buehlmann, Sebastian Schwarz, Ken Cheng, and Paul Graham. Rapid Aversive and Memory Trace Learning during Route Navigation in Desert Ants. Current Biology, 30(10):1927–1933.e2, May 2020. ISSN 0960-9822. doi: 10.1016/j.cub.2020.02.082.