Benchmarking Predictive Coding Networks
– Made Simple

Luca Pinchetti1, Chang Qi2, Oleh Lokshyn2, Gaspard Olivers3, Cornelius Emde1,
Mufeng Tang3, Amine M’Charrak1, Simon Frieder1, Bayar Menzat2,
Rafal Bogacz3, Thomas Lukasiewicz2,1, Tommaso Salvatori4,2,∗
1
Department of Computer Science, University of Oxford, Oxford, UK
2Institute of Logic and Computation, Vienna University of Technology, Vienna, Austria
3MRC Brain Network Dynamics Unit, University of Oxford, UK
4VERSES AI Research Lab, Los Angeles, US
Abstract

In this work, we tackle the problems of efficiency and scalability for predictive coding networks in machine learning. To do so, we first propose a library called PCX, whose focus lies on performance and simplicity, and provides a user-friendly, deep-learning oriented interface. Second, we use PCX to implement a large set of benchmarks for the community to use for their experiments. As most works propose their own tasks and architectures, do not compare one against each other, and focus on small-scale tasks, a simple and fast open-source library adopted by the whole community would address all of these concerns. Third, we perform extensive benchmarks using multiple algorithms, setting new state-of-the-art results in multiple tasks and datasets, as well as highlighting limitations inherent to PC that should be addressed. Thanks to the efficiency of PCX, we are able to analyze larger architectures than commonly used, providing baselines to galvanize community efforts towards one of the main open problems in the field: scalability. The code for PCX is available at https://github.com/liukidar/pcax.

1 Introduction

The history of predictive coding is long, and spans a large number of disciplines (Friston, 2018; Spratling, 2017). It first appeared as a computational framework in the 50’s, when electronic engineers realized that sending compressed representations of prediction errors in time series data was cheaper than sending the data itself (Elias, 1955). A similar algorithm was then used in the neurosciences, first to describe inhibitory signals in the retina (Srinivasan et al., 1982), and then as a general theory of information processing in different brain regions (Mumford, 1992). In 1999, Rao and Ballard (1999) proposed a hierarchical formulation of predictive coding (PC) as a model of visual processing. Recently, researchers realized that this framework could be used to train neural networks using a bio-plausible learning rule (Whittington and Bogacz, 2017). This has led to different directions of research that either explored interesting properties of PC networks, such as their robustness (Song et al., 2024; Alonso et al., 2022) and flexibility (Salvatori et al., 2022), or proposed variations to improve the performance on specific tasks (Salvatori et al., 2024). While interesting and important for the progress of the field, these lines of research have the tendency of not comparing their results against other papers or those of related fields, and to focus on small-scale experiments. The field is hence avoiding what we believe to be the most important open problem: scaling such results to large scale tasks.

There are multiple reasons why such an important problem has been overlooked. First, it is a hard problem, and it is still unclear why PC is able to perform as well as classical gradient descent with backprop only up to a certain scale, which is of small convolutional models trained to classify the CIFAR10 dataset (Salvatori et al., 2024). Understanding the reason for this limitation would allow us to develop regularization techniques that stabilize learning, and hence allow better performance on more complex tasks, similarly to what dropout and batch normalization have been for deep learning (Srivastava et al., 2014; Ioffe and Szegedy, 2015). Second, the lack of specialized libraries makes PC models extremely slow: a full hyperparameter search on a small convolutional network can take several hours. Third, the lack of a common framework makes reproducibility and iterative contributions impossible, as implementation details or code are rarely provided. In this work, we make first steps toward addressing these problems with three contributions, that we call tool, benchmarking, and analysis.

Tool.

We release an open-source library for accelerated training for predictive coding called PCX. This library runs in JAX (Bradbury et al., 2018), and offers a user-friendly interface with minimal learning curve through familiar syntax inspired from Pytorch, and extensive tutorials. It is also fully compatible with Equinox (Kidger and Garcia, 2021), a popular deep-learning-oriented extension of JAX, ensuring reliability, extendability, and compatibility with ongoing research developments. It also supports JAX’s Just-In-Time (JIT) compilation, making it efficient and allowing both easy development and execution of PC networks. We empirically show the gain in efficiency with respect to an existing library.

Benchmarking.

We propose a uniform set of tasks, datasets, metrics, and architectures that should be used as a skeleton to test the performance of future variations of PC. The tasks that we propose are the standard ones in computer vision: image classification and generation (with relative metrics). The models that we use, as well as the datasets, are picked according to two criteria: First, to allow researchers to test their algorithm from the easiest model (feedforward network on MNIST) to more complex ones (deep convolutional models), where we failed to get acceptable results, and should hence pave the way for future research; Second, to favor the comparison against related fields in the literature, such as equilibrium propagation (Scellier and Bengio, 2017). To this end, we have picked some of the models that are consistently used in other research papers. All the source files of the proposed benchmarks, as well as tutorials explaining how to implement them, will be present in the library, to facilitate researchers to use them in their studies.

Analysis.

We provide the baselines for future research by performing an extensive comparative study between different hyperparameters and PC algorithms on multiple tasks. We considered standard PC, incremental PC (Salvatori et al., 2024), PC with Langevin dynamics (Oliviers et al., 2024; Zahid et al., 2023), and nudged PC, as done in the Eqprop literature (Scellier and Bengio, 2017; Scellier et al., 2024). This is also the first time nudging algorithms were applied in PC models. In terms of quantitative contributions, we get state-of-the-art results for PC on multiple benchmarks and show for the first time that it is able to perform well on more complex datasets, such as CIFAR100 and Tiny Imagenet, where we get results comparable to those of backprop. In image generation tasks, we present experiments on datasets of colored images, going beyond MNIST and FashionMNIST as performed in previous works. We conclude with an analysis on the credit assignment of PC, which tries to shed light on problems that we will need to solve to scale up such models even more. To this end, we believe that future effort should aim towards develo** algorithms and models that improve the numbers that we show in this work, as they represent the new state of the art in the field.

2 Related Works

Rao and Ballard’s PC.

The most related works are those that explore different properties or optimization algorithms of standard PC in the deep learning regime, using formulations inspired by Rao and Ballard’s original work (Rao and Ballard, 1999). Examples are works that study their associative memory capabilities (Salvatori et al., 2021; Yoo and Wood, 2022; Tang et al., 2023, 2024), their ability to train Bayesian networks (Salvatori et al., 2022, 2023b), and theoretical results that explain, or improve, their optimization process (Millidge et al., 2022a, b; Alonso et al., 2022). Results in this field have allowed to either improve the performance of such models in different tasks, or to study different properties that could benefit from the use of PCNs.

Variations of PC.

In the literature, there are multiple variations of PC algorithms, which differ from Rao and Ballard’s original formulation in the way they update the neural activities. Important examples of such variations are biased competition and divisive input modulation (Spratling, 2008), or the neural generative coding framework (Ororbia and Kifer, 2022). The latter is already used in multiple reinforcement learning and control tasks (Ororbia and Mali, 2023; Ororbia et al., 2023). For a review on how different PC algorithms evolved through time, from signal processing to neuroscience, we refer to (Spratling, 2017); for a more recent review specific to machine learning applications, to (Salvatori et al., 2023a). It is also worth mentioning the original literature on PC in the neurosciences, that does not intersect with ours as it is not related to deep neural networks, has evolved from Rao and Ballard’s work into a general theory that models information processing in the brain using probability and variational inference, called the free energy principle (Friston, 2005; Friston and Kiebel, 2009; Friston, 2010).

Neuroscience-inspired deep learning.

Another line of related works is that of neuroscience methods applied to machine learning, like equilibrium propagation (Scellier and Bengio, 2017), which is the most similar to PC (Laborieux and Zenke, 2022; Millidge et al., 2022a). Other methods able to train models of similar sizes are target propagation (Bengio, 2014; Ernoult et al., 2022; Millidge et al., 2022b) and SoftHebb (Moraitis et al., 2022; Journé et al., 2022). The first two communities, that of targetprop and eqprop, consistently use similar architectures in different research papers to test the performance of their methods. In our benchmarking effort, some of the architectures proposed are the same ones they use, to favor a more direct comparison. There are also methods that differ more from PC, such as forward-only methods (Kohan et al., 2023; Nøkland, 2016; Hinton, 2022), methods that back-propagate the errors using a designated set of weights (Lillicrap et al., 2014; Launay et al., 2020), and other Hebbian methods (Moraitis et al., 2022; Journé et al., 2022).

3 Background and Notation

Predictive coding networks (PCNs) are hierarchical Gaussian generative models with L𝐿Litalic_L levels with parameters θ={θ0,θ1,θ2,,θL}𝜃subscript𝜃0subscript𝜃1subscript𝜃2subscript𝜃𝐿\theta=\{\theta_{0},\theta_{1},\theta_{2},...,\theta_{L}\}italic_θ = { italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }, in which each level models a multi-variate distribution parameterized by activation of the preceding level. Let hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT be the realization of the vector of random variables Hlsubscript𝐻𝑙H_{l}italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of level l𝑙litalic_l, then we have that the likelihood

Pθ(h0,h1,,hL)=Pθ0(h0)Pθ1(h1|h0)PθL(hL|hL1).subscript𝑃𝜃subscript0subscript1subscript𝐿subscript𝑃subscript𝜃0subscript0subscript𝑃subscript𝜃1conditionalsubscript1subscript0subscript𝑃subscript𝜃𝐿conditionalsubscript𝐿subscript𝐿1P_{\theta}(h_{0},h_{1},\ldots,h_{L})=P_{\theta_{0}}(h_{0})P_{\theta_{1}}(h_{1}% |h_{0})\cdots P_{\theta_{L}}(h_{L}|h_{L-1}).italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋯ italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) .

For simplicity, we write Pθl(hl)subscript𝑃subscript𝜃𝑙subscript𝑙P_{\theta_{l}}(h_{l})italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) instead of Pθl(Hl=hl)subscript𝑃subscript𝜃𝑙subscript𝐻𝑙subscript𝑙P_{\theta_{l}}(H_{l}=h_{l})italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). We refer to each of the scalar random variables of Hlsubscript𝐻𝑙H_{l}italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as a neuron. PC assumes that both the prior on h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the relationships between levels are governed by a normal distribution parameterized as follows:

Pθ0(h0)=𝒩(h0,μ0,Σ0),μ0=θ0,formulae-sequencesubscript𝑃subscript𝜃0subscript0𝒩subscript0subscript𝜇0subscriptΣ0subscript𝜇0subscript𝜃0\displaystyle P_{\theta_{0}}(h_{0})=\mathcal{N}(h_{0},\mu_{0},\Sigma_{0}),\;\;% \;\;\;\;\;\;\,\mu_{0}=\theta_{0},italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
Pθl(hl|hl1)=𝒩(hl;μl,Σl),μl=fl(hl1,θl),formulae-sequencesubscript𝑃subscript𝜃𝑙conditionalsubscript𝑙subscript𝑙1𝒩subscript𝑙subscript𝜇𝑙subscriptΣ𝑙subscript𝜇𝑙subscript𝑓𝑙subscript𝑙1subscript𝜃𝑙\displaystyle P_{\theta_{l}}(h_{l}|h_{l-1})=\mathcal{N}(h_{l};\mu_{l},\Sigma_{% l}),\;\;\;\mu_{l}=f_{l}(h_{l-1},\theta_{l}),italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ,

where θlsubscript𝜃𝑙\theta_{l}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the learnable weights parametrizing the transformation flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and ΣlsubscriptΣ𝑙\Sigma_{l}roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is a covariance matrix. As it is standard practice, ΣlsubscriptΣ𝑙\Sigma_{l}roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT will be fixed to the identity matrix throughout this work (Whittington and Bogacz, 2017). If, for example, θl=(Wl,bl)subscript𝜃𝑙subscript𝑊𝑙subscript𝑏𝑙\theta_{l}=(W_{l},b_{l})italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) and fl(hl1,θl)=σl(Wlhl1+bl)subscript𝑓𝑙subscript𝑙1subscript𝜃𝑙subscript𝜎𝑙subscript𝑊𝑙subscript𝑙1subscript𝑏𝑙f_{l}(h_{l-1},\theta_{l})=\sigma_{l}(W_{l}h_{l-1}+b_{l})italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), then the neurons in level l1𝑙1l-1italic_l - 1 are connected to neurons in level l𝑙litalic_l via a linear operation, followed by a non linear map, that is the analogous to a fully connected layer in standard deep learning. Intuitively, θ𝜃\thetaitalic_θ is the set of learnable weights of the model, while h={h0,h1,,hL}subscript0subscript1subscript𝐿h=\{h_{0},h_{1},...,h_{L}\}italic_h = { italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } is data-point-dependent latent state, containing the abstract representations for the given observations.

Training.

In supervised settings, training consists of learning the relationship between given pairs of input-output observations (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). In PC, this is performed by maximizing the joint likelihood of our generative model with the latent vectors h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT respectively fixed to the input and label of the provided data-point: Pθ(h|h0=x,hL=y)=Pθ(hL=y,,h1,h0=x)P_{\theta}(h\rvert_{h_{0}=x,h_{L}=y})=P_{\theta}(h_{L}=y,\ldots,h_{1},h_{0}=x)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h | start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_y end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_y , … , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x ). This is achieved by minimizing the so-called variational free energy \mathcal{F}caligraphic_F (Friston et al., 2007):

(h,θ)=lnPθ(h)=ln(𝒩(h0|μ0)l=1L𝒩(hl;fl(hl1,θl)))=l=0L12(hlμl)2+k.𝜃subscript𝑃𝜃𝒩conditionalsubscript0subscript𝜇0superscriptsubscriptproduct𝑙1𝐿𝒩subscript𝑙subscript𝑓𝑙subscript𝑙1subscript𝜃𝑙superscriptsubscript𝑙0𝐿12superscriptsubscript𝑙subscript𝜇𝑙2𝑘\mathcal{F}(h,\theta)=-\ln{P_{\theta}(h)}=-\ln{\left(\mathcal{N}(h_{0}|\mu_{0}% )\prod_{l=1}^{L}\mathcal{N}(h_{l};f_{l}(h_{l-1},\theta_{l}))\right)}=\sum_{l=0% }^{L}\frac{1}{2}(h_{l}-\mu_{l})^{2}+k.caligraphic_F ( italic_h , italic_θ ) = - roman_ln italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h ) = - roman_ln ( caligraphic_N ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT caligraphic_N ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) ) = ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k . (1)

The quantity ϵl=(hlμl)subscriptitalic-ϵ𝑙subscript𝑙subscript𝜇𝑙\epsilon_{l}=(h_{l}-\mu_{l})italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) is often referred to as prediction error of layer l𝑙litalic_l, being the difference between the predicted activation μlsubscript𝜇𝑙\mu_{l}italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and the current state hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Refer to the appendix, for a full derivation of Eq. (1). To minimize \mathcal{F}caligraphic_F, the Expectation-Maximization (EM) (Dempster et al., 1977) algorithm is used by iteratively optimizing first the state hhitalic_h, and then the weights θ𝜃\thetaitalic_θ according to the equations

h=argmin(h,θ),θ=argmin𝜃(h,θ).formulae-sequencesuperscript𝜃superscript𝜃𝜃superscript𝜃h^{*}=\underset{h}{\arg\min}\,\mathcal{F}(h,\theta),\;\;\theta^{*}=\underset{% \theta}{\arg\min}\,\mathcal{F}(h^{*},\theta).italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_h start_ARG roman_arg roman_min end_ARG caligraphic_F ( italic_h , italic_θ ) , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_θ start_ARG roman_arg roman_min end_ARG caligraphic_F ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ ) . (2)

We refer to the first step described by Eq. (2) as inference and to the second as learning phase. In practice, we do not train on a single pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) but on a dataset split in mini-batches that are subsequently used to train the model parameters. Furthermore, both inference and learning are approximated via gradient descent on the variational free energy. In the inference phase, firstly hhitalic_h is initialized to an initial value h(0)superscript0h^{(0)}italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, and, then, it is optimized for T𝑇Titalic_T iterations. Then, during the learning phase we use the newly computed values to perform a single update on the weights θ𝜃\thetaitalic_θ. The gradients of the variational free energy with respect to both hhitalic_h and θ𝜃\thetaitalic_θ are as follows:

hl=hl=12(ϵl2hl+ϵl+12hl),θl=θl=12ϵl2θl.formulae-sequencesubscript𝑙subscript𝑙12superscriptsubscriptitalic-ϵ𝑙2subscript𝑙superscriptsubscriptitalic-ϵ𝑙12subscript𝑙subscript𝜃𝑙subscript𝜃𝑙12superscriptsubscriptitalic-ϵ𝑙2subscript𝜃𝑙\nabla h_{l}=\frac{\partial\mathcal{F}}{\partial h_{l}}=\frac{1}{2}\left(\frac% {\partial\epsilon_{l}^{2}}{\partial h_{l}}+\frac{\partial\epsilon_{l+1}^{2}}{% \partial h_{l}}\right),\;\;\;\;\;\;\nabla\theta_{l}=\frac{\partial\mathcal{F}}% {\partial\theta_{l}}=\frac{1}{2}\frac{\partial\epsilon_{l}^{2}}{\partial\theta% _{l}}.∇ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG ∂ caligraphic_F end_ARG start_ARG ∂ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG ∂ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ italic_ϵ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ) , ∇ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG ∂ caligraphic_F end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ∂ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG . (3)

Then, a new batch of data points is provided to the model and the process is repeated until convergence. As highlighted by Eq. (3), each state and each parameter is updated using local information as the gradients depend exclusively on the pre and post-synaptic errors ϵlsubscriptitalic-ϵ𝑙\epsilon_{l}italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ϵl+1subscriptitalic-ϵ𝑙1\epsilon_{l+1}italic_ϵ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT. This is the main reason why, in contrast to BP, PC is a local algorithm and is considered more biologically plausible. In Appendix A, we provide an algorithmic description of the concepts illustrated in these paragraphs, highlighting how each equation is translated to code in PCX.

Refer to caption
Figure 1: Left: Generative and discriminative modes, Right: Pseudocode of one parameter update of PC in supervised learning, as well as an informal description of the different algorithms considered in this work.

Evaluation.

This phase is similar to the inference phase, with the difference that we perform it on a test point x¯¯𝑥\bar{x}over¯ start_ARG italic_x end_ARG, used to infer the label y¯¯𝑦\bar{y}over¯ start_ARG italic_y end_ARG. This is achieved by fixing h0=x¯subscript0¯𝑥h_{0}=\bar{x}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG and compute the most likely value of the latent states h|h0=x¯h^{*}\rvert_{h_{0}=\bar{x}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG end_POSTSUBSCRIPT, again using the state gradients of Eq. (3). We refer to this as discriminative mode. In practice, for discriminiative networks, the values of the latent states computed this way are equivalent to those obtained via a forward pass, that is setting hl(0)=μl(0)superscriptsubscript𝑙0superscriptsubscript𝜇𝑙0h_{l}^{(0)}=\mu_{l}^{(0)}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT for every l0𝑙0l\neq 0italic_l ≠ 0, as it corresponds to the global minimum of \mathcal{F}caligraphic_F (Frieder and Lukasiewicz, 2022).

Generative Mode.

So far, we have only described how to use PCNs to perform supervised training. However, as we will see in the experimental section, such models can also be used (and were initially developed to be used) to perform unsupervised learning tasks. Given a datapoint x𝑥xitalic_x, the goal is to use PCNs to compress the information of x𝑥xitalic_x into a latent representation, conceptually similar to how variational autoencoders work (Kingma and Welling, 2013). Such a compression, that should contain all the information needed to generate x𝑥xitalic_x, is computed by fixing the state vector hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT to the data-point, and run inference – that is, we maximize Pθ(h|hL=x)P_{\theta}(h\rvert_{h_{L}=x})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h | start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_x end_POSTSUBSCRIPT ) via gradient descent on hhitalic_h – until the process has converged. The compressed representation will then be the value of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at convergence (or, in practice, after T𝑇Titalic_T steps). If we are training the model, we then perform a gradient update on the parameters to minimize the variational free energy of Eq.(1), as we do in supervised learning. A sketch of the discriminative and generative ways of training PCNs is represented in Fig. 1(a).

4 Experiments and Benchmarks

This section is divided in two areas, that correspond to discriminative and generative inference tasks. The first focuses on classification tasks; the second on unsupervised generation. A sketch illustrating the two modes is provided in Fig. 1. To provide a comprehensive evaluation, we will test our models on multiple computer vision datasets, MNIST, FashionMNIST, CIFAR10/100, CelebA, and Tiny ImageNET; on models of increasing complexity, with both feedforward and convolutional layers; and multiple learning algorithms present in the literature.

Algorithms.

We consider various learning algorithms present in the literature: (1) Standard PC, already discussed in the background section; (2) Incremental PC (iPC), a simple and recently proposed modification where the weight parameters are updated alongside the latent variables at every time step; (3) Monte Carlo PC (MCPC), obtained by applying unadjusted Langevin dynamics to the inference process; (4) Positive nudging (PN), where the target used is obtained by a small perturbation of the output towards the original, 1-hot label; (5) Negative nudging (NN), where the target is obtained by a small perturbation away from the target, and updating the weights in the opposite direction; (6) Centered nudging (CN), where we alternate epochs of positive and negative nudging. Among these, PC, iPC, and MCPC will be used for the generative mode, and PC, iPC, PN, and NN, and CN for the discriminative mode. See Fig. 1, and the supplementary material, for a more detailed description.

4.1 Discriminative Mode

Table 1: Test accuracies of the different algorithms on different datasets.
% Accuracy PC-CE PC-SE PN NN CN iPC BP-CE BP-SE
MLP
MNIST 98.11±0.03superscript98.11plus-or-minus0.0398.11^{\pm 0.03}98.11 start_POSTSUPERSCRIPT ± 0.03 end_POSTSUPERSCRIPT 98.26±0.04superscript98.26plus-or-minus0.0498.26^{\pm 0.04}98.26 start_POSTSUPERSCRIPT ± 0.04 end_POSTSUPERSCRIPT 98.36±0.06superscript98.36plus-or-minus0.0698.36^{\pm 0.06}98.36 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT 98.26±0.07superscript98.26plus-or-minus0.0798.26^{\pm 0.07}98.26 start_POSTSUPERSCRIPT ± 0.07 end_POSTSUPERSCRIPT 98.23±0.09superscript98.23plus-or-minus0.0998.23^{\pm 0.09}98.23 start_POSTSUPERSCRIPT ± 0.09 end_POSTSUPERSCRIPT 98.45±0.09superscript98.45plus-or-minus0.09\mathbf{98.45^{\pm 0.09}}bold_98.45 start_POSTSUPERSCRIPT ± bold_0.09 end_POSTSUPERSCRIPT 98.07±0.06superscript98.07plus-or-minus0.0698.07^{\pm 0.06}98.07 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT 98.29±0.08superscript98.29plus-or-minus0.0898.29^{\pm 0.08}98.29 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT
FashionMNIST 89.16±0.08superscript89.16plus-or-minus0.0889.16^{\pm 0.08}89.16 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT 89.58±0.13superscript89.58plus-or-minus0.1389.58^{\pm 0.13}89.58 start_POSTSUPERSCRIPT ± 0.13 end_POSTSUPERSCRIPT 89.57±0.08superscript89.57plus-or-minus0.0889.57^{\pm 0.08}89.57 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT 89.46±0.08superscript89.46plus-or-minus0.0889.46^{\pm 0.08}89.46 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT 89.56±0.05superscript89.56plus-or-minus0.0589.56^{\pm 0.05}89.56 start_POSTSUPERSCRIPT ± 0.05 end_POSTSUPERSCRIPT 89.90±0.06superscript89.90plus-or-minus0.06\mathbf{89.90^{\pm 0.06}}bold_89.90 start_POSTSUPERSCRIPT ± bold_0.06 end_POSTSUPERSCRIPT 89.04±0.08superscript89.04plus-or-minus0.0889.04^{\pm 0.08}89.04 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT 89.48±0.07superscript89.48plus-or-minus0.0789.48^{\pm 0.07}89.48 start_POSTSUPERSCRIPT ± 0.07 end_POSTSUPERSCRIPT
VGG-5
CIFAR-10 88.06±0.13superscript88.06plus-or-minus0.1388.06^{\pm 0.13}88.06 start_POSTSUPERSCRIPT ± 0.13 end_POSTSUPERSCRIPT 87.98±0.11superscript87.98plus-or-minus0.1187.98^{\pm 0.11}87.98 start_POSTSUPERSCRIPT ± 0.11 end_POSTSUPERSCRIPT 88.42±0.66superscript88.42plus-or-minus0.6688.42^{\pm 0.66}88.42 start_POSTSUPERSCRIPT ± 0.66 end_POSTSUPERSCRIPT 88.83±0.04superscript88.83plus-or-minus0.0488.83^{\pm 0.04}88.83 start_POSTSUPERSCRIPT ± 0.04 end_POSTSUPERSCRIPT 89.47±0.13superscript89.47plus-or-minus0.13\mathbf{89.47^{\pm 0.13}}bold_89.47 start_POSTSUPERSCRIPT ± bold_0.13 end_POSTSUPERSCRIPT 85.51±0.12superscript85.51plus-or-minus0.1285.51^{\pm 0.12}85.51 start_POSTSUPERSCRIPT ± 0.12 end_POSTSUPERSCRIPT 88.11±0.13superscript88.11plus-or-minus0.1388.11^{\pm 0.13}88.11 start_POSTSUPERSCRIPT ± 0.13 end_POSTSUPERSCRIPT 89.43±0.12superscript89.43plus-or-minus0.1289.43^{\pm 0.12}89.43 start_POSTSUPERSCRIPT ± 0.12 end_POSTSUPERSCRIPT
CIFAR-100 (Top-1) 60.00±0.19superscript60.00plus-or-minus0.1960.00^{\pm 0.19}60.00 start_POSTSUPERSCRIPT ± 0.19 end_POSTSUPERSCRIPT 54.08±1.66superscript54.08plus-or-minus1.6654.08^{\pm 1.66}54.08 start_POSTSUPERSCRIPT ± 1.66 end_POSTSUPERSCRIPT 64.70±0.25superscript64.70plus-or-minus0.2564.70^{\pm 0.25}64.70 start_POSTSUPERSCRIPT ± 0.25 end_POSTSUPERSCRIPT 65.46±0.05superscript65.46plus-or-minus0.0565.46^{\pm 0.05}65.46 start_POSTSUPERSCRIPT ± 0.05 end_POSTSUPERSCRIPT 67.19±0.24superscript67.19plus-or-minus0.24\mathbf{67.19^{\pm 0.24}}bold_67.19 start_POSTSUPERSCRIPT ± bold_0.24 end_POSTSUPERSCRIPT 56.07±0.16superscript56.07plus-or-minus0.1656.07^{\pm 0.16}56.07 start_POSTSUPERSCRIPT ± 0.16 end_POSTSUPERSCRIPT 60.82±0.10superscript60.82plus-or-minus0.1060.82^{\pm 0.10}60.82 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT 66.28±0.23superscript66.28plus-or-minus0.2366.28^{\pm 0.23}66.28 start_POSTSUPERSCRIPT ± 0.23 end_POSTSUPERSCRIPT
CIFAR-100 (Top-5) 84.97±0.19superscript84.97plus-or-minus0.1984.97^{\pm 0.19}84.97 start_POSTSUPERSCRIPT ± 0.19 end_POSTSUPERSCRIPT 78.70±1.00superscript78.70plus-or-minus1.0078.70^{\pm 1.00}78.70 start_POSTSUPERSCRIPT ± 1.00 end_POSTSUPERSCRIPT 84.74±0.38superscript84.74plus-or-minus0.3884.74^{\pm 0.38}84.74 start_POSTSUPERSCRIPT ± 0.38 end_POSTSUPERSCRIPT 85.15±0.16superscript85.15plus-or-minus0.1685.15^{\pm 0.16}85.15 start_POSTSUPERSCRIPT ± 0.16 end_POSTSUPERSCRIPT 86.60±0.18superscript86.60plus-or-minus0.18\mathbf{86.60^{\pm 0.18}}bold_86.60 start_POSTSUPERSCRIPT ± bold_0.18 end_POSTSUPERSCRIPT 78.91±0.23superscript78.91plus-or-minus0.2378.91^{\pm 0.23}78.91 start_POSTSUPERSCRIPT ± 0.23 end_POSTSUPERSCRIPT 85.84±0.14superscript85.84plus-or-minus0.1485.84^{\pm 0.14}85.84 start_POSTSUPERSCRIPT ± 0.14 end_POSTSUPERSCRIPT 85.85±0.27superscript85.85plus-or-minus0.2785.85^{\pm 0.27}85.85 start_POSTSUPERSCRIPT ± 0.27 end_POSTSUPERSCRIPT
Tiny ImageNet (Top-1) 41.29±0.2superscript41.29plus-or-minus0.241.29^{\pm 0.2}41.29 start_POSTSUPERSCRIPT ± 0.2 end_POSTSUPERSCRIPT 30.28±0.2superscript30.28plus-or-minus0.230.28^{\pm 0.2}30.28 start_POSTSUPERSCRIPT ± 0.2 end_POSTSUPERSCRIPT 34.61±0.2superscript34.61plus-or-minus0.234.61^{\pm 0.2}34.61 start_POSTSUPERSCRIPT ± 0.2 end_POSTSUPERSCRIPT 46.40±0.1superscript46.40plus-or-minus0.1\mathbf{46.40^{\pm 0.1}}bold_46.40 start_POSTSUPERSCRIPT ± bold_0.1 end_POSTSUPERSCRIPT 46.38±0.11superscript46.38plus-or-minus0.1146.38^{\pm 0.11}46.38 start_POSTSUPERSCRIPT ± 0.11 end_POSTSUPERSCRIPT 29.94±0.47superscript29.94plus-or-minus0.4729.94^{\pm 0.47}29.94 start_POSTSUPERSCRIPT ± 0.47 end_POSTSUPERSCRIPT 43.72±0.1superscript43.72plus-or-minus0.143.72^{\pm 0.1}43.72 start_POSTSUPERSCRIPT ± 0.1 end_POSTSUPERSCRIPT 44.90±0.2superscript44.90plus-or-minus0.244.90^{\pm 0.2}44.90 start_POSTSUPERSCRIPT ± 0.2 end_POSTSUPERSCRIPT
Tiny ImageNet (Top-5) 66.68±0.09superscript66.68plus-or-minus0.0966.68^{\pm 0.09}66.68 start_POSTSUPERSCRIPT ± 0.09 end_POSTSUPERSCRIPT 57.31±0.21superscript57.31plus-or-minus0.2157.31^{\pm 0.21}57.31 start_POSTSUPERSCRIPT ± 0.21 end_POSTSUPERSCRIPT 59.91±0.24superscript59.91plus-or-minus0.2459.91^{\pm 0.24}59.91 start_POSTSUPERSCRIPT ± 0.24 end_POSTSUPERSCRIPT 68.50±0.18superscript68.50plus-or-minus0.1868.50^{\pm 0.18}68.50 start_POSTSUPERSCRIPT ± 0.18 end_POSTSUPERSCRIPT 69.06±0.10superscript69.06plus-or-minus0.1069.06^{\pm 0.10}69.06 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT 54.73±0.52superscript54.73plus-or-minus0.5254.73^{\pm 0.52}54.73 start_POSTSUPERSCRIPT ± 0.52 end_POSTSUPERSCRIPT 69.23±0.23superscript69.23plus-or-minus0.23\mathbf{69.23^{\pm 0.23}}bold_69.23 start_POSTSUPERSCRIPT ± bold_0.23 end_POSTSUPERSCRIPT 65.26±0.37superscript65.26plus-or-minus0.3765.26^{\pm 0.37}65.26 start_POSTSUPERSCRIPT ± 0.37 end_POSTSUPERSCRIPT
VGG-7
CIFAR-100 (Top-1) 56.80±0.14superscript56.80plus-or-minus0.1456.80^{\pm 0.14}56.80 start_POSTSUPERSCRIPT ± 0.14 end_POSTSUPERSCRIPT 37.52±2.60superscript37.52plus-or-minus2.6037.52^{\pm 2.60}37.52 start_POSTSUPERSCRIPT ± 2.60 end_POSTSUPERSCRIPT 56.56±0.13superscript56.56plus-or-minus0.1356.56^{\pm 0.13}56.56 start_POSTSUPERSCRIPT ± 0.13 end_POSTSUPERSCRIPT 59.97±0.41superscript59.97plus-or-minus0.4159.97^{\pm 0.41}59.97 start_POSTSUPERSCRIPT ± 0.41 end_POSTSUPERSCRIPT 64.76±0.17superscript64.76plus-or-minus0.1764.76^{\pm 0.17}64.76 start_POSTSUPERSCRIPT ± 0.17 end_POSTSUPERSCRIPT 43.99±0.30superscript43.99plus-or-minus0.3043.99^{\pm 0.30}43.99 start_POSTSUPERSCRIPT ± 0.30 end_POSTSUPERSCRIPT 59.96±0.10superscript59.96plus-or-minus0.1059.96^{\pm 0.10}59.96 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT 65.36±0.15superscript65.36plus-or-minus0.15\mathbf{65.36^{\pm 0.15}}bold_65.36 start_POSTSUPERSCRIPT ± bold_0.15 end_POSTSUPERSCRIPT
CIFAR-100 (Top-5) 83.00±0.09superscript83.00plus-or-minus0.0983.00^{\pm 0.09}83.00 start_POSTSUPERSCRIPT ± 0.09 end_POSTSUPERSCRIPT 66.73±2.37superscript66.73plus-or-minus2.3766.73^{\pm 2.37}66.73 start_POSTSUPERSCRIPT ± 2.37 end_POSTSUPERSCRIPT 81.52±0.17superscript81.52plus-or-minus0.1781.52^{\pm 0.17}81.52 start_POSTSUPERSCRIPT ± 0.17 end_POSTSUPERSCRIPT 81.50±0.41superscript81.50plus-or-minus0.4181.50^{\pm 0.41}81.50 start_POSTSUPERSCRIPT ± 0.41 end_POSTSUPERSCRIPT 84.65±0.18superscript84.65plus-or-minus0.1884.65^{\pm 0.18}84.65 start_POSTSUPERSCRIPT ± 0.18 end_POSTSUPERSCRIPT 73.23±0.30superscript73.23plus-or-minus0.3073.23^{\pm 0.30}73.23 start_POSTSUPERSCRIPT ± 0.30 end_POSTSUPERSCRIPT 85.61±0.10superscript85.61plus-or-minus0.10\mathbf{85.61^{\pm 0.10}}bold_85.61 start_POSTSUPERSCRIPT ± bold_0.10 end_POSTSUPERSCRIPT 84.41±0.26superscript84.41plus-or-minus0.2684.41^{\pm 0.26}84.41 start_POSTSUPERSCRIPT ± 0.26 end_POSTSUPERSCRIPT
Tiny ImageNet (Top-1) 41.15±0.14superscript41.15plus-or-minus0.1441.15^{\pm 0.14}41.15 start_POSTSUPERSCRIPT ± 0.14 end_POSTSUPERSCRIPT 21.28±0.46superscript21.28plus-or-minus0.4621.28^{\pm 0.46}21.28 start_POSTSUPERSCRIPT ± 0.46 end_POSTSUPERSCRIPT 25.53±0.77superscript25.53plus-or-minus0.7725.53^{\pm 0.77}25.53 start_POSTSUPERSCRIPT ± 0.77 end_POSTSUPERSCRIPT 39.49±2.69superscript39.49plus-or-minus2.6939.49^{\pm 2.69}39.49 start_POSTSUPERSCRIPT ± 2.69 end_POSTSUPERSCRIPT 35.59±7.69superscript35.59plus-or-minus7.6935.59^{\pm 7.69}35.59 start_POSTSUPERSCRIPT ± 7.69 end_POSTSUPERSCRIPT 19.76±0.15superscript19.76plus-or-minus0.1519.76^{\pm 0.15}19.76 start_POSTSUPERSCRIPT ± 0.15 end_POSTSUPERSCRIPT 45.32±0.11superscript45.32plus-or-minus0.1145.32^{\pm 0.11}45.32 start_POSTSUPERSCRIPT ± 0.11 end_POSTSUPERSCRIPT 46.08±0.15superscript46.08plus-or-minus0.15\mathbf{46.08^{\pm 0.15}}bold_46.08 start_POSTSUPERSCRIPT ± bold_0.15 end_POSTSUPERSCRIPT
Tiny ImageNet (Top-5) 66.25±0.11superscript66.25plus-or-minus0.1166.25^{\pm 0.11}66.25 start_POSTSUPERSCRIPT ± 0.11 end_POSTSUPERSCRIPT 44.92±0.27superscript44.92plus-or-minus0.2744.92^{\pm 0.27}44.92 start_POSTSUPERSCRIPT ± 0.27 end_POSTSUPERSCRIPT 50.06±0.84superscript50.06plus-or-minus0.8450.06^{\pm 0.84}50.06 start_POSTSUPERSCRIPT ± 0.84 end_POSTSUPERSCRIPT 64.66±1.95superscript64.66plus-or-minus1.9564.66^{\pm 1.95}64.66 start_POSTSUPERSCRIPT ± 1.95 end_POSTSUPERSCRIPT 59.63±6.00superscript59.63plus-or-minus6.0059.63^{\pm 6.00}59.63 start_POSTSUPERSCRIPT ± 6.00 end_POSTSUPERSCRIPT 40.36±0.22superscript40.36plus-or-minus0.2240.36^{\pm 0.22}40.36 start_POSTSUPERSCRIPT ± 0.22 end_POSTSUPERSCRIPT 69.64±0.18superscript69.64plus-or-minus0.18\mathbf{69.64^{\pm 0.18}}bold_69.64 start_POSTSUPERSCRIPT ± bold_0.18 end_POSTSUPERSCRIPT 66.65±0.20superscript66.65plus-or-minus0.2066.65^{\pm 0.20}66.65 start_POSTSUPERSCRIPT ± 0.20 end_POSTSUPERSCRIPT

Here, we test the performance of PCNs on image classification tasks. We compare PC against BP, using both Squared Error (SE) and Cross Entropy (CE) loss, by adapting the energy function as described in (Pinchetti et al., 2022). For the experiments on MNIST and FashionMNIST, we use feedforward models with 3333 hidden layers of 128128128128 hidden neurons, while for CIFAR10/100 and Tiny ImageNET, we compare VGG-like models. We performed a large hyperparameter search over learning rates, optimizers, activation functions, and algorithm-specific parameters. All the details needed to reproduce the experiments, as well as a discussion about ‘lessons learned’ during such a large search, are in the Appendix B. Results, averaged over 5555 seeds are reported in Tab. 1.

Discussion.

The results show that the best performing algorithms, at least on the most complex tasks, are the ones where the target is nudged towards the real label, that are PN and NN. This is in line with previous findings in the Eqprop literature (Scellier et al., 2024). The recently proposed iPC, on the other hand, performs well on small architectures, as it is the best performing one on MNIST and FashionMNIST, but its performance worsten when it comes to train on large architectures. More broadly, the performance are comparable to those of backprop, except on the largest model. An interesting observation, is that all the best results for PC have been achieved using a VGG5, that has always outperformed the deeper VGG7. Future work should investigate the reason of such a phenomenon, as scaling up to more complex datasets will require the use of much deeper architectures, such as ResNets (He et al., 2016). In Section 5, we analyze possible causes, as well as comparing the wall-clock time of the different algorithms.

Table 2: MSE loss for image reconstruction of BP, PC, and iPC on different datasets.
MSE (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) PC iPC BP
MNIST 9.25±0.00superscript9.25plus-or-minus0.009.25^{\pm 0.00}9.25 start_POSTSUPERSCRIPT ± 0.00 end_POSTSUPERSCRIPT 9.09±0.00superscript9.09plus-or-minus0.009.09^{\pm 0.00}9.09 start_POSTSUPERSCRIPT ± 0.00 end_POSTSUPERSCRIPT 9.08±0.00superscript9.08plus-or-minus0.00\mathbf{9.08^{\pm 0.00}}bold_9.08 start_POSTSUPERSCRIPT ± bold_0.00 end_POSTSUPERSCRIPT
FashionMNIST 10.56±0.01superscript10.56plus-or-minus0.0110.56^{\pm 0.01}10.56 start_POSTSUPERSCRIPT ± 0.01 end_POSTSUPERSCRIPT 10.11±0.01superscript10.11plus-or-minus0.0110.11^{\pm 0.01}10.11 start_POSTSUPERSCRIPT ± 0.01 end_POSTSUPERSCRIPT 10.04±0.00superscript10.04plus-or-minus0.00\mathbf{10.04^{\pm 0.00}}bold_10.04 start_POSTSUPERSCRIPT ± bold_0.00 end_POSTSUPERSCRIPT
MSE (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) PC iPC BP
CIFAR-10 6.67±0.10superscript6.67plus-or-minus0.106.67^{\pm 0.10}6.67 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT 5.50±0.01superscript5.50plus-or-minus0.01\mathbf{5.50^{\pm 0.01}}bold_5.50 start_POSTSUPERSCRIPT ± bold_0.01 end_POSTSUPERSCRIPT 6.17±0.46superscript6.17plus-or-minus0.466.17^{\pm 0.46}6.17 start_POSTSUPERSCRIPT ± 0.46 end_POSTSUPERSCRIPT
CELEB-A 2.35±0.12superscript2.35plus-or-minus0.122.35^{\pm 0.12}2.35 start_POSTSUPERSCRIPT ± 0.12 end_POSTSUPERSCRIPT 1.30±0.12superscript1.30plus-or-minus0.12\mathbf{1.30^{\pm 0.12}}bold_1.30 start_POSTSUPERSCRIPT ± bold_0.12 end_POSTSUPERSCRIPT 3.34±0.30superscript3.34plus-or-minus0.303.34^{\pm 0.30}3.34 start_POSTSUPERSCRIPT ± 0.30 end_POSTSUPERSCRIPT
Refer to caption
Figure 2: CIFAR10 image reconstruction via autoencoding convolutional networks. In order: original, PC, iPC, BP, BP (half of the parameters).
Refer to caption
Refer to caption
Figure 3: Generative samples obtained by MCPC. Left: Contour plot of learned generative distribution compared to Iris data samples (x). Right: Samples obtained for a PCN trained on MNIST. In order: unconditional generation, conditional generation (odd), conditional generation (even). For more samples, please refer to Appendix C.2.

4.2 Generative Mode

We test the performance of PCNs on image generation tasks. We perform three different kinds of experiments: (1) generation from a posterior distribution; (2) generation via sampling from the learned joint distribution; and (3) associative memory retrieval. In the first case, we provide a test image y𝑦yitalic_y to a trained model, run inference to compute a compressed representation x¯¯𝑥\bar{x}over¯ start_ARG italic_x end_ARG (stored in the latent vector h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at convergence), and produce a reconstructed y¯=hL¯𝑦subscript𝐿\bar{y}=h_{L}over¯ start_ARG italic_y end_ARG = italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT by performing a forward pass with h0=x¯subscript0¯𝑥h_{0}=\bar{x}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG). The model we consider are three layer networks. As this is an autoencoding task, we compare against autoencoders with three layer encoder/decoder structure (so, six layers in total). In the case of MNIST and FashionMNSIT, we use feedforward layers, in the case of CIFAR10 and CelebA, deconvolutional (and convolutional for the encoder) ones. The results in Tab. 2 and Fig. 2 report comparable performance, with a small advantage for PC compared to BP on the more complex convolutional tasks. In this case, iPC is the best performing algorithm, probably due to the small size of the considered models which allows for better stability. Furthermore, note that the BP architectures have double the amount of parameters, being the PC networks decoder only. If we halve the number of features in each layer of the autoencoder architecture (while kee** the bottleneck dimension unchanged), we get significantly reduced performance for BP (Fig. 2, bottom), achieving a loss of 10.66±0.94×103superscript10.66plus-or-minus0.94superscript10310.66^{\pm{0.94}}\times 10^{-3}10.66 start_POSTSUPERSCRIPT ± 0.94 end_POSTSUPERSCRIPT × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT on CIFAR10. Details about these and the following experiments are provided in Appendix C.

Table 3: MSE (×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) of associative memory tasks. Columns indicate the number of hidden neurons while rows shows the training images to memorize. Results over 5 seeds.
Noise 512 1024 2048
50505050 6.06±0.11superscript6.06plus-or-minus0.116.06^{\pm 0.11}6.06 start_POSTSUPERSCRIPT ± 0.11 end_POSTSUPERSCRIPT 5.91±0.14superscript5.91plus-or-minus0.145.91^{\pm 0.14}5.91 start_POSTSUPERSCRIPT ± 0.14 end_POSTSUPERSCRIPT 5.95±0.06superscript5.95plus-or-minus0.065.95^{\pm 0.06}5.95 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT
100100100100 6.99±0.19superscript6.99plus-or-minus0.196.99^{\pm 0.19}6.99 start_POSTSUPERSCRIPT ± 0.19 end_POSTSUPERSCRIPT 6.76±0.23superscript6.76plus-or-minus0.236.76^{\pm 0.23}6.76 start_POSTSUPERSCRIPT ± 0.23 end_POSTSUPERSCRIPT 6.16±0.07superscript6.16plus-or-minus0.076.16^{\pm 0.07}6.16 start_POSTSUPERSCRIPT ± 0.07 end_POSTSUPERSCRIPT
250250250250 9.95±0.05superscript9.95plus-or-minus0.059.95^{\pm 0.05}9.95 start_POSTSUPERSCRIPT ± 0.05 end_POSTSUPERSCRIPT 10.14±0.06superscript10.14plus-or-minus0.0610.14^{\pm 0.06}10.14 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT 8.90±0.06superscript8.90plus-or-minus0.068.90^{\pm 0.06}8.90 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT
Mask 512 1024 2048
50505050 0.06±0.02superscript0.06plus-or-minus0.020.06^{\pm 0.02}0.06 start_POSTSUPERSCRIPT ± 0.02 end_POSTSUPERSCRIPT 0.01±0.00superscript0.01plus-or-minus0.000.01^{\pm 0.00}0.01 start_POSTSUPERSCRIPT ± 0.00 end_POSTSUPERSCRIPT 0.00±0.00superscript0.00plus-or-minus0.000.00^{\pm 0.00}0.00 start_POSTSUPERSCRIPT ± 0.00 end_POSTSUPERSCRIPT
100100100100 1.15±0.78superscript1.15plus-or-minus0.781.15^{\pm 0.78}1.15 start_POSTSUPERSCRIPT ± 0.78 end_POSTSUPERSCRIPT 1.01±0.79superscript1.01plus-or-minus0.791.01^{\pm 0.79}1.01 start_POSTSUPERSCRIPT ± 0.79 end_POSTSUPERSCRIPT 0.11±0.03superscript0.11plus-or-minus0.030.11^{\pm 0.03}0.11 start_POSTSUPERSCRIPT ± 0.03 end_POSTSUPERSCRIPT
250250250250 39.1±10.8superscript39.1plus-or-minus10.839.1^{\pm 10.8}39.1 start_POSTSUPERSCRIPT ± 10.8 end_POSTSUPERSCRIPT 3.74±0.73superscript3.74plus-or-minus0.733.74^{\pm 0.73}3.74 start_POSTSUPERSCRIPT ± 0.73 end_POSTSUPERSCRIPT 0.22±0.06superscript0.22plus-or-minus0.060.22^{\pm 0.06}0.22 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT
Refer to caption
Figure 4: Memory recalled images. Top: Original images. Left: Noisy input (guassian noise, σ=0.2𝜎0.2\sigma=0.2italic_σ = 0.2) and reconstruction. Right: Masked input (bottom half removed) and reconstruction.

For the second category of experiments, we tested the capability of PCNs to learn, and sample from, a complex probability distribution. MCPC extends PC by incorporating Gaussian noise to the activity updates of each neuron. This change enables a PCN to learn and generate samples analogous to a variational autoencoders (VAE). This change shifts the inference of PCNs from a variational approximation to Monte Carlo sampling of the posterior using Langevin dynamics. Data samples can be generated from the learned joint Pθ(h)subscript𝑃𝜃P_{\theta}(h)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h ) by leaving all states hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT free and performing noisy inference updates. Figure 3 illustrates MCPC’s ability to learn non-linear multimodal distributions using the iris dataset (Pedregosa et al., 2011) and shows generative samples for MNIST. When comparing MCPC to a VAE on MNIST, both models produced samples of similar quality despite the VAE having twice the number of parameters. MCPC achieved a lower FID score (MCPC: 2.53±0.17superscript2.53plus-or-minus0.172.53^{\pm 0.17}2.53 start_POSTSUPERSCRIPT ± 0.17 end_POSTSUPERSCRIPT vs. VAE: 4.19±0.38superscript4.19plus-or-minus0.384.19^{\pm 0.38}4.19 start_POSTSUPERSCRIPT ± 0.38 end_POSTSUPERSCRIPT), whereas the VAE attained a higher inception score (VAE: 7.91±0.03superscript7.91plus-or-minus0.037.91^{\pm 0.03}7.91 start_POSTSUPERSCRIPT ± 0.03 end_POSTSUPERSCRIPT vs. MCPC: 7.13±0.10superscript7.13plus-or-minus0.107.13^{\pm 0.10}7.13 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT).

In the associative memory (AM) experiments, we test how well the model is able to reconstruct an image already present in the training set, after it is provided with an incomplete or corrupted version of it, as done in a previous work (Salvatori et al., 2021). Fig. 4 show the results obtained by a PCN with 2 hidden layers of 512 neurons given noise or mask corrupted images. In Tab. 3, we study the memory capacity as the number of hidden layers increases. No visual difference between the recall and original images can be observed for MSE up to 0.0050.0050.0050.005. To evaluate efficiency we then trained a PCN with 5 hidden layers of 512 neurons on 500 TinyImagenet samples, with a batch size of 50 and 50 inference iterations during training. Training takes 0.40±0.005plus-or-minus0.400.0050.40\pm 0.0050.40 ± 0.005 seconds per epoch on an Nvidia V100 GPU.

Discussion.

The results show that PC is able to perform generative tasks, as well as and associative memory ones using decoder-only architectures. Via inference, PCNs are able to encode complex probability distributions in their latent state which can be used to perform a variety of different tasks, as we have shown. Thus, compared to artificial neural networks, PCNs are more flexible and require only half the parameters to achieve similar performance. This comes at a higher computational cost due to the number of inference steps to perform. Future work should look into this issue, aiming at reducing the inference time by propagating the information more efficiently through the network.

5 Analysis and metrics

In this section, we report several metrics that we believe are important to understand the current state and challenges of training networks with PC and compare them with standard models trained with gradient-descent and backprop when suitable. More in detail, we discuss how regularly the energy flows into the model, and how stable training is when changing parameters, initializations, and optimizers. A better understanding of such phenomena would allow us to solve the current problems of PCNs and, hence, scale up to the training of larger models on more complex datasets.

5.1 Energy and stability

The first study we perform regards the initialization of the network states hhitalic_h, and how this influences the performance of the model. In the literature, they have been either initialized to be equal to zero, randomly initialized via a Gaussian prior (Whittington and Bogacz, 2017), and be initialized via a forward pass. This last technique has been the preferred option in machine learning papers as it sets the errors ϵlL=0subscriptitalic-ϵ𝑙𝐿0\mathcal{\epsilon}_{l\neq L}=0italic_ϵ start_POSTSUBSCRIPT italic_l ≠ italic_L end_POSTSUBSCRIPT = 0 at every internal layer of the model. This allows the prediction error to be concentrated in the output layer only, and hence be equivalent to the SE. To provide a comparison among the three methods, we have trained a 3333-layer feedforward model on FashionMNIST. The results, plotted in Fig. 5(a), show that forward initialization is indeed the better method, although the gap in performance shrinks the more iterations T𝑇Titalic_T are performed.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: (a) Highest test accuracy reported for different initialization methods and iteration steps T𝑇Titalic_T used during training. (b) Energies per layer during inference of the best performing model (which has γ=0.003𝛾0.003\gamma=0.003italic_γ = 0.003). (c) Decay in accuracy when increasing the learning rate of the states γ𝛾\gammaitalic_γ, tested using both SGD and Adam. (d) Imbalance between energies in the layers. All figures are obtained using a three layer model trained on FashionMNIST.

Energy propagation.

Concentrating the total error of the model, and hence its energy, to the last layer as done when performing forward initialization, makes it hard for the model to then propagate such an energy back to the first layers. As reported in Fig. 5(b), we observe that the energy in the last layer is orders of magnitude larger than the one in the input layer, even after performing several inference steps. However, this behavior raises the question whether better initialization or optimization techniques could result in a more balanced energy distribution and thus better weight updates, as learning in this unbalanced energy regime has been shown problematic for more complex models (Alonso et al., 2024). An easy way of quickly propagating the energy through the network is to use learning rates equal to 1.01.01.01.0 for the updates of the states. However, both the results reported in Fig. 5(c), as well as our large experimental analysis of Section 4 show that the best performance was consistently achieved for state learning rates γ𝛾\gammaitalic_γ significantly smaller than 1.01.01.01.0.

We hypothesize that the current training setup for PCNs favors small state learning rates that are sub-optimal to scale to deeper architectures. Fig. 5(d) shows the energy ratios for different state learning rates: when γ1.0much-less-than𝛾1.0\gamma\ll 1.0italic_γ ≪ 1.0, the ratios of energies between layers are small, ϵl+12/ϵl21much-less-thansuperscriptsubscriptitalic-ϵ𝑙12superscriptsubscriptitalic-ϵ𝑙21\nicefrac{{\epsilon_{l+1}^{2}}}{{\epsilon_{l}^{2}}}\ll 1/ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≪ 1. The energy in the the first hidden layer is on average 6 orders of magnitude below the last layer for γ=0.01𝛾0.01\gamma=0.01italic_γ = 0.01. While models trained with large γ𝛾\gammaitalic_γ values achieve better energy propagation, they achieve lower accuracy as shown in Fig 5(c). Note that the decay in performance as function of increasing γ𝛾\gammaitalic_γ is stronger for Adam despite being the overall better optimizer in our experiments. This suggests limitations in the current training techniques and possible direction for future improvements aimed at reducing the energy imbalance between layers. We provide implementation details and results on other datasets in Appendix D.

Training stability.

Refer to caption
Figure 6: Updating weights with AdamW becomes unstable for wide layers as the accuracy plummets to random guessing for progressively smaller state learning rates as the network’s width increases. Contrarely to using SGD, the optimal state learning rate depends on the width of the layers.

We observed a further link between the weight optimizer and the structure of a PCN that might hinder the scalability of PC, that is, the influence of the hidden dimension on the performance of the model. To better study this, we trained feedforward PCNs with different hidden dimensions, state learning rates γ𝛾\gammaitalic_γ and optimizers, and reported the results in Fig. 6. The results show that, when using Adam, the width strongly affects the values of the learning rate γ𝛾\gammaitalic_γ for which the training process is stable. Interestingly, this phenomenon does not appear when using both the SGD optimizer, nor on standard networks trained with backprop. This behavioral difference with BP is unexpected and suggests the need for better optimization strategies for PCNs, as Adam was still the best choice in our experiments, but could be a bottleneck for larger architectures.

5.2 Properties of predictive coding networks

With PCX, it is straightforward to inspect and analyze several properties of PCNs. Here, we use \mathcal{F}caligraphic_F to differentiate between in-distribution (ID) and out-of-distribution (OOD) due to a semantic distribution shift Liu et al. (2020), as well as to compute the likelihood of a datasets Grathwohl et al. (2020). This can occur when samples are drawn from different, unseen classes, such as FashionMNIST samples under an MNIST setup Hendrycks and Gimpel (2017). To understand the confidence of the predictions of a PCN, we compare the distribution of the probability p(x,y^;θ)𝑝𝑥^𝑦𝜃p(x,\hat{y};\theta)italic_p ( italic_x , over^ start_ARG italic_y end_ARG ; italic_θ ), for ID and OOD samples against these to the distribution of softmax values of the PCN classifier, and compute their negative log-likelihoods (NLL), according to the assumptions stated in Section 3, that is,

=lnp(x,y;θ)p(x,y;θ)=e.𝑝𝑥𝑦𝜃𝑝𝑥𝑦𝜃superscript𝑒\mathcal{F}=-\ln{p(x,y;\theta)}\implies p(x,y;\theta)=e^{-\mathcal{F}}.caligraphic_F = - roman_ln italic_p ( italic_x , italic_y ; italic_θ ) ⟹ italic_p ( italic_x , italic_y ; italic_θ ) = italic_e start_POSTSUPERSCRIPT - caligraphic_F end_POSTSUPERSCRIPT . (4)

Our results in Fig. 7(a) demonstrate that a trained PCN classifier can effectively (1) assess OOD samples out-of-the-box, without requiring specific training for that purpose Yang et al. (2021), and (2) produce energy scores for ID and OOD samples that initially correlate with softmax values prior to the optimization of the states variables, hhitalic_h. However, after optimizing the states for T𝑇Titalic_T inference steps, the scores for ID and OOD samples become decorrelated, especially for samples with lower softmax values as shown in Fig. 7(b). To corroborate this observation, we also present ROC curves for the most challenging samples, including only the lowest 25%percent2525\%25 % of the scores. As shown in Fig.7(c), the probability (i.e., energy-based) scores provide a more reliable assessment of whether samples are OOD. Experiment details and results on other datasets are provided in in Appendix E.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7: (a) Energy and NLL of ID/OOD data before and after state optimization. (b) Nonlinearity between energy and softmax post-convergence. (c) ROC curve of OOD detection at the 100th and 25th percentiles of scores. In all plots, “ID” refers to MNIST and “OOD” to FashionMNIST.
Table 4: Comparison of the training times of BP against PC on different architectures and datasets.
Epoch time (seconds) BP PC (ours) PC (Song)
MLP - FashionMNIST 1.82±0.01superscript1.82plus-or-minus0.011.82^{\pm 0.01}1.82 start_POSTSUPERSCRIPT ± 0.01 end_POSTSUPERSCRIPT 1.94±0.07superscript1.94plus-or-minus0.071.94^{\pm 0.07}1.94 start_POSTSUPERSCRIPT ± 0.07 end_POSTSUPERSCRIPT 5.94±0.55superscript5.94plus-or-minus0.555.94^{\pm 0.55}5.94 start_POSTSUPERSCRIPT ± 0.55 end_POSTSUPERSCRIPT
AlexNet - CIFAR-10 1.04±0.08superscript1.04plus-or-minus0.081.04^{\pm 0.08}1.04 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT 3.86±0.06superscript3.86plus-or-minus0.063.86^{\pm 0.06}3.86 start_POSTSUPERSCRIPT ± 0.06 end_POSTSUPERSCRIPT 17.93±0.37superscript17.93plus-or-minus0.3717.93^{\pm 0.37}17.93 start_POSTSUPERSCRIPT ± 0.37 end_POSTSUPERSCRIPT
VGG-5 - CIFAR-100 1.61±0.04superscript1.61plus-or-minus0.041.61^{\pm 0.04}1.61 start_POSTSUPERSCRIPT ± 0.04 end_POSTSUPERSCRIPT 5.33±0.02superscript5.33plus-or-minus0.025.33^{\pm 0.02}5.33 start_POSTSUPERSCRIPT ± 0.02 end_POSTSUPERSCRIPT 13.49±0.05superscript13.49plus-or-minus0.0513.49^{\pm 0.05}13.49 start_POSTSUPERSCRIPT ± 0.05 end_POSTSUPERSCRIPT
VGG-7 - Tiny ImageNet 7.59±0.63superscript7.59plus-or-minus0.637.59^{\pm 0.63}7.59 start_POSTSUPERSCRIPT ± 0.63 end_POSTSUPERSCRIPT 54.60±0.10superscript54.60plus-or-minus0.1054.60^{\pm 0.10}54.60 start_POSTSUPERSCRIPT ± 0.10 end_POSTSUPERSCRIPT 137.58±0.08superscript137.58plus-or-minus0.08137.58^{\pm 0.08}137.58 start_POSTSUPERSCRIPT ± 0.08 end_POSTSUPERSCRIPT
Refer to caption
Figure 8: Training time for different network configurations.

6 Computational Resources and Implementations Details

PCX is developed on top on JAX, focusing on performance and versatility. We plan to further improve its efficiency and expand its capabilities to support new developments in the field of backprop-free training techniques that align with the PC core principles. In relation to current alternatives, however, PCX is a very competitive choice. We measured the wall-clock time of our PCNs implementation against another existing open-source library (Song, 2024) used in many PC works (Song et al., 2024; Salvatori et al., 2021, 2022; Tang et al., 2023), as well as comparing it with equivalent BP-trained networks (developed also with PCX for a fair comparison). Tab. 4 reports the measured time per epoch, averaged over 5 trials, using a A100 GPU. Despite being a sub-optimal architecture in term of classification performance, we also tested on AlexNet (Krizhevsky et al., 2012), showing that our library can efficiently train models of more than 100 million parameters (having AlexNet 160absent160\approx 160≈ 160 million parameters). We also outperform alternative methods such as Eqprop: using the same architecture on CIFAR100, the authors report that one epochs takes 110absent110\approx 110≈ 110 seconds, while we take 5.5absent5.5\approx 5.5≈ 5.5 on the same hardware (Scellier et al., 2024). However, we stress that this is not an apple-to-apple comparison, as the authors are more concerned with simulations on analog circuits, rather than achieving optimal GPU usage.

Limitations.

The efficiency of PCX could be further increased by fully parallelizing all the computations occuring within a PCN. In fact, in its current state, JIT seems to be unable to parallelize the executions of the layers; a problem that can be addressed with the JAX primitive vmap, but only in the unpractical case where all the layers have the same dimension. To test how different hyperparameters of the model influence the training speed, we have taken a feedforward model, and trained it multiple times, each time increasing a specific hyperparameter by a multiplicative factor. The results, reported in Fig. 8, show that the two parameters that increase the training time are the number of layers L𝐿Litalic_L (when the calculations are not explicitly vmapped), and the number of steps T𝑇Titalic_T. Ideally, only T𝑇Titalic_T should affect the training time as inference is an inherently sequential process that cannot be parallelized, but this is not the case, as the time scales linearly with‘F the amount of layers. Details are reported in Appendix F.

7 Discussion

The main contribution of this work is the introduction and open-source release that can be used to perform deep learning tasks using PCNs. Its efficiency relies on JAX’s Just-In-Time compilation and carefully structured primitives built to take advantage of it. A second advantage of our library is its intuitive setup, tailored to users already familiar with other deep learning frameworks such as PyTorch. Together with the large number of tutorials we release will make it easy for new users to train networks using PC.

We have also performed a large-scale comparison study on image classification and image generation tasks, unifying under the same computational framework multiple optimization algorithms for PCNs present in the literature. In addition, we have tried multiple, popular, optimizers and training techniques, as well as an extremely large choice of hyperparameters. For CIFAR100 only, for example, we have conducted thousands of individual experiments, that have been used to obtain new state of the art results, as well as provide insights on what works and what does not, that will be useful in the future to researchers tackling deep learning problem with PCNs.

Acknowledgments

This work was supported by Medical Research Council grant MC_UU_00003/1 awarded to RB.

References

  • Alonso et al. (2024) Nicholas Alonso, Jeffrey Krichmar, and Emre Neftci. Understanding and improving optimization in predictive coding networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10812–10820, 2024.
  • Alonso et al. (2022) Nick Alonso, Beren Millidge, Jeffrey Krichmar, and Emre O Neftci. A theoretical framework for inference learning. Advances in Neural Information Processing Systems, 35:37335–37348, 2022.
  • Bengio (2014) Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv:1407.7906, 2014.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Dempster et al. (1977) Arthur Dempster, Nan Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • Elias (1955) Peter Elias. Predictive coding–I. IRE Transactions on Information Theory, 1(1):16–24, 1955.
  • Ernoult et al. (2022) Maxence M Ernoult, Fabrice Normandin, Abhinav Moudgil, Sean Spinney, Eugene Belilovsky, Irina Rish, Blake Richards, and Yoshua Bengio. Towards scaling difference target propagation by learning backprop targets. In International Conference on Machine Learning, pages 5968–5987. PMLR, 2022.
  • Frieder and Lukasiewicz (2022) Simon Frieder and Thomas Lukasiewicz. (non-) convergence results for predictive coding networks. In International Conference on Machine Learning, pages 6793–6810. PMLR, 2022.
  • Friston et al. (2007) K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny. Variational free energy and the Laplace approximation. Neuroimage, 2007.
  • Friston (2005) Karl Friston. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1456), 2005.
  • Friston (2010) Karl Friston. The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2):127–138, 2010.
  • Friston (2018) Karl Friston. Does predictive coding have a future? Nature neuroscience, 21(8):1019–1021, 2018.
  • Friston and Kiebel (2009) Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical transactions of the Royal Society B: Biological sciences, 364(1521):1211–1221, 2009.
  • Grathwohl et al. (2020) Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • Hendrycks and Gimpel (2017) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
  • Hinton (2022) Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations. arXiv preprint arXiv:2212.13345, 2022.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  • Journé et al. (2022) Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learning without feedback. arXiv preprint arXiv:2209.11883, 2022.
  • Kidger and Garcia (2021) Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable PyTrees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021, 2021.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kohan et al. (2023) Adam Kohan, Edward A Rietman, and Hava T Siegelmann. Signal propagation: The framework for learning and inference in a forward pass. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In 26th Annual Conference on Neural Information Processing Systems (NIPS) 2012, 2012.
  • Laborieux and Zenke (2022) Axel Laborieux and Friedemann Zenke. Holomorphic equilibrium propagation computes exact gradients through finite size oscillations. Advances in Neural Information Processing Systems, 35:12950–12963, 2022.
  • Launay et al. (2020) Julien Launay, Iacopo Poli, François Boniface, and Florent Krzakala. Direct feedback alignment scales to modern deep learning tasks and architectures. Advances in neural information processing systems, 33:9346–9360, 2020.
  • Lillicrap et al. (2014) Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247, 2014.
  • Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020.
  • Millidge et al. (2022a) Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Backpropagation at the infinitesimal inference limit of energy-based models: unifying predictive coding, equilibrium propagation, and contrastive hebbian learning. arXiv preprint arXiv:2206.02629, 2022a.
  • Millidge et al. (2022b) Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. A theoretical framework for inference and learning in predictive coding networks. arXiv preprint arXiv:2207.12316, 2022b.
  • Moraitis et al. (2022) Timoleon Moraitis, Dmitry Toichkin, Adrien Journé, Yansong Chua, and Qinghai Guo. Softhebb: Bayesian inference in unsupervised hebbian soft winner-take-all networks. Neuromorphic Computing and Engineering, 2(4):044017, 2022.
  • Mumford (1992) David Mumford. On the computational architecture of the neocortex. Biological Cybernetics, 66(3):241–251, 1992.
  • Nøkland (2016) Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, 2016.
  • Oliviers et al. (2024) Gaspard Oliviers, Rafal Bogacz, and Alexander Meulemans. Learning probability distributions of sensory inputs with monte carlo predictive coding. bioRxiv, pages 2024–02, 2024.
  • Ororbia and Kifer (2022) Alexander Ororbia and Daniel Kifer. The neural coding framework for learning generative models. Nature communications, 13(1):2064, 2022.
  • Ororbia and Mali (2023) Alexander Ororbia and Ankur Mali. Active predictive coding: Brain-inspired reinforcement learning for sparse reward robotic control problems. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3015–3021. IEEE, 2023.
  • Ororbia et al. (2023) Alexander G Ororbia, Ankur Mali, Daniel Kifer, and C Lee Giles. Backpropagation-free deep learning with recursive local representation alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9327–9335, 2023.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Pinchetti et al. (2022) Luca Pinchetti, Tommaso Salvatori, Beren Millidge, Yuhang Song, Yordan Yordanov, and Thomas Lukasiewicz. Predictive coding beyond Gaussian distributions. 36th Conference on Neural Information Processing Systems, 2022.
  • Rao and Ballard (1999) Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87, 1999.
  • Salvatori et al. (2021) Tommaso Salvatori, Yuhang Song, Yujian Hong, Lei Sha, Simon Frieder, Zhenghua Xu, Rafal Bogacz, and Thomas Lukasiewicz. Associative memories via predictive coding. In Advances in Neural Information Processing Systems, volume 34, 2021.
  • Salvatori et al. (2022) Tommaso Salvatori, Luca Pinchetti, Beren Millidge, Yuhang Song, Tianyi Bao, Rafal Bogacz, and Thomas Lukasiewicz. Learning on arbitrary graph topologies via predictive coding. arXiv:2201.13180, 2022.
  • Salvatori et al. (2023a) Tommaso Salvatori, Ankur Mali, Christopher L Buckley, Thomas Lukasiewicz, Rajesh PN Rao, Karl Friston, and Alexander Ororbia. Brain-inspired computational intelligence via predictive coding. arXiv preprint arXiv:2308.07870, 2023a.
  • Salvatori et al. (2023b) Tommaso Salvatori, Luca Pinchetti, Amine M’Charrak, Beren Millidge, and Thomas Lukasiewicz. Causal inference via predictive coding. arXiv preprint arXiv:2306.15479, 2023b.
  • Salvatori et al. (2024) Tommaso Salvatori, Yuhang Song, Beren Millidge, Zhenghua Xu, Lei Sha, Cornelius Emde, Rafal Bogacz, and Thomas Lukasiewicz. Incremental predictive coding: A parallel and fully automatic learning algorithm. International Conference on Learning Representations 2024, 2024.
  • Scellier and Bengio (2017) Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11:24, 2017.
  • Scellier et al. (2024) Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learning algorithms for analog computing: a comparative study. Advances in Neural Information Processing Systems, 36, 2024.
  • Song (2024) Yuhang Song. Prospective-configuration. https://github.com/YuhangSong/Prospective-Configuration, 2024.
  • Song et al. (2024) Yuhang Song, Beren Millidge, Tommaso Salvatori, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Inferring neural activity before plasticity as a foundation for learning beyond backpropagation. Nature Neuroscience, pages 1–11, 2024.
  • Spratling (2008) Michael W Spratling. Reconciling predictive coding and biased competition models of cortical function. Frontiers in Computational Neuroscience, 2:4, 2008.
  • Spratling (2017) Michael W Spratling. A review of predictive coding algorithms. Brain and Cognition, 112:92–97, 2017.
  • Srinivasan et al. (1982) Mandyam Veerambudi Srinivasan, Simon Barry Laughlin, and Andreas Dubs. Predictive coding: A fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014.
  • Tang et al. (2023) Mufeng Tang, Tommaso Salvatori, Beren Millidge, Yuhang Song, Thomas Lukasiewicz, and Rafal Bogacz. Recurrent predictive coding models for associative memory employing covariance learning. PLOS Computational Biology, 19(4):e1010719, 2023.
  • Tang et al. (2024) Mufeng Tang, Helen Barron, and Rafal Bogacz. Sequential memory with temporal predictive coding. Advances in Neural Information Processing Systems, 36, 2024.
  • Whittington and Bogacz (2017) James C. R. Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity. Neural Computation, 29(5), 2017.
  • Yang et al. (2021) **gkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
  • Yoo and Wood (2022) **soo Yoo and Frank Wood. Bayespcn: A continually learnable predictive coding associative memory. Advances in Neural Information Processing Systems, 35:29903–29914, 2022.
  • Zahid et al. (2023) Umais Zahid, Qinghai Guo, and Zafeirios Fountas. Sample as you infer: Predictive coding with langevin dynamics. arXiv preprint arXiv:2311.13664, 2023.

Appendix

The code for PCX is available at https://github.com/liukidar/pcax. The experiments conducted in the main body and in the appendix can be found in the examples folder.

Here we provide the details on how experiments were conducted and results obtained. We opt for a more descriptive approach to convey the fundamental concepts, and leave all details for reproducibility in the provided code, as well as in the next sections. There, each section will link to the exact directory corresponding to the described experiments.

Appendix A PCX – A Brief Introduction

In this section, we illustrate the core ideas of PCX by describing the main building blocks necessary to train and evaluate a feedforward classifier in predictive coding. For more detailed and complete explanations, please refer to the tutorial notebooks in the examples folder of the library.

In Section 3, we defined PCNs as models with parameters θ={θ0,,θL}𝜃subscript𝜃0subscript𝜃𝐿\theta=\{\theta_{0},\ldots,\theta_{L}\}italic_θ = { italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } and state h={h0,,hL}subscript0subscript𝐿h=\{h_{0},\ldots,h_{L}\}italic_h = { italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }. In PCX, we divide a model in components of two main categories: layers (i.e., the traditional deep-learning transformations such as ’Linear’ or ’Conv2D’) and vodes (i.e., vectorized nodes that store the array of neurons representing state hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT). A PCN is defined as follows: {minted}python import jax.nn as jnn import pcx.predictive_coding as pxc import pcx.nn as pxnn

class MLP(pcx.EnergyModule): def __init__(self, in_dim, h_dim, out_dim): self.layers = [ pxnn.Linear(in_dim, h_dim), pxnn.Linear(h_dim, h_dim), pxnn.Linear(h_dim, out_dim) ]

self.vodes = [ pxc.Vode((dim,)) for dim in (h_dim, h_dim, out_dim) ]

def __call__(self, x, y = None): for layer, vode in zip(self.layers, self.vodes): u = jnn.leaky_relu(layer(x)) x = vode(u)

if y is not None: self.vodes[-1].set("h", y)

return u In the __call__ method, we forward the input x𝑥xitalic_x through the network. Note that every time we call a vode, we are effectively storing in it the activation ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (so that we can later compute the energy ϵl2superscriptsubscriptitalic-ϵ𝑙2\epsilon_{l}^{2}italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT associated to the vode) and returning its state hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (i.e., x = vode(u) corresponds to vode.set("u", u); x = vode.get("h")). During training, the label y𝑦yitalic_y is provided to the model and fixed to the last vode by overwriting its state hhitalic_h. Note that, since both during training and evaluation the state of the first vode would be fixed to the input x𝑥xitalic_x, we avoid defining it (i.e., we avoid computing Pθ0(h0)subscript𝑃subscript𝜃0subscript0P_{\theta_{0}}(h_{0})italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) since it would be constant), and directly forward x𝑥xitalic_x to the first layer transformation.

The class pxc.EnergyModule provides a .energy() function that computes the variational free energy \mathcal{F}caligraphic_F as per Eq. (1). We can compute the state and parameters gradients as per Eqs. (3) by calling pxf.value_and_grad, a wrap around the homonymous JAX function. Having defined two optimizers, optim_w and optim_h, for parameters and state respectively, we can define training on a pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) as following: {minted}python import pcx.utils as pxu import pcx.functional as pxf

def energy(x, y, *, model): model(x, y) return model.energy()

grad_h = pxf.value_and_grad( pxu.Mask(pxc.VodeParam, [False, True]) )(energy)

grad_w = pxf.value_and_grad( pxu.Mask(pxc.LayerParam, [False, True]) )(energy)

def train(T, x, y, *, model, optim_h, optim_w): model.train()

# Initialization with pxu.step(model, pxc.STATUS.INIT, clear_params=pxc.VodeParam.Cache): model(x)

# Inference steps for i in range(T): with pxu.step(model, clear_params=pxc.VodeParam.Cache): _, g_h = grad_h(x, y, model=model) optim_h.step(model, g_h["model"], True)

# Learning step with pxu.step(model, clear_params=pxc.VodeParam.Cache): _, g_w = grad_w(x, y, model=model) optim_w.step(model, g_w["model"]) A few notes on the above code:

  • JAX [Bradbury et al., 2018] is a functional library, PCX is not. Modules in PCX are PyTrees, using the same philosophy as another popular JAX library, equinox [Kidger and Garcia, 2021], with which PCX modules are fully compatible. However, their state is managed by PCX so that each parameter transformation is automatically tracked. The user can opt in for this behavior by passing arguments as keyword argmunets (such as in the above example). Positional function parameters, instead are ignored by PCX and it is the user’s duty to track their state as done in JAX or equinox.

  • pxf.value_and_grad allows to specify a Mask object to identify which parameters to target with the given transformation. In the case above, we first compute the gradient of \mathcal{F}caligraphic_F with respect of the state (VodeParam) and, then, of the weights (LayerParam) of the model.

  • In the train function, we use pxu.step to set the model status to pxc.STATUS.INIT to perform the state initialization. In PCX, forward initialization is the default method, however other ones can be easily specified. pxu.step is also used to clear the PCN’s cache which is used to store intermediate values such as the activations ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

  • The actual examples in the library are on mini-batches of data, so all transformations above are vmapped in the actual experiments.

For the evaluation function, being in discriminative mode, we simply perform a forward pass through the PCN which sets ϵl=0subscriptitalic-ϵ𝑙0\epsilon_{l}=0italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0 for all layers. {minted}python def eval(x, *, model): with pxu.step(model, pxc.STATUS.INIT, clear_params=pxc.VodeParam.Cache): return model(x)

Appendix B Discriminative experiments

Model.

We conducted experiments on three models: MLP, VGG-5, and VGG-7. The detailed architectures of these models are presented in Table 5.

Table 5: Detailed Architectures of base models
MLP VGG-5 VGG-7
Channel Sizes [128, 128] [128, 256, 512, 512] [128, 128, 256, 256, 512, 512]
Kernel Sizes - [3, 3, 3, 3] [3, 3, 3, 3, 3, 3]
Strides - [1, 1, 1, 1] [1, 1, 1, 1, 1, 1]
Paddings - [1, 1, 1, 0] [1, 1, 1, 0, 1, 0]
Pool window - 2 × 2 2 × 2
Pool stride - 2 2

For each model, we conducted experiments with seven different algorithms:

  1. 1.

    Standard PC with Cross-Entropy Loss (PC-CE) / Squared Error Loss (PC-SE): already discussed in the background section.

  2. 2.

    PC with Positive Nudging (PC-PN):

    Unlike standard Predictive Coding with Squared Error Loss (PC-SE), where the output is clamped to the target, we “nudge” the output towards the target in PC with nudging. This is achieved by fixing the representation hhitalic_h of last layer hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT to μL+β(yμL)subscript𝜇𝐿𝛽𝑦subscript𝜇𝐿\mu_{L}+\beta(y-\mu_{L})italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + italic_β ( italic_y - italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), where μLsubscript𝜇𝐿\mu_{L}italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the predicted activation of the last layer after forward initialisation, y𝑦yitalic_y is the target, and β(0,1)𝛽01\beta\in(0,1)italic_β ∈ ( 0 , 1 ) is a scalar parameter that controls the strength of nudging. Note that when β=1𝛽1\beta=1italic_β = 1, PC with nudging is equivalent to the standard PC.

    During training procedure, as the model output gradually approaches to the target, we employ a strategy of increasing β𝛽\betaitalic_β. At the end of each epoch, the value of β𝛽\betaitalic_β is incremented by a fixed learning rate βirsubscript𝛽𝑖𝑟\beta_{ir}italic_β start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT. When β𝛽\betaitalic_β becomes greater than or equal to 1, we set it to 1. This strategy allows the model to learn and explore in the early stages of training, while gradually transitioning to standard PC in the later stages.

  3. 3.

    PC with Negative Nudging (PC-NN):

    In this algorithm, we do the opposite of positive nudging: we push the output away from the target. Therefore, we fix the representation hhitalic_h of the last layer to μLβ(yμL)subscript𝜇𝐿𝛽𝑦subscript𝜇𝐿\mu_{L}-\beta(y-\mu_{L})italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT - italic_β ( italic_y - italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). We use the same strategy of dynamically increasing β𝛽\betaitalic_β. When β𝛽\betaitalic_β becomes greater than or equal to 1, we set it to -1.

    In the learning stage, to ensure that the direction of the weight update is consistent with the target (since we fixed hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT to the opposite direction), we invert the weight update: θlθlΔθlsubscript𝜃𝑙subscript𝜃𝑙Δsubscript𝜃𝑙\theta_{l}\leftarrow\theta_{l}-\Delta\theta_{l}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT where ΔθlΔsubscript𝜃𝑙\Delta\theta_{l}roman_Δ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT defined in the Eq. 3.

  4. 4.

    PC with Center Nudging (PC-CN):

    Center Nudging [Scellier et al., 2024] is used in equilibrium propagation to improve and stabilize performance compared to both positive and negative nudging, and it is obtained as an average of the gradients produced by the two methods. Here, we approximate this behavior by randomly alternating between epochs in which we train with either negative or positive nudging. In this way, the training model can benefit from both methods without any extra computational cost.

  5. 5.

    Incremental PC (iPC), a simple and recently proposed modification where the weight parameters are updated alongside the latent variables at every time step [Salvatori et al., 2024].

  6. 6.

    Standard Backpropagation with Cross-Entropy Loss (BP-CE) / Squared Error Loss (BP-SE): the most popular way to do the credit assignment in the neural networks. The model is trained by computing the gradients of the loss function with the weights of the network using the chain rule.

Experiments.

The benchmark results of MLP are obtained with MNIST and Fashion-MNIST, the results of VGG-5 are obtained with CIFAR-10, CIFAR-100 and Tiny ImageNet, the results of VGG-7 are obtained with CIFAR-100 and Tiny ImageNet. The data is normalized as in Table 6.

Table 6: Data normalization
Mean (μ𝜇\muitalic_μ) Std (σ𝜎\sigmaitalic_σ)
MNIST 0.5 0.5
Fashion-MNIST 0.5 0.5
CIFAR-10 [0.4914, 0.4822, 0.4465] [0.2023, 0.1994, 0.2010]
CIFAR-100 [0.5071, 0.4867, 0.4408] [0.2675, 0.2565, 0.2761]
Tiny ImageNet [0.485, 0.456, 0.406] [0.229, 0.224, 0.225]

For data augmentation on the training sets of CIFAR-10, CIFAR-100, and Tiny ImageNet, we apply random horizontal flip** with a probability of 50%. Additionally, we employ random crop** with different settings for each dataset. For CIFAR-10 and CIFAR-100, images are randomly cropped to 32×32 resolution with a padding of 4 pixels on each side. In the case of Tiny ImageNet, random crop** is performed to obtain 56×56 resolution images without any padding. And on the testing set of Tiny ImageNet, we use center crop** to extract 56×56 resolution images, also without padding, since the original resolution of Tiny ImageNet is 64x64.

The model hyperparameters are determined using the search space shown in Table 7. The results presented in Table 1 were obtained using 5 seeds with the optimal hyperparameters, which are stored in the YAML files located in the "examples/s4_1_discriminative_mode/" subdirectories of the PCX library.

As for the optimizer and scheduler, we use mini-batch gradient descent (SGD) with momentum as the optimizer for the hhitalic_h, and we utilize AdamW with weight decay as the optimizer for the θ𝜃\thetaitalic_θ. Additionally, we apply a warmup-cosine-annealing scheduler without restart for the learning rates of θ𝜃\thetaitalic_θ. We also tried SGD with momentum for the weights θ𝜃\thetaitalic_θ, but we did not perform complete hyperaparemter searches on all combinations of architectures and datasets as its performance was suboptimal compared to AdamW in all the cases tested.

Table 7: Hyperparameters search configuration
Parameter PC iPC BP
Epoch (MLP) 25
Epoch (VGG-5 and VGG-7) 50
Batch Size 128
Activation [leaky relu, gelu, hard tanh] [leaky relu, gelu, hard tanh, relu]
β𝛽\betaitalic_β [0.0, 1.0], 0.051 - -
βirsubscript𝛽𝑖𝑟\beta_{ir}italic_β start_POSTSUBSCRIPT italic_i italic_r end_POSTSUBSCRIPT [0.02, 0.0] - -
lrh𝑙subscript𝑟lr_{h}italic_l italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (1e-2, 5e-1)2 (1e-2, 1.0)2 -
lrθ𝑙subscript𝑟𝜃lr_{\theta}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (1e-5, 3e-4)2 (3e-5, 3e-4)2
momentumh𝑚𝑜𝑚𝑒𝑛𝑡𝑢subscript𝑚momentum_{h}italic_m italic_o italic_m italic_e italic_n italic_t italic_u italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT [0.0, 1.0], 0.051 -
weightdecayθ𝑤𝑒𝑖𝑔𝑡𝑑𝑒𝑐𝑎subscript𝑦𝜃weightdecay_{\theta}italic_w italic_e italic_i italic_g italic_h italic_t italic_d italic_e italic_c italic_a italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (1e-5, 1e-2)2 (1e-5, 1e-1)2 (1e-5, 1e-2)2
T (MLP and VGG-5) [4,5,6,7,8] -
T (VGG-7) [8,9,10,11,12] -

1: “[a, b], c” denotes a sequence of values from a to b with a step size of c.

2: “(a, b)” represents a log-uniform distribution between a and b.

Results.

All the results presented in this study were obtained using forward initialization, a technique that initializes the model’s parameters by performing a forward pass on a zero tensor with the same shape as the input data. Besides, in our experiments, we limited the range of T𝑇Titalic_T to ensure a fair comparison with BP in terms of training times. Higher T𝑇Titalic_T correspond to a greater number of optimization rounds of hhitalic_h, which can lead to improved model performance but also increased computational costs and longer training durations. To maintain comparability with BP, we restricted our searching space of T that resulted in training times similar to those observed in BP-based training.

Momentum helps significantly. In Figure 9, we present the accuracy of the VGG-7 model trained on CIFAR-100 using different momentum values, both without nudging(Figure 9(a)) and with nudging(Figure 9(b)). It is evident from Figure 9 that selecting an appropriate momentum value can substantially improve model accuracy. By comparing Figures 9(a) and 9(b), we can observe that different training algorithms have different optimal momentum values. The optimal momentum for training with nudging is generally higher than that for training without nudging. Furthermore, the optimal momentum for negative nudging is larger than that for positive nudging. These differences in optimal momentum values highlight the importance of carefully tuning the momentum hyperparameter based on the specific training algorithm and nudging method employed. For reference, the optimal model parameters and momentum values for various tasks and models can be found in the example/discriminative_experiments folder of the PCX library.

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Comparison of the accuracy of the VGG-7 model trained on CIFAR-100 using different momentum values

Activation function also plays a crucial role in improving model accuracy. For models using Cross-Entropy Loss, the “HardTanh” activation function is a better choice. In the case of models using Squared Error Loss without nudging, the “LeakyReLU” activation function tends to perform better. When using Positive Nudging, the optimal activation function varies depending on the model architecture. For Negative Nudging, the “GeLU” activation function is the most suitable choice.

Nudging improves performance. Fig. 11 illustrates the relationship between the learning rate of hhitalic_h and accuracy with or without nudging. From the plot, we can observe that when nudging is not used (red dots), the model achieves better results at lower learning rates. However, when nudging is employed (purple and blue dots), regardless of whether it is positive nudging or negative nudging, the model can attain better accuracy at higher learning rates compared to the case without nudging. Additionally, Fig. 9(b) shows the relationship between momentum and accuracy. We can see that after applying nudging, the model can achieve better results at higher momentum values. We believe this is the reason why nudging can improve performance. The ability to use higher learning rates and momentum values without sacrificing accuracy is a significant advantage of nudging, as it can lead to faster convergence and improved generalization performance.

Refer to caption
Figure 10: Different training runs for VGG-7 on TinyImageNet using CN. Most of the seeds diverge, resulting in poor average performance.

Training instability and model size. Similarly to what we show in Section 5.1, we noticed that largest architectures present significantly more training instability. iPC is stable and produces optimal results only for the smallest, fully connected, architectures, while PC achieves its peak performance on VGG-5. In particular, we noticed that VGG-7 is able to reach similar highest accuracy (e.g., 43%absentpercent43\approx 43\%≈ 43 % on Tiny ImageNet) but, on average, performs notably worse, as most seeds result in a diverged model (Fig. 10).

Refer to caption
(a) VGG-7
Refer to caption
(b) VGG-5
Figure 11: Comparison of the accuracy of the VGG-7 and VGG-5 model trained on CIFAR-100 using different learning rates for hhitalic_h.

Appendix C Generative experiments

C.1 Autoencoder

An Autoencoder is a network that learns how to compress a high-dimensional input into a much smaller dimensional space, called the bottleneck dimension or the hidden dimension, as accurately as possible. Thus, a backpropagation-based Autoencoder consists of two parts: an encoder, that compresses the input from the original high-dimensional space into the bottleneck dimension, and a decoder, that reconstructs the original input from the bottleneck dimension. A mean-squared error (MSE) between the original and the reconstructed input is used as a loss to train the Autoencoder network in an unsupervised manner.

Predictive Coding (PC) alleviates the need in the encoder part of an Autoencoder. Specifically, only the decoder part of an Autoencoder is used, with a PC layer acting as the bottleneck dimension and as an input to the decoder. Moreover, PC layers are inserted after each layer of the decoder.

Refer to caption
Figure 12: Left: An Autoencoder implemented with backpropagation consists of both an encoder and a decoder. The encoder compresses the input data into the bottleneck dimension, and the decoder restores the original image. Right: An Autoencoder implemented with Predictive Coding. The state of the first PC layer is the bottleneck dimension. The state of the last PC layer is the original input, and the predicted state of the last PC layer is the predicted input. Inference steps update the bottleneck dimension to make it a good compressed representation.
Table 8: Hyperparameters and search spaces for deconvolution-based autoencoders
Parameter PC iPC BP
Number of layers 3 conv layers: 3 deconv layers: 3
Internal state dimension 4x4
Internal state channels 8
Kernel size [3, 4, 5, 7]
Activation function [relu, leaky_relu, gelu, tanh, hard_tanh]
Batch size 200
Epochs 30
T 20 -
Optim hhitalic_h SGD+momentum -
lrh𝑙subscript𝑟lr_{h}italic_l italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (1e-2, 5e-1)2 (1e-2, 1.0)2 -
momentumh𝑚𝑜𝑚𝑒𝑛𝑡𝑢subscript𝑚momentum_{h}italic_m italic_o italic_m italic_e italic_n italic_t italic_u italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT [0.0, 0.95] -
Optim θ𝜃\thetaitalic_θ AdamW
lrθ𝑙subscript𝑟𝜃lr_{\theta}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 3e-5, 1e-32
weightdecayθ𝑤𝑒𝑖𝑔𝑡𝑑𝑒𝑐𝑎subscript𝑦𝜃weightdecay_{\theta}italic_w italic_e italic_i italic_g italic_h italic_t italic_d italic_e italic_c italic_a italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (1e-5, 1e-2)2 (1e-5, 1e-1)2 (1e-5, 1e-2)2
Table 9: Hyperparameters and search spaces for linear-based autoencoders
Parameter PC iPC BP
Number of layers 3 encoder: 3 decoder: 3
Internal state dimension 64
Activation function [relu, leaky_relu, gelu, tanh, hard_tanh]
Batch size 200
Epochs 30
T 20 -
Optim hhitalic_h SGD+momentum -
lrh𝑙subscript𝑟lr_{h}italic_l italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (1e-2, 5e-1)2 (1e-2, 1.0)2 -
momentumh𝑚𝑜𝑚𝑒𝑛𝑡𝑢subscript𝑚momentum_{h}italic_m italic_o italic_m italic_e italic_n italic_t italic_u italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT [0.0, 0.95] -
Optim θ𝜃\thetaitalic_θ AdamW
lrθ𝑙subscript𝑟𝜃lr_{\theta}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (3e-5, 1e-3)2
weightdecayθ𝑤𝑒𝑖𝑔𝑡𝑑𝑒𝑐𝑎subscript𝑦𝜃weightdecay_{\theta}italic_w italic_e italic_i italic_g italic_h italic_t italic_d italic_e italic_c italic_a italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (1e-5, 1e-2)2 (1e-5, 1e-1)2 (1e-5, 1e-2)2

A PC-based Autoencoder works as follows:

  1. 1.

    The energy function of the last PC layer is set to MSE upon its creation. In PCX, the squared error is the default energy function. The squared error is then summed across all dimensions in the input and averaged over the batch, that approximates the MSE up to a multiplication constant.

  2. 2.

    The current state of the last PC Layer L𝐿Litalic_L, hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, is fixed to the original input data, which means that hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is not changed during inference steps.

  3. 3.

    Since the energy of the last layer L𝐿Litalic_L now encodes the MSE loss between the predicted image μLsubscript𝜇𝐿\mu_{L}italic_μ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and the original input stored as hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, the inference steps will update the current states hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of all PC layers but the last one, including the one that represents the bottleneck dimension, to minimize this MSE loss.

  4. 4.

    Once the inference steps are done, the state of the bottleneck dimension PC layer will converge to the compressed representation of the original input.

C.2 MCPC

Model. Monte Carlo predictive coding (MCPC) is a version of predictive coding that can be used for generative learning. MCPC differs from PC by its noisy neural dynamics. Unlike PC where the neural activity converges to a mode of the free-energy, the neural activity of MCPC performs noisy gradient descent which is used for Monte Carlo sampling. When an input is provided, the noisy neural activity samples the posterior distribution of the generative model given the sensory input. When no input is provided the neural activity samples the generative model encoded in the model parameters. Specifically, the neural dynamics of MCPC leverage the following Langevin dynamics:

Δhl=γhlhl(h,θ)+2γNΔsubscript𝑙𝛾subscriptsubscript𝑙subscriptsubscript𝑙𝜃2𝛾𝑁\Delta h_{l}=-\gamma\nabla_{h_{l}}\mathcal{F}_{h_{l}}(h,\theta)+\sqrt{2\gamma}Nroman_Δ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = - italic_γ ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_θ ) + square-root start_ARG 2 italic_γ end_ARG italic_N (5)

where N is a Gaussian random variable with variance σmcpc2superscriptsubscript𝜎𝑚𝑐𝑝𝑐2\sigma_{mcpc}^{2}italic_σ start_POSTSUBSCRIPT italic_m italic_c italic_p italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. These neural dynamics can be extended to 2nd-order Langevin dynamics for faster sampling:

Δhl=γrlΔsubscript𝑙𝛾subscript𝑟𝑙\displaystyle\Delta h_{l}=\gamma r_{l}roman_Δ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_γ italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (6)
Δrl=γhl(h,θ)γ(1m)rl+2(1m)γNΔsubscript𝑟𝑙𝛾subscriptsubscript𝑙𝜃𝛾1𝑚subscript𝑟𝑙21𝑚𝛾𝑁\displaystyle\Delta r_{l}=\gamma\nabla_{h_{l}}\mathcal{F}(h,\theta)-\gamma(1-m% )r_{l}+\sqrt{2(1-m)\gamma}Nroman_Δ italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_γ ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_F ( italic_h , italic_θ ) - italic_γ ( 1 - italic_m ) italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + square-root start_ARG 2 ( 1 - italic_m ) italic_γ end_ARG italic_N (7)

where m𝑚mitalic_m is a momentum constant.

An MCPC model is trained following a Monte Carlo expectation maximisation scheme which iterates over the following two steps: (i) MCPC’s neural activity samples the model’s posterior distribution for the given data, and (ii) the model parameters are updated to increase the model log-likelihood under the samples of the posterior. In practice, we run MCPC inference for a limited number of steps after which we update the model parameters with a single sample of the posterior similarly to how model parameters are updated in variational auto encoders.

After training, samples of a trained model are generated by leaving all neurons unclamped and recording the activity of input neurons (the neurons clamped to data during training). The activity is recorded after a limited number of activity update steps. This process is repeated for each data sample.

MCPC’s implementation in PCX utilizes a noisy SGD optimizer for the state hhitalic_h. Compared to PC than uses an SGD or Adam optimizer, MCPC incorporates an optimizer that merges the addition of noise to the model’s gradients with an SGD optimizer. The variance of the noise added to the gradients needs to be carefully crafted to scale appropriately with the learning rate and the momentum as shown in equations (5 - 7).

Experiments. All the MCPC experiments use feedforward models with Squared Error (SE) loss. The SE loss of the state layer hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is also scaled by a variance parameter σhL2subscriptsuperscript𝜎2subscript𝐿\sigma^{2}_{h_{L}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This additional parameter is introduced to prevent the Gaussian layer hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT from having a variance much larger than the variance of the data which would prevent learning. Moreover, for unconditional learning and generation, the layer h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is left unclamped during both training and generation. In contrast, for the conditional learning task on MNIST, the layer h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is clamped to labels during training and generation.

For the iris dataset, we train a model with layer dimensions [2 x 64 x 2], tanh activation function and default parameter values (state learning rate γ𝛾\gammaitalic_γ=0.01, state momentum = 0.9 , noise state variance σmcpc2=1superscriptsubscript𝜎𝑚𝑐𝑝𝑐21\sigma_{mcpc}^{2}=1italic_σ start_POSTSUBSCRIPT italic_m italic_c italic_p italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, parameter learning rate lrθ𝑙subscript𝑟𝜃lr_{\theta}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, parameter decay = 0.0001, Adam parameter optimizer, layer variance σhL2subscriptsuperscript𝜎2subscript𝐿\sigma^{2}_{h_{L}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.01 and a batch size of 150). We use 500 state update steps during learning and 10000 for generation.

For the unconditional learning task on MNIST, we train models with layer dimensions [30 x 256 x 256 x 256 x 784]. The model hyperparameters for MCPC and VAE were determined using the hyperparameter search shown in table 10 to optimize the FID and the inception score separately. Refer to the code for exact optimal parameter values. We use 1000 state update steps during learning and 10000 for generation.

For the conditional learning task on MNIST, we train models with layer dimensions [2 x 256 x 256 x 256 x 784]. The labels used in this task, clamped to h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, specify whether an image corresponds to an even or odd number. The model hyperparameters are determined using the search space shown in table 10. We use 1000 state update steps during learning and 10000 for generation.

Results. Figure 13 shows samples generated by the trained models for hyperparameters that maximize the inception score.

Table 10: Bayes hyperparameter search configuration for MCPC and VAE (where applicable) on MNIST.
Parameter Value
activation {ReLU, Silu, Tanh, Leaky-ReLU, Hard-Tanh}
γ𝛾\gammaitalic_γ log-uniform(0.0001, 0.05)
momentum {0.0, 0.9}
σmcpc2subscriptsuperscript𝜎2𝑚𝑐𝑝𝑐\sigma^{2}_{mcpc}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_c italic_p italic_c end_POSTSUBSCRIPT {1.0, 0.3, 0.01, 0.001}
lrθ𝑙subscript𝑟𝜃lr_{\theta}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT log-uniform(0.0001, 0.1)
parameter decay {0.0, 0.1, 0.01, 0.001, 0.0001}
σhL2subscriptsuperscript𝜎2subscript𝐿\sigma^{2}_{h_{L}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT log-uniform(0.03, 1.0)
batch size {150, 300, 600, 900}
Refer to caption
Figure 13: Samples generated by trained models that optimize the inception score under the unconditional and conditional learning regimes.

C.3 Associative memories

This section describes the experimental setup of associative memory tasks.

Model.

A generative PCN is first trained on n𝑛nitalic_n images sampled from the Tiny ImageNet dataset until its parameters have converged. Then, a corrupted version of the training images is presented to the sensory layer of the model (hLsubscript𝐿h_{L}italic_h start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) and we run inference hlsubscript𝑙\nabla h_{l}∇ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT on all layers, including the sensory layer, until convergence. Note that in masked experiments, the intact top half of the images is kept fixed during inference. Intuitively, suppose the model has minimized its free energy with its sensory layer fixed at each of the n𝑛nitalic_n training examples during training. In that case, it has formed attractors defined by these training examples and would thus tend to “refine" the corrupted images to fall back into the energy attractors.

Experiments.

Here, the benchmark results are obtained with Tiny ImageNet, corrupted with either Gaussian noise with 0.2 standard deviation, or a mask on the bottom half of the images (examples shown in Fig. 4). We vary the model size and number of training examples to memorize, to study the capacity of the models. Specifically, we use a generative PCN with architecture [512,d,d,12288]512𝑑𝑑12288[512,d,d,12288][ 512 , italic_d , italic_d , 12288 ] where d=[512,1024,2048]𝑑51210242048d=[512,1024,2048]italic_d = [ 512 , 1024 , 2048 ] (12288122881228812288 being the flattened Tiny ImageNet images) and varied n=[50,100,250]𝑛50100250n=[50,100,250]italic_n = [ 50 , 100 , 250 ]. We performed a hyperparameter search for each d𝑑ditalic_d and n𝑛nitalic_n on the parameter learning rate lrθ{1×104+k5×105k,0n18}𝑙subscript𝑟𝜃conditional-set1superscript104𝑘5superscript105formulae-sequence𝑘0𝑛18lr_{\theta}\in\{1\times 10^{-4}+k\cdot 5\times 10^{-5}\mid k\in\mathbb{Z},0% \leq n\leq 18\}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ { 1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT + italic_k ⋅ 5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT ∣ italic_k ∈ blackboard_Z , 0 ≤ italic_n ≤ 18 }, the state learning rate γ{0.1+k0.05k,0n18}𝛾conditional-set0.1𝑘0.05formulae-sequence𝑘0𝑛18\gamma\in\{0.1+k\cdot 0.05\mid k\in\mathbb{Z},0\leq n\leq 18\}italic_γ ∈ { 0.1 + italic_k ⋅ 0.05 ∣ italic_k ∈ blackboard_Z , 0 ≤ italic_n ≤ 18 }, training inference steps Ttrain[20,50,100]subscript𝑇train2050100T_{\textnormal{train}}\in[20,50,100]italic_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∈ [ 20 , 50 , 100 ] and recall inference steps Trecall[50000,100000]subscript𝑇recall50000100000T_{\textnormal{recall}}\in[50000,100000]italic_T start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT ∈ [ 50000 , 100000 ]. We fix the activation function of the model to Tanh, and the number of training epochs to 500500500500 and a batch size of 50505050. The results in Table 3 are obtained with 5555 seeds with the searched optimal hyperparameters, which are stored in the hps.json file under the examples/s4_2_generative_mode/associative_memory folder in the PCX library.

Appendix D Energy and Stability

This section describes the experimental setup of Section 5.1, provides replications on other datasets and ablations.

The code can be found in: examples/s5_analysis_and_metrics/energy_and_stability.

D.1 Energy propagation

We test a grid of models on multiple datasets to examine the energy propagation in the models. We test on the FashionMNIST, Two Moons, and, Two Circles datasets. The Two Circles dataset is particularly interesting, as poor energy distribution intuitively results in a linear inductive bias (we primarily learn a one-layer network). This linear inductive bias harms the performance on Two Circles (linear model accuracy 50absent50\approx 50≈ 50%) more than FashionMNIST (83absent83\approx 83≈ 83%) and Two Moons (86absent86\approx 86≈ 86%).

Experimental Setup.

We train a grid of feedforward PCNs with 2 hidden layers. We train on three datasets: FahionMNIST (as reported in the main body) and additionally Two Moons and Two Circles. For all models, we train for 8 epochs with T=8𝑇8T=8italic_T = 8 inference steps. States are optimized with SGD and forward initialization. The grid is formed over weight learning rate lrθ{1×105,1×104,,1}𝑙subscript𝑟𝜃1superscript1051superscript1041lr_{\theta}\in\{1\times 10^{-5},1\times 10^{-4},\dots,1\}italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ { 1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , … , 1 }, state learning rate γ{1×103,3×103,1×102,3×102,1×101,3×101,1}𝛾1superscript1033superscript1031superscript1023superscript1021superscript1013superscript1011\gamma\in\{1\times 10^{-3},3\times 10^{-3},1\times 10^{-2},3\times 10^{-2},1% \times 10^{-1},3\times 10^{-1},1\}italic_γ ∈ { 1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 3 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 3 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 1 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 3 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 1 }, activation functions f{LeakyReLU,HardTanh}𝑓LeakyReLUHardTanhf\in\{\text{LeakyReLU},\text{HardTanh}\}italic_f ∈ { LeakyReLU , HardTanh } (the former is unbounded the latter is bounded), optimization with AdamW or SGD with momentum m{0.0,0.5,0.9,0.95}𝑚0.00.50.90.95m\in\{0.0,0.5,0.9,0.95\}italic_m ∈ { 0.0 , 0.5 , 0.9 , 0.95 } and hidden widths of {512,1024,2048,4096}512102420484096\{512,1024,2048,4096\}{ 512 , 1024 , 2048 , 4096 } for FashionMNIST and {128,256,512,1024}1282565121024\{128,256,512,1024\}{ 128 , 256 , 512 , 1024 } for Two Moons and Two Cricles. We replicate all experiments on 3 seeds for FashionMNIST and 10 seeds for the other datasets.

Results.

Fig. 5(b) in the main paper shows the average energy across the last batch at the end of training for the best performing model on the grid. Fig. 5(c) compares SGD with momentum 0.90.90.90.9 and AdamW. It is obtained for activation function “HardTanh” and a width of 1024. We replicate this figure for the other combinations of activation functions and widths below in Fig 14. We observe that across all conditions, small to medium state learning rates are generally preferred by SGD, while AdamW has a stronger preference to smaller state learning rates. Given the uneven distribution of energies across layers, Adam, in particular, may not scale to deeper architectures. We further, observe a larger variance in performance for Adam, especially for wider layers, which we discuss in paragraph “Training Instability“ in Sec. 5.1 and below. Fig. 5(d) is based on all models trained with AdamW. Many models with high state learning rates diverge, we only plot models achieving accuracy >0.5absent0.5>0.5> 0.5.

Refer to caption
Figure 14: Model accuracies for a range of combinations of activation functions and model widths. Adam perfers small learning rates and tends to be less stable than SGD. Obtained on FashionMNIST.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 15: Energy propagation on the Two Moons dataset. 15(a) shows the imbalance between layers across T𝑇Titalic_T steps. 15(b) shows the model performance across state learning rates and 15(c) the energy distribution across state learning rates.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 16: Energy propagation on the Two Circles dataset. 16(a) shows the imbalance between layers across T𝑇Titalic_T steps. 16(b) shows the model performance across state learning rates and 16(c) the energy distribution across state learning rates.

Below we present the results of experiments on the Two Moons and Two Circles datasets. Fig. 15(b), 15(a), and 15(c) replicate figures Fig. 5(b),  5(c), and 5(d) for Two Moons, and Fig. 16(b), 16(a), and 16(c) for Two Circles. Results are very similar to FashionMNIST: The energy is concentrated in the last layer, even after T𝑇Titalic_T inference steps. However, in the example for Two Circles, we actually observe a training effect for earlier layers: While the energy increases first due to error propagation (still orders of magnitude below later layers), the energy is reduced afterwards. Energy ratios are consistenly indicating poor energy propagation for state learning rates γ𝛾\gammaitalic_γ, that perform well. As predicted the variance in results is significantly larger for Two Circles, especially for small state learning rates.

D.2 Training Stability

We test a grid of PCNs to analyze the interaction between model width, state learning rates and weight optimizers.

Experimental Setup.

We train models on FashionMNIST (as reported above) and Two Moons. We train feedforward PCNs (2 hidden layers) with “LeakyReLU” activations over a grid of parameters. All models are trained over 8 epochs. The widths of the hidden layers are {32,64,,4096}32644096\{32,64,\ldots,4096\}{ 32 , 64 , … , 4096 }. State variables are trained for T=8𝑇8T=8italic_T = 8 steps with SGD and learning rates γ{1×105,3×105,,0.3}𝛾1superscript1053superscript1050.3\gamma\in\{1\times 10^{-5},3\times 10^{-5},\ldots,0.3\}italic_γ ∈ { 1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , … , 0.3 }. The weights are updated through SGDor the Adam optimizer with a learning rate of 0.01 for FashionMNIST and 0.03 for Two Moons. Both optimizers uses 0.9 momentum for weights. We further train baseline BP models with the same hyperparameters. For FashionMNIST we replicate each run over 3 random initializations, for Two Moons over 10.

Results.

We replicate Fig. 6 (FashionMNIST) here for the Two Moons dataset, see Fig. 17. We observe effects for Two Moons that are analog to FashionMNIST as presented above: The stability of optimization strongly depends on the width of the hidden layers for Adam. This effect is not observed for SGD on either dataset. This further supports the our conclusion in Sec. 5.1: While Adam is the better optimizer, this interaction effect (width ×\times× γ𝛾\gammaitalic_γ) can hinder the scaling of PCNs with Adam. Optimization methods for PCNs require further attention from the research community.

Refer to caption
Figure 17: The instability of optimization with Adam given architectural choices can be observed for Two Moons.
Refer to caption
Figure 18: The instability of optimization as a result of an optimizer-architecture-interaction can be (at least partially) be attributed to the absolute size of layers.

Ablation.

We further provide an ablation on FashionMNIST. In the experiments above, the hidden layer width is altered, introducing changes in the absolute size of the hidden layers (i.e. number of neurons), but also changing the relative size of the hidden layers in the network, as input and output layers remain the same size across all experiments. Hence, we provide another experiment on FashionMNIST, where we increase the image size and augment the label vector with 00s, such that the width of all layers is equal. All other experimental variables remain as described above. The results are shown in Fig. 18 and follow the trend observed in Fig. 6 and 17: We find that there exists an interaction between the optimization and the width of the network as described above. Hence, accounting for relative changes in layer width does not sufficiently explain the problem and we conclude that the absolute size of the layers plays a role in the stability of optimization with AdamW.

D.3 Integrating Skip Connections into VGG19 to Enhance PC Networks for CIFAR10 Classification

Skip connections.

We investigate the integration of skip connections into the VGG19 architecture to enhance its performance on the CIFAR10 image classification task, showing a significant increase in maximum achieved test accuracy from 25.32% to 73.95%. The vanishing gradient problem, a notable challenge in deep Predictive Coding (PC) models, becomes pronounced with increased network depth, hindering error transmission to earlier layers and impacting learning efficacy. To address this, we introduce skip connections that allow gradients to bypass multiple layers, enhancing gradient flow and overall learning performance.

Table 11: Hyperparameter configuration and best accuracy for VGG19 with and without skip connections on CIFAR10
Parameter Range Best Value
With Skip Connections
Epochs 30 30
Batch size 128 128
Activation functions {GELU, Leaky ReLU} Leaky ReLU
Optimizer for network parameters - Learning rate {5e-2, 1e-1, 5e-1} 0.5
Optimizer for network parameters - Momentum {0.0, 0.5, 0.9, 0.99} 0.5
Optimizer for weight parameters - Learning rate 1e-4 1e-4
Optimizer for weight parameters - Weight decay {5e-4, 1e-4, 5e-5} 5e-4
Number of inference steps (T) {24, 36} 24
Best Accuracy 73.95%
Without Skip Connections
Epochs 30 30
Batch size 128 128
Activation functions {GELU, Leaky ReLU} GELU (default)
Optimizer for network parameters - Learning rate {5e-2, 1e-1, 5e-1} 0.1
Optimizer for network parameters - Momentum {0.0, 0.5, 0.9, 0.99} 0.99
Optimizer for weight parameters - Learning rate 1e-4 1e-4
Optimizer for weight parameters - Weight decay {5e-4, 1e-4, 5e-5} 1e-4
Number of inference steps (T) {24, 36} 24
Best Accuracy 25.32%
Refer to caption
Figure 19: Performance comparison of VGG19 with and without skip connections on the CIFAR-10 dataset over 30 epochs. The plot shows the mean test accuracy along with the shaded area representing the variability across three different seeds.

Results

Our modified VGG19 model includes a skip connection from an early layer within the feature extraction stage, with the output flattened and adjusted using a linear layer before being reintegrated during the classification stage. The model underwent rigorous training and evaluation on the CIFAR10 dataset, employing standard preprocessing techniques like normalization and data augmentation (horizontal flips and rotations). Detailed hyperparameter tuning revealed optimal configurations for both models, with and without skip connections, exploring various optimizers, learning rates, momentum values, and weight decay settings, significantly enhancing the model performance with skip connections as summarized in Table 11.Figure 19 shows the test accuracy progression over 30 epochs for the VGG19 model with and without skip connections on the CIFAR10 dataset, using three different seed values and identical hyperparameters for both simulations. All experiments and scripts used in these experiments can be found in the examples/s5_analysis_and_metrics/energy_and_stability/skip_connections folder of the PCX library.

Appendix E Properties of predictive coding networks

This section describes the experimental setup of Section 5.2 and displays the utility of using the free energy of a PCN classifier to differentiate between in-distribution (ID) and out-of-distribution (OOD) data Liu et al. [2020]. We show how one can compute the negative log-likelihood of various datasets Grathwohl et al. [2020] under the PCN. We further provide analyses on the relationship between maximum softmax values and energy values before convergence and after convergence at the state optimum. We compare results across multiple datasets to corroborate our results as well as to show how PCNs can be used for OOD detection out of the box based on a single trained PCN classifier for which we study the receiver operating characteristic (ROC) curve based on different percentiles of the softmax and energy scores.

E.1 Free energy and out-of-distribution data.

Experimental Setup.

We train a PCN classifier on MNIST using a feedforward PCNs with 3 hidden layers each of size H=512𝐻512H=512italic_H = 512 with “GELU” activation and cross entropy loss in the output layer. We train the model until test error convergence using early stop** at epoch 75. During training the state variables are optimized for T=10𝑇10T=10italic_T = 10 steps with SGD and state learning rate γ=0.01𝛾0.01\gamma=0.01italic_γ = 0.01 without momentum. The weights are optimized using the SGD optimizer with a momentum of mθ=0.9subscript𝑚𝜃0.9m_{\theta}=0.9italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 0.9 and the weight learning rate is chosen as lrθ=0.01𝑙subscript𝑟𝜃0.01lr_{\theta}=0.01italic_l italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 0.01. During test-time inference, we optimize the state variables until convergence for T=100𝑇100T=100italic_T = 100. To understand the confidence of a PCN’s predictions, we compare the distribution of energy for ID and OOD samples against the distribution of the softmax scores that the classifier generates. We compute negative log-likelihoods for ID and OOD samples under the PCN classifier via:

=lnp(x,y;θ)p(x,y;θ)=e,𝑝𝑥𝑦𝜃𝑝𝑥𝑦𝜃superscript𝑒\mathcal{F}=-\ln{p(x,y;\theta)}\implies p(x,y;\theta)=e^{-\mathcal{F}},caligraphic_F = - roman_ln italic_p ( italic_x , italic_y ; italic_θ ) ⟹ italic_p ( italic_x , italic_y ; italic_θ ) = italic_e start_POSTSUPERSCRIPT - caligraphic_F end_POSTSUPERSCRIPT , (8)

We conduct the experiments on MNIST as the in-distribution (ID) dataset and we compare it against various out-of-distribution datasets such as notMNIST, KMNIST, EMNIST (letters) as well as FashionMNIST.

Results.

In the following we briefly interpret the additional results on the basis of experiments supported by various figures

Refer to caption
Figure 20: Energy distributions before and after state optimization.

In Fig. 20 we see how the energy is distributed at test-time before and after state optimization. We can see, that all OOD datasets have significantly larger initial energies as well as final energies compared to the ID dataset (MNIST).

Refer to caption
Figure 21: Energy histograms against ID data before and after state optimization.

In Fig. 21 we then show how each energy distribution for the OOD dataset compares against the energy of the in-distribution dataset by overlaying the histograms of the energies before and after state optimization. We can see that by plotting the histograms, a pattern emerges, namely, that a majority of the OOD data samples do not overlap with ID data samples, which supports the idea that energy can be used for OOD detection.

Refer to caption
Figure 22: Softmax histograms overlapped with ID dataset.

Next in Fig. 22 we show how this pattern might look like when comparing the softmax scores of ID against OOD datasets. One can see, that the softmax scores are less informative for determining if samples are OOD as can be seen by the bigger overlap in the range of softmax values that ID and OOD samples have in common.

Refer to caption
Figure 23: Non-linear relationship between energy and softmax scores.

In Fig. 23 we further study the relationship between softmax scores and energy values before and after state convergence. The plot shows that while the energy and softmax scores are strongly correlated before inference, a non-linear relationship is evident after convergence, especially for smaller values where the model is more uncertain. This indicates, that softmax scores and energy values do not fully agree on which samples we should have less confidence in.

Refer to caption
Figure 24: Energy and NLL for various OOD datasets before and after inference.

In Fig. 24 we show how the energy distributions for all datasets look like before and after inference. Each box plot represents a different scenario and a different dataset. In addition, we compute the NLL of each dataset and display it as part of the box plot labels. We observe that across all OOD datasets, the initial and final energy values are significantly higher than the MNIST (ID) dataset. Furthermore, we can see that the variance of the energy scores is smaller for the in-distribution data as can be seen by the fact, that there are no outlier samples for MNIST beyond the whiskers of the box plot. Finally, the NLL values for each scenario confirm this observation, with the likelihood of the MNIST data being significantly higher than that of the OOD distributions.

Refer to caption
Figure 25: Performing OOD detection with PCN energy and classifier softmax scores.

Finally, in Fig. 25 we show how the PCN can be used to classify samples as belonging to the ID or some OOD data. We use the PCN classifier’s energy to perform OOD detection and we show that the ROC curves for energy-based detection are superior to ROC curves created via softmax scores. This observation becomes even clearer, when looking at the most challenging samples by picking the 25% percentile of the scores and energies, in effect the samples, that the PCN model is least confident about as reflected by small energy or softmax values.

Appendix F Computational Resources

Fig. 8 was obtained by taking a small feedforward PCN made by 2 layers of 64 neurons each and training it on batches of 32 elements (generated as random noise so to avoid any overhead due to loading training data to the GPU) for T=8𝑇8T=8italic_T = 8 steps. Then, each parameter was scaled independently to measure its effect on the total training time. Each model obtained this way was trained for 5 epochs and the mean time was reported. In all our timing measurements, we skip the first epoch to avoid including the JIT compilation time. Results were obtained on a GTX TITAN X, showing that parallelization is potentially achievable also on consumer GPUs.