Learning Sequence Attractors in Recurrent Networks with Hidden Neurons

Yao Lu Si Wu

Abstract

The brain is targeted for processing temporal sequence information. It remains largely unclear how the brain learns to store and retrieve sequence memories. Here, we study how recurrent networks of binary neurons learn sequence attractors to store predefined pattern sequences and retrieve them robustly. We show that to store arbitrary pattern sequences, it is necessary for the network to include hidden neurons even though their role in displaying sequence memories is indirect. We develop a local learning algorithm to learn sequence attractors in the networks with hidden neurons. The algorithm is proven to converge and lead to sequence attractors. We demonstrate that the network model can store and retrieve sequences robustly on synthetic and real-world datasets. We hope that this study provides new insights in understanding sequence memory and temporal information processing in the brain.

keywords:

Recurrent networks , attractor networks , sequence memory , sequence attractors

^†^†journal: Neural Networks

\affiliation

[a1]organization=School of Psychological and Cognitive Sciences, IDG/McGovern Institute for Brain Research, Be**g Key Laboratory of Behavior and Mental Health, Peking-Tsinghua Center for Life Sciences, Center of Quantitative Biology, Academy for Advanced Interdisciplinary Studies,
Peking University,country=China

1 Introduction

The brain is targeted for processing temporal sequence information. Taking visual recognition for example, the conventional setting of static image processing never happens in the brain. Starting from retina, visual inputs of an image arrive in the form of optical flow, which are transformed into spike trains of retinal ganglia cells, and then transmitted through LGN, V1, V2, V4 and higher cortical regions until the image is recognized. Along the whole pathway, the computations performed by the brain are in the form of temporal sequential processing, rather than being static. For another example, when we recall an episodic memory, a sequence of events represented by neuronal responses flows into our mind, and these events do not come in disorder or isolation, but are unfolded in time, as we experience “mental time travel” [Tulving,, 2002]. The hippocampus has been revealed to be essential for sequence memories by physiological and behavioral studies. In animals, sequences of neural activity patterns are observed in the hippocampus for memory replay and memory related tasks [Nádasdy et al.,, 1999; Lee and Wilson,, 2002; Foster and Wilson,, 2006; Pastalkova et al.,, 2008; Davidson et al.,, 2009; Pfeiffer and Foster,, 2013, 2015]. The discovery of time cells in hippocampus shows that the brain has specialized neurons encoding the temporal structure of events [Eichenbaum,, 2014]. Overall, the processing of temporal sequences is critical to the brain, but computational modeling study on this issue lags far behind that on static information processing.

Attractor neural networks are promising computational models for elucidating the mechanisms of the brain representing, memorizing and processing information [Amari,, 1972; Hopfield,, 1982; Amit,, 1989]. An attractor network is a type of recurrent networks, in which information is stored as stable states (attractors) of the network. Once stored, a memory can be retrieved robustly under the evolving of the network dynamics given noisy or incomplete cues. The experimental evidences have indicated that the brain employs attractor networks for memory related tasks [Khona and Fiete,, 2022]. By considering simplified neuron model and threshold dynamics, the classical Hopfield networks have successfully elucidated how recurrent networks learn to store static memory patterns [Hopfield,, 1982]. Recurrent networks of binary neurons can also generate pattern sequences by employing asymmetric weight connections [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992], to explain the sequential neural activities widely observed in the brain (e.g., in memory retrieval in the hippocampus). In this paper, we follow and extend the standard form of the recurrent networks of binary neurons and threshold dynamics, as this enables us to pursue theoretical analysis, and we investigate how the networks learn to store sequence attractors. By sequence attractor, it means the state of a recurrent network evolves in the order of the stored pattern sequence and being robust to noise.

We first show that, to store arbitrary pattern sequences, the recurrent networks which contains only visible binary neurons [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992] is inadequate in Section 3. Then we argue that it is necessary for the networks to include hidden neurons in Section 4. These neurons are not directly involved in expressing pattern sequences, but they are indispensable for the networks to store and retrieve arbitrary pattern sequences. We further develop a local learning algorithm to learn sequence attractors in the networks with hidden neurons in Section 5. The algorithm is proven to converge and lead to sequence attractors. We demonstrate that our network model can learn to store and retrieve pattern sequences robustly on synthetic and real-world datasets in Section 6.

2 Related Work and Our Contributions

Learning temporal sequences in recurrent networks has been studied previously in the field of computational neuroscience. These works employ different forms of recurrent networks and have different focuses of investigation. Specifically, [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992; Fiete et al.,, 2010] investigated recurrent networks of binary neurons and simple threshold dynamics. This approach takes advantages of simplified models that capture the essential features of neural dynamics and allows us to pursue theoretical analysis. [Brea et al.,, 2013; Tully et al.,, 2016] investigated recurrent networks of spiking neurons which are more biologically realistic but hard to analyze theoretically. [Laje and Buonomano,, 2013; Rajan et al.,, 2016; Gillett et al.,, 2020; Rajakumar et al.,, 2021] investigated recurrent networks of firing-rate neurons (e.g., sigmoid and linear-threshold), whose complexity is a trade-off between binary neurons and spiking neurons. More recently, [Karuvally et al.,, 2023; Chaudhry et al.,, 2023] employ modern Hopfield networks [Krotov and Hopfield,, 2016] and [Tang et al.,, 2023] employ predictive coding networks for modeling sequence memory.

In this paper, we study and extend the classical recurrent network model [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986], with the focus of theoretical analysis. Below summarizes the main contributions of our work in comparison to related work.

1.

We highlight the importance of hidden neurons in the recurrent networks of binary neurons for learning arbitrary pattern sequences. [Amari,, 1972; Hopfield,, 1982; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992; Fiete et al.,, 2010] considered only visible neurons and hence the pattern sequences they can generate are limited. Although [Laje and Buonomano,, 2013; Brea et al.,, 2013; Rajakumar et al.,, 2021; Chaudhry et al.,, 2023; Tang et al.,, 2023] also employed hidden neurons in sequence learning, they are based on different network architectures or neuron models.
2.

We have clear theoretical characterization of sequences that can be generated by our networks (Theorem 1), a result which is lacking in all other related work. Despite that this conclusion comes from the analysis of the simple model we use, it lays foundation for future work to test it in biologically more realistic networks. Although [Chaudhry et al.,, 2023] also provided theoretical characterization of the network capacity, it is based on Rademacher (random and uniformly distributed) sequence patterns and several approximations.
3.

Our learning algorithm is proven to converge and lead to sequence attractors, while most previous work only provided empirical evidences on the effectiveness of their learning algorithms [Laje and Buonomano,, 2013; Brea et al.,, 2013; Rajan et al.,, 2016; Rajakumar et al.,, 2021]. Although [Amari,, 1972; Bressloff and Taylor,, 1992] gave provable results on sequence attractors, they did not include hidden neurons and hence the results only hold for a restricted class of sequences.
4.

Our learning algorithm only requires local information between neurons, which is believed to be biologically plausible. [Rajakumar et al.,, 2021] used backpropagation which is often criticized for its biologically implausibility as it requires gradient computation and has the weight transport problem [Lillicrap et al.,, 2020].

3 Limitation of Networks without Hidden Neurons

We first consider recurrent networks of $N$ visible binary neurons [Amari,, 1972; Hopfield,, 1982; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992]. All the neurons are bidirectionally connected and their weight matrix is $\mathbf{W}$ of which $W_{ij}$ denotes the synaptic weight from the $j$ -th neuron to the $i$ -th neuron. Let $\bm{\xi}(t)=(\xi_{1}(t),...,\xi_{N}(t))^{\top}\in\{-1,1\}^{N}$ be the states of the neurons at time $t$ . These states are synchronously updated according to the threshold dynamics, for $i=1,...,N$ ,

\displaystyle\xi_{i}(t+1)=\text{sign}\Big{(}\sum_{j=1}^{N}W_{ij}\xi_{j}(t)\Big% {)},

(1)

where $\text{sign}(x)=1$ if $x\geq 0$ and $\text{sign}(x)=-1$ otherwise. The bias parameters are omitted here as they can be absorbed into the equation. Given a pair of successive network states $\bm{\xi}(t)$ and $\bm{\xi}(t+1)$ , the dynamics of the network can be unfolded in time and viewed as a feedforward network, in which each output neuron is a perceptron of the inputs.

Given a sequence in the form of $\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}$ , one can use a learning algorithm to adjust $\mathbf{W}$ such that the evolution of the network state matches the pattern sequence. Although the networks can generate some sequences of maximal length $2^{N}$ [Muscinelli et al.,, 2017], they are fundamentally limited in the class of sequences that can be generated. Since each neuron can be regarded as a perceptron, the condition that sequence $\mathbf{x}(1),...,\mathbf{x}(T)$ can be generated by the network is, for each $i$ , the dataset $\{(\mathbf{x}(t),x_{i}(t+1))\}_{t=1}^{T-1}$ is linearly separable [LeCun,, 1986; Bressloff and Taylor,, 1992; Brea et al.,, 2013; Muscinelli et al.,, 2017].

For a simple example of sequences which cannot be generated by the networks, consider the sequence

\displaystyle\begin{pmatrix}1\\ 1\end{pmatrix},\begin{pmatrix}\ \ \ 1\\ -1\end{pmatrix},\begin{pmatrix}-1\\ \ \ \ 1\end{pmatrix},\begin{pmatrix}-1\\ -1\end{pmatrix},\begin{pmatrix}1\\ 1\end{pmatrix}

with $N=2$ and $T=5$ . To generate this sequence, the first neuron of the network needs to map $(1,1)^{\top}$ to $1$ , $(1,-1)^{\top}$ to $-1$ , $(-1,1)^{\top}$ to $-1$ and $(-1,-1)^{\top}$ to $1$ . This map** is essentially the XOR operation which cannot be performed by a perceptron [Minsky and Papert,, 1969].

In Figure 1, we show additional examples of sequences that cannot be generated by the network. The sequences are synthetically constructed. We then test if the perceptron learning algorithm can learn the sequences. Since the algorithm converges if the linear separability condition is met [Minsky and Papert,, 1969], the divergence of the algorithm implies that the sequences cannot be generated by the networks.

Refer to caption — Figure 1: Two example sequences which cannot be generated by networks without hidden neurons. White squares denote positive ones and black squares denote negative ones.

4 Networks with Hidden Neurons

To overcome the limitation of networks with only visible neurons, we consider including a group of hidden neurons in the network. The visible and hidden neurons are bidirectionally connected, and there is no intra-connection within visible neurons or hidden neurons. Let $\mathbf{U}$ be the weight matrix from visible neurons to hidden neurons, of which $U_{ij}$ denotes the synaptic weight from the $j$ -th visible neuron to the $i$ -th hidden neuron, and $\mathbf{V}$ be the weight matrix from hidden neurons to visible neurons, of which $V_{ji}$ denotes the synaptic weight from the $i$ -th hidden neuron to the $j$ -th visible neuron. Let $\bm{\xi}(t)=(\xi_{1}(t),...,\xi_{N}(t))^{\top}\in\{-1,1\}^{N}$ be the states of visible neurons and $\bm{\zeta}(t)=(\zeta_{1}(t),...,\zeta_{M}(t))^{\top}\in\{-1,1\}^{M}$ be the states of hidden neurons at time $t$ . These states are synchronously updated according to, for $i=1,...,M$ and $j=1,...,N$ ,

	$\displaystyle\zeta_{i}(t)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}\xi_{k}(t)\Big{)},$		(2)
	$\displaystyle\xi_{j}(t+1)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}\zeta_{k}(t)\Big{)},$		(3)

where we omit the bias parameters as they can be absorbed into the equations. As illustrated in Figure 2, given a pair of successive network states $\bm{\xi}(t)$ and $\bm{\xi}(t+1)$ , the dynamics of the network can be unfolded in time and viewed as a feedforward network with a hidden layer of neurons.

Figure 2: Recurrent network with hidden neurons. The red circles denote visible neurons and the white circles denote hidden neurons.

The networks of $M$ hidden neurons can generate arbitrary sequences with Markov property and of length at least $M$ , as stated in Theorem 1. We provide a constructive proof based on an one-hot encoding by the hidden neurons in the Appendix.

Theorem 1.

Let $\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}$ such that $\mathbf{x}(i)\neq\mathbf{x}(j)$ for $i\neq j$ except that $\mathbf{x}(1)=\mathbf{x}(T)$ . Then $\mathbf{x}(1),...,\mathbf{x}(T)$ can be generated by the network defined in (2)(3) for $M=T-1$ .

5 Learning

To learn the weight matrices, one can first unfold the recurrent network with hidden neurons in time such that it becomes a feedforward network with a hidden layer, and the pairs of successive patterns in the sequence constitute the training examples, as illustrated in Figure 2. However, learning in the unfolded feedforward network is difficult since the backpropagation algorithm cannot be applied as the neurons are not differentiable.

We propose a new learning algorithm to learn the weight matrices in the unfolded feedforward networks, which draws inspirations from three ideas: feedback alignment [Lillicrap et al.,, 2016], target propagation [LeCun,, 1987; Bengio,, 2014; Litwin-Kumar et al.,, 2017] and three-factor rules [Frémaux and Gerstner,, 2016; Kuśmierz et al.,, 2017]. As in feedback alignment, it requires a random matrix $\mathbf{P}$ , which is fixed during the learning process, to backpropagate signals. As in target propagation, it does not propagate errors but targets to create surrogate targets for the hidden neurons. Each weight parameter is updated by a three-factor rule, in which the presynaptic activation, the postsynaptic activation and an error term as neuromodulation are multiplied. The three-factor rule is similar to the one in [Bressloff and Taylor,, 1992] and known as margin perceptron in the machine learning literature [Collobert and Bengio,, 2004].

The algorithm works as follows. Given a pair of successive patterns $\mathbf{x}(t)$ and $\mathbf{x}(t+1)$ , for $i=1,...,M$ and $j=1,...,N$ in parallel,

Update $\mathbf{U}$ by

$\displaystyle z_{i}(t+1)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)\Big{)},$	(4)
$\displaystyle\mu_{i}(t)$	$\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)},$	(5)
$\displaystyle U_{ij}$	$\displaystyle\leftarrow U_{ij}+\eta\mu_{i}(t)z_{i}(t+1)x_{j}(t).$	(6)

Update $\mathbf{V}$ by

$\displaystyle y_{i}(t)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)},$	(7)
$\displaystyle\nu_{j}(t)$	$\displaystyle=H\Big{(}\kappa-x_{j}(t+1)\sum_{k=1}^{M}V_{jk}y_{k}(t)\Big{)},$	(8)
$\displaystyle V_{ji}$	$\displaystyle\leftarrow V_{ji}+\eta\nu_{j}(t)x_{j}(t+1)y_{i}(t),$	(9)

where $P_{ik}$ denotes the $(i,k)$ entry of the fixed random matrix $\mathbf{P}$ , $H(\cdot)$ is the Heaviside function ( $H(x)=1$ if $x\geq 0$ and $H(x)=0$ otherwise), $\kappa>0$ is the robustness hyperparameter and $\eta>0$ is the learning rate hyperparameter. $\mu_{i}(t)$ and $\nu_{j}(t)$ can be interpreted as the error terms for the hidden and the visible neurons, respectively. $z_{i}(t+1)$ can be interpreted as the synaptic input from an external neuron. The above procedure is then repeated for each $t$ .

5.1 Analysis

In this section, we provide theoretical analysis of the algorithm. The proofs are left to the Appendix. First, we provide convergence guarantee of the algorithm.

Theorem 2.

Given the definitions in (4)(5)(7)(8), for all $i$ , $j$ and $t$ , if a solution exists such that $\mu_{i}(t)=0$ and $\nu_{j}(t)=0$ , then the algorithm (4)-(9) converges in finite steps given $U_{ij}$ and $V_{ji}$ are initialized to zero.

Next, we show the algorithm can reduce error $\mu_{i}(t)$ for a single step of updating $\mathbf{U}$ . The theorem can be trivially extended for $\nu_{j}(t)$ and $\mathbf{V}$ by a similar proof.

Theorem 3.

Given the definitions in (4)(5), let

	$\displaystyle U_{ik}^{\prime}$	$\displaystyle=U_{ik}+\eta\mu_{i}(t)z_{i}(t+1)x_{k}(t),$		(10)
	$\displaystyle\mu_{i}^{\prime}(t)$	$\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}^{\prime}x_{k}(t)% \Big{)}.$		(11)

Then $\mu_{i}^{\prime}(t)=0$ for sufficiently large $\eta>0$ .

To understand why reducing the errors $\mu_{i}(t)$ and $\nu_{j}(t)$ leads to sequence attractors, we present the following result.

Theorem 4.

Given the definitions in (7)(8), let $\hat{\mathbf{y}}(t)=(\hat{y}_{1}(t),...,\hat{y}_{M}(t))^{\top}\in\{-1,1\}^{M}$ such that $\sum_{k}|\hat{y}_{k}(t)-y_{k}(t)|<\epsilon$ . If $\nu_{j}(t)=0$ and

\displaystyle\epsilon\cdot\max_{k}|V_{jk}|<\kappa,

(12)

then

\displaystyle x_{j}(t+1)=\emph{\text{sign}}\Big{(}\sum_{k=1}^{M}V_{jk}\hat{y}_% {k}(t)\Big{)}.

(13)

Theorem 4 shows that when the errors are zero, given perturbed hidden neuron states $\hat{\mathbf{y}}(t)$ , we have $\mathbf{x}(t+1)=\text{sign}(\mathbf{V}\hat{\mathbf{y}}(t))$ . The result can be trivially extended to show that given perturbed visible neuron states $\hat{\mathbf{x}}(t)$ we have $\mathbf{y}(t)=\text{sign}(\mathbf{U}\hat{\mathbf{x}}(t))$ by a similar proof. Therefore, the network can generate sequence $\mathbf{x}(1),...,\mathbf{x}(T)$ as an attractor. From Theorem 4, we can also see that $\kappa$ acts as the robustness hyperparameter as it controls the level of perturbation $\epsilon$ for inequality (12) to hold.

To understand why the algorithm works despite that $\mathbf{P}$ is a random matrix and fixed during learning, consider the following. If the update of $\mathbf{U}$ converges, then $\mu_{i}(t)=0$ for all $i$ . Therefore,

$\displaystyle y_{i}(t)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)}$	(14)
	$\displaystyle=z_{i}(t+1)$	(15)
	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)\Big{)}.$	(16)

The update of $\mathbf{V}$ aims at making the condition $\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}y_{k}(t)\Big{)}=x_{j}(t+1)$ hold, which is

\displaystyle\text{sign}(\mathbf{V}\text{sign}(\mathbf{Px}(t+1)))=\mathbf{x}(t% +1),

(17)

in matrix form when $y_{k}(t)$ is substituted by (16). For large $M$ , a solution $\mathbf{V}$ exists, that is, the pseudo-inverse of $\mathbf{P}$ or the transpose of $\mathbf{P}$ . The numerical result is shown in Figure 3. The phenomenon might be explained by the high-dimensional probability theory [Vershynin,, 2018].

5.2 Robustness Hyperparameter

Having a hyperparameter $\kappa$ in the algorithm is not problematic in practice. One can simply set $\kappa=1$ as we did for all the experiments in the next section and adjust the scale of initial weights and the learning rate. In margin perceptron, the margin learned is disproportional to the learning rate [Collobert and Bengio,, 2004]. The margin is defined as the reciprocal of the weight magnitude, which is related to the robustness hyperparameter, as shown in Theorem 4. Therefore, it can be interpreted that the robustness hyperparameter is automatically adjusted during learning.

6 Experiments

We ran experiments on synthetic and real-world sequence datasets for the networks with hidden neurons by the algorithm proposed in the previous section to learn sequence attractors. All the experiments were carried out in MATLAB and PyTorch. In all the experiments, each weight parameter of $\mathbf{U}$ , $\mathbf{V}$ and $\mathbf{P}$ was sampled i.i.d. from Gaussian distribution with mean zero and variance $1\times 10^{-6}$ , learning rate $\eta=1\times 10^{-3}$ and robustness $\kappa=1$ . In each experiment, we ran the algorithm for $500$ epochs. In each epoch, the algorithm ran on $(\mathbf{x}(t),\mathbf{x}(t+1))$ from the start to the end of each sequence. No noise was added during learning. Noise was added only at retrieval. We also provide additional experiments in the Appendix.

6.1 Toy Examples

To show the recurrent networks with hidden neurons can overcome the limitation of the networks without hidden neurons, we conducted experiments on the examples in Figure 1. We constructed a network of visible neurons $N=10$ and hidden neurons $M=50$ for each example. After learning, we tested the robustness of the networks in retrieval by adding two salt-and-pepper noises (flip** the states of two out of ten neurons) to the first pattern of a sequence and set it to be the initial network state.

The results are shown in Figure 4, from which we can see that the networks with hidden neurons can generate sequences which cannot be generated by the networks without hidden neurons and retrieve them robustly under moderate level of noise.

6.2 Random Sequences

We generated periodic sequences of random patterns $\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}$ . In each sequence, $\mathbf{x}(i)\neq\mathbf{x}(j)$ for $i\neq j$ except that $\mathbf{x}(1)=\mathbf{x}(T)$ for the periodicity. We set $N=100$ and varied period length $T$ . We sampled each $\mathbf{x}(t)$ independently from the uniform distribution of $\{-1,1\}^{N}$ for $t=1,...,T-1$ and then resampled it if it is identical to a previous pattern. Finally, we set $\mathbf{x}(T)=\mathbf{x}(1)$ .

For each random sequence, we constructed a network with hidden neurons and applied the proposed learning algorithm. To evaluate the effectiveness of the learning algorithm, we compared learning only $\mathbf{V}$ (with $\mathbf{U}$ fixed during learning) and learning both $\mathbf{U}$ and $\mathbf{V}$ . Once the learning was done, we tested if the network can retrieve the sequence robustly given perturbed $\mathbf{x}(1)$ with $10$ salt-and-pepper noises as the initial network state $\bm{\xi}(1)$ . We define that the retrieval is successful if $\bm{\xi}(\tau+t)=\mathbf{x}(t)$ for some $\tau$ and all $t=1,...,T$ . We run $100$ trials for each $T$ or $M$ setting and count the successful retrievals.

In Figure 5 (a), we show the results with various period lengths $T$ for $M=500$ . In Figure 5 (b), we show the results with various numbers of hidden neurons $M$ for $T=70$ . We can see learning both $\mathbf{U}$ and $\mathbf{V}$ is more effective than learning only $\mathbf{V}$ . However, in both cases, the algorithm failed for large $T$ , even if we increased the number of hidden neurons, which might be due to the suboptimality of the algorithm.

6.3 Real-World Sequences

We tested the networks with hidden neurons by our algorithm in learning real-world sequences on a silhouette sequence dataset (OU-ISIR gait database large population [Iwama et al.,, 2012]) and a handwriting sequence dataset (Moving MNIST [Srivastava et al.,, 2015]). The patterns in the sequences are rather correlated since adjacent image frames are similar. To adopt the datasets for the networks to learn, we converted the image intensity values to $\pm 1$ .

For the OU-ISIR gait dataset, we used a network with hidden neuron number $M=200$ to learn a single image sequence of length $103$ , in which each image has size $88\times 128$ . The images were flatten to vectors of size $88\times 128=11264$ . For the Moving MNIST dataset, we used a network with hidden neuron number $M=1000$ to learn $20$ image sequences of length $20$ , in which each image has size $64\times 64$ . The images were flatten to vectors of size $64\times 64=4096$ . In Figure 6 and 7, we show the visualization results of the learned networks for robust retrieval, in which the first image of a sequence was corrupted and set to be the initial state of the network.

In Figure 8, we show the average errors $\frac{1}{M}\sum_{t}\sum_{i}\mu_{i}(t)$ and $\frac{1}{N}\sum_{t}\sum_{j}\nu_{j}(t)$ during the learning process, from which we can see that both errors reduce to zero smoothly. This result demonstrates that $\mathbf{U}$ and $\mathbf{V}$ can be cooperatively learned instead of conflicting with each other during learning.

7 Discussion and Conclusion

In this paper, we have investigated how recurrent networks of binary neurons learn sequence attractors to represent temporal sequence information. We showed that to store arbitrary sequence patterns, it is necessary for the networks to include hidden neurons. We developed a local learning algorithm and demonstrated that our model works well on synthetic and real-world datasets. Thus, our work provides a possible biologically plausible mechanism in elucidating sequence memory in the brain. In our model, hidden neurons are not directly involved in expressing pattern sequences. Instead, their contribution is on facilitating the storing and retrieving of pattern sequences. The indirect but indispensable role of hidden neurons may have a far-reaching implication to neural information processing.

Modern Hopfield networks also employ hidden neurons for sequence learning [Chaudhry et al.,, 2023]. In comparison, our approach is different in several aspects. First, our network model requires only the threshold activation function, while modern Hopfield networks require a polynomial or exponential activaton function for the hidden neurons. Second, the weights in our network model are learned from scratch in an online manner while modern Hopfield networks require to store the sequence patterns explicitly as the weights in a predefined manner. Third, our learning algorithm requires only local information between neurons, while modern Hopfield networks require the pseudo-inverse computation. Thus, our approach provides an alternative mechanism of sequence memory to modern Hopfield networks.

To pursue theoretical analysis, we have employed a very simple network model with binary neurons and threshold dynamics. From this simple model, we can get some insights into the neural mechanisms of sequence processing in the brain (as the classical Hopfield networks to static memories), but this simplification also incurs limitations. To fully validate our results, further researches with biologically more plausible models are needed, which include, for instances, biologically more realistic neuron models, synapses models, connection structures, learning rules and form of pattern sequences. Additionally, we look forward to seeing improvement of our learning algorithm for larger network capacity and robustness.

Acknowledgements

This work was supported by the Science and Technology Innovation 2030-Brain Science and Brain-inspired Intelligence Project (No. 2021ZD0200204).

References

Amari, [1972] Amari, S.-I. (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers.
Amit, [1989] Amit, D. J. (1989). Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press.
Bengio, [2014] Bengio, Y. (2014). How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906.
Brea et al., [2013] Brea, J., Senn, W., and Pfister, J.-P. (2013). Matching recall and storage in sequence learning with spiking neural networks. Journal of Neuroscience.
Bressloff and Taylor, [1992] Bressloff, P. C. and Taylor, J. G. (1992). Perceptron-like learning in time-summating neural networks. Journal of Physics A: Mathematical and General.
Chaudhry et al., [2023] Chaudhry, H. T., Zavatone-Veth, J. A., Krotov, D., and Pehlevan, C. (2023). Long sequence hopfield memory. Advances in Neural Information Processing Systems.
Collobert and Bengio, [2004] Collobert, R. and Bengio, S. (2004). Links between perceptrons, mlps and svms. International Conference on Machine learning.
Davidson et al., [2009] Davidson, T. J., Kloosterman, F., and Wilson, M. A. (2009). Hippocampal replay of extended experience. Neuron.
Eichenbaum, [2014] Eichenbaum, H. (2014). Time cells in the hippocampus: a new dimension for map** memories. Nature Reviews Neuroscience.
Fiete et al., [2010] Fiete, I. R., Senn, W., Wang, C. Z., and Hahnloser, R. H. (2010). Spike-time-dependent plasticity and heterosynaptic competition organize networks to produce long scale-free sequences of neural activity. Neuron.
Foster and Wilson, [2006] Foster, D. J. and Wilson, M. A. (2006). Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature.
Frémaux and Gerstner, [2016] Frémaux, N. and Gerstner, W. (2016). Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Frontiers in Neural Circuits.
Gardner, [1988] Gardner, E. (1988). The space of interactions in neural network models. Journal of Physics A: Mathematical and General.
Gillett et al., [2020] Gillett, M., Pereira, U., and Brunel, N. (2020). Characteristics of sequential activity in networks with temporally asymmetric hebbian learning. Proceedings of the National Academy of Sciences.
Hopfield, [1982] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences.
Iwama et al., [2012] Iwama, H., Okumura, M., Makihara, Y., and Yagi, Y. (2012). The ou-isir gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Transactions on Information Forensics and Security.
Karuvally et al., [2023] Karuvally, A., Sejnowski, T., and Siegelmann, H. T. (2023). General sequential episodic memory model. International Conference on Machine Learning.
Khona and Fiete, [2022] Khona, M. and Fiete, I. R. (2022). Attractor and integrator networks in the brain. Nature Reviews Neuroscience.
Kleinfeld, [1986] Kleinfeld, D. (1986). Sequential state generation by model neural networks. Proceedings of the National Academy of Sciences.
Krotov and Hopfield, [2016] Krotov, D. and Hopfield, J. J. (2016). Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems.
Kuśmierz et al., [2017] Kuśmierz, Ł., Isomura, T., and Toyoizumi, T. (2017). Learning with three factors: modulating hebbian plasticity with errors. Current Opinion in Neurobiology.
Laje and Buonomano, [2013] Laje, R. and Buonomano, D. V. (2013). Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience.
LeCun, [1986] LeCun, Y. (1986). Learning process in an asymmetric threshold network. Disordered Systems and Biological Organization.
LeCun, [1987] LeCun, Y. (1987). Modeles connexionnistes de lapprentissage. PhD thesis, These de Doctorat, Universite Paris.
Lee and Wilson, [2002] Lee, A. K. and Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron.
Lillicrap et al., [2016] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications.
Lillicrap et al., [2020] Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience.
Litwin-Kumar et al., [2017] Litwin-Kumar, A., Harris, K. D., Axel, R., Sompolinsky, H., and Abbott, L. (2017). Optimal degrees of synaptic connectivity. Neuron.
Minsky and Papert, [1969] Minsky, M. and Papert, S. A. (1969). Perceptrons. MIT Press.
Muscinelli et al., [2017] Muscinelli, S. P., Gerstner, W., and Brea, J. (2017). Exponentially long orbits in hopfield neural networks. Neural Computation.
Nádasdy et al., [1999] Nádasdy, Z., Hirase, H., Czurkó, A., Csicsvari, J., and Buzsáki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience.
Pastalkova et al., [2008] Pastalkova, E., Itskov, V., Amarasingham, A., and Buzsaki, G. (2008). Internally generated cell assembly sequences in the rat hippocampus. Science.
Pfeiffer and Foster, [2013] Pfeiffer, B. E. and Foster, D. J. (2013). Hippocampal place-cell sequences depict future paths to remembered goals. Nature.
Pfeiffer and Foster, [2015] Pfeiffer, B. E. and Foster, D. J. (2015). Autoassociative dynamics in the generation of sequences of hippocampal place cells. Science.
Rajakumar et al., [2021] Rajakumar, A., Rinzel, J., and Chen, Z. S. (2021). Stimulus-driven and spontaneous dynamics in excitatory-inhibitory recurrent neural networks for sequence representation. Neural Computation.
Rajan et al., [2016] Rajan, K., Harvey, C. D., and Tank, D. W. (2016). Recurrent network models of sequence generation and memory. Neuron.
Sompolinsky and Kanter, [1986] Sompolinsky, H. and Kanter, I. (1986). Temporal association in asymmetric neural networks. Physical Review Letters.
Srivastava et al., [2015] Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. International Conference on Machine Learning.
Tang et al., [2023] Tang, M., Barron, H., and Bogacz, R. (2023). Sequential memory with temporal predictive coding. Advances in Neural Information Processing Systems.
Tully et al., [2016] Tully, P. J., Lindén, H., Hennig, M. H., and Lansner, A. (2016). Spike-based bayesian-hebbian learning of temporal sequences. PLoS Computational Biology.
Tulving, [2002] Tulving, E. (2002). Episodic memory: From mind to brain. Annual Review of Psychology.
Vershynin, [2018] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.

Appendix A Proof of Theorem 1

We construct a network such that, given $\bm{\xi}(t)=\mathbf{x}(i)$ for $i=1,...,T-1$ , the hidden neurons provide an one-hot encoding of the successive pattern $\mathbf{x}(i+1)$ , which is then decoded to be $\bm{\xi}(t+1)$ .

To store $\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}$ in (2)(3), assuming $\mathbf{x}(i)\neq\mathbf{x}(j)$ for $i\neq j$ except that $\mathbf{x}(1)=\mathbf{x}(T)$ , let $M=T-1$ and construct weight matrix $\mathbf{U}$ as

\displaystyle\mathbf{U}=(\mathbf{x}(1),\mathbf{x}(2),...,\mathbf{x}(T-1))^{\top}

(18)

and hidden neurons $\bm{\zeta}(t)=(\zeta_{1}(t),...,\zeta_{M}(t))^{\top}$ as

	$\displaystyle\zeta_{i}(t)$	$\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}\xi_{k}(t)-N\Big{)}$		(19)
		$\displaystyle=\text{sign}\big{(}\mathbf{x}(i)^{\top}\bm{\xi}(t)-N\big{)}$		(20)

such that given $\bm{\xi}(t)=\mathbf{x}(i)$ for $i=1,...,T-1$ , we have

\displaystyle\zeta_{j}(t)=\begin{cases}\ \ \ 1,&\text{if }j=i,\\ -1,&\text{otherwise}.\end{cases}

(21)

Next, we construct the weight matrix $\mathbf{V}$ as

\displaystyle\mathbf{V}=(\mathbf{x}(2),\mathbf{x}(3),...,\mathbf{x}(T))

(22)

and visible neurons $\bm{\xi}(t+1)=(\xi_{1}(t+1),...,\xi_{N}(t+1))^{\top}$ as

\displaystyle\bm{\xi}(t+1)=\text{sign}(\mathbf{V}\bm{\zeta}(t)+\bm{\theta})

(23)

where $\bm{\theta}=\sum_{j=2}^{T}\mathbf{x}(j)$ such that given the one-hot vector $\bm{\zeta}(t)$ we have

$\displaystyle\bm{\xi}(t+1)$	$\displaystyle=\text{sign}\Big{(}\mathbf{x}(i+1)-\sum_{j\neq i+1}\mathbf{x}(j)+% \sum_{j=2}^{T}\mathbf{x}(j)\Big{)}$	(24)
	$\displaystyle=\text{sign}(2\cdot\mathbf{x}(i+1))$	(25)
	$\displaystyle=\mathbf{x}(i+1)$	(26)

Appendix B Proof of Theorem 2

Note that the update of $\mathbf{U}$ (4)(5)(6) in Section 5 does not depend on $\mathbf{V}$ . Therefore, we first prove the convergence of updating $\mathbf{U}$ for $\eta>0$ and $\kappa>0$ . The proof follows from [Gardner,, 1988]. Assume $\mathbf{U}^{*}$ exists such that, for all $t$ and $i$ ,

\displaystyle z_{i}(t+1)\sum_{k}U^{*}_{ik}x_{k}(t)\geq\kappa.

(27)

Define the $p$ -th update of $\mathbf{U}$ with $\mu_{i}(t_{p})=1$ by

\displaystyle U_{ij}^{(p+1)}=U_{ij}^{(p)}+\eta z_{i}(t_{p}+1)x_{j}(t_{p})

(28)

for some $t_{p}\in\{1,...,T-1\}$ and all $j$ in parallel. We assume zero-initialization, that is, $U_{ij}^{(1)}=0$ for simplicity but the result holds if $|U_{ij}^{(1)}|$ is sufficiently small. Let

\displaystyle X_{i}^{(p+1)}=\frac{\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}}{\sqrt{\sum% _{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}}\sqrt{\sum_{j}\big{(}U_{ij}^{*}\big{)}^{2% }}}.

(29)

The Cauchy-Schwarz inequality, we have $X_{i}^{(p+1)}\leq 1$ . Now we prove the convergence of updating $\mathbf{U}$ by contradiction. Assuming the update of $\mathbf{U}$ does not converge, we will show that $X_{i}^{(p+1)}>1$ as $p\to\infty$ . First, we have

\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}-\sum_{j}U_{ij}^{(p)}U_{ij}^{*}=% \eta\sum_{j}z_{i}(t_{p}+1)U_{ij}^{*}x_{j}(t_{p})\geq\eta\kappa

(30)

due to (27) and therefore

	$\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}$	$\displaystyle=\sum_{j}U_{ij}^{(p+1)}U_{ij}^{}-\sum_{j}U_{ij}^{(p)}U_{ij}^{}+% ...+\sum_{j}U_{ij}^{(2)}U_{ij}^{}-\sum_{j}U_{ij}^{(1)}U_{ij}^{}+\sum_{j}U_{% ij}^{(1)}U_{ij}^{*}$		(31)
		$\displaystyle\geq\eta\kappa p$		(32)

since we assumed $U_{ij}^{(1)}=0$ . Next, we have

$\displaystyle\sum_{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}-\sum_{j}\big{(}U_{ij}^{(% p)}\big{)}^{2}$	$\displaystyle=\sum_{j}\big{(}U_{ij}^{(p)}+\eta z_{i}(t_{p}+1)x_{j}(t_{p})\big{% )}^{2}-\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}$	(33)
	$\displaystyle=2\eta\sum_{j}U_{ij}^{(p)}z_{i}(t_{p}+1)x_{j}(t_{p})+N\eta^{2}$	(34)
	$\displaystyle=2\eta z_{i}(t_{p}+1)\sum_{j}U_{ij}^{(p)}x_{j}(t_{p})+N\eta^{2}$	(35)
	$\displaystyle<2\eta\kappa+N\eta^{2}$	(36)

since we assumed $\mu_{i}(t_{p})=1$ and therefore $z_{i}(t_{p}+1)\sum_{j}U_{ij}^{(p)}x_{j}(t_{p})<\kappa$ . Then, we have

	$\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(p)})^% {2}}$	(37)
$\displaystyle=$	$\displaystyle\Big{(}\sum_{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}-\sum_{j}\big{(}U_% {ij}^{(p)}\big{)}^{2}\Big{)}\Big{/}\Big{(}\sqrt{\sum_{j}\big{(}U_{ij}^{(p+1)}% \big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}}\Big{)}$	(38)
$\displaystyle<$	$\displaystyle(2\eta\kappa+N\eta^{2})\Big{/}\Big{(}\sqrt{\sum_{j}\big{(}U_{ij}^% {(p+1)}\big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}}\Big{)}.$	(39)

By Cauchy-Schwarz inequality, we have

\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}\sqrt{\sum_{j}(U_{ij}^{*})^{2}% }\geq\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}\geq\eta\kappa p

(40)

and therefore

\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}\geq\frac{\eta\kappa p}{\sqrt{% \sum_{j}(U_{ij}^{*})^{2}}}.

(41)

Also,

$\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}$	$\displaystyle=\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(p)})% ^{2}}+...+\sqrt{\sum_{j}(U_{ij}^{(2)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(1)})^{2}}$	(42)
	$\displaystyle+\sqrt{\sum_{j}(U_{ij}^{(1)})^{2}}$	(43)
	$\displaystyle<\sum_{q=1}^{p}(2\eta\kappa+N\eta^{2})\Big{/}\Big{(}\sqrt{\sum_{j% }\big{(}U_{ij}^{(q+1)}\big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(q)}\big{)}^{2% }}\Big{)}$	(44)
	$\displaystyle<\sum_{q=1}^{p}(2\eta\kappa+N\eta^{2})\sqrt{\sum_{j}(U_{ij}^{*})^% {2}}\frac{1}{\eta\kappa(2q-1)}$	(45)
	$\displaystyle=\frac{\eta\kappa+N\eta^{2}/2}{\eta\kappa}\sqrt{\sum_{j}(U_{ij}^{% *})^{2}}\sum_{q=1}^{p}\frac{1}{q-1/2}$	(46)

due to (39) and $U_{ij}^{(1)}=0$ . Note that for $q>1$

\displaystyle\frac{1}{q-1/2}\leq\int_{q-3/2}^{q-1/2}\frac{1}{x}dx=\log(q-1/2)-% \log(q-3/2)

(47)

and

\displaystyle\sum_{q=1}^{p}\frac{1}{q-1/2}=\frac{1}{2}+\sum_{q=2}^{p}\frac{1}{% q-1/2}\leq\frac{1}{2}+\int_{1/2}^{p-1/2}\frac{1}{x}dx=2+\log(p-1/2)-\log(1/2).

(48)

Therefore,

\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}=O(\log(p))

(49)

and

\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}=\Omega(p)

(50)

as $p\to\infty$ . We have,

\displaystyle X_{i}^{(p+1)}=\frac{\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}}{\sqrt{\sum% _{j}(U_{ij}^{(p+1)})^{2}}\sqrt{\sum_{j}(U_{ij}^{*})^{2}}}>1

(51)

for some $p$ . This contradicts that $X_{i}^{(p+1)}\leq 1$ . Thus, the updating $\mathbf{U}$ converges.

Upon the convergence of updating $\mathbf{U}$ , we can prove the convergence of $\mathbf{V}$ if there exists $\mathbf{V}^{*}$ such that for all $t$ and $i$ ,

\displaystyle x_{i}(t+1)\sum_{k}V^{*}_{ik}y_{k}(t)\geq\kappa

(52)

by a similar proof.

Appendix C Proof of Theorem 3

If $\mu_{i}(t)=0$ , then $U_{ik}^{\prime}=U_{ik}$ and $\mu_{i}^{\prime}(t)=\mu_{i}(t)=0$ . If $\mu_{i}(t)=1$ , then

$\displaystyle\mu_{i}^{\prime}(t)$	$\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}\big{(}U_{ik}+\eta z_{i}(% t+1)x_{k}(t)\big{)}x_{k}(t)\Big{)}$	(53)
	$\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)-\eta\big{(% }z_{i}(t+1)\big{)}^{2}\sum_{k=1}^{N}\big{(}x_{k}(t)\big{)}^{2}\Big{)}$	(54)
	$\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)-\eta N\Big% {)}=0$	(55)

for sufficiently large $\eta>0$ given $x_{k}(t)=\pm 1$ , $z_{i}(t+1)=\pm 1$ and the property of Heaviside function.

Appendix D Proof of Theorem 4

If $\nu_{j}(t)=0$ , then we have

\displaystyle x_{j}(t+1)\sum_{k=1}^{M}V_{jk}y_{k}(t)\geq\kappa.

(56)

Next,

$\displaystyle x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\hat{y}_{k}(t)$	$\displaystyle=x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\big{(}{y}_{k}(t)+\hat{y}_{k}(t)-{% y}_{k}(t)\big{)}$	(57)
	$\displaystyle=x_{j}(t+1)\sum_{k=1}^{M}V_{jk}{y}_{k}(t)+x_{j}(t+1)\sum_{k=1}^{M% }V_{jk}\big{(}\hat{y}_{k}(t)-{y}_{k}(t)\big{)}$	(58)
	$\displaystyle\geq\kappa+x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\big{(}\hat{y}_{k}(t)-{y% }_{k}(t)\big{)}$	(59)
	$\displaystyle\geq\kappa-\Big{\|}\sum_{k=1}^{M}V_{jk}\big{(}\hat{y}_{k}(t)-{y}_{% k}(t)\big{)}\Big{\|}$	(60)
	$\displaystyle\geq\kappa-\max_{k}\|V_{jk}\|\sum_{k=1}^{M}\|\hat{y}_{k}(t)-{y}_{k}(% t)\|$	(61)
	$\displaystyle>\kappa-\max_{k}\|V_{jk}\|\cdot\epsilon>0$	(62)

since $x_{j}(t+1)=\pm 1$ , which implies

\displaystyle x_{j}(t+1)=\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}\hat{y}_{k}(t)% \Big{)}.

(63)

Appendix E Numerical Results for Figure 5 (a) and (b)

In the main paper, we only showed bar charts (Figure 5 (a) and (b)) of the results in Section 6.2. Here, for more information, we provide the numerical results for Figure 5 (a) in Table 1 and Figure 5 (b) in Table 2.

$T$	10	20	30	40	50	60	70	80	90	100	110	120	130	140	150
Learning only $\mathbf{V}$	100	100	100	91	66	19	8	2	0	0	0	0	0	0	0
Learning $\mathbf{U}$ and $\mathbf{V}$	100	100	100	100	100	99	88	52	20	1	0	0	0	0	0

Table 1: Successful retrievals out of 100 trials with different sequence period lengths

T

$M$	100	200	300	400	500	600	700	800	900	1000
Learning only $\mathbf{V}$	0	0	1	3	6	16	14	27	30	37
Learning $\mathbf{U}$ and $\mathbf{V}$	10	52	85	90	94	88	95	96	96	97

Table 2: Successful retrievals out of 100 trials with different numbers of hidden neurons

M

Appendix F Ablation Experiments: Joint Learning of $\mathbf{U}$ and $\mathbf{V}$

To verify the effective of the proposed learning algorithm in Section 5, we show additional experimental results in which three methods for the networks of hidden units in learning the sequences in Section 5.3 are compared.

Fixing $\mathbf{U}$ and learning $\mathbf{V}$ by the temporal asymmetric Hebbian algorithm

V_{ji}=\sum_{t}x_{j}(t+1)y_{i}(t)

where

y_{i}(t)=\text{sign}\big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\big{)}.

2.

Fixing $\mathbf{U}$ and learning $\mathbf{V}$ with the three-factor rule (7)(8)(9) in Section 5.
3.

Learning both $\mathbf{U}$ and $\mathbf{V}$ with the three-factor rule (4)(5)(6)(7)(8)(9) in Section 5.

The experimental settings are the same as in Section 6.3. The results are shown in Figure 1-5, from which we can see the algorithm proposed in Section 5 is indeed effective.

Appendix G Ablation Experiments: Sparsity

We provide some further ablation study of our algorithm on the effect of sparsity under the experimental settings of Section 6.2 in the main paper.

Figure 14: Sparse Random Projected Inputs

We compare our method (learning both $\mathbf{U}$ and $\mathbf{V}$ with the three-factor rule) with using fixed random $\mathbf{U}$ whose elements are sampled i.i.d. from the standard Gaussian distribution and learning only with the three-factor rule. The sparse random projected inputs are defined as

y_{i}(t)=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)-\theta\Big{)}

where $\theta>0$ controls the sparsity level.

Figure 15: Sparse Random Projected Targets

We test different levels of sparsity in the random projected targets defined as

z_{i}(t=1)=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)-\theta\Big{)}