Learning Sequence Attractors in Recurrent Networks with Hidden Neurons

Yao Lu Si Wu
Abstract

The brain is targeted for processing temporal sequence information. It remains largely unclear how the brain learns to store and retrieve sequence memories. Here, we study how recurrent networks of binary neurons learn sequence attractors to store predefined pattern sequences and retrieve them robustly. We show that to store arbitrary pattern sequences, it is necessary for the network to include hidden neurons even though their role in displaying sequence memories is indirect. We develop a local learning algorithm to learn sequence attractors in the networks with hidden neurons. The algorithm is proven to converge and lead to sequence attractors. We demonstrate that the network model can store and retrieve sequences robustly on synthetic and real-world datasets. We hope that this study provides new insights in understanding sequence memory and temporal information processing in the brain.

keywords:
Recurrent networks , attractor networks , sequence memory , sequence attractors
journal: Neural Networks
\affiliation

[a1]organization=School of Psychological and Cognitive Sciences, IDG/McGovern Institute for Brain Research, Be**g Key Laboratory of Behavior and Mental Health, Peking-Tsinghua Center for Life Sciences, Center of Quantitative Biology, Academy for Advanced Interdisciplinary Studies,
Peking University,country=China

1 Introduction

The brain is targeted for processing temporal sequence information. Taking visual recognition for example, the conventional setting of static image processing never happens in the brain. Starting from retina, visual inputs of an image arrive in the form of optical flow, which are transformed into spike trains of retinal ganglia cells, and then transmitted through LGN, V1, V2, V4 and higher cortical regions until the image is recognized. Along the whole pathway, the computations performed by the brain are in the form of temporal sequential processing, rather than being static. For another example, when we recall an episodic memory, a sequence of events represented by neuronal responses flows into our mind, and these events do not come in disorder or isolation, but are unfolded in time, as we experience “mental time travel” [Tulving,, 2002]. The hippocampus has been revealed to be essential for sequence memories by physiological and behavioral studies. In animals, sequences of neural activity patterns are observed in the hippocampus for memory replay and memory related tasks [Nádasdy et al.,, 1999; Lee and Wilson,, 2002; Foster and Wilson,, 2006; Pastalkova et al.,, 2008; Davidson et al.,, 2009; Pfeiffer and Foster,, 2013, 2015]. The discovery of time cells in hippocampus shows that the brain has specialized neurons encoding the temporal structure of events [Eichenbaum,, 2014]. Overall, the processing of temporal sequences is critical to the brain, but computational modeling study on this issue lags far behind that on static information processing.

Attractor neural networks are promising computational models for elucidating the mechanisms of the brain representing, memorizing and processing information [Amari,, 1972; Hopfield,, 1982; Amit,, 1989]. An attractor network is a type of recurrent networks, in which information is stored as stable states (attractors) of the network. Once stored, a memory can be retrieved robustly under the evolving of the network dynamics given noisy or incomplete cues. The experimental evidences have indicated that the brain employs attractor networks for memory related tasks [Khona and Fiete,, 2022]. By considering simplified neuron model and threshold dynamics, the classical Hopfield networks have successfully elucidated how recurrent networks learn to store static memory patterns [Hopfield,, 1982]. Recurrent networks of binary neurons can also generate pattern sequences by employing asymmetric weight connections [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992], to explain the sequential neural activities widely observed in the brain (e.g., in memory retrieval in the hippocampus). In this paper, we follow and extend the standard form of the recurrent networks of binary neurons and threshold dynamics, as this enables us to pursue theoretical analysis, and we investigate how the networks learn to store sequence attractors. By sequence attractor, it means the state of a recurrent network evolves in the order of the stored pattern sequence and being robust to noise.

We first show that, to store arbitrary pattern sequences, the recurrent networks which contains only visible binary neurons [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992] is inadequate in Section 3. Then we argue that it is necessary for the networks to include hidden neurons in Section 4. These neurons are not directly involved in expressing pattern sequences, but they are indispensable for the networks to store and retrieve arbitrary pattern sequences. We further develop a local learning algorithm to learn sequence attractors in the networks with hidden neurons in Section 5. The algorithm is proven to converge and lead to sequence attractors. We demonstrate that our network model can learn to store and retrieve pattern sequences robustly on synthetic and real-world datasets in Section 6.

2 Related Work and Our Contributions

Learning temporal sequences in recurrent networks has been studied previously in the field of computational neuroscience. These works employ different forms of recurrent networks and have different focuses of investigation. Specifically, [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992; Fiete et al.,, 2010] investigated recurrent networks of binary neurons and simple threshold dynamics. This approach takes advantages of simplified models that capture the essential features of neural dynamics and allows us to pursue theoretical analysis. [Brea et al.,, 2013; Tully et al.,, 2016] investigated recurrent networks of spiking neurons which are more biologically realistic but hard to analyze theoretically. [Laje and Buonomano,, 2013; Rajan et al.,, 2016; Gillett et al.,, 2020; Rajakumar et al.,, 2021] investigated recurrent networks of firing-rate neurons (e.g., sigmoid and linear-threshold), whose complexity is a trade-off between binary neurons and spiking neurons. More recently, [Karuvally et al.,, 2023; Chaudhry et al.,, 2023] employ modern Hopfield networks [Krotov and Hopfield,, 2016] and [Tang et al.,, 2023] employ predictive coding networks for modeling sequence memory.

In this paper, we study and extend the classical recurrent network model [Amari,, 1972; Hopfield,, 1982; Kleinfeld,, 1986; Sompolinsky and Kanter,, 1986], with the focus of theoretical analysis. Below summarizes the main contributions of our work in comparison to related work.

  • 1.

    We highlight the importance of hidden neurons in the recurrent networks of binary neurons for learning arbitrary pattern sequences. [Amari,, 1972; Hopfield,, 1982; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992; Fiete et al.,, 2010] considered only visible neurons and hence the pattern sequences they can generate are limited. Although [Laje and Buonomano,, 2013; Brea et al.,, 2013; Rajakumar et al.,, 2021; Chaudhry et al.,, 2023; Tang et al.,, 2023] also employed hidden neurons in sequence learning, they are based on different network architectures or neuron models.

  • 2.

    We have clear theoretical characterization of sequences that can be generated by our networks (Theorem 1), a result which is lacking in all other related work. Despite that this conclusion comes from the analysis of the simple model we use, it lays foundation for future work to test it in biologically more realistic networks. Although [Chaudhry et al.,, 2023] also provided theoretical characterization of the network capacity, it is based on Rademacher (random and uniformly distributed) sequence patterns and several approximations.

  • 3.

    Our learning algorithm is proven to converge and lead to sequence attractors, while most previous work only provided empirical evidences on the effectiveness of their learning algorithms [Laje and Buonomano,, 2013; Brea et al.,, 2013; Rajan et al.,, 2016; Rajakumar et al.,, 2021]. Although [Amari,, 1972; Bressloff and Taylor,, 1992] gave provable results on sequence attractors, they did not include hidden neurons and hence the results only hold for a restricted class of sequences.

  • 4.

    Our learning algorithm only requires local information between neurons, which is believed to be biologically plausible. [Rajakumar et al.,, 2021] used backpropagation which is often criticized for its biologically implausibility as it requires gradient computation and has the weight transport problem [Lillicrap et al.,, 2020].

3 Limitation of Networks without Hidden Neurons

We first consider recurrent networks of N𝑁Nitalic_N visible binary neurons [Amari,, 1972; Hopfield,, 1982; Sompolinsky and Kanter,, 1986; Bressloff and Taylor,, 1992]. All the neurons are bidirectionally connected and their weight matrix is 𝐖𝐖\mathbf{W}bold_W of which Wijsubscript𝑊𝑖𝑗W_{ij}italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the synaptic weight from the j𝑗jitalic_j-th neuron to the i𝑖iitalic_i-th neuron. Let 𝝃(t)=(ξ1(t),,ξN(t)){1,1}N𝝃𝑡superscriptsubscript𝜉1𝑡subscript𝜉𝑁𝑡topsuperscript11𝑁\bm{\xi}(t)=(\xi_{1}(t),...,\xi_{N}(t))^{\top}\in\{-1,1\}^{N}bold_italic_ξ ( italic_t ) = ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ξ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be the states of the neurons at time t𝑡titalic_t. These states are synchronously updated according to the threshold dynamics, for i=1,,N𝑖1𝑁i=1,...,Nitalic_i = 1 , … , italic_N,

ξi(t+1)=sign(j=1NWijξj(t)),subscript𝜉𝑖𝑡1signsuperscriptsubscript𝑗1𝑁subscript𝑊𝑖𝑗subscript𝜉𝑗𝑡\displaystyle\xi_{i}(t+1)=\text{sign}\Big{(}\sum_{j=1}^{N}W_{ij}\xi_{j}(t)\Big% {)},italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = sign ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) , (1)

where sign(x)=1sign𝑥1\text{sign}(x)=1sign ( italic_x ) = 1 if x0𝑥0x\geq 0italic_x ≥ 0 and sign(x)=1sign𝑥1\text{sign}(x)=-1sign ( italic_x ) = - 1 otherwise. The bias parameters are omitted here as they can be absorbed into the equation. Given a pair of successive network states 𝝃(t)𝝃𝑡\bm{\xi}(t)bold_italic_ξ ( italic_t ) and 𝝃(t+1)𝝃𝑡1\bm{\xi}(t+1)bold_italic_ξ ( italic_t + 1 ), the dynamics of the network can be unfolded in time and viewed as a feedforward network, in which each output neuron is a perceptron of the inputs.

Given a sequence in the form of 𝐱(1),,𝐱(T){1,1}N𝐱1𝐱𝑇superscript11𝑁\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}bold_x ( 1 ) , … , bold_x ( italic_T ) ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, one can use a learning algorithm to adjust 𝐖𝐖\mathbf{W}bold_W such that the evolution of the network state matches the pattern sequence. Although the networks can generate some sequences of maximal length 2Nsuperscript2𝑁2^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [Muscinelli et al.,, 2017], they are fundamentally limited in the class of sequences that can be generated. Since each neuron can be regarded as a perceptron, the condition that sequence 𝐱(1),,𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1),...,\mathbf{x}(T)bold_x ( 1 ) , … , bold_x ( italic_T ) can be generated by the network is, for each i𝑖iitalic_i, the dataset {(𝐱(t),xi(t+1))}t=1T1superscriptsubscript𝐱𝑡subscript𝑥𝑖𝑡1𝑡1𝑇1\{(\mathbf{x}(t),x_{i}(t+1))\}_{t=1}^{T-1}{ ( bold_x ( italic_t ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT is linearly separable [LeCun,, 1986; Bressloff and Taylor,, 1992; Brea et al.,, 2013; Muscinelli et al.,, 2017].

For a simple example of sequences which cannot be generated by the networks, consider the sequence

(11),( 11),(1 1),(11),(11)matrix11matrix11matrix11matrix11matrix11\displaystyle\begin{pmatrix}1\\ 1\end{pmatrix},\begin{pmatrix}\ \ \ 1\\ -1\end{pmatrix},\begin{pmatrix}-1\\ \ \ \ 1\end{pmatrix},\begin{pmatrix}-1\\ -1\end{pmatrix},\begin{pmatrix}1\\ 1\end{pmatrix}( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ) , ( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL end_ROW end_ARG ) , ( start_ARG start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ) , ( start_ARG start_ROW start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL end_ROW end_ARG ) , ( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG )

with N=2𝑁2N=2italic_N = 2 and T=5𝑇5T=5italic_T = 5. To generate this sequence, the first neuron of the network needs to map (1,1)superscript11top(1,1)^{\top}( 1 , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to 1111, (1,1)superscript11top(1,-1)^{\top}( 1 , - 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to 11-1- 1, (1,1)superscript11top(-1,1)^{\top}( - 1 , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to 11-1- 1 and (1,1)superscript11top(-1,-1)^{\top}( - 1 , - 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to 1111. This map** is essentially the XOR operation which cannot be performed by a perceptron [Minsky and Papert,, 1969].

In Figure 1, we show additional examples of sequences that cannot be generated by the network. The sequences are synthetically constructed. We then test if the perceptron learning algorithm can learn the sequences. Since the algorithm converges if the linear separability condition is met [Minsky and Papert,, 1969], the divergence of the algorithm implies that the sequences cannot be generated by the networks.

Refer to captionRefer to caption
Figure 1: Two example sequences which cannot be generated by networks without hidden neurons. White squares denote positive ones and black squares denote negative ones.

4 Networks with Hidden Neurons

To overcome the limitation of networks with only visible neurons, we consider including a group of hidden neurons in the network. The visible and hidden neurons are bidirectionally connected, and there is no intra-connection within visible neurons or hidden neurons. Let 𝐔𝐔\mathbf{U}bold_U be the weight matrix from visible neurons to hidden neurons, of which Uijsubscript𝑈𝑖𝑗U_{ij}italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the synaptic weight from the j𝑗jitalic_j-th visible neuron to the i𝑖iitalic_i-th hidden neuron, and 𝐕𝐕\mathbf{V}bold_V be the weight matrix from hidden neurons to visible neurons, of which Vjisubscript𝑉𝑗𝑖V_{ji}italic_V start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT denotes the synaptic weight from the i𝑖iitalic_i-th hidden neuron to the j𝑗jitalic_j-th visible neuron. Let 𝝃(t)=(ξ1(t),,ξN(t)){1,1}N𝝃𝑡superscriptsubscript𝜉1𝑡subscript𝜉𝑁𝑡topsuperscript11𝑁\bm{\xi}(t)=(\xi_{1}(t),...,\xi_{N}(t))^{\top}\in\{-1,1\}^{N}bold_italic_ξ ( italic_t ) = ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ξ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be the states of visible neurons and 𝜻(t)=(ζ1(t),,ζM(t)){1,1}M𝜻𝑡superscriptsubscript𝜁1𝑡subscript𝜁𝑀𝑡topsuperscript11𝑀\bm{\zeta}(t)=(\zeta_{1}(t),...,\zeta_{M}(t))^{\top}\in\{-1,1\}^{M}bold_italic_ζ ( italic_t ) = ( italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ζ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT be the states of hidden neurons at time t𝑡titalic_t. These states are synchronously updated according to, for i=1,,M𝑖1𝑀i=1,...,Mitalic_i = 1 , … , italic_M and j=1,,N𝑗1𝑁j=1,...,Nitalic_j = 1 , … , italic_N,

ζi(t)subscript𝜁𝑖𝑡\displaystyle\zeta_{i}(t)italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =sign(k=1NUikξk(t)),absentsignsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝜉𝑘𝑡\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}\xi_{k}(t)\Big{)},= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) , (2)
ξj(t+1)subscript𝜉𝑗𝑡1\displaystyle\xi_{j}(t+1)italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) =sign(k=1MVjkζk(t)),absentsignsuperscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝜁𝑘𝑡\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}\zeta_{k}(t)\Big{)},= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) , (3)

where we omit the bias parameters as they can be absorbed into the equations. As illustrated in Figure 2, given a pair of successive network states 𝝃(t)𝝃𝑡\bm{\xi}(t)bold_italic_ξ ( italic_t ) and 𝝃(t+1)𝝃𝑡1\bm{\xi}(t+1)bold_italic_ξ ( italic_t + 1 ), the dynamics of the network can be unfolded in time and viewed as a feedforward network with a hidden layer of neurons.

Unfolding in time𝝃(t)𝝃𝑡\bm{\xi}(t)bold_italic_ξ ( italic_t )𝝃(t+1)𝝃𝑡1\bm{\xi}(t+1)bold_italic_ξ ( italic_t + 1 )𝜻(t)𝜻𝑡\bm{\zeta}(t)bold_italic_ζ ( italic_t )𝐔𝐔\mathbf{U}bold_U𝐕𝐕\mathbf{V}bold_V
Figure 2: Recurrent network with hidden neurons. The red circles denote visible neurons and the white circles denote hidden neurons.

The networks of M𝑀Mitalic_M hidden neurons can generate arbitrary sequences with Markov property and of length at least M𝑀Mitalic_M, as stated in Theorem 1. We provide a constructive proof based on an one-hot encoding by the hidden neurons in the Appendix.

Theorem 1.

Let 𝐱(1),,𝐱(T){1,1}N𝐱1𝐱𝑇superscript11𝑁\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}bold_x ( 1 ) , … , bold_x ( italic_T ) ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT such that 𝐱(i)𝐱(j)𝐱𝑖𝐱𝑗\mathbf{x}(i)\neq\mathbf{x}(j)bold_x ( italic_i ) ≠ bold_x ( italic_j ) for ij𝑖𝑗i\neq jitalic_i ≠ italic_j except that 𝐱(1)=𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1)=\mathbf{x}(T)bold_x ( 1 ) = bold_x ( italic_T ). Then 𝐱(1),,𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1),...,\mathbf{x}(T)bold_x ( 1 ) , … , bold_x ( italic_T ) can be generated by the network defined in (2)(3) for M=T1𝑀𝑇1M=T-1italic_M = italic_T - 1.

5 Learning

To learn the weight matrices, one can first unfold the recurrent network with hidden neurons in time such that it becomes a feedforward network with a hidden layer, and the pairs of successive patterns in the sequence constitute the training examples, as illustrated in Figure 2. However, learning in the unfolded feedforward network is difficult since the backpropagation algorithm cannot be applied as the neurons are not differentiable.

We propose a new learning algorithm to learn the weight matrices in the unfolded feedforward networks, which draws inspirations from three ideas: feedback alignment [Lillicrap et al.,, 2016], target propagation [LeCun,, 1987; Bengio,, 2014; Litwin-Kumar et al.,, 2017] and three-factor rules [Frémaux and Gerstner,, 2016; Kuśmierz et al.,, 2017]. As in feedback alignment, it requires a random matrix 𝐏𝐏\mathbf{P}bold_P, which is fixed during the learning process, to backpropagate signals. As in target propagation, it does not propagate errors but targets to create surrogate targets for the hidden neurons. Each weight parameter is updated by a three-factor rule, in which the presynaptic activation, the postsynaptic activation and an error term as neuromodulation are multiplied. The three-factor rule is similar to the one in [Bressloff and Taylor,, 1992] and known as margin perceptron in the machine learning literature [Collobert and Bengio,, 2004].

The algorithm works as follows. Given a pair of successive patterns 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) and 𝐱(t+1)𝐱𝑡1\mathbf{x}(t+1)bold_x ( italic_t + 1 ), for i=1,,M𝑖1𝑀i=1,...,Mitalic_i = 1 , … , italic_M and j=1,,N𝑗1𝑁j=1,...,Nitalic_j = 1 , … , italic_N in parallel,

  1. 1.

    Update 𝐔𝐔\mathbf{U}bold_U by

    zi(t+1)subscript𝑧𝑖𝑡1\displaystyle z_{i}(t+1)italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) =sign(k=1NPikxk(t+1)),absentsignsuperscriptsubscript𝑘1𝑁subscript𝑃𝑖𝑘subscript𝑥𝑘𝑡1\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)\Big{)},= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t + 1 ) ) , (4)
    μi(t)subscript𝜇𝑖𝑡\displaystyle\mu_{i}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =H(κzi(t+1)k=1NUikxk(t)),absent𝐻𝜅subscript𝑧𝑖𝑡1superscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)},= italic_H ( italic_κ - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) , (5)
    Uijsubscript𝑈𝑖𝑗\displaystyle U_{ij}italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT Uij+ημi(t)zi(t+1)xj(t).absentsubscript𝑈𝑖𝑗𝜂subscript𝜇𝑖𝑡subscript𝑧𝑖𝑡1subscript𝑥𝑗𝑡\displaystyle\leftarrow U_{ij}+\eta\mu_{i}(t)z_{i}(t+1)x_{j}(t).← italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_η italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) . (6)
  2. 2.

    Update 𝐕𝐕\mathbf{V}bold_V by

    yi(t)subscript𝑦𝑖𝑡\displaystyle y_{i}(t)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =sign(k=1NUikxk(t)),absentsignsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)},= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) , (7)
    νj(t)subscript𝜈𝑗𝑡\displaystyle\nu_{j}(t)italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) =H(κxj(t+1)k=1MVjkyk(t)),absent𝐻𝜅subscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝑦𝑘𝑡\displaystyle=H\Big{(}\kappa-x_{j}(t+1)\sum_{k=1}^{M}V_{jk}y_{k}(t)\Big{)},= italic_H ( italic_κ - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) , (8)
    Vjisubscript𝑉𝑗𝑖\displaystyle V_{ji}italic_V start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT Vji+ηνj(t)xj(t+1)yi(t),absentsubscript𝑉𝑗𝑖𝜂subscript𝜈𝑗𝑡subscript𝑥𝑗𝑡1subscript𝑦𝑖𝑡\displaystyle\leftarrow V_{ji}+\eta\nu_{j}(t)x_{j}(t+1)y_{i}(t),← italic_V start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + italic_η italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , (9)

where Piksubscript𝑃𝑖𝑘P_{ik}italic_P start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT denotes the (i,k)𝑖𝑘(i,k)( italic_i , italic_k ) entry of the fixed random matrix 𝐏𝐏\mathbf{P}bold_P, H()𝐻H(\cdot)italic_H ( ⋅ ) is the Heaviside function (H(x)=1𝐻𝑥1H(x)=1italic_H ( italic_x ) = 1 if x0𝑥0x\geq 0italic_x ≥ 0 and H(x)=0𝐻𝑥0H(x)=0italic_H ( italic_x ) = 0 otherwise), κ>0𝜅0\kappa>0italic_κ > 0 is the robustness hyperparameter and η>0𝜂0\eta>0italic_η > 0 is the learning rate hyperparameter. μi(t)subscript𝜇𝑖𝑡\mu_{i}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and νj(t)subscript𝜈𝑗𝑡\nu_{j}(t)italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) can be interpreted as the error terms for the hidden and the visible neurons, respectively. zi(t+1)subscript𝑧𝑖𝑡1z_{i}(t+1)italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) can be interpreted as the synaptic input from an external neuron. The above procedure is then repeated for each t𝑡titalic_t.

5.1 Analysis

In this section, we provide theoretical analysis of the algorithm. The proofs are left to the Appendix. First, we provide convergence guarantee of the algorithm.

Theorem 2.

Given the definitions in (4)(5)(7)(8), for all i𝑖iitalic_i, j𝑗jitalic_j and t𝑡titalic_t, if a solution exists such that μi(t)=0subscript𝜇𝑖𝑡0\mu_{i}(t)=0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 0 and νj(t)=0subscript𝜈𝑗𝑡0\nu_{j}(t)=0italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = 0 , then the algorithm (4)-(9) converges in finite steps given Uijsubscript𝑈𝑖𝑗U_{ij}italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and Vjisubscript𝑉𝑗𝑖V_{ji}italic_V start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT are initialized to zero.

Next, we show the algorithm can reduce error μi(t)subscript𝜇𝑖𝑡\mu_{i}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) for a single step of updating 𝐔𝐔\mathbf{U}bold_U. The theorem can be trivially extended for νj(t)subscript𝜈𝑗𝑡\nu_{j}(t)italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) and 𝐕𝐕\mathbf{V}bold_V by a similar proof.

Theorem 3.

Given the definitions in (4)(5), let

Uiksuperscriptsubscript𝑈𝑖𝑘\displaystyle U_{ik}^{\prime}italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =Uik+ημi(t)zi(t+1)xk(t),absentsubscript𝑈𝑖𝑘𝜂subscript𝜇𝑖𝑡subscript𝑧𝑖𝑡1subscript𝑥𝑘𝑡\displaystyle=U_{ik}+\eta\mu_{i}(t)z_{i}(t+1)x_{k}(t),= italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + italic_η italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) , (10)
μi(t)superscriptsubscript𝜇𝑖𝑡\displaystyle\mu_{i}^{\prime}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) =H(κzi(t+1)k=1NUikxk(t)).absent𝐻𝜅subscript𝑧𝑖𝑡1superscriptsubscript𝑘1𝑁superscriptsubscript𝑈𝑖𝑘subscript𝑥𝑘𝑡\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}^{\prime}x_{k}(t)% \Big{)}.= italic_H ( italic_κ - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) . (11)

Then μi(t)=0superscriptsubscript𝜇𝑖𝑡0\mu_{i}^{\prime}(t)=0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = 0 for sufficiently large η>0𝜂0\eta>0italic_η > 0.

To understand why reducing the errors μi(t)subscript𝜇𝑖𝑡\mu_{i}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and νj(t)subscript𝜈𝑗𝑡\nu_{j}(t)italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) leads to sequence attractors, we present the following result.

Theorem 4.

Given the definitions in (7)(8), let 𝐲^(t)=(y^1(t),,y^M(t)){1,1}M^𝐲𝑡superscriptsubscript^𝑦1𝑡subscript^𝑦𝑀𝑡topsuperscript11𝑀\hat{\mathbf{y}}(t)=(\hat{y}_{1}(t),...,\hat{y}_{M}(t))^{\top}\in\{-1,1\}^{M}over^ start_ARG bold_y end_ARG ( italic_t ) = ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT such that k|y^k(t)yk(t)|<ϵsubscript𝑘subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡italic-ϵ\sum_{k}|\hat{y}_{k}(t)-y_{k}(t)|<\epsilon∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) | < italic_ϵ. If νj(t)=0subscript𝜈𝑗𝑡0\nu_{j}(t)=0italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = 0 and

ϵmaxk|Vjk|<κ,italic-ϵsubscript𝑘subscript𝑉𝑗𝑘𝜅\displaystyle\epsilon\cdot\max_{k}|V_{jk}|<\kappa,italic_ϵ ⋅ roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT | < italic_κ , (12)

then

xj(t+1)=sign(k=1MVjky^k(t)).subscript𝑥𝑗𝑡1signsuperscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡\displaystyle x_{j}(t+1)=\emph{\text{sign}}\Big{(}\sum_{k=1}^{M}V_{jk}\hat{y}_% {k}(t)\Big{)}.italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) = sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) . (13)

Theorem 4 shows that when the errors are zero, given perturbed hidden neuron states 𝐲^(t)^𝐲𝑡\hat{\mathbf{y}}(t)over^ start_ARG bold_y end_ARG ( italic_t ), we have 𝐱(t+1)=sign(𝐕𝐲^(t))𝐱𝑡1sign𝐕^𝐲𝑡\mathbf{x}(t+1)=\text{sign}(\mathbf{V}\hat{\mathbf{y}}(t))bold_x ( italic_t + 1 ) = sign ( bold_V over^ start_ARG bold_y end_ARG ( italic_t ) ). The result can be trivially extended to show that given perturbed visible neuron states 𝐱^(t)^𝐱𝑡\hat{\mathbf{x}}(t)over^ start_ARG bold_x end_ARG ( italic_t ) we have 𝐲(t)=sign(𝐔𝐱^(t))𝐲𝑡sign𝐔^𝐱𝑡\mathbf{y}(t)=\text{sign}(\mathbf{U}\hat{\mathbf{x}}(t))bold_y ( italic_t ) = sign ( bold_U over^ start_ARG bold_x end_ARG ( italic_t ) ) by a similar proof. Therefore, the network can generate sequence 𝐱(1),,𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1),...,\mathbf{x}(T)bold_x ( 1 ) , … , bold_x ( italic_T ) as an attractor. From Theorem 4, we can also see that κ𝜅\kappaitalic_κ acts as the robustness hyperparameter as it controls the level of perturbation ϵitalic-ϵ\epsilonitalic_ϵ for inequality (12) to hold.

To understand why the algorithm works despite that 𝐏𝐏\mathbf{P}bold_P is a random matrix and fixed during learning, consider the following. If the update of 𝐔𝐔\mathbf{U}bold_U converges, then μi(t)=0subscript𝜇𝑖𝑡0\mu_{i}(t)=0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 0 for all i𝑖iitalic_i. Therefore,

yi(t)subscript𝑦𝑖𝑡\displaystyle y_{i}(t)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =sign(k=1NUikxk(t))absentsignsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\Big{)}= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) (14)
=zi(t+1)absentsubscript𝑧𝑖𝑡1\displaystyle=z_{i}(t+1)= italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) (15)
=sign(k=1NPikxk(t+1)).absentsignsuperscriptsubscript𝑘1𝑁subscript𝑃𝑖𝑘subscript𝑥𝑘𝑡1\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)\Big{)}.= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t + 1 ) ) . (16)

The update of 𝐕𝐕\mathbf{V}bold_V aims at making the condition sign(k=1MVjkyk(t))=xj(t+1)signsuperscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝑦𝑘𝑡subscript𝑥𝑗𝑡1\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}y_{k}(t)\Big{)}=x_{j}(t+1)sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) hold, which is

sign(𝐕sign(𝐏𝐱(t+1)))=𝐱(t+1),sign𝐕sign𝐏𝐱𝑡1𝐱𝑡1\displaystyle\text{sign}(\mathbf{V}\text{sign}(\mathbf{Px}(t+1)))=\mathbf{x}(t% +1),sign ( bold_V sign ( bold_Px ( italic_t + 1 ) ) ) = bold_x ( italic_t + 1 ) , (17)

in matrix form when yk(t)subscript𝑦𝑘𝑡y_{k}(t)italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) is substituted by (16). For large M𝑀Mitalic_M, a solution 𝐕𝐕\mathbf{V}bold_V exists, that is, the pseudo-inverse of 𝐏𝐏\mathbf{P}bold_P or the transpose of 𝐏𝐏\mathbf{P}bold_P. The numerical result is shown in Figure 3. The phenomenon might be explained by the high-dimensional probability theory [Vershynin,, 2018].

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Reconstruction error 𝐱sign(𝐕sign(𝐏𝐱))1subscriptnorm𝐱sign𝐕sign𝐏𝐱1\|\mathbf{x}-\text{sign}(\mathbf{V}\text{sign}(\mathbf{Px}))\|_{1}∥ bold_x - sign ( bold_V sign ( bold_Px ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where 𝐏𝐏\mathbf{P}bold_P is a M×N𝑀𝑁M\times Nitalic_M × italic_N random matrix and 𝐱𝐱\mathbf{x}bold_x is a random vector uniformly sampled from {1,1}Nsuperscript11𝑁\{-1,1\}^{N}{ - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. 𝐏+superscript𝐏\mathbf{P}^{+}bold_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes the pseudo-inverse of 𝐏𝐏\mathbf{P}bold_P. 𝐏superscript𝐏top\mathbf{P}^{\top}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT denotes the transpose of 𝐏𝐏\mathbf{P}bold_P. (a) The entries of 𝐏𝐏\mathbf{P}bold_P are sampled i.i.d. from the standard Gaussian distribution. (b) The entries of 𝐏𝐏\mathbf{P}bold_P are sampled i.i.d. from the uniform distribution on [1,1]11{[-1,1]}[ - 1 , 1 ]. In (a) and (b), N=100𝑁100N=100italic_N = 100 and the results are averaged over 100 trials. The results are similar in (a) and (b). The error bars are not displayed for visual clarity.

5.2 Robustness Hyperparameter

Having a hyperparameter κ𝜅\kappaitalic_κ in the algorithm is not problematic in practice. One can simply set κ=1𝜅1\kappa=1italic_κ = 1 as we did for all the experiments in the next section and adjust the scale of initial weights and the learning rate. In margin perceptron, the margin learned is disproportional to the learning rate [Collobert and Bengio,, 2004]. The margin is defined as the reciprocal of the weight magnitude, which is related to the robustness hyperparameter, as shown in Theorem 4. Therefore, it can be interpreted that the robustness hyperparameter is automatically adjusted during learning.

6 Experiments

We ran experiments on synthetic and real-world sequence datasets for the networks with hidden neurons by the algorithm proposed in the previous section to learn sequence attractors. All the experiments were carried out in MATLAB and PyTorch. In all the experiments, each weight parameter of 𝐔𝐔\mathbf{U}bold_U, 𝐕𝐕\mathbf{V}bold_V and 𝐏𝐏\mathbf{P}bold_P was sampled i.i.d. from Gaussian distribution with mean zero and variance 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, learning rate η=1×103𝜂1superscript103\eta=1\times 10^{-3}italic_η = 1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and robustness κ=1𝜅1\kappa=1italic_κ = 1. In each experiment, we ran the algorithm for 500500500500 epochs. In each epoch, the algorithm ran on (𝐱(t),𝐱(t+1))𝐱𝑡𝐱𝑡1(\mathbf{x}(t),\mathbf{x}(t+1))( bold_x ( italic_t ) , bold_x ( italic_t + 1 ) ) from the start to the end of each sequence. No noise was added during learning. Noise was added only at retrieval. We also provide additional experiments in the Appendix.

Refer to captionRefer to caption
Figure 4: Recurrent networks with hidden neurons can generate the two sequences in Figure 1, which cannot be generated without hidden neurons, despite noisy initial states. Note that in the first column of each diagram two salt-and-pepper noises are added to test the robustness of the retrieval.

6.1 Toy Examples

To show the recurrent networks with hidden neurons can overcome the limitation of the networks without hidden neurons, we conducted experiments on the examples in Figure 1. We constructed a network of visible neurons N=10𝑁10N=10italic_N = 10 and hidden neurons M=50𝑀50M=50italic_M = 50 for each example. After learning, we tested the robustness of the networks in retrieval by adding two salt-and-pepper noises (flip** the states of two out of ten neurons) to the first pattern of a sequence and set it to be the initial network state.

The results are shown in Figure 4, from which we can see that the networks with hidden neurons can generate sequences which cannot be generated by the networks without hidden neurons and retrieve them robustly under moderate level of noise.

6.2 Random Sequences

We generated periodic sequences of random patterns 𝐱(1),,𝐱(T){1,1}N𝐱1𝐱𝑇superscript11𝑁\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}bold_x ( 1 ) , … , bold_x ( italic_T ) ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. In each sequence, 𝐱(i)𝐱(j)𝐱𝑖𝐱𝑗\mathbf{x}(i)\neq\mathbf{x}(j)bold_x ( italic_i ) ≠ bold_x ( italic_j ) for ij𝑖𝑗i\neq jitalic_i ≠ italic_j except that 𝐱(1)=𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1)=\mathbf{x}(T)bold_x ( 1 ) = bold_x ( italic_T ) for the periodicity. We set N=100𝑁100N=100italic_N = 100 and varied period length T𝑇Titalic_T. We sampled each 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) independently from the uniform distribution of {1,1}Nsuperscript11𝑁\{-1,1\}^{N}{ - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for t=1,,T1𝑡1𝑇1t=1,...,T-1italic_t = 1 , … , italic_T - 1 and then resampled it if it is identical to a previous pattern. Finally, we set 𝐱(T)=𝐱(1)𝐱𝑇𝐱1\mathbf{x}(T)=\mathbf{x}(1)bold_x ( italic_T ) = bold_x ( 1 ).

For each random sequence, we constructed a network with hidden neurons and applied the proposed learning algorithm. To evaluate the effectiveness of the learning algorithm, we compared learning only 𝐕𝐕\mathbf{V}bold_V (with 𝐔𝐔\mathbf{U}bold_U fixed during learning) and learning both 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V. Once the learning was done, we tested if the network can retrieve the sequence robustly given perturbed 𝐱(1)𝐱1\mathbf{x}(1)bold_x ( 1 ) with 10101010 salt-and-pepper noises as the initial network state 𝝃(1)𝝃1\bm{\xi}(1)bold_italic_ξ ( 1 ). We define that the retrieval is successful if 𝝃(τ+t)=𝐱(t)𝝃𝜏𝑡𝐱𝑡\bm{\xi}(\tau+t)=\mathbf{x}(t)bold_italic_ξ ( italic_τ + italic_t ) = bold_x ( italic_t ) for some τ𝜏\tauitalic_τ and all t=1,,T𝑡1𝑇t=1,...,Titalic_t = 1 , … , italic_T. We run 100100100100 trials for each T𝑇Titalic_T or M𝑀Mitalic_M setting and count the successful retrievals.

In Figure 5 (a), we show the results with various period lengths T𝑇Titalic_T for M=500𝑀500M=500italic_M = 500. In Figure 5 (b), we show the results with various numbers of hidden neurons M𝑀Mitalic_M for T=70𝑇70T=70italic_T = 70. We can see learning both 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V is more effective than learning only 𝐕𝐕\mathbf{V}bold_V. However, in both cases, the algorithm failed for large T𝑇Titalic_T, even if we increased the number of hidden neurons, which might be due to the suboptimality of the algorithm.

Refer to caption
(a) Varying sequence length
Refer to caption
(b) Varying number of hidden neurons
Figure 5: Successful retrievals out of 100 trials under noise, comparing learning only 𝐕𝐕\mathbf{V}bold_V and learning both 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V.

6.3 Real-World Sequences

We tested the networks with hidden neurons by our algorithm in learning real-world sequences on a silhouette sequence dataset (OU-ISIR gait database large population [Iwama et al.,, 2012]) and a handwriting sequence dataset (Moving MNIST [Srivastava et al.,, 2015]). The patterns in the sequences are rather correlated since adjacent image frames are similar. To adopt the datasets for the networks to learn, we converted the image intensity values to ±1plus-or-minus1\pm 1± 1.

For the OU-ISIR gait dataset, we used a network with hidden neuron number M=200𝑀200M=200italic_M = 200 to learn a single image sequence of length 103103103103, in which each image has size 88×1288812888\times 12888 × 128. The images were flatten to vectors of size 88×128=11264881281126488\times 128=1126488 × 128 = 11264. For the Moving MNIST dataset, we used a network with hidden neuron number M=1000𝑀1000M=1000italic_M = 1000 to learn 20202020 image sequences of length 20202020, in which each image has size 64×64646464\times 6464 × 64. The images were flatten to vectors of size 64×64=40966464409664\times 64=409664 × 64 = 4096. In Figure 6 and 7, we show the visualization results of the learned networks for robust retrieval, in which the first image of a sequence was corrupted and set to be the initial state of the network.

In Figure 8, we show the average errors 1Mtiμi(t)1𝑀subscript𝑡subscript𝑖subscript𝜇𝑖𝑡\frac{1}{M}\sum_{t}\sum_{i}\mu_{i}(t)divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and 1Ntjνj(t)1𝑁subscript𝑡subscript𝑗subscript𝜈𝑗𝑡\frac{1}{N}\sum_{t}\sum_{j}\nu_{j}(t)divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) during the learning process, from which we can see that both errors reduce to zero smoothly. This result demonstrates that 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V can be cooperatively learned instead of conflicting with each other during learning.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Figure 6: Retrieval of sequences under noise on the OU-ISIR gait dataset. An image sequence of length 103103103103 is learned. Each image has size 88×1288812888\times 12888 × 128. In (a) and (b), 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) and 𝝃(t)𝝃𝑡\bm{\xi}(t)bold_italic_ξ ( italic_t ) are shown respectively for t=1,,10𝑡110t=1,...,10italic_t = 1 , … , 10. In (b), 2000200020002000 salt-and-pepper noises are added to the first image. The corrupted image is set to be the initial state of the network.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Figure 7: Retrieval of sequences under noise on the Moving MNIST dataset. 20202020 image sequences of length 20202020 are learned. Due to space limitation, only one image sequence is displayed in here. Each image has size 64×64646464\times 6464 × 64. In (a) and (b), 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) and 𝝃(t)𝝃𝑡\bm{\xi}(t)bold_italic_ξ ( italic_t ) are shown respectively for t=1,,8𝑡18t=1,...,8italic_t = 1 , … , 8. In (b), 300300300300 salt-and-pepper noises are added to the first image. The corrupted image is set to be the initial state of the network.
Refer to caption
(a) OU-ISIR gait
Refer to caption
(b) Moving MNIST
Figure 8: Errors during learning. 1Mtiμi(t)1𝑀subscript𝑡subscript𝑖subscript𝜇𝑖𝑡\frac{1}{M}\sum_{t}\sum_{i}\mu_{i}(t)divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) is the average error for the hidden neurons. 1Ntjνj(t)1𝑁subscript𝑡subscript𝑗subscript𝜈𝑗𝑡\frac{1}{N}\sum_{t}\sum_{j}\nu_{j}(t)divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) is the average error for the visible neurons.

7 Discussion and Conclusion

In this paper, we have investigated how recurrent networks of binary neurons learn sequence attractors to represent temporal sequence information. We showed that to store arbitrary sequence patterns, it is necessary for the networks to include hidden neurons. We developed a local learning algorithm and demonstrated that our model works well on synthetic and real-world datasets. Thus, our work provides a possible biologically plausible mechanism in elucidating sequence memory in the brain. In our model, hidden neurons are not directly involved in expressing pattern sequences. Instead, their contribution is on facilitating the storing and retrieving of pattern sequences. The indirect but indispensable role of hidden neurons may have a far-reaching implication to neural information processing.

Modern Hopfield networks also employ hidden neurons for sequence learning [Chaudhry et al.,, 2023]. In comparison, our approach is different in several aspects. First, our network model requires only the threshold activation function, while modern Hopfield networks require a polynomial or exponential activaton function for the hidden neurons. Second, the weights in our network model are learned from scratch in an online manner while modern Hopfield networks require to store the sequence patterns explicitly as the weights in a predefined manner. Third, our learning algorithm requires only local information between neurons, while modern Hopfield networks require the pseudo-inverse computation. Thus, our approach provides an alternative mechanism of sequence memory to modern Hopfield networks.

To pursue theoretical analysis, we have employed a very simple network model with binary neurons and threshold dynamics. From this simple model, we can get some insights into the neural mechanisms of sequence processing in the brain (as the classical Hopfield networks to static memories), but this simplification also incurs limitations. To fully validate our results, further researches with biologically more plausible models are needed, which include, for instances, biologically more realistic neuron models, synapses models, connection structures, learning rules and form of pattern sequences. Additionally, we look forward to seeing improvement of our learning algorithm for larger network capacity and robustness.

Acknowledgements

This work was supported by the Science and Technology Innovation 2030-Brain Science and Brain-inspired Intelligence Project (No. 2021ZD0200204).

References

  • Amari, [1972] Amari, S.-I. (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers.
  • Amit, [1989] Amit, D. J. (1989). Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press.
  • Bengio, [2014] Bengio, Y. (2014). How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906.
  • Brea et al., [2013] Brea, J., Senn, W., and Pfister, J.-P. (2013). Matching recall and storage in sequence learning with spiking neural networks. Journal of Neuroscience.
  • Bressloff and Taylor, [1992] Bressloff, P. C. and Taylor, J. G. (1992). Perceptron-like learning in time-summating neural networks. Journal of Physics A: Mathematical and General.
  • Chaudhry et al., [2023] Chaudhry, H. T., Zavatone-Veth, J. A., Krotov, D., and Pehlevan, C. (2023). Long sequence hopfield memory. Advances in Neural Information Processing Systems.
  • Collobert and Bengio, [2004] Collobert, R. and Bengio, S. (2004). Links between perceptrons, mlps and svms. International Conference on Machine learning.
  • Davidson et al., [2009] Davidson, T. J., Kloosterman, F., and Wilson, M. A. (2009). Hippocampal replay of extended experience. Neuron.
  • Eichenbaum, [2014] Eichenbaum, H. (2014). Time cells in the hippocampus: a new dimension for map** memories. Nature Reviews Neuroscience.
  • Fiete et al., [2010] Fiete, I. R., Senn, W., Wang, C. Z., and Hahnloser, R. H. (2010). Spike-time-dependent plasticity and heterosynaptic competition organize networks to produce long scale-free sequences of neural activity. Neuron.
  • Foster and Wilson, [2006] Foster, D. J. and Wilson, M. A. (2006). Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature.
  • Frémaux and Gerstner, [2016] Frémaux, N. and Gerstner, W. (2016). Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Frontiers in Neural Circuits.
  • Gardner, [1988] Gardner, E. (1988). The space of interactions in neural network models. Journal of Physics A: Mathematical and General.
  • Gillett et al., [2020] Gillett, M., Pereira, U., and Brunel, N. (2020). Characteristics of sequential activity in networks with temporally asymmetric hebbian learning. Proceedings of the National Academy of Sciences.
  • Hopfield, [1982] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences.
  • Iwama et al., [2012] Iwama, H., Okumura, M., Makihara, Y., and Yagi, Y. (2012). The ou-isir gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Transactions on Information Forensics and Security.
  • Karuvally et al., [2023] Karuvally, A., Sejnowski, T., and Siegelmann, H. T. (2023). General sequential episodic memory model. International Conference on Machine Learning.
  • Khona and Fiete, [2022] Khona, M. and Fiete, I. R. (2022). Attractor and integrator networks in the brain. Nature Reviews Neuroscience.
  • Kleinfeld, [1986] Kleinfeld, D. (1986). Sequential state generation by model neural networks. Proceedings of the National Academy of Sciences.
  • Krotov and Hopfield, [2016] Krotov, D. and Hopfield, J. J. (2016). Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems.
  • Kuśmierz et al., [2017] Kuśmierz, Ł., Isomura, T., and Toyoizumi, T. (2017). Learning with three factors: modulating hebbian plasticity with errors. Current Opinion in Neurobiology.
  • Laje and Buonomano, [2013] Laje, R. and Buonomano, D. V. (2013). Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience.
  • LeCun, [1986] LeCun, Y. (1986). Learning process in an asymmetric threshold network. Disordered Systems and Biological Organization.
  • LeCun, [1987] LeCun, Y. (1987). Modeles connexionnistes de lapprentissage. PhD thesis, These de Doctorat, Universite Paris.
  • Lee and Wilson, [2002] Lee, A. K. and Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron.
  • Lillicrap et al., [2016] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications.
  • Lillicrap et al., [2020] Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020). Backpropagation and the brain. Nature Reviews Neuroscience.
  • Litwin-Kumar et al., [2017] Litwin-Kumar, A., Harris, K. D., Axel, R., Sompolinsky, H., and Abbott, L. (2017). Optimal degrees of synaptic connectivity. Neuron.
  • Minsky and Papert, [1969] Minsky, M. and Papert, S. A. (1969). Perceptrons. MIT Press.
  • Muscinelli et al., [2017] Muscinelli, S. P., Gerstner, W., and Brea, J. (2017). Exponentially long orbits in hopfield neural networks. Neural Computation.
  • Nádasdy et al., [1999] Nádasdy, Z., Hirase, H., Czurkó, A., Csicsvari, J., and Buzsáki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience.
  • Pastalkova et al., [2008] Pastalkova, E., Itskov, V., Amarasingham, A., and Buzsaki, G. (2008). Internally generated cell assembly sequences in the rat hippocampus. Science.
  • Pfeiffer and Foster, [2013] Pfeiffer, B. E. and Foster, D. J. (2013). Hippocampal place-cell sequences depict future paths to remembered goals. Nature.
  • Pfeiffer and Foster, [2015] Pfeiffer, B. E. and Foster, D. J. (2015). Autoassociative dynamics in the generation of sequences of hippocampal place cells. Science.
  • Rajakumar et al., [2021] Rajakumar, A., Rinzel, J., and Chen, Z. S. (2021). Stimulus-driven and spontaneous dynamics in excitatory-inhibitory recurrent neural networks for sequence representation. Neural Computation.
  • Rajan et al., [2016] Rajan, K., Harvey, C. D., and Tank, D. W. (2016). Recurrent network models of sequence generation and memory. Neuron.
  • Sompolinsky and Kanter, [1986] Sompolinsky, H. and Kanter, I. (1986). Temporal association in asymmetric neural networks. Physical Review Letters.
  • Srivastava et al., [2015] Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. International Conference on Machine Learning.
  • Tang et al., [2023] Tang, M., Barron, H., and Bogacz, R. (2023). Sequential memory with temporal predictive coding. Advances in Neural Information Processing Systems.
  • Tully et al., [2016] Tully, P. J., Lindén, H., Hennig, M. H., and Lansner, A. (2016). Spike-based bayesian-hebbian learning of temporal sequences. PLoS Computational Biology.
  • Tulving, [2002] Tulving, E. (2002). Episodic memory: From mind to brain. Annual Review of Psychology.
  • Vershynin, [2018] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.

Appendix A Proof of Theorem 1

We construct a network such that, given 𝝃(t)=𝐱(i)𝝃𝑡𝐱𝑖\bm{\xi}(t)=\mathbf{x}(i)bold_italic_ξ ( italic_t ) = bold_x ( italic_i ) for i=1,,T1𝑖1𝑇1i=1,...,T-1italic_i = 1 , … , italic_T - 1, the hidden neurons provide an one-hot encoding of the successive pattern 𝐱(i+1)𝐱𝑖1\mathbf{x}(i+1)bold_x ( italic_i + 1 ), which is then decoded to be 𝝃(t+1)𝝃𝑡1\bm{\xi}(t+1)bold_italic_ξ ( italic_t + 1 ).

To store 𝐱(1),,𝐱(T){1,1}N𝐱1𝐱𝑇superscript11𝑁\mathbf{x}(1),...,\mathbf{x}(T)\in\{-1,1\}^{N}bold_x ( 1 ) , … , bold_x ( italic_T ) ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in (2)(3), assuming 𝐱(i)𝐱(j)𝐱𝑖𝐱𝑗\mathbf{x}(i)\neq\mathbf{x}(j)bold_x ( italic_i ) ≠ bold_x ( italic_j ) for ij𝑖𝑗i\neq jitalic_i ≠ italic_j except that 𝐱(1)=𝐱(T)𝐱1𝐱𝑇\mathbf{x}(1)=\mathbf{x}(T)bold_x ( 1 ) = bold_x ( italic_T ), let M=T1𝑀𝑇1M=T-1italic_M = italic_T - 1 and construct weight matrix 𝐔𝐔\mathbf{U}bold_U as

𝐔=(𝐱(1),𝐱(2),,𝐱(T1))𝐔superscript𝐱1𝐱2𝐱𝑇1top\displaystyle\mathbf{U}=(\mathbf{x}(1),\mathbf{x}(2),...,\mathbf{x}(T-1))^{\top}bold_U = ( bold_x ( 1 ) , bold_x ( 2 ) , … , bold_x ( italic_T - 1 ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (18)

and hidden neurons 𝜻(t)=(ζ1(t),,ζM(t))𝜻𝑡superscriptsubscript𝜁1𝑡subscript𝜁𝑀𝑡top\bm{\zeta}(t)=(\zeta_{1}(t),...,\zeta_{M}(t))^{\top}bold_italic_ζ ( italic_t ) = ( italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_ζ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as

ζi(t)subscript𝜁𝑖𝑡\displaystyle\zeta_{i}(t)italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) =sign(k=1NUikξk(t)N)absentsignsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝜉𝑘𝑡𝑁\displaystyle=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}\xi_{k}(t)-N\Big{)}= sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_N ) (19)
=sign(𝐱(i)𝝃(t)N)absentsign𝐱superscript𝑖top𝝃𝑡𝑁\displaystyle=\text{sign}\big{(}\mathbf{x}(i)^{\top}\bm{\xi}(t)-N\big{)}= sign ( bold_x ( italic_i ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ξ ( italic_t ) - italic_N ) (20)

such that given 𝝃(t)=𝐱(i)𝝃𝑡𝐱𝑖\bm{\xi}(t)=\mathbf{x}(i)bold_italic_ξ ( italic_t ) = bold_x ( italic_i ) for i=1,,T1𝑖1𝑇1i=1,...,T-1italic_i = 1 , … , italic_T - 1, we have

ζj(t)={ 1,if j=i,1,otherwise.subscript𝜁𝑗𝑡cases1if 𝑗𝑖1otherwise\displaystyle\zeta_{j}(t)=\begin{cases}\ \ \ 1,&\text{if }j=i,\\ -1,&\text{otherwise}.\end{cases}italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = { start_ROW start_CELL 1 , end_CELL start_CELL if italic_j = italic_i , end_CELL end_ROW start_ROW start_CELL - 1 , end_CELL start_CELL otherwise . end_CELL end_ROW (21)

Next, we construct the weight matrix 𝐕𝐕\mathbf{V}bold_V as

𝐕=(𝐱(2),𝐱(3),,𝐱(T))𝐕𝐱2𝐱3𝐱𝑇\displaystyle\mathbf{V}=(\mathbf{x}(2),\mathbf{x}(3),...,\mathbf{x}(T))bold_V = ( bold_x ( 2 ) , bold_x ( 3 ) , … , bold_x ( italic_T ) ) (22)

and visible neurons 𝝃(t+1)=(ξ1(t+1),,ξN(t+1))𝝃𝑡1superscriptsubscript𝜉1𝑡1subscript𝜉𝑁𝑡1top\bm{\xi}(t+1)=(\xi_{1}(t+1),...,\xi_{N}(t+1))^{\top}bold_italic_ξ ( italic_t + 1 ) = ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + 1 ) , … , italic_ξ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t + 1 ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as

𝝃(t+1)=sign(𝐕𝜻(t)+𝜽)𝝃𝑡1sign𝐕𝜻𝑡𝜽\displaystyle\bm{\xi}(t+1)=\text{sign}(\mathbf{V}\bm{\zeta}(t)+\bm{\theta})bold_italic_ξ ( italic_t + 1 ) = sign ( bold_V bold_italic_ζ ( italic_t ) + bold_italic_θ ) (23)

where 𝜽=j=2T𝐱(j)𝜽superscriptsubscript𝑗2𝑇𝐱𝑗\bm{\theta}=\sum_{j=2}^{T}\mathbf{x}(j)bold_italic_θ = ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x ( italic_j ) such that given the one-hot vector 𝜻(t)𝜻𝑡\bm{\zeta}(t)bold_italic_ζ ( italic_t ) we have

𝝃(t+1)𝝃𝑡1\displaystyle\bm{\xi}(t+1)bold_italic_ξ ( italic_t + 1 ) =sign(𝐱(i+1)ji+1𝐱(j)+j=2T𝐱(j))absentsign𝐱𝑖1subscript𝑗𝑖1𝐱𝑗superscriptsubscript𝑗2𝑇𝐱𝑗\displaystyle=\text{sign}\Big{(}\mathbf{x}(i+1)-\sum_{j\neq i+1}\mathbf{x}(j)+% \sum_{j=2}^{T}\mathbf{x}(j)\Big{)}= sign ( bold_x ( italic_i + 1 ) - ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i + 1 end_POSTSUBSCRIPT bold_x ( italic_j ) + ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x ( italic_j ) ) (24)
=sign(2𝐱(i+1))absentsign2𝐱𝑖1\displaystyle=\text{sign}(2\cdot\mathbf{x}(i+1))= sign ( 2 ⋅ bold_x ( italic_i + 1 ) ) (25)
=𝐱(i+1)absent𝐱𝑖1\displaystyle=\mathbf{x}(i+1)= bold_x ( italic_i + 1 ) (26)

Appendix B Proof of Theorem 2

Note that the update of 𝐔𝐔\mathbf{U}bold_U (4)(5)(6) in Section 5 does not depend on 𝐕𝐕\mathbf{V}bold_V. Therefore, we first prove the convergence of updating 𝐔𝐔\mathbf{U}bold_U for η>0𝜂0\eta>0italic_η > 0 and κ>0𝜅0\kappa>0italic_κ > 0. The proof follows from [Gardner,, 1988]. Assume 𝐔superscript𝐔\mathbf{U}^{*}bold_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT exists such that, for all t𝑡titalic_t and i𝑖iitalic_i,

zi(t+1)kUikxk(t)κ.subscript𝑧𝑖𝑡1subscript𝑘subscriptsuperscript𝑈𝑖𝑘subscript𝑥𝑘𝑡𝜅\displaystyle z_{i}(t+1)\sum_{k}U^{*}_{ik}x_{k}(t)\geq\kappa.italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ≥ italic_κ . (27)

Define the p𝑝pitalic_p-th update of 𝐔𝐔\mathbf{U}bold_U with μi(tp)=1subscript𝜇𝑖subscript𝑡𝑝1\mu_{i}(t_{p})=1italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = 1 by

Uij(p+1)=Uij(p)+ηzi(tp+1)xj(tp)superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗𝑝𝜂subscript𝑧𝑖subscript𝑡𝑝1subscript𝑥𝑗subscript𝑡𝑝\displaystyle U_{ij}^{(p+1)}=U_{ij}^{(p)}+\eta z_{i}(t_{p}+1)x_{j}(t_{p})italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT + italic_η italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (28)

for some tp{1,,T1}subscript𝑡𝑝1𝑇1t_{p}\in\{1,...,T-1\}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 1 , … , italic_T - 1 } and all j𝑗jitalic_j in parallel. We assume zero-initialization, that is, Uij(1)=0superscriptsubscript𝑈𝑖𝑗10U_{ij}^{(1)}=0italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0 for simplicity but the result holds if |Uij(1)|superscriptsubscript𝑈𝑖𝑗1|U_{ij}^{(1)}|| italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | is sufficiently small. Let

Xi(p+1)=jUij(p+1)Uijj(Uij(p+1))2j(Uij)2.superscriptsubscript𝑋𝑖𝑝1subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗2\displaystyle X_{i}^{(p+1)}=\frac{\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}}{\sqrt{\sum% _{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}}\sqrt{\sum_{j}\big{(}U_{ij}^{*}\big{)}^{2% }}}.italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (29)

The Cauchy-Schwarz inequality, we have Xi(p+1)1superscriptsubscript𝑋𝑖𝑝11X_{i}^{(p+1)}\leq 1italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ≤ 1. Now we prove the convergence of updating 𝐔𝐔\mathbf{U}bold_U by contradiction. Assuming the update of 𝐔𝐔\mathbf{U}bold_U does not converge, we will show that Xi(p+1)>1superscriptsubscript𝑋𝑖𝑝11X_{i}^{(p+1)}>1italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT > 1 as p𝑝p\to\inftyitalic_p → ∞. First, we have

jUij(p+1)UijjUij(p)Uij=ηjzi(tp+1)Uijxj(tp)ηκsubscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝superscriptsubscript𝑈𝑖𝑗𝜂subscript𝑗subscript𝑧𝑖subscript𝑡𝑝1superscriptsubscript𝑈𝑖𝑗subscript𝑥𝑗subscript𝑡𝑝𝜂𝜅\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}-\sum_{j}U_{ij}^{(p)}U_{ij}^{*}=% \eta\sum_{j}z_{i}(t_{p}+1)U_{ij}^{*}x_{j}(t_{p})\geq\eta\kappa∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_η ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ≥ italic_η italic_κ (30)

due to (27) and therefore

jUij(p+1)Uijsubscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =jUij(p+1)UijjUij(p)Uij++jUij(2)UijjUij(1)Uij+jUij(1)Uijabsentsubscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsubscript𝑈𝑖𝑗2superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsubscript𝑈𝑖𝑗1superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsubscript𝑈𝑖𝑗1superscriptsubscript𝑈𝑖𝑗\displaystyle=\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}-\sum_{j}U_{ij}^{(p)}U_{ij}^{*}+% ...+\sum_{j}U_{ij}^{(2)}U_{ij}^{*}-\sum_{j}U_{ij}^{(1)}U_{ij}^{*}+\sum_{j}U_{% ij}^{(1)}U_{ij}^{*}= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + … + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (31)
ηκpabsent𝜂𝜅𝑝\displaystyle\geq\eta\kappa p≥ italic_η italic_κ italic_p (32)

since we assumed Uij(1)=0superscriptsubscript𝑈𝑖𝑗10U_{ij}^{(1)}=0italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0. Next, we have

j(Uij(p+1))2j(Uij(p))2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2\displaystyle\sum_{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}-\sum_{j}\big{(}U_{ij}^{(% p)}\big{)}^{2}∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =j(Uij(p)+ηzi(tp+1)xj(tp))2j(Uij(p))2absentsubscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝𝜂subscript𝑧𝑖subscript𝑡𝑝1subscript𝑥𝑗subscript𝑡𝑝2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2\displaystyle=\sum_{j}\big{(}U_{ij}^{(p)}+\eta z_{i}(t_{p}+1)x_{j}(t_{p})\big{% )}^{2}-\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT + italic_η italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (33)
=2ηjUij(p)zi(tp+1)xj(tp)+Nη2absent2𝜂subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝subscript𝑧𝑖subscript𝑡𝑝1subscript𝑥𝑗subscript𝑡𝑝𝑁superscript𝜂2\displaystyle=2\eta\sum_{j}U_{ij}^{(p)}z_{i}(t_{p}+1)x_{j}(t_{p})+N\eta^{2}= 2 italic_η ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (34)
=2ηzi(tp+1)jUij(p)xj(tp)+Nη2absent2𝜂subscript𝑧𝑖subscript𝑡𝑝1subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝subscript𝑥𝑗subscript𝑡𝑝𝑁superscript𝜂2\displaystyle=2\eta z_{i}(t_{p}+1)\sum_{j}U_{ij}^{(p)}x_{j}(t_{p})+N\eta^{2}= 2 italic_η italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (35)
<2ηκ+Nη2absent2𝜂𝜅𝑁superscript𝜂2\displaystyle<2\eta\kappa+N\eta^{2}< 2 italic_η italic_κ + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (36)

since we assumed μi(tp)=1subscript𝜇𝑖subscript𝑡𝑝1\mu_{i}(t_{p})=1italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = 1 and therefore zi(tp+1)jUij(p)xj(tp)<κsubscript𝑧𝑖subscript𝑡𝑝1subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝subscript𝑥𝑗subscript𝑡𝑝𝜅z_{i}(t_{p}+1)\sum_{j}U_{ij}^{(p)}x_{j}(t_{p})<\kappaitalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) < italic_κ. Then, we have

j(Uij(p+1))2j(Uij(p))2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(p)})^% {2}}square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (37)
=\displaystyle== (j(Uij(p+1))2j(Uij(p))2)/(j(Uij(p+1))2+j(Uij(p))2)subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2\displaystyle\Big{(}\sum_{j}\big{(}U_{ij}^{(p+1)}\big{)}^{2}-\sum_{j}\big{(}U_% {ij}^{(p)}\big{)}^{2}\Big{)}\Big{/}\Big{(}\sqrt{\sum_{j}\big{(}U_{ij}^{(p+1)}% \big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}}\Big{)}( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (38)
<\displaystyle<< (2ηκ+Nη2)/(j(Uij(p+1))2+j(Uij(p))2).2𝜂𝜅𝑁superscript𝜂2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2\displaystyle(2\eta\kappa+N\eta^{2})\Big{/}\Big{(}\sqrt{\sum_{j}\big{(}U_{ij}^% {(p+1)}\big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(p)}\big{)}^{2}}\Big{)}.( 2 italic_η italic_κ + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (39)

By Cauchy-Schwarz inequality, we have

j(Uij(p+1))2j(Uij)2jUij(p+1)Uijηκpsubscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗2subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗𝜂𝜅𝑝\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}\sqrt{\sum_{j}(U_{ij}^{*})^{2}% }\geq\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}\geq\eta\kappa psquare-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≥ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ italic_η italic_κ italic_p (40)

and therefore

j(Uij(p+1))2ηκpj(Uij)2.subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12𝜂𝜅𝑝subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗2\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}\geq\frac{\eta\kappa p}{\sqrt{% \sum_{j}(U_{ij}^{*})^{2}}}.square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≥ divide start_ARG italic_η italic_κ italic_p end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (41)

Also,

j(Uij(p+1))2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG =j(Uij(p+1))2j(Uij(p))2++j(Uij(2))2j(Uij(1))2absentsubscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗22subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗12\displaystyle=\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(p)})% ^{2}}+...+\sqrt{\sum_{j}(U_{ij}^{(2)})^{2}}-\sqrt{\sum_{j}(U_{ij}^{(1)})^{2}}= square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + … + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (42)
+j(Uij(1))2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗12\displaystyle+\sqrt{\sum_{j}(U_{ij}^{(1)})^{2}}+ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (43)
<q=1p(2ηκ+Nη2)/(j(Uij(q+1))2+j(Uij(q))2)absentsuperscriptsubscript𝑞1𝑝2𝜂𝜅𝑁superscript𝜂2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑞12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑞2\displaystyle<\sum_{q=1}^{p}(2\eta\kappa+N\eta^{2})\Big{/}\Big{(}\sqrt{\sum_{j% }\big{(}U_{ij}^{(q+1)}\big{)}^{2}}+\sqrt{\sum_{j}\big{(}U_{ij}^{(q)}\big{)}^{2% }}\Big{)}< ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 2 italic_η italic_κ + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (44)
<q=1p(2ηκ+Nη2)j(Uij)21ηκ(2q1)absentsuperscriptsubscript𝑞1𝑝2𝜂𝜅𝑁superscript𝜂2subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗21𝜂𝜅2𝑞1\displaystyle<\sum_{q=1}^{p}(2\eta\kappa+N\eta^{2})\sqrt{\sum_{j}(U_{ij}^{*})^% {2}}\frac{1}{\eta\kappa(2q-1)}< ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 2 italic_η italic_κ + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_η italic_κ ( 2 italic_q - 1 ) end_ARG (45)
=ηκ+Nη2/2ηκj(Uij)2q=1p1q1/2absent𝜂𝜅𝑁superscript𝜂22𝜂𝜅subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗2superscriptsubscript𝑞1𝑝1𝑞12\displaystyle=\frac{\eta\kappa+N\eta^{2}/2}{\eta\kappa}\sqrt{\sum_{j}(U_{ij}^{% *})^{2}}\sum_{q=1}^{p}\frac{1}{q-1/2}= divide start_ARG italic_η italic_κ + italic_N italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_ARG start_ARG italic_η italic_κ end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_q - 1 / 2 end_ARG (46)

due to (39) and Uij(1)=0superscriptsubscript𝑈𝑖𝑗10U_{ij}^{(1)}=0italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0. Note that for q>1𝑞1q>1italic_q > 1

1q1/2q3/2q1/21x𝑑x=log(q1/2)log(q3/2)1𝑞12superscriptsubscript𝑞32𝑞121𝑥differential-d𝑥𝑞12𝑞32\displaystyle\frac{1}{q-1/2}\leq\int_{q-3/2}^{q-1/2}\frac{1}{x}dx=\log(q-1/2)-% \log(q-3/2)divide start_ARG 1 end_ARG start_ARG italic_q - 1 / 2 end_ARG ≤ ∫ start_POSTSUBSCRIPT italic_q - 3 / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q - 1 / 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_x end_ARG italic_d italic_x = roman_log ( italic_q - 1 / 2 ) - roman_log ( italic_q - 3 / 2 ) (47)

and

q=1p1q1/2=12+q=2p1q1/212+1/2p1/21x𝑑x=2+log(p1/2)log(1/2).superscriptsubscript𝑞1𝑝1𝑞1212superscriptsubscript𝑞2𝑝1𝑞1212superscriptsubscript12𝑝121𝑥differential-d𝑥2𝑝1212\displaystyle\sum_{q=1}^{p}\frac{1}{q-1/2}=\frac{1}{2}+\sum_{q=2}^{p}\frac{1}{% q-1/2}\leq\frac{1}{2}+\int_{1/2}^{p-1/2}\frac{1}{x}dx=2+\log(p-1/2)-\log(1/2).∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_q - 1 / 2 end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_q = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_q - 1 / 2 end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + ∫ start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - 1 / 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_x end_ARG italic_d italic_x = 2 + roman_log ( italic_p - 1 / 2 ) - roman_log ( 1 / 2 ) . (48)

Therefore,

j(Uij(p+1))2=O(log(p))subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12𝑂𝑝\displaystyle\sqrt{\sum_{j}(U_{ij}^{(p+1)})^{2}}=O(\log(p))square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_O ( roman_log ( italic_p ) ) (49)

and

jUij(p+1)Uij=Ω(p)subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗Ω𝑝\displaystyle\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}=\Omega(p)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Ω ( italic_p ) (50)

as p𝑝p\to\inftyitalic_p → ∞. We have,

Xi(p+1)=jUij(p+1)Uijj(Uij(p+1))2j(Uij)2>1superscriptsubscript𝑋𝑖𝑝1subscript𝑗superscriptsubscript𝑈𝑖𝑗𝑝1superscriptsubscript𝑈𝑖𝑗subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗𝑝12subscript𝑗superscriptsuperscriptsubscript𝑈𝑖𝑗21\displaystyle X_{i}^{(p+1)}=\frac{\sum_{j}U_{ij}^{(p+1)}U_{ij}^{*}}{\sqrt{\sum% _{j}(U_{ij}^{(p+1)})^{2}}\sqrt{\sum_{j}(U_{ij}^{*})^{2}}}>1italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG > 1 (51)

for some p𝑝pitalic_p. This contradicts that Xi(p+1)1superscriptsubscript𝑋𝑖𝑝11X_{i}^{(p+1)}\leq 1italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT ≤ 1. Thus, the updating 𝐔𝐔\mathbf{U}bold_U converges.

Upon the convergence of updating 𝐔𝐔\mathbf{U}bold_U, we can prove the convergence of 𝐕𝐕\mathbf{V}bold_V if there exists 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that for all t𝑡titalic_t and i𝑖iitalic_i,

xi(t+1)kVikyk(t)κsubscript𝑥𝑖𝑡1subscript𝑘subscriptsuperscript𝑉𝑖𝑘subscript𝑦𝑘𝑡𝜅\displaystyle x_{i}(t+1)\sum_{k}V^{*}_{ik}y_{k}(t)\geq\kappaitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ≥ italic_κ (52)

by a similar proof.

Appendix C Proof of Theorem 3

If μi(t)=0subscript𝜇𝑖𝑡0\mu_{i}(t)=0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 0, then Uik=Uiksuperscriptsubscript𝑈𝑖𝑘subscript𝑈𝑖𝑘U_{ik}^{\prime}=U_{ik}italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT and μi(t)=μi(t)=0superscriptsubscript𝜇𝑖𝑡subscript𝜇𝑖𝑡0\mu_{i}^{\prime}(t)=\mu_{i}(t)=0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 0. If μi(t)=1subscript𝜇𝑖𝑡1\mu_{i}(t)=1italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 1, then

μi(t)superscriptsubscript𝜇𝑖𝑡\displaystyle\mu_{i}^{\prime}(t)italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) =H(κzi(t+1)k=1N(Uik+ηzi(t+1)xk(t))xk(t))absent𝐻𝜅subscript𝑧𝑖𝑡1superscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘𝜂subscript𝑧𝑖𝑡1subscript𝑥𝑘𝑡subscript𝑥𝑘𝑡\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}\big{(}U_{ik}+\eta z_{i}(% t+1)x_{k}(t)\big{)}x_{k}(t)\Big{)}= italic_H ( italic_κ - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + italic_η italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) (53)
=H(κzi(t+1)k=1NUikxk(t)η(zi(t+1))2k=1N(xk(t))2)absent𝐻𝜅subscript𝑧𝑖𝑡1superscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡𝜂superscriptsubscript𝑧𝑖𝑡12superscriptsubscript𝑘1𝑁superscriptsubscript𝑥𝑘𝑡2\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)-\eta\big{(% }z_{i}(t+1)\big{)}^{2}\sum_{k=1}^{N}\big{(}x_{k}(t)\big{)}^{2}\Big{)}= italic_H ( italic_κ - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_η ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (54)
=H(κzi(t+1)k=1NUikxk(t)ηN)=0absent𝐻𝜅subscript𝑧𝑖𝑡1superscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡𝜂𝑁0\displaystyle=H\Big{(}\kappa-z_{i}(t+1)\sum_{k=1}^{N}U_{ik}x_{k}(t)-\eta N\Big% {)}=0= italic_H ( italic_κ - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_η italic_N ) = 0 (55)

for sufficiently large η>0𝜂0\eta>0italic_η > 0 given xk(t)=±1subscript𝑥𝑘𝑡plus-or-minus1x_{k}(t)=\pm 1italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) = ± 1, zi(t+1)=±1subscript𝑧𝑖𝑡1plus-or-minus1z_{i}(t+1)=\pm 1italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + 1 ) = ± 1 and the property of Heaviside function.

Appendix D Proof of Theorem 4

If νj(t)=0subscript𝜈𝑗𝑡0\nu_{j}(t)=0italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = 0, then we have

xj(t+1)k=1MVjkyk(t)κ.subscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝑦𝑘𝑡𝜅\displaystyle x_{j}(t+1)\sum_{k=1}^{M}V_{jk}y_{k}(t)\geq\kappa.italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ≥ italic_κ . (56)

Next,

xj(t+1)k=1MVjky^k(t)subscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡\displaystyle x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\hat{y}_{k}(t)italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) =xj(t+1)k=1MVjk(yk(t)+y^k(t)yk(t))absentsubscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝑦𝑘𝑡subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡\displaystyle=x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\big{(}{y}_{k}(t)+\hat{y}_{k}(t)-{% y}_{k}(t)\big{)}= italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) + over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) (57)
=xj(t+1)k=1MVjkyk(t)+xj(t+1)k=1MVjk(y^k(t)yk(t))absentsubscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript𝑦𝑘𝑡subscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡\displaystyle=x_{j}(t+1)\sum_{k=1}^{M}V_{jk}{y}_{k}(t)+x_{j}(t+1)\sum_{k=1}^{M% }V_{jk}\big{(}\hat{y}_{k}(t)-{y}_{k}(t)\big{)}= italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) (58)
κ+xj(t+1)k=1MVjk(y^k(t)yk(t))absent𝜅subscript𝑥𝑗𝑡1superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡\displaystyle\geq\kappa+x_{j}(t+1)\sum_{k=1}^{M}V_{jk}\big{(}\hat{y}_{k}(t)-{y% }_{k}(t)\big{)}≥ italic_κ + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) (59)
κ|k=1MVjk(y^k(t)yk(t))|absent𝜅superscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡\displaystyle\geq\kappa-\Big{|}\sum_{k=1}^{M}V_{jk}\big{(}\hat{y}_{k}(t)-{y}_{% k}(t)\big{)}\Big{|}≥ italic_κ - | ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) | (60)
κmaxk|Vjk|k=1M|y^k(t)yk(t)|absent𝜅subscript𝑘subscript𝑉𝑗𝑘superscriptsubscript𝑘1𝑀subscript^𝑦𝑘𝑡subscript𝑦𝑘𝑡\displaystyle\geq\kappa-\max_{k}|V_{jk}|\sum_{k=1}^{M}|\hat{y}_{k}(t)-{y}_{k}(% t)|≥ italic_κ - roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) | (61)
>κmaxk|Vjk|ϵ>0absent𝜅subscript𝑘subscript𝑉𝑗𝑘italic-ϵ0\displaystyle>\kappa-\max_{k}|V_{jk}|\cdot\epsilon>0> italic_κ - roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT | ⋅ italic_ϵ > 0 (62)

since xj(t+1)=±1subscript𝑥𝑗𝑡1plus-or-minus1x_{j}(t+1)=\pm 1italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) = ± 1, which implies

xj(t+1)=sign(k=1MVjky^k(t)).subscript𝑥𝑗𝑡1signsuperscriptsubscript𝑘1𝑀subscript𝑉𝑗𝑘subscript^𝑦𝑘𝑡\displaystyle x_{j}(t+1)=\text{sign}\Big{(}\sum_{k=1}^{M}V_{jk}\hat{y}_{k}(t)% \Big{)}.italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) = sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) . (63)

Appendix E Numerical Results for Figure 5 (a) and (b)

In the main paper, we only showed bar charts (Figure 5 (a) and (b)) of the results in Section 6.2. Here, for more information, we provide the numerical results for Figure 5 (a) in Table 1 and Figure 5 (b) in Table 2.

T𝑇Titalic_T 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Learning only 𝐕𝐕\mathbf{V}bold_V 100 100 100 91 66 19 8 2 0 0 0 0 0 0 0
Learning 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V 100 100 100 100 100 99 88 52 20 1 0 0 0 0 0
Table 1: Successful retrievals out of 100 trials with different sequence period lengths T𝑇Titalic_T.
M𝑀Mitalic_M 100 200 300 400 500 600 700 800 900 1000
Learning only 𝐕𝐕\mathbf{V}bold_V 0 0 1 3 6 16 14 27 30 37
Learning 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V 10 52 85 90 94 88 95 96 96 97
Table 2: Successful retrievals out of 100 trials with different numbers of hidden neurons M𝑀Mitalic_M.

Appendix F Ablation Experiments: Joint Learning of 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V

To verify the effective of the proposed learning algorithm in Section 5, we show additional experimental results in which three methods for the networks of hidden units in learning the sequences in Section 5.3 are compared.

  1. 1.

    Fixing 𝐔𝐔\mathbf{U}bold_U and learning 𝐕𝐕\mathbf{V}bold_V by the temporal asymmetric Hebbian algorithm

    Vji=txj(t+1)yi(t)subscript𝑉𝑗𝑖subscript𝑡subscript𝑥𝑗𝑡1subscript𝑦𝑖𝑡V_{ji}=\sum_{t}x_{j}(t+1)y_{i}(t)italic_V start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t )

    where

    yi(t)=sign(k=1NUikxk(t)).subscript𝑦𝑖𝑡signsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡y_{i}(t)=\text{sign}\big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)\big{)}.italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) ) .
  2. 2.

    Fixing 𝐔𝐔\mathbf{U}bold_U and learning 𝐕𝐕\mathbf{V}bold_V with the three-factor rule (7)(8)(9) in Section 5.

  3. 3.

    Learning both 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V with the three-factor rule (4)(5)(6)(7)(8)(9) in Section 5.

The experimental settings are the same as in Section 6.3. The results are shown in Figure 1-5, from which we can see the algorithm proposed in Section 5 is indeed effective.

Appendix G Ablation Experiments: Sparsity

We provide some further ablation study of our algorithm on the effect of sparsity under the experimental settings of Section 6.2 in the main paper.

Figure 14: Sparse Random Projected Inputs

We compare our method (learning both 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V with the three-factor rule) with using fixed random 𝐔𝐔\mathbf{U}bold_U whose elements are sampled i.i.d. from the standard Gaussian distribution and learning only with the three-factor rule. The sparse random projected inputs are defined as

yi(t)=sign(k=1NUikxk(t)θ)subscript𝑦𝑖𝑡signsuperscriptsubscript𝑘1𝑁subscript𝑈𝑖𝑘subscript𝑥𝑘𝑡𝜃y_{i}(t)=\text{sign}\Big{(}\sum_{k=1}^{N}U_{ik}x_{k}(t)-\theta\Big{)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - italic_θ )

where θ>0𝜃0\theta>0italic_θ > 0 controls the sparsity level.

Figure 15: Sparse Random Projected Targets

We test different levels of sparsity in the random projected targets defined as

zi(t=1)=sign(k=1NPikxk(t+1)θ)subscript𝑧𝑖𝑡1signsuperscriptsubscript𝑘1𝑁subscript𝑃𝑖𝑘subscript𝑥𝑘𝑡1𝜃z_{i}(t=1)=\text{sign}\Big{(}\sum_{k=1}^{N}P_{ik}x_{k}(t+1)-\theta\Big{)}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t = 1 ) = sign ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t + 1 ) - italic_θ )

where θ>0𝜃0\theta>0italic_θ > 0 controls the sparsity level.

From both sets of experiments, we do not find sparsity enlarges significantly the capacity of the networks in learning sequences as attractors.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 9: OU-ISIR gait sequence for t=1,,10𝑡110t=1,...,10italic_t = 1 , … , 10.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 10: Moving MNIST sequence 1 for t=1,,8𝑡18t=1,...,8italic_t = 1 , … , 8.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 11: Moving MNIST sequence 6 for t=1,,8𝑡18t=1,...,8italic_t = 1 , … , 8.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 12: Moving MNIST sequence 11 for t=1,,8𝑡18t=1,...,8italic_t = 1 , … , 8.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 13: Moving MNIST sequence 16 for t=1,,8𝑡18t=1,...,8italic_t = 1 , … , 8.
Refer to caption
Figure 14: Successful retrievals out of 100 trials with different sequence period lengths (sparse random projected inputs).
Refer to caption
Figure 15: Successful retrievals out of 100 trials with different sequence period lengths (sparse random projected targets).