Biology-inspired joint distribution neurons
based on Hierarchical Correlation Reconstruction
allowing for multidirectional neural networks

Jarek Duda
Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: [email protected]
Abstract

Biological neural networks seem qualitatively superior (e.g. in learning, flexibility, robustness) from current artificial like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them: have fundamentally multidirectional signal propagation [1], also of probability distributions e.g. for uncertainty estimation, and are believed not being able to use standard backpropagation training [2]. There are proposed novel artificial neurons based on HCR (Hierarchical Correlation Reconstruction) removing the above low level differences: with neurons containing local joint distribution model (of its connections), representing joint density on normalized variables as just linear combination among (f𝐣)subscript𝑓𝐣(f_{\mathbf{j}})( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) orthonormal polynomials: ρ(𝐱)=𝐣Ba𝐣f𝐣(𝐱)𝜌𝐱subscript𝐣𝐵subscript𝑎𝐣subscript𝑓𝐣𝐱\rho(\mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}f_{\mathbf{j}}(\mathbf{x})italic_ρ ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) for 𝐱[0,1]d𝐱superscript01𝑑\mathbf{x}\in[0,1]^{d}bold_x ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and B𝐵Bitalic_B some chosen basis, with basis growth approaching complete description of joint distribution. By various index summations of such (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) tensor as neuron parameters, we get simple formulas for e.g. conditional expected values for propagation in any direction, like E[x|y,z]𝐸delimited-[]conditional𝑥𝑦𝑧E[x|y,z]italic_E [ italic_x | italic_y , italic_z ], E[y|x]𝐸delimited-[]conditional𝑦𝑥E[y|x]italic_E [ italic_y | italic_x ], which degenerate to KAN-like parametrization if restricting to pairwise dependencies. Such HCR network can also propagate probability distributions (also joint) like ρ(y,z|x)𝜌𝑦conditional𝑧𝑥\rho(y,z|x)italic_ρ ( italic_y , italic_z | italic_x ). It also allows for additional training approaches, like direct (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) estimation, through tensor decomposition, or more biologically plausible information bottleneck training: layers directly influencing only neighbors, optimizing content to maximize information about the next layer, and minimizing about the previous to minimize the noise.

Keywords: machine learning, neural networks, Kolmogorov-Arnold Network, joint distribution, conditional distribution, Bayesian Neural Networks, tensor decomposition, mutual information, information bottleneck approach

I Introduction

Refer to caption
Figure 1: The proposed HCR neuron and neural network (HCRN, HCRNN) containing local joint distribution model represented in (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) tensor, e.g. (aijk)subscript𝑎𝑖𝑗𝑘(a_{ijk})( italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ) for d=3𝑑3d=3italic_d = 3 connections. Top: orthonormal polynomial basis assuming normalization to uniform distribution in [0,1]01[0,1][ 0 , 1 ]. Middle: HCR neuron containing and applying joint distribution model for d=3𝑑3d=3italic_d = 3 variables, and gathered formulas for direct estimation/model update, its application to propagate entire distributions and expected values alone. Such density parametrization can drop below 0, what is usually repaired by calibration e.g. use normalized max(ρ,0.1)𝜌0.1\max(\rho,0.1)roman_max ( italic_ρ , 0.1 ) density, however, for neural networks with inter-layer normalization this issue seems negligible, what essentially simplifies calculations to the shown formulas. Propagating only expected values and normalizing, we can use only the marked nominators - as in KAN optimizing nonlinear functions (polynomial here) by including only pairwise dependencies (a𝑎aitalic_a with two nonzero indexes), extending to their products to consciously include higher order dependencies. Bottom: schematic HCR neural network and some training approaches of intermediate layers - which in HCR can be treated as values or their distributions (replacing fi(u)subscript𝑓𝑖𝑢f_{i}(u)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u ) with its i𝑖iitalic_i-th moment: 01ρ(u)fi(u)𝑑usuperscriptsubscript01𝜌𝑢subscript𝑓𝑖𝑢differential-d𝑢\int_{0}^{1}\rho(u)f_{i}(u)du∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ρ ( italic_u ) italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u ) italic_d italic_u). There is also visualized tensor decomposition approach - estimate dependencies (e.g. pairwise) for multiple variables and try to automatically decompose it to multiple dependencies of a smaller numbers of variables with algebraic methods.

Biological neurons use complex propagation of action potentials, travelling in both directions of e.g. axons: ”it is not uncommon for axonal propagation of action potentials to happen in both directions” [1]. They have access to information from connected neurons, which complete statistical description is their joint distribution - beside value dependence, also describing dependencies of e.g. variances as so called homoscedasticity popular in financial data, and other moments (we will decompose into with HCR approach as in Fig. 1). Biological neurons should be evolutionarily optimized to include such additional information if only beneficial, often required to work on distributions e.g. to estimate uncertainties.

In contrast, arbitrarily chosen popular parametrization types like Multi-Layer Perceptron (MLP) [3] as trained linear combination and fixed nonlinear activation function, or Kolmogorov-Arnold Network (KAN) [4] additionally training activation functions, are optimized for propagation in one direction, and work only on values not distributions. Additionally, they are mainly trained by backpropagation, which is rather inaccessible for biological [2]. These differences lead to being far from learning capabilities, flexility, robustness of biological neural networks - summarized in Fig. 2. To reduce this difference, we should search for more powerful artificial neurons, like HCRNN proposed in this article.

To reach both multidirectional propagation, and working also with distributions e.g. for uncertainty estimation, there could be used Bayes theorem e.g. in Bayesian Neural Networks [5] - which in practice use relatively simple models. To include more detailed description of complex dependencies also of continuous variables, neuron could model the entire joint distribution of its connections, substitute and normalize to get conditional distributions without Bayes theorem. However, joint distributions of continuous variables become quite complicated, difficult to describe and handle. Classical approaches are copulas [6] but they are low parametric, or kernel density estimation (KDE) [7] which is impractical in higher dimension.

Refer to caption
Figure 2: Summary of differences between (unidirectional) artificial (ANN) and (fundamentally multidirectional) biological neural networks (BNN, from https://www.geeksforgeeks.org/difference-between-ann-and-bnn/) - BNNs are qualitatively superior in terms of learning, flexibility, robustness - just increasing the number of neurons might be insufficient to reach it. To get ANNs closer to capabilities of BNNs, we should include their low level differences, summarized in Fig. 3.

Hierarchical Correlation Reconstruction (HCR) ([8, 9, 10, 11, 12, 13])111HCR introduction: https://community.wolfram.com/groups/-/m/t/3017754, used for such artificial neurons as in Fig. 1, allow to overcome these difficulties by representing joint distribution of d𝑑ditalic_d normalized variables as just a linear combination ρ(𝐱)=𝐣a𝐣i=1dfji(xi)𝜌𝐱subscript𝐣subscript𝑎𝐣superscriptsubscriptproduct𝑖1𝑑subscript𝑓subscript𝑗𝑖subscript𝑥𝑖\rho(\mathbf{x})=\sum_{\mathbf{j}}a_{\mathbf{j}}\prod_{i=1}^{d}f_{j_{i}}(x_{i})italic_ρ ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where by bold font there are denoted vectors. Using orthonormal polynomials: 01fi(x)fj(x)𝑑x=δijsuperscriptsubscript01subscript𝑓𝑖𝑥subscript𝑓𝑗𝑥differential-d𝑥subscript𝛿𝑖𝑗\int_{0}^{1}f_{i}(x)f_{j}(x)dx=\delta_{ij}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x = italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) tensor of coefficients is inexpensive to estimate and update, literally providing hierarchical correlation decomposition with (mixed) moments as in Fig. 4. While generally such density as a linear combination can get below 0, what usually is repaired by calibration, for neural networks with normalization between layers this issue could be just neglected, essentially reducing computational costs.

Having such (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) tensor as neuron parameters, we can for example propagate in various directions values or probabilities by various index summations like in Fig. 1. Restricting to pairwise dependencies by using only (a𝐣)subscript𝑎𝐣(a_{\mathbf{j}})( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) with two nonzero indexes, such propagation formulas become sums of trained polynomial, like in KAN parametrization - but additionally allowing e.g. multidirectional propagation also of probability distributions.

Additionally HCRNN allows for many additional learning ways: as estimation of joint distributions, which can be linearized into products/decomposition of tensors. For biological plausibility, information bottleneck training seems very promising like in Fig. 3 - by directly optimizing content of intermediate layers: to increase mutual information with the next layer, and decrease from the previous to remove unnecessary information like noise. It is related to HSIC [14, 15] approach, replacing kernel width dependent local basis, with global polynomial basis for normalized variables.

This article introduces to HCR from perspective of neural network applications, earlier suggested in [8]. This is early version of article - to be extended in the future e.g. with practical realizations, examples, e.g. replacing MLP, KAN.

Refer to caption
Figure 3: MLP, KAN are arbitrarily chosen parametrizations, trained for unidirectional value propagation with backpropgation - believed to be inaccessible for biological NN [2], which are qualitatively superior e.g. in learning, flexibility, robustness as summarized in Fig. 2. While our understanding of such high level superiority is far from complete, it should come from low level differences, like fundamentally multidirectional propagation, also of probability distributions, and e.g. information bottleneck training (here similar to HSIC [14, 15])- all included in proposed HCRNN, which can be also degenerated to KAN-like.
Refer to caption
Figure 4: Visualized HCR polynomial [0,1]01[0,1][ 0 , 1 ] basis in d=1𝑑1d=1italic_d = 1 dimension and product bases for d=2,3𝑑23d=2,3italic_d = 2 , 3. For d=3𝑑3d=3italic_d = 3 we assume ρ(x,y,z)=ijkaijkfi(x)fj(y)fk(z)𝜌𝑥𝑦𝑧subscript𝑖𝑗𝑘subscript𝑎𝑖𝑗𝑘subscript𝑓𝑖𝑥subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧\rho(x,y,z)=\sum_{ijk}a_{ijk}f_{i}(x)f_{j}(y)f_{k}(z)italic_ρ ( italic_x , italic_y , italic_z ) = ∑ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ). As f0=1subscript𝑓01f_{0}=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, zero index in aijksubscript𝑎𝑖𝑗𝑘a_{ijk}italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT means independence from given variable, hence a000=1subscript𝑎0001a_{000}=1italic_a start_POSTSUBSCRIPT 000 end_POSTSUBSCRIPT = 1 corresponds to normalization, ai00,a0i0,a00isubscript𝑎𝑖00subscript𝑎0𝑖0subscript𝑎00𝑖a_{i00},a_{0i0},a_{00i}italic_a start_POSTSUBSCRIPT italic_i 00 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 italic_i 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 00 italic_i end_POSTSUBSCRIPT for i1𝑖1i\geq 1italic_i ≥ 1 describe marginal distributions through i𝑖iitalic_i-th moments. Then aij0,ai0j,a0ijsubscript𝑎𝑖𝑗0subscript𝑎𝑖0𝑗subscript𝑎0𝑖𝑗a_{ij0},a_{i0j},a_{0ij}italic_a start_POSTSUBSCRIPT italic_i italic_j 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i 0 italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 italic_i italic_j end_POSTSUBSCRIPT for i,j1𝑖𝑗1i,j\geq 1italic_i , italic_j ≥ 1 describe pairwise joint distributions through mixed moments, and finally aijksubscript𝑎𝑖𝑗𝑘a_{ijk}italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT for i,j,k1𝑖𝑗𝑘1i,j,k\geq 1italic_i , italic_j , italic_k ≥ 1 describe triplewise dependencies - we literally get hierarchical correlation reconstruction by moments relating distributions of increasing numbers of variables, with clear interpretation of coefficients of e.g. trained HCR-based neural network.

II HCR neural networks (HCRNN)

This main Section introduces to HCR and discusses it as a basic building block for neural network.

II-A Introduction to Hierarchical Correlation Reconstrution

As in copula theory [6], it is convenient to use normalization of variables to nearly uniform distribution in [0,1]01[0,1][ 0 , 1 ]. It requires transformation through cumulative distribution function (CDF): xCDF(x)𝑥CDF𝑥x\leftrightarrow\textrm{CDF}(x)italic_x ↔ CDF ( italic_x ) becoming its estimated quantile. This CDF can be modeled with some parametric distribution using parameters estimated from dataset, or be empirical distribution function: x𝑥xitalic_x becomes its position in dataset rescaled to (0,1)01(0,1)( 0 , 1 ). For neural networks normalization is popular to be made in batches [16], here needed to be used between layers (but can be skipped in further linearization), in practice should be nearly constant between layers, can be approximated, parameterized e.g. by Gaussian, put into tables, inversed for backward propagation.

For d𝑑ditalic_d normalized variables: 𝐱[0,1]d𝐱superscript01𝑑\mathbf{x}\in[0,1]^{d}bold_x ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, in HCR we represent joint distribution as a linear combination, conveniently in some product basis B=B+{0}𝐵superscript𝐵0B=B^{+}\cup\{\textbf{0}\}italic_B = italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∪ { 0 } with f0(x)=f𝟎(𝐱)=1subscript𝑓0𝑥subscript𝑓0𝐱1f_{0}(x)=f_{\mathbf{0}}(\mathbf{x})=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ( bold_x ) = 1:

ρ(𝐱)=1+𝐣B+a𝐣f𝐣(𝐱)=𝐣Ba𝐣f𝐣(𝐱)=𝐣Ba𝐣i=1dfji(xi)𝜌𝐱1subscript𝐣superscript𝐵subscript𝑎𝐣subscript𝑓𝐣𝐱subscript𝐣𝐵subscript𝑎𝐣subscript𝑓𝐣𝐱subscript𝐣𝐵subscript𝑎𝐣superscriptsubscriptproduct𝑖1𝑑subscript𝑓subscript𝑗𝑖subscript𝑥𝑖\rho(\mathbf{x})=1+\sum_{\mathbf{j}\in B^{+}}a_{\mathbf{j}}f_{\mathbf{j}}(% \mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}f_{\mathbf{j}}(\mathbf{x})=% \sum_{\mathbf{j}\in B}a_{\mathbf{j}}\prod_{i=1}^{d}f_{j_{i}}(x_{i})italic_ρ ( bold_x ) = 1 + ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

where B+=B\{0}superscript𝐵\𝐵0B^{+}=B\backslash\{\textbf{0}\}italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_B \ { 0 } removes zero corresponding to normalization as f0=1subscript𝑓01f_{0}=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, bold fonts denote vectors: 𝐣=(j1,..,jd)\mathbf{j}=(j_{1},..,j_{d})bold_j = ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ).

Assuming orthonormal basis: 01fi(x)fj(x)𝑑x=δijsuperscriptsubscript01subscript𝑓𝑖𝑥subscript𝑓𝑗𝑥differential-d𝑥subscript𝛿𝑖𝑗\int_{0}^{1}f_{i}(x)f_{j}(x)dx=\delta_{ij}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x = italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, static estimation [17] (minimizing mean-squared error between kernel density estimation smoothed sample and parametrization) from X¯¯𝑋\bar{X}over¯ start_ARG italic_X end_ARG dataset becomes just:

a𝐣=1|X¯|𝐱X¯f𝐣(𝐱)=1|X¯|𝐱X¯i=1dfji(xi)subscript𝑎𝐣1¯𝑋subscript𝐱¯𝑋subscript𝑓𝐣𝐱1¯𝑋subscript𝐱¯𝑋superscriptsubscriptproduct𝑖1𝑑subscript𝑓subscript𝑗𝑖subscript𝑥𝑖a_{\mathbf{j}}=\frac{1}{|\bar{X}|}\sum_{\mathbf{x}\in\bar{X}}f_{\mathbf{j}}(% \mathbf{x})=\frac{1}{|\bar{X}|}\sum_{\mathbf{x}\in\bar{X}}\prod_{i=1}^{d}f_{j_% {i}}(x_{i})italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over¯ start_ARG italic_X end_ARG | end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ over¯ start_ARG italic_X end_ARG end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG | over¯ start_ARG italic_X end_ARG | end_ARG ∑ start_POSTSUBSCRIPT bold_x ∈ over¯ start_ARG italic_X end_ARG end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (2)

We assume here orthonormal polynomial basis (rescaled Legendre), allowing to interpret coefficients as moments of normalized variables, becoming approximately expected value, variance, skewness, kurtosis. Independent a𝐣subscript𝑎𝐣a_{\mathbf{j}}italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT estimation allow to freely modify the basis - e.g. including information from additional mixed basis. Estimation as just average allows to control uncertainty of the found parameters.

Alternatively we could use various trigonometric bases (e.g. discrete cosine/sine transform), localized like B-splines used by KAN, from wavelets, finite elements methods. Alternatively, instead of normalization to uniform in [0,1]01[0,1][ 0 , 1 ], we could use a different normalization e.g. to Gaussian distribution, times Hermite polynomials for orthonormal basis. For discrete variables we can use one-hot encoding, or optimize it like in [13].

As in Fig. 4 and f0=1subscript𝑓01f_{0}=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, a𝐣subscript𝑎𝐣a_{\mathbf{j}}italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT coefficients are mixed moments of {i:ji1}conditional-set𝑖subscript𝑗𝑖1\{i:j_{i}\geq 1\}{ italic_i : italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 1 } variables of nonzero indexes, independent from variables of zero indexes, allowing for literally hierarchical decomposition of statistical dependencies: start with a0..0=1subscript𝑎0..01a_{0..0}=1italic_a start_POSTSUBSCRIPT 0..0 end_POSTSUBSCRIPT = 1 for normalization, add single nonzero index coefficients to describe marginal distributions, then add pairwise dependencies with two nonzero indexes, then triplewise, and so on. For example a2010subscript𝑎2010a_{2010}italic_a start_POSTSUBSCRIPT 2010 end_POSTSUBSCRIPT coefficient would describe dependence between 2nd moment of first variable and 1st moment of 3rd variable among d=4𝑑4d=4italic_d = 4 variables. Generally the selection of basis B𝐵Bitalic_B is a difficult question, e.g. to use only pairwise dependencies up to a fixed moment m𝑚mitalic_m, preferably optimized during training, maybe separately for each neuron or layer. Such decomposition also allows to efficiently work with missing data by using to estimate/update/propagate only a𝑎aitalic_a coefficients with zero indexes for the missing variables, as zero index means independence from given variable.

While static estimation averages over dataset with equal weights, for dynamic updating we should increase weights of recent values, e.g. using computationally convenient exponential moving average: for some small memory parameter λ𝜆\lambdaitalic_λ

a𝐣𝐱(1λ)a𝐣+λifji(xi)𝐱subscript𝑎𝐣1𝜆subscript𝑎𝐣𝜆subscriptproduct𝑖subscript𝑓subscript𝑗𝑖subscript𝑥𝑖a_{\mathbf{j}}\xrightarrow{\mathbf{x}}(1-\lambda)a_{\mathbf{j}}+\lambda\prod_{% i}f_{j_{i}}(x_{i})italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_ARROW overbold_x → end_ARROW ( 1 - italic_λ ) italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT + italic_λ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (3)
Refer to caption
Figure 5: Basic formulas and example for d=2𝑑2d=2italic_d = 2 dimensional case. Neuron contains aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT matrix (generally order d𝑑ditalic_d tensor of mainly zeros), allowing for propagation in various directions e.g. as expected values by various index summations - if there is further normalization, we can use just the marked polynomial (summed for multiple variables like in KAN e.g. (6)). We can see change of propagation direction here needs just transposition of aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT matrix (then normalization), also for larger tensors it is a matter of index permutation and finally normalizing.

However, modelling (joint) probability density as a linear combination can sometimes lead to negative densities - to avoid this issue, there is usually used calibration: instead of the modelled density ρ𝜌\rhoitalic_ρ, use e.g. max(ρ,0.1)𝜌0.1\max(\rho,0.1)roman_max ( italic_ρ , 0.1 ) and divide by integral to remain normalized density. However, it makes computations much more complex, especially in higher dimension - for neural network applications we should be able to ignore this issue to simplify calculations, especially working on expected values and normalizing between layers. Therefore, we ignore this issue/calibration in this article, however, it should be remembered, maybe adding calibration for some applications.

II-B Conditional distributions and expected value propagation

Having (1) model of joint distribution, to get conditional distribution we need to substitute known variables and normalize dividing by integral:

ρ(x1|x2,,xd)=𝐣a𝐣fj1(x1)fj2(x2)fjd(xd)01𝐣a𝐣fj1(x1)fj2(x2)fjd(xd)dx1=𝜌conditionalsubscript𝑥1subscript𝑥2subscript𝑥𝑑subscript𝐣subscript𝑎𝐣subscript𝑓subscript𝑗1subscript𝑥1subscript𝑓subscript𝑗2subscript𝑥2subscript𝑓subscript𝑗𝑑subscript𝑥𝑑superscriptsubscript01subscript𝐣subscript𝑎𝐣subscript𝑓subscript𝑗1subscript𝑥1subscript𝑓subscript𝑗2subscript𝑥2subscript𝑓subscript𝑗𝑑subscript𝑥𝑑𝑑subscript𝑥1absent\rho(x_{1}|x_{2},\ldots,x_{d})=\frac{\sum_{\mathbf{j}}a_{\mathbf{j}}f_{j_{1}}(% x_{1})f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}{\int_{0}^{1}\sum_{\mathbf{j}}a_% {\mathbf{j}}f_{j_{1}}(x_{1})f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})dx_{1}}=italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG =
=j1fj1(x1)j2jdaj1j2..jdfj2(x2)fjd(xd)j2jda0j2..jdfj2(x2)fjd(xd)=\sum_{j_{1}}f_{j_{1}}(x_{1})\frac{\sum_{j_{2}\ldots j_{d}}a_{j_{1}j_{2}..j_{d% }}\,f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}{\sum_{j_{2}\ldots j_{d}}a_{0j_{2}% ..j_{d}}\,f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}= ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . . italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . . italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG (4)

as 01fi(x)𝑑x=δi0superscriptsubscript01subscript𝑓𝑖𝑥differential-d𝑥subscript𝛿𝑖0\int_{0}^{1}f_{i}(x)dx=\delta_{i0}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x = italic_δ start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT. Such sums for pairwise dependencies use only two nonzero jisubscript𝑗𝑖j_{i}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indexes (input-output), three for triplewise, and so on. Denominator corresponds to normalization, indeed the fraction becomes 1 for j1=0subscript𝑗10j_{1}=0italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. Examples for d=2,3𝑑23d=2,3italic_d = 2 , 3 are shown in Fig. 5, 1 - generally nominator sums over all indexes with the current indexes of predicted variables. Denominator replaces current variables indexes with zeros for normalization, could be removed if having further (inter-layer) normalization.

Here is example of analogous prediction of conditional joint distributions for multiple (2) variables:

ρ(y,z|x)=jyjzfjy(y)fjz(z)jxajxjyjzfjx(x)jxajx00fjx(x)𝜌𝑦conditional𝑧𝑥subscriptsubscript𝑗𝑦subscript𝑗𝑧subscript𝑓subscript𝑗𝑦𝑦subscript𝑓subscript𝑗𝑧𝑧subscriptsubscript𝑗𝑥subscript𝑎subscript𝑗𝑥subscript𝑗𝑦subscript𝑗𝑧subscript𝑓subscript𝑗𝑥𝑥subscriptsubscript𝑗𝑥subscript𝑎subscript𝑗𝑥00subscript𝑓subscript𝑗𝑥𝑥\rho(y,z|x)=\sum_{j_{y}j_{z}}f_{j_{y}}(y)f_{j_{z}}(z)\frac{\sum_{j_{x}}a_{j_{x% }j_{y}j_{z}}\,f_{j_{x}}(x)}{\sum_{j_{x}}a_{j_{x}00}\,f_{j_{x}}(x)}italic_ρ ( italic_y , italic_z | italic_x ) = ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 00 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_ARG

However, working on expected values would remove yz𝑦𝑧y-zitalic_y - italic_z mixed moments, making E[y|x],E[z|x]𝐸delimited-[]conditional𝑦𝑥𝐸delimited-[]conditional𝑧𝑥E[y|x],E[z|x]italic_E [ italic_y | italic_x ] , italic_E [ italic_z | italic_x ] independent.

Having such conditional distribution, we can for example calculate expected value e.g. to be propagated by neural networks. For polynomial basis expected values contributions are: 01xf0(x)𝑑x=1/2,01xf1(x)𝑑x=1/12formulae-sequencesuperscriptsubscript01𝑥subscript𝑓0𝑥differential-d𝑥12superscriptsubscript01𝑥subscript𝑓1𝑥differential-d𝑥112\int_{0}^{1}xf_{0}(x)dx=1/2,\int_{0}^{1}xf_{1}(x)dx=1/\sqrt{12}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x = 1 / 2 , ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x = 1 / square-root start_ARG 12 end_ARG, and zero for higher, leading to formulas including only i=1𝑖1i=1italic_i = 1 the first moment as in Fig. 1, e.g.:

E[x|y,z]=12+123jka1jkfj(y)fk(z)jka0jkfj(y)fk(z)𝐸delimited-[]conditional𝑥𝑦𝑧12123subscript𝑗𝑘subscript𝑎1𝑗𝑘subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧subscript𝑗𝑘subscript𝑎0𝑗𝑘subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧E[x|y,z]=\frac{1}{2}+\frac{1}{2\sqrt{3}}\frac{\sum_{jk}a_{1jk}f_{j}(y)f_{k}(z)% }{\sum_{jk}a_{0jk}f_{j}(y)f_{k}(z)}italic_E [ italic_x | italic_y , italic_z ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG 3 end_ARG end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) end_ARG (5)

As further there is rather required CDF normalization which both shifts and rescales, in practice it is sufficient to work on such marked nominators.

Restricting it to pairwise dependencies: (single variable of input - single variable of output), similarly to KAN we get summation of trained 1-parameter functions: here polynomials (could be different e.g. B-splines like in KAN) + e.g. approximate fixed CDF for normalization:

jka1jkfj(y)fk(z)KAN-likepairwise onlyja1j0fj(y)+a10jfj(z)KAN-likepairwise onlysubscript𝑗𝑘subscript𝑎1𝑗𝑘subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧subscript𝑗subscript𝑎1𝑗0subscript𝑓𝑗𝑦subscript𝑎10𝑗subscript𝑓𝑗𝑧{\sum_{jk}a_{1jk}f_{j}(y)f_{k}(z)}\xrightarrow[\textrm{KAN-like}]{\textrm{% pairwise only}}\sum_{j}a_{1j0}f_{j}(y)+a_{10j}f_{j}(z)∑ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) start_ARROW underKAN-like start_ARROW overpairwise only → end_ARROW end_ARROW ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 italic_j 0 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) + italic_a start_POSTSUBSCRIPT 10 italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_z ) (6)

However, in comparison to KAN, using the proposed HCRNN parametrization we get multiple advantages:

  • it can propagate in any direction (as BNNs),

  • propagate values or probability distributions (as BNNs),

  • interpretation of parameters as mixed moments,

  • consciously add triplewise and higher dependencies,

  • inexpensive evaluation of modeled mutual information,

  • additional training ways (needed for BNNs), e.g. direct estimation, tensor decomposition, information bottleneck.

II-C Propagation of probability distribution

Let us start with a simple example: that we would like to calculate conditional density like previously:

ρ(x|y)=ifi(x)jaijfj(y)ja0jfj(y)𝜌conditional𝑥𝑦subscript𝑖subscript𝑓𝑖𝑥subscript𝑗subscript𝑎𝑖𝑗subscript𝑓𝑗𝑦subscript𝑗subscript𝑎0𝑗subscript𝑓𝑗𝑦\rho(x|y)=\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ij}f_{j}(y)}{\sum_{j}a_{0j}f_{j}(y)}italic_ρ ( italic_x | italic_y ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) end_ARG (7)

but for y𝑦yitalic_y being from ρ(y)=kbkfk(y)𝜌𝑦subscript𝑘subscript𝑏𝑘subscript𝑓𝑘𝑦\rho(y)=\sum_{k}b_{k}\,f_{k}(y)italic_ρ ( italic_y ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y ) probability density. So the propagated probability density of x𝑥xitalic_x should be 01ρ(x|y)ρ(y)𝑑ysuperscriptsubscript01𝜌conditional𝑥𝑦𝜌𝑦differential-d𝑦\int_{0}^{1}\rho(x|y)\rho(y)dy∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ρ ( italic_x | italic_y ) italic_ρ ( italic_y ) italic_d italic_y. Approximating with constant denominator, using 01fj(y)fk(y)𝑑y=δjksuperscriptsubscript01subscript𝑓𝑗𝑦subscript𝑓𝑘𝑦differential-d𝑦subscript𝛿𝑗𝑘\int_{0}^{1}f_{j}(y)f_{k}(y)dy=\delta_{jk}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y ) italic_d italic_y = italic_δ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT and finally normalizing, we get:

ρ(x)01ρ(x|y)ρ(y)𝑑yifi(x)jaijbjja0jbj𝜌𝑥superscriptsubscript01𝜌conditional𝑥𝑦𝜌𝑦differential-d𝑦subscript𝑖subscript𝑓𝑖𝑥subscript𝑗subscript𝑎𝑖𝑗subscript𝑏𝑗subscript𝑗subscript𝑎0𝑗subscript𝑏𝑗\rho(x)\leftarrow\int_{0}^{1}\rho(x|y)\rho(y)dy\approx\sum_{i}f_{i}(x)\frac{% \sum_{j}a_{ij}b_{j}}{\sum_{j}a_{0j}b_{j}}italic_ρ ( italic_x ) ← ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_ρ ( italic_x | italic_y ) italic_ρ ( italic_y ) italic_d italic_y ≈ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (8)

Such constant denominator approximation allows to propagate (in any direction) through HCR neurons not only values, but also entire probability distributions - by just replacing fj(y)subscript𝑓𝑗𝑦f_{j}(y)italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) for concrete value of y𝑦yitalic_y, with bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT describing its probability distribution. It is easy to generalize, e.g. for ρ(x|y,z)𝜌conditional𝑥𝑦𝑧\rho(x|y,z)italic_ρ ( italic_x | italic_y , italic_z ) we could replace fj(y)fk(z)subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧f_{j}(y)f_{k}(z)italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) with bjksubscript𝑏𝑗𝑘b_{jk}italic_b start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT when ρ(y,z)=jkbjkfj(y)fk(z)𝜌𝑦𝑧subscript𝑗𝑘subscript𝑏𝑗𝑘subscript𝑓𝑗𝑦subscript𝑓𝑘𝑧\rho(y,z)=\sum_{jk}b_{jk}f_{j}(y)f_{k}(z)italic_ρ ( italic_y , italic_z ) = ∑ start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ):

ifi(x)jaijkfj(y)fk(z)ja0jkfj(y)fk(z)[density]valueifi(x)jaijkbjkja0jkbjk\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ijk}f_{j}(y)f_{k}(z)}{\sum_{j}a_{0jk}f_{j}(y)% f_{k}(z)}\quad\stackrel{{\scriptstyle[}}{{\textrm{}}}{density}]{\textrm{value}% }{\leftrightarrows}\quad\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ijk}b_{jk}}{\sum_{j}a% _{0jk}b_{jk}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG end_ARG start_ARG [ end_ARG end_RELOP italic_d italic_e italic_n italic_s italic_i italic_t italic_y ] value ⇆ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 italic_j italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG

II-D Tensor decomposition and linearization

Analogously for intermediate layers like in bottom of Fig. 1:

Ai1i2i3i4kj1,j2ai1i2j1bi3i4j2cj1j2kasckΣj1Σj2ai1i2bi3i4subscriptsuperscript𝐴𝑘subscript𝑖1subscript𝑖2subscript𝑖3subscript𝑖4subscriptsubscript𝑗1subscript𝑗2subscriptsuperscript𝑎subscript𝑗1subscript𝑖1subscript𝑖2subscriptsuperscript𝑏subscript𝑗2subscript𝑖3subscript𝑖4subscriptsuperscript𝑐𝑘subscript𝑗1subscript𝑗2assuperscript𝑐𝑘subscriptΣsubscript𝑗1subscriptΣsubscript𝑗2subscript𝑎subscript𝑖1subscript𝑖2subscript𝑏subscript𝑖3subscript𝑖4A^{k}_{i_{1}i_{2}i_{3}i_{4}}\approx\sum_{j_{1},j_{2}}a^{j_{1}}_{i_{1}i_{2}}\,b% ^{j_{2}}_{i_{3}i_{4}}c^{k}_{j_{1}j_{2}}\qquad\textrm{as}\qquad\begin{array}[]{% c}c^{k}\\ \Sigma_{j_{1}}\wedge\Sigma_{j_{2}}\\ a_{i_{1}i_{2}}\quad b_{i_{3}i_{4}}\\ \end{array}italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as start_ARRAY start_ROW start_CELL italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∧ roman_Σ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY

integrating over the intermediate variables, approximating with constant denominator and normalizing at the end, thanks to basis orthogonality we get Kronecker deltas enforcing equality of intermediate indexes, leading to condition for approximation of higher order tensors with lower order ones, which is generally studied by tensor decomposition field [18] - hopefully leading to better training approaches.

While neural networks require nonlinearity, such tensor approach allows to linearize its intrinsic behavior: calculate nonlinearities in some basis e.g. polynomials only for the outer inputs/outputs (f𝐣(𝐱):𝐣B,𝐱X¯):subscript𝑓𝐣𝐱formulae-sequence𝐣𝐵𝐱¯𝑋(f_{\mathbf{j}}(\mathbf{x}):\mathbf{j}\in B,\mathbf{x}\in\bar{X})( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) : bold_j ∈ italic_B , bold_x ∈ over¯ start_ARG italic_X end_ARG ), including multivariate dependencies. Then treat the entire neural network as a linear transformation of such features (normalization only at the end - no need for inter-layer), e.g. just changing indexes (like transposition) to modify propagation direction. However, it contains this constant denominator approximation, and it would become a tensor of exponentially increasing size if including all dependencies - we should combine it with reductions like tensor decomposition, information bottleneck - working on linear approximations to reduce dimension.

II-E Basis optimization and selection

Another direction is application of the found a𝐣subscript𝑎𝐣a_{\mathbf{j}}italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT coefficients, for example to optimize the arbitrarily chosen {fi}subscript𝑓𝑖\{f_{i}\}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } basis to be able to reduce the number of considered coefficients, also to reduce overfitting issues, e.g. discussed in [13, 19]. For this purpose we can for example treat the current coefficients as a rectangular matrix Mj1,j2..jd:=a𝐣M_{j_{1},j_{2}..j_{d}}:=a_{\mathbf{j}}italic_M start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . . italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT := italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT - with blocked the remaining indexes for all considered coefficients in the basis. Now we can use SVD (singular value decomposition): find orthonormal eigenbasis of MMT=iσi𝐮i𝐮iT𝑀superscript𝑀𝑇subscript𝑖subscript𝜎𝑖subscript𝐮𝑖superscriptsubscript𝐮𝑖𝑇MM^{T}=\sum_{i}\sigma_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T}italic_M italic_M start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and use gi=juijfjsubscript𝑔𝑖subscript𝑗subscript𝑢𝑖𝑗subscript𝑓𝑗g_{i}=\sum_{j}u_{ij}f_{j}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as the new basis for one or a few dominant eigenvectors. Similarly we can do for the remaining variables, getting separate or common optimized bases for them.

A more difficult question is basis selection - which 𝐣B𝐣𝐵\mathbf{j}\in Bbold_j ∈ italic_B indexes to use in considered linear combinations for each neuron. Extending all to m𝑚mitalic_m-th moment/degree for d𝑑ditalic_d variables, we would need (m+1)d=k=0d(dk)mksuperscript𝑚1𝑑superscriptsubscript𝑘0𝑑binomial𝑑𝑘superscript𝑚𝑘(m+1)^{d}=\sum_{k=0}^{d}{d\choose k}m^{k}( italic_m + 1 ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( binomial start_ARG italic_d end_ARG start_ARG italic_k end_ARG ) italic_m start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT coefficients: 1 for normalization, dm𝑑𝑚dmitalic_d italic_m for marginal distributions, d(d1)m2/2𝑑𝑑1superscript𝑚22d(d-1)m^{2}/2italic_d ( italic_d - 1 ) italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 for pairwise, and so on. With proper normalization the coefficients for marginal distributions should be close to 0 - can be neglected. To reduce the number of coefficients, we can restrict e.g. up to pairwise dependencies (\approx KAN). Generally we can e.g. calculate more coefficients and during training discard those close to zero. Using optimized bases as above, should allow to reduce their size.

II-F Some HCRNN training approaches

A single HCR neuron models multidimensional joint distribution, what is already quite powerful. However, for neural networks the main difficulty is training the intermediate layers. Here are some approaches:

  • Treat HCRNN as just a parametrization and use standard backpropagation like for other ANNs e.g. MLP, KAN. It can be mixed with other techniques, e.g. static parameter estimation/update from recent values, online basis optimization and selection, or just optimizing initial parameters to improve further main optimization.

  • Maybe find initial intermediate values by dimensionality reduction like PCA of {f𝐣(x):𝐣B}conditional-setsubscript𝑓𝐣𝑥𝐣𝐵\{f_{\mathbf{j}}(x):\mathbf{j}\in B\}{ italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( italic_x ) : bold_j ∈ italic_B } vectors of features as (nonlinear) products of functions of inputs - further extended into information bottleneck approach.

  • Maybe use propagation as e.g. expected values in both directions, combine with coefficient estimation/update.

  • Maybe use some tensor decomposition techniques - start with estimation of e.g. pairwise dependencies for a larger set of variables, and use algebraic methods to try to approximate it with multiple lower order tensors.

While backpropagation is available for various parametrizations, such HCRNN with neurons containing joint distribution models bring some additional possibilities - hopefully allowing for faster training, especially with further looking the most promising information bottleneck approach.

Coefficients of such trained HCRNN remain mixed moments - providing dependency interpretation between input/output and hidden intermediate variables, allowing for multidirectional propagation of values or distributions like in Fig. 6, and its parameters can be further continuously updated e.g. using (3).

III Information bottleneck based training

Let us consider l𝑙litalic_l hidden layer neural network intended to predict Y𝑌Yitalic_Y from X𝑋Xitalic_X with {θi}i=0..l\{\theta^{i}\}_{i=0..l}{ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 . . italic_l end_POSTSUBSCRIPT sets of parameters:

Ylabel ofXθ0T1θ1θl1TlθlY^superscript𝜃0𝑌label of𝑋superscript𝑇1superscript𝜃1superscript𝜃𝑙1superscript𝑇𝑙superscript𝜃𝑙^𝑌Y\quad\textrm{label of}\quad X\xrightarrow{\theta^{0}}T^{1}\xrightarrow{\theta% ^{1}}\ldots\xrightarrow{\theta^{l-1}}T^{l}\xrightarrow{\theta^{l}}\hat{Y}italic_Y label of italic_X start_ARROW start_OVERACCENT italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW … start_ARROW start_OVERACCENT italic_θ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_OVERACCENT → end_ARROW over^ start_ARG italic_Y end_ARG (9)

While the standard training approach is focused on optimization of neuron parameters θ𝜃\thetaitalic_θ used to process the data, alternatively we could try to directly optimize Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT distributions of hidden/intermediate layers, in practice: dataset processed through some of layers (here could be in both directions) - like in image recognition: first layers extract low level features like edges, then some intermediate features, and finally e.g. faces.

Information bottleneck approach [20, 21] suggests how to directly optimize the intermediate layers. It uses information theory - offering nearly objective evaluation, being invariant to variable permutations, bijections. I(X;Y)=H(X)+H(Y)H(X,Y)𝐼𝑋𝑌𝐻𝑋𝐻𝑌𝐻𝑋𝑌I(X;Y)=H(X)+H(Y)-H(X,Y)italic_I ( italic_X ; italic_Y ) = italic_H ( italic_X ) + italic_H ( italic_Y ) - italic_H ( italic_X , italic_Y ) mutual information is the number of bits (or nits) X𝑋Xitalic_X on average gives about Y𝑌Yitalic_Y, or Y𝑌Yitalic_Y about X𝑋Xitalic_X.

Optimizing intermediate layer Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, obviously it should maximize information about the predicted Y𝑌Yitalic_Y: maxTI(T;Y)subscript𝑇𝐼𝑇𝑌\max_{T}I(T;Y)roman_max start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_I ( italic_T ; italic_Y ). However, focusing only on prediction would rather lead to overfitting. To prevent that, turns out neural networks performs also compression: minTI(X;T)subscript𝑇𝐼𝑋𝑇\min_{T}I(X;T)roman_min start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_I ( italic_X ; italic_T ) - removal of unnecessary information/noise from input. Finally information bottleneck approach, for some β>0𝛽0\beta>0italic_β > 0, assumes optimization of:

infT(I(X;T)βI(T;Y))subscriptinfimum𝑇𝐼𝑋𝑇𝛽𝐼𝑇𝑌\inf_{T}\left(I(X;T)-\beta I(T;Y)\right)roman_inf start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_I ( italic_X ; italic_T ) - italic_β italic_I ( italic_T ; italic_Y ) ) (10)

III-A Information theory view on HCR

Mutual information has excellent properties, however, it is relatively difficult to calculate, requires a joint distributions model. HCR assumes ρ(𝐱)=𝐣Ba𝐣f𝐣(𝐱)𝜌𝐱subscript𝐣𝐵subscript𝑎𝐣subscript𝑓𝐣𝐱\rho(\mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}\,f_{\mathbf{j}}(\mathbf{% x})italic_ρ ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ) model of joint distribution. Using natural logarithm (information in nits) with first order approximation ln(1+a)a1𝑎𝑎\ln(1+a)\approx aroman_ln ( 1 + italic_a ) ≈ italic_a, and orthogonality of basis, we get simple practical approximation:

H(X)=[0,1]dρ(𝐱)ln(ρ(𝐱))𝑑𝐱𝐣B+(a𝐣)2𝐻𝑋subscriptsuperscript01𝑑𝜌𝐱𝜌𝐱differential-d𝐱subscript𝐣superscript𝐵superscriptsubscript𝑎𝐣2H(X)=-\int_{[0,1]^{d}}\rho(\mathbf{x})\ln(\rho(\mathbf{x}))d\mathbf{x}\approx-% \sum_{\mathbf{j}\in B^{+}}(a_{\mathbf{j}})^{2}italic_H ( italic_X ) = - ∫ start_POSTSUBSCRIPT [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ρ ( bold_x ) roman_ln ( italic_ρ ( bold_x ) ) italic_d bold_x ≈ - ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (11)

For joint distribution of (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) (multivariate) variables, denoting BX,BYsubscript𝐵𝑋subscript𝐵𝑌B_{X},B_{Y}italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT as bases used for individual variables, the approximate mutual information formula becomes:

I(X;Y)𝐣xBX+𝐣yBX+(a(𝐣x,𝐣y))2𝐼𝑋𝑌subscriptsubscript𝐣𝑥superscriptsubscript𝐵𝑋subscriptsubscript𝐣𝑦superscriptsubscript𝐵𝑋superscriptsubscript𝑎subscript𝐣𝑥subscript𝐣𝑦2I(X;Y)\approx\sum_{\mathbf{j}_{x}\in B_{X}^{+}}\ \sum_{\mathbf{j}_{y}\in B_{X}% ^{+}}\left(a_{(\mathbf{j}_{x},\mathbf{j}_{y})}\right)^{2}italic_I ( italic_X ; italic_Y ) ≈ ∑ start_POSTSUBSCRIPT bold_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ( bold_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (12)

Hence this ln(1+a)a1𝑎𝑎\ln(1+a)\approx aroman_ln ( 1 + italic_a ) ≈ italic_a approximation allows to evaluate mutual information by just summing squared nontrivial coefficients, mixed moments between the two variables.

Let us now add estimation of these coefficients/mixed moments which will become θ𝜃\thetaitalic_θ neuron parameters. Denoting {𝐱i}i=1..n,{𝐲i}i=1..n\{\mathbf{x}^{i}\}_{i=1..n},\{\mathbf{y}^{i}\}_{i=1..n}{ bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 . . italic_n end_POSTSUBSCRIPT , { bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 . . italic_n end_POSTSUBSCRIPT as the current batch from dataset, denote

X¯=1n(f𝐣(𝐱i))i=1..n,jBX+Y¯=1n(f𝐣(𝐲i))i=1..n,jBY+\bar{X}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{x}^{i}))_{i=1..n,j\in B^{+}_% {X}}\qquad\bar{Y}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{y}^{i}))_{i=1..n,j% \in B^{+}_{Y}}over¯ start_ARG italic_X end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT (13)

as n×|BX+|,n×|BY+|𝑛subscriptsuperscript𝐵𝑋𝑛subscriptsuperscript𝐵𝑌n\times|B^{+}_{X}|,n\times|B^{+}_{Y}|italic_n × | italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | , italic_n × | italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT | matrices containing features of vectors from the batch - values in the chosen basis, or alternatively they could be interpreted as description of propagated probability distribution, e.g. ρ(𝐱)=1+𝐣BX+a𝐣f𝐣(𝐱)𝜌𝐱1subscript𝐣subscriptsuperscript𝐵𝑋subscript𝑎𝐣subscript𝑓𝐣𝐱\rho(\mathbf{x})=1+\sum_{\mathbf{j}\in B^{+}_{X}}a_{\mathbf{j}}\,f_{\mathbf{j}% }(\mathbf{x})italic_ρ ( bold_x ) = 1 + ∑ start_POSTSUBSCRIPT bold_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x ).

Having above X¯,Y¯¯𝑋¯𝑌\bar{X},\bar{Y}over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_Y end_ARG matrices of features, we can directly MSE estimate (2) parameters for all BX+×BY+superscriptsubscript𝐵𝑋superscriptsubscript𝐵𝑌B_{X}^{+}\times B_{Y}^{+}italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT × italic_B start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT pairs, mutual information:

(a(jx,jy):jxBX+,jYBY+)=X¯TY¯(a_{(j_{x},j_{y})}:j_{x}\in B_{X}^{+},j_{Y}\in B_{Y}^{+})=\bar{X}^{T}\bar{Y}( italic_a start_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT : italic_j start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_j start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG (14)
I(X;Y)Tr(X¯TY¯(X¯TY¯)T)=Tr(CXCY)𝐼𝑋𝑌Trsuperscript¯𝑋𝑇¯𝑌superscriptsuperscript¯𝑋𝑇¯𝑌𝑇Trsubscript𝐶𝑋subscript𝐶𝑌\qquad I(X;Y)\approx\textrm{Tr}\left(\bar{X}^{T}\bar{Y}(\bar{X}^{T}\bar{Y})^{T% }\right)=\textrm{Tr}(C_{X}C_{Y})italic_I ( italic_X ; italic_Y ) ≈ Tr ( over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG ( over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = Tr ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT )

for CX=X¯X¯T,CY=Y¯Y¯Tformulae-sequencesubscript𝐶𝑋¯𝑋superscript¯𝑋𝑇subscript𝐶𝑌¯𝑌superscript¯𝑌𝑇C_{X}=\bar{X}\bar{X}^{T},C_{Y}=\bar{Y}\bar{Y}^{T}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = over¯ start_ARG italic_X end_ARG over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = over¯ start_ARG italic_Y end_ARG over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT matrices of size n×n𝑛𝑛n\times nitalic_n × italic_n, containing scalar products of points in batch as vectors of features. If subtracting the means earlier, they resemble covariance matrices of the used features - this subtraction might be worth including, but generally means should be close to zero. Also, instead of covariance matrix, this is rather n×n𝑛𝑛n\times nitalic_n × italic_n similarity matrix inside the size n𝑛nitalic_n batch using the chosen features.

III-B Information bottleneck training

Using the above approximation, information bottleneck training of XTY𝑋𝑇𝑌X\to T\to Yitalic_X → italic_T → italic_Y for some hidden intermediate layer T𝑇Titalic_T, as trace is linear and cyclic, becomes (similar to nHSIC in [15]):

infCT0(Tr(CXCT)βTr(CTCY))=infCT0Tr(CT(CXβCY))subscriptinfimumsubscript𝐶𝑇0Trsubscript𝐶𝑋subscript𝐶𝑇𝛽Trsubscript𝐶𝑇subscript𝐶𝑌subscriptinfimumsubscript𝐶𝑇0Trsubscript𝐶𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌\inf_{C_{T}\geq 0}(\textrm{Tr}(C_{X}C_{T})-\beta\textrm{Tr}(C_{T}C_{Y}))=\inf_% {C_{T}\geq 0}\textrm{Tr}(C_{T}(C_{X}-\beta C_{Y}))roman_inf start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ( Tr ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_β Tr ( italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) = roman_inf start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT Tr ( italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) (15)

for some symmetric CT=T¯T¯Tsubscript𝐶𝑇¯𝑇superscript¯𝑇𝑇C_{T}=\bar{T}\bar{T}^{T}italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over¯ start_ARG italic_T end_ARG over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with nonnegative spectrum, maybe earlier subtracting the means to make it covariance matrix. Decomposing it to CT=ODOTsubscript𝐶𝑇𝑂𝐷superscript𝑂𝑇C_{T}=ODO^{T}italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O italic_D italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for D=diag(Λ1,,Λn)𝐷diagsubscriptΛ1subscriptΛ𝑛D=\textrm{diag}(\Lambda_{1},\ldots,\Lambda_{n})italic_D = diag ( roman_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the above optimization would become of:

Tr(DOT(CXβCY)O)=i=1nΛi(OT(CXβCY)O)iiTr𝐷superscript𝑂𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌𝑂superscriptsubscript𝑖1𝑛subscriptΛ𝑖subscriptsuperscript𝑂𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌𝑂𝑖𝑖\textrm{Tr}\left(D\,O^{T}(C_{X}-\beta C_{Y})O\right)=\sum_{i=1}^{n}\Lambda_{i}% \,(O^{T}(C_{X}-\beta C_{Y})O)_{ii}Tr ( italic_D italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) italic_O ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) italic_O ) start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT

Denoting M=OT(CXβCY)O𝑀superscript𝑂𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌𝑂M=O^{T}(C_{X}-\beta C_{Y})Oitalic_M = italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) italic_O, the above becomes just i=1nΛiMiisuperscriptsubscript𝑖1𝑛subscriptΛ𝑖subscript𝑀𝑖𝑖\sum_{i=1}^{n}\Lambda_{i}M_{ii}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT. Minimizing it over Λi0subscriptΛ𝑖0\Lambda_{i}\geq 0roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0, we could get to minus infinity for negative Miisubscript𝑀𝑖𝑖M_{ii}italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT. Hence it is crucial to add regularization e.g. l2superscript𝑙2l^{2}italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: adding Tr(CT2)=iΛi2Trsuperscriptsubscript𝐶𝑇2subscript𝑖superscriptsubscriptΛ𝑖2\textrm{Tr}(C_{T}^{2})=\sum_{i}\Lambda_{i}^{2}Tr ( italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times some 1/2η>012𝜂01/2\eta>01 / 2 italic_η > 0 to the minimized:

inf(Λi)i=1nΛiMii+12ηiΛi2subscriptinfimumsubscriptΛ𝑖superscriptsubscript𝑖1𝑛subscriptΛ𝑖subscript𝑀𝑖𝑖12𝜂subscript𝑖superscriptsubscriptΛ𝑖2\inf_{(\Lambda_{i})}\ \sum_{i=1}^{n}\Lambda_{i}M_{ii}+\frac{1}{2\eta}\sum_{i}% \Lambda_{i}^{2}roman_inf start_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (16)

For positive {i:Mii0}conditional-set𝑖subscript𝑀𝑖𝑖0\{i:M_{ii}\geq 0\}{ italic_i : italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ≥ 0 } minimization requires Λi=0subscriptΛ𝑖0\Lambda_{i}=0roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. For negative {i:Mii<0}conditional-set𝑖subscript𝑀𝑖𝑖0\{i:M_{ii}<0\}{ italic_i : italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT < 0 } minimization gives Λi=ηMiisubscriptΛ𝑖𝜂subscript𝑀𝑖𝑖\Lambda_{i}=-\eta M_{ii}roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_η italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT.

There has remained optimization of O𝑂Oitalic_O rotation, with above ΛisubscriptΛ𝑖\Lambda_{i}roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT optimization minimized (16) becomes:

infO:OTO=Iη2i:(OT(CXβCY)O)ii<0((OT(CXβCY)O)ii)2subscriptinfimum:𝑂superscript𝑂𝑇𝑂𝐼𝜂2subscript:𝑖subscriptsuperscript𝑂𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌𝑂𝑖𝑖0superscriptsubscriptsuperscript𝑂𝑇subscript𝐶𝑋𝛽subscript𝐶𝑌𝑂𝑖𝑖2\inf_{O:O^{T}O=I}-\frac{\eta}{2}\sum_{i:(O^{T}(C_{X}-\beta C_{Y})O)_{ii}<0}((O% ^{T}(C_{X}-\beta C_{Y})O)_{ii})^{2}roman_inf start_POSTSUBSCRIPT italic_O : italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_O = italic_I end_POSTSUBSCRIPT - divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i : ( italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) italic_O ) start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT < 0 end_POSTSUBSCRIPT ( ( italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) italic_O ) start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (17)

Eigendecomposition CXβCY=Odiag(λi)OTsubscript𝐶𝑋𝛽subscript𝐶𝑌𝑂diagsubscript𝜆𝑖superscript𝑂𝑇C_{X}-\beta C_{Y}=O\,\textrm{diag}(\lambda_{i})\,O^{T}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = italic_O diag ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT allows to choose O𝑂Oitalic_O for optimized CT=Odiag(Λi)OTsubscript𝐶𝑇𝑂diagsubscriptΛ𝑖superscript𝑂𝑇C_{T}=O\,\textrm{diag}(\Lambda_{i})\,O^{T}italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_O diag ( roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, which makes Mii=λisubscript𝑀𝑖𝑖subscript𝜆𝑖M_{ii}=\lambda_{i}italic_M start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As above, its optimal eigenvalues should be chosen as Λi=max(0,ηλi)subscriptΛ𝑖0𝜂subscript𝜆𝑖\Lambda_{i}=\max(0,-\eta\lambda_{i})roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max ( 0 , - italic_η italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Choosing T¯=(Odiag(max(0,ηλi)))i=1..n,j=1..k\bar{T}=(O\,\textrm{diag}(\sqrt{\max(0,-\eta\lambda_{i})}))_{i=1..n,j=1..k}over¯ start_ARG italic_T end_ARG = ( italic_O diag ( square-root start_ARG roman_max ( 0 , - italic_η italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) ) start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j = 1 . . italic_k end_POSTSUBSCRIPT for sorted λ1λnsubscript𝜆1subscript𝜆𝑛\lambda_{1}\leq\ldots\leq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ … ≤ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and λk<0,λk+10formulae-sequencesubscript𝜆𝑘0subscript𝜆𝑘10\lambda_{k}<0,\lambda_{k+1}\geq 0italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < 0 , italic_λ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ≥ 0, the matrix CT=T¯T¯Tsubscript𝐶𝑇¯𝑇superscript¯𝑇𝑇C_{T}=\bar{T}\bar{T}^{T}italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over¯ start_ARG italic_T end_ARG over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT will be as required, making hidden layer represented as size k𝑘kitalic_k vectors of features.

For such T¯¯𝑇\bar{T}over¯ start_ARG italic_T end_ARG content of hidden layer, we can estimate (10) transition parameters: X¯TT¯superscript¯𝑋𝑇¯𝑇\bar{X}^{T}\bar{T}over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_T end_ARG for XT𝑋𝑇X\to Titalic_X → italic_T and T¯TY¯superscript¯𝑇𝑇¯𝑌\bar{T}^{T}\bar{Y}over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG for TY𝑇𝑌T\to Yitalic_T → italic_Y. Multiplying both we get X¯TT¯T¯TY¯superscript¯𝑋𝑇¯𝑇superscript¯𝑇𝑇¯𝑌\bar{X}^{T}\bar{T}\bar{T}^{T}\bar{Y}over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_T end_ARG over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG for indirection connection, while for direct connection we got X¯TY¯superscript¯𝑋𝑇¯𝑌\bar{X}^{T}\bar{Y}over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG. Both should be comparable, suggesting to use just Λi=1subscriptΛ𝑖1\Lambda_{i}=1roman_Λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for i=1..ki=1..kitalic_i = 1 . . italic_k and 0 for the rest - content of T𝑇Titalic_T layer as just projection to eigenvectors of CXβCYsubscript𝐶𝑋𝛽subscript𝐶𝑌C_{X}-\beta C_{Y}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT corresponding to negative eigenvalues.

However, such k𝑘kitalic_k number of features could be comparable to n𝑛nitalic_n batch size - can be impractically large. We can decrease it by reducing β𝛽\betaitalic_β. Alternatively we could fix β𝛽\betaitalic_β and use a smaller e.g. fixed k𝑘kitalic_k, or λ𝜆\lambdaitalic_λ below some negative threshold, or up to some large jump in eigenspectrum.

To summarize the suggested procedure: Information bottleneck HCR training for XTY𝑋𝑇𝑌X\to T\to Yitalic_X → italic_T → italic_Y of T𝑇Titalic_T intermediate layer with β𝛽\betaitalic_β parameters, k𝑘kitalic_k features:

  1. 1.

    prepare matrices of features (if not done earlier):

    X¯=1n(f𝐣(𝐱i))i=1..n,jBX+,Y¯=1n(f𝐣(𝐲i))i=1..n,jBY+\bar{X}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{x}^{i}))_{i=1..n,j\in B^{+}_% {X}},\quad\bar{Y}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{y}^{i}))_{i=1..n,j% \in B^{+}_{Y}}over¯ start_ARG italic_X end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_Y end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ( italic_f start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j ∈ italic_B start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT
  2. 2.

    perform eigendecomposition λ1λ2λnsubscript𝜆1subscript𝜆2subscript𝜆𝑛\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{n}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ … ≤ italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

    X¯X¯TβY¯Y¯T=Odiag(λi)OT¯𝑋superscript¯𝑋𝑇𝛽¯𝑌superscript¯𝑌𝑇𝑂diagsubscript𝜆𝑖superscript𝑂𝑇\bar{X}\bar{X}^{T}-\beta\bar{Y}\bar{Y}^{T}=O\,\textrm{diag}(\lambda_{i})\,O^{T}over¯ start_ARG italic_X end_ARG over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_β over¯ start_ARG italic_Y end_ARG over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_O diag ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
  3. 3.

    Take T¯=Oi=1..n,j=1..k\bar{T}=O_{i=1..n,j=1..k}over¯ start_ARG italic_T end_ARG = italic_O start_POSTSUBSCRIPT italic_i = 1 . . italic_n , italic_j = 1 . . italic_k end_POSTSUBSCRIPT projection to k𝑘kitalic_k lowest eigenvectors as content of T𝑇Titalic_T layer: k𝑘kitalic_k features for size n𝑛nitalic_n batch.

  4. 4.

    If needed, neuron weights can be calculated/estimated using (10) as X¯TT¯superscript¯𝑋𝑇¯𝑇\bar{X}^{T}\bar{T}over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_T end_ARG for XT𝑋𝑇X\to Titalic_X → italic_T and T¯TY¯superscript¯𝑇𝑇¯𝑌\bar{T}^{T}\bar{Y}over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_Y end_ARG for TY𝑇𝑌T\to Yitalic_T → italic_Y.

III-C Remarks, applications of information bottleneck training

  • The found k𝑘kitalic_k features on size n𝑛nitalic_n batch are abstract - there is no need to choose basis, however, choosing a size k𝑘kitalic_k basis we would get interpretation of content of this intermediate layer as values or rather probability distribution. Working on such abstract values, they are no longer normalized, seems we can skip CDF normalization layers.

  • Diagonalized X¯X¯TβY¯Y¯T¯𝑋superscript¯𝑋𝑇𝛽¯𝑌superscript¯𝑌𝑇\bar{X}\bar{X}^{T}-\beta\bar{Y}\bar{Y}^{T}over¯ start_ARG italic_X end_ARG over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_β over¯ start_ARG italic_Y end_ARG over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is n×n𝑛𝑛n\times nitalic_n × italic_n matrix of distances in size n𝑛nitalic_n batch as scalar products of feature vectors. Cost of its operations grows with the batch size - in practice we should split dataset into batches and e.g. estimate network parameters and average them over batches, or update the previous ones using exponential moving average: θ=μθ+(1μ)X¯TT¯𝜃𝜇𝜃1𝜇superscript¯𝑋𝑇¯𝑇\theta=\mu\theta+(1-\mu)\bar{X}^{T}\bar{T}italic_θ = italic_μ italic_θ + ( 1 - italic_μ ) over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_T end_ARG, for example visiting various layers/neurons in some order using a given batch, maybe varying β,η,k𝛽𝜂𝑘\beta,\eta,kitalic_β , italic_η , italic_k parameters.

  • For practical training we can optimize/update intermediate layers for T𝑇Titalic_T in various ways, e.g. start with X𝑋Xitalic_X input, Y𝑌Yitalic_Y output to optimize intermediate T𝑇Titalic_T layer, then optimize another intermediate layer between T𝑇Titalic_T and Y𝑌Yitalic_Y adding e.g. pairwise features of T𝑇Titalic_T (increase T¯¯𝑇\bar{T}over¯ start_ARG italic_T end_ARG to add nonlinearity), and so on recursively. Or we can optimize for succeeding layers Ti1TiTi+1subscript𝑇𝑖1subscript𝑇𝑖subscript𝑇𝑖1T_{i-1}\to T_{i}\to T_{i+1}italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT → italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Also standard techniques like convolution, pooling can be added. The finally found parameters can be further improved with a different technique like backpropagation.

  • The found approach is different than for Gaussian variables [22] close to canonical correlation analysis, maybe it is worth co consider some intermediate approach, e.g. subtracting means to make CXsubscript𝐶𝑋C_{X}italic_C start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT closer to covariance matrix, improve on the use ln(1+a)a1𝑎𝑎\ln(1+a)\approx aroman_ln ( 1 + italic_a ) ≈ italic_a approximation, etc.

IV Conclusions and further work

Neurons with joint distribution models seem powerful agnostic improvement for currently popular guessed parametrizations like MLP or KAN, and are practically accessible with HCR, up to omnidirectional neurons like in Fig 6 - allowing to freely choose inference directions, propagate both values and probability distributions, with clear coefficient interpretations.

As in Fig. 2, 3, BNNs are qualitatively superior than current ANNs, also have multidirectional propagation, including probabilities, and need different training than standard backpropagation. Proposed new ANNs allow to catch up with such low level behavior still looking biologically plausible e.g. KAN-like, hopefully allowing to also get closer for high level behavior, maybe recreating mathematics hidden in BNN behavior.

However, mastering such new neural network architecture will require a lot of work, planned also for future versions of this article. Here are some basic research directions:

  • Search for practical applications, from replacement of standard ANN, for multidirectional inference e.g. in Bayes-like scenarios, as neural networks propagating probability distributions, up to exploration of similarity/replacement for biological neurons.

  • Practical implementation, optimization especially of training and update, basis optimization and selection techniques, exploration of tensor decomposition approach.

  • Working on probability distributions makes it natural for information theoretic approaches like information bottleneck [20] optimization of intermediate layers, also hopefully leading to better understanding e.g. of information propagation during learning/inference, information held by intermediate layers, etc.

  • Adding time dependence like model update, also for similarity with biological neurons, e.g. long term potentiation, connection to various periodic processes/clocks.

  • While the discussed neurons containing joint distribution models seem very powerful and flexible, directly working in high dimensions they have various issues - suggesting to directly predict conditional distributions instead with HCR parametrization ([9, 10, 12, 13]), what might be also worth included in neural network, e.g. as a part of the training process - to be decomposed into single neurons.

Refer to caption
Figure 6: Omnidirectional HCR neuron proposed in [8] - getting any subset S𝑆Sitalic_S of connections as input, it can update model for a𝐣subscript𝑎𝐣a_{\mathbf{j}}italic_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT coefficients positive only in this subset: {i:ji1S}conditional-set𝑖subscript𝑗𝑖1𝑆\{i:j_{i}\geq 1\subset S\}{ italic_i : italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 1 ⊂ italic_S }, and predict/propagate to output as the remaining connections e.g expected values for these inputs, for example accumulated up to some threshold, including sign for excitatory/inhibitory.

References

  • [1] R. Follmann, E. Rosa Jr, and W. Stein, “Dynamics of signal propagation and collision in axons,” Physical Review E, vol. 92, no. 3, p. 032707, 2015.
  • [2] B. A. Richards, T. P. Lillicrap, P. Beaudoin, Y. Bengio, R. Bogacz, A. Christensen, C. Clopath, R. P. Costa, A. de Berker, S. Ganguli et al., “A deep learning framework for neuroscience,” Nature neuroscience, vol. 22, no. 11, pp. 1761–1770, 2019.
  • [3] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
  • [4] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark, “Kan: Kolmogorov-arnold networks,” arXiv preprint arXiv:2404.19756, 2024.
  • [5] I. Kononenko, “Bayesian neural networks,” Biological Cybernetics, vol. 61, no. 5, pp. 361–370, 1989.
  • [6] F. Durante and C. Sempi, “Copula theory: an introduction,” in Copula theory and its applications.   Springer, 2010, pp. 3–31.
  • [7] G. R. Terrell and D. W. Scott, “Variable kernel density estimation,” The Annals of Statistics, pp. 1236–1265, 1992.
  • [8] J. Duda, “Hierarchical correlation reconstruction with missing data, for example for biology-inspired neuron,” arXiv preprint arXiv:1804.06218, 2018.
  • [9] J. Duda and A. Szulc, “Social benefits versus monetary and multidimensional poverty in poland: Imputed income exercise,” in International Conference on Applied Economics.   Springer, 2019, pp. 87–102, preprint: https://arxiv.longhoe.net/abs/1812.08040.
  • [10] J. Duda, H. Gurgul, and R. Syrek, “Modelling bid-ask spread conditional distributions using hierarchical correlation reconstruction,” Statistics in Transition New Series, vol. 21, no. 5, 2020, preprint: https://arxiv.longhoe.net/abs/1911.02361.
  • [11] J. Duda and G. Bhatta, “Gamma-ray blazar variability: new statistical methods of time-flux distributions,” Monthly Notices of the Royal Astronomical Society, vol. 508, no. 1, pp. 1446–1458, 2021.
  • [12] J. Duda and S. Podlewska, “Prediction of probability distributions of molecular properties: towards more efficient virtual screening and better understanding of compound representations,” Molecular Diversity, pp. 1–12, 2022.
  • [13] J. Duda and G. Bhatta, “Predicting conditional probability distributions of redshifts of active galactic nuclei using hierarchical correlation reconstruction,” Monthly Notices of the Royal Astronomical Society, p. stae963, 2024.
  • [14] R. Pogodin and P. Latham, “Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 7296–7307, 2020.
  • [15] W.-D. K. Ma, J. Lewis, and W. B. Kleijn, “The hsic bottleneck: Deep learning without back-propagation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 5085–5092.
  • [16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   pmlr, 2015, pp. 448–456.
  • [17] J. Duda, “Rapid parametric density estimation,” arXiv preprint arXiv:1702.02144, 2017.
  • [18] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  • [19] J. Duda, “Fast optimization of common basis for matrix set through common singular value decomposition,” arXiv preprint arXiv:2204.08242, 2022.
  • [20] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
  • [21] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 ieee information theory workshop (itw).   IEEE, 2015, pp. 1–5.
  • [22] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information bottleneck for gaussian variables,” Advances in Neural Information Processing Systems, vol. 16, 2003.