Biology-inspired joint distribution neurons
based on Hierarchical Correlation Reconstruction
allowing for multidirectional neural networks

Jarek Duda
Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: [email protected]

Abstract

Biological neural networks seem qualitatively superior (e.g. in learning, flexibility, robustness) from current artificial like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them: have fundamentally multidirectional signal propagation [1], also of probability distributions e.g. for uncertainty estimation, and are believed not being able to use standard backpropagation training [2]. There are proposed novel artificial neurons based on HCR (Hierarchical Correlation Reconstruction) removing the above low level differences: with neurons containing local joint distribution model (of its connections), representing joint density on normalized variables as just linear combination among $(f_{\mathbf{j}})$ orthonormal polynomials: $\rho(\mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}f_{\mathbf{j}}(\mathbf{x})$ for $\mathbf{x}\in[0,1]^{d}$ and $B$ some chosen basis, with basis growth approaching complete description of joint distribution. By various index summations of such $(a_{\mathbf{j}})$ tensor as neuron parameters, we get simple formulas for e.g. conditional expected values for propagation in any direction, like $E[x|y,z]$ , $E[y|x]$ , which degenerate to KAN-like parametrization if restricting to pairwise dependencies. Such HCR network can also propagate probability distributions (also joint) like $\rho(y,z|x)$ . It also allows for additional training approaches, like direct $(a_{\mathbf{j}})$ estimation, through tensor decomposition, or more biologically plausible information bottleneck training: layers directly influencing only neighbors, optimizing content to maximize information about the next layer, and minimizing about the previous to minimize the noise.

Keywords: machine learning, neural networks, Kolmogorov-Arnold Network, joint distribution, conditional distribution, Bayesian Neural Networks, tensor decomposition, mutual information, information bottleneck approach

I Introduction

Refer to caption — Figure 1: The proposed HCR neuron and neural network (HCRN, HCRNN) containing local joint distribution model represented in $(a_{\mathbf{j}})$ tensor, e.g. $(a_{ijk})$ for $d=3$ connections. Top: orthonormal polynomial basis assuming normalization to uniform distribution in $[0,1]$ . Middle: HCR neuron containing and applying joint distribution model for $d=3$ variables, and gathered formulas for direct estimation/model update, its application to propagate entire distributions and expected values alone. Such density parametrization can drop below 0, what is usually repaired by calibration e.g. use normalized $\max(\rho,0.1)$ density, however, for neural networks with inter-layer normalization this issue seems negligible, what essentially simplifies calculations to the shown formulas. Propagating only expected values and normalizing, we can use only the marked nominators - as in KAN optimizing nonlinear functions (polynomial here) by including only pairwise dependencies ( $a$ with two nonzero indexes), extending to their products to consciously include higher order dependencies. Bottom: schematic HCR neural network and some training approaches of intermediate layers - which in HCR can be treated as values or their distributions (replacing $f_{i}(u)$ with its $i$ -th moment: $\int_{0}^{1}\rho(u)f_{i}(u)du$ ). There is also visualized tensor decomposition approach - estimate dependencies (e.g. pairwise) for multiple variables and try to automatically decompose it to multiple dependencies of a smaller numbers of variables with algebraic methods.

Biological neurons use complex propagation of action potentials, travelling in both directions of e.g. axons: ”it is not uncommon for axonal propagation of action potentials to happen in both directions” [1]. They have access to information from connected neurons, which complete statistical description is their joint distribution - beside value dependence, also describing dependencies of e.g. variances as so called homoscedasticity popular in financial data, and other moments (we will decompose into with HCR approach as in Fig. 1). Biological neurons should be evolutionarily optimized to include such additional information if only beneficial, often required to work on distributions e.g. to estimate uncertainties.

In contrast, arbitrarily chosen popular parametrization types like Multi-Layer Perceptron (MLP) [3] as trained linear combination and fixed nonlinear activation function, or Kolmogorov-Arnold Network (KAN) [4] additionally training activation functions, are optimized for propagation in one direction, and work only on values not distributions. Additionally, they are mainly trained by backpropagation, which is rather inaccessible for biological [2]. These differences lead to being far from learning capabilities, flexility, robustness of biological neural networks - summarized in Fig. 2. To reduce this difference, we should search for more powerful artificial neurons, like HCRNN proposed in this article.

To reach both multidirectional propagation, and working also with distributions e.g. for uncertainty estimation, there could be used Bayes theorem e.g. in Bayesian Neural Networks [5] - which in practice use relatively simple models. To include more detailed description of complex dependencies also of continuous variables, neuron could model the entire joint distribution of its connections, substitute and normalize to get conditional distributions without Bayes theorem. However, joint distributions of continuous variables become quite complicated, difficult to describe and handle. Classical approaches are copulas [6] but they are low parametric, or kernel density estimation (KDE) [7] which is impractical in higher dimension.

Hierarchical Correlation Reconstruction (HCR) ([8, 9, 10, 11, 12, 13])¹¹1HCR introduction: https://community.wolfram.com/groups/-/m/t/3017754, used for such artificial neurons as in Fig. 1, allow to overcome these difficulties by representing joint distribution of $d$ normalized variables as just a linear combination $\rho(\mathbf{x})=\sum_{\mathbf{j}}a_{\mathbf{j}}\prod_{i=1}^{d}f_{j_{i}}(x_{i})$ , where by bold font there are denoted vectors. Using orthonormal polynomials: $\int_{0}^{1}f_{i}(x)f_{j}(x)dx=\delta_{ij}$ , the $(a_{\mathbf{j}})$ tensor of coefficients is inexpensive to estimate and update, literally providing hierarchical correlation decomposition with (mixed) moments as in Fig. 4. While generally such density as a linear combination can get below 0, what usually is repaired by calibration, for neural networks with normalization between layers this issue could be just neglected, essentially reducing computational costs.

Having such $(a_{\mathbf{j}})$ tensor as neuron parameters, we can for example propagate in various directions values or probabilities by various index summations like in Fig. 1. Restricting to pairwise dependencies by using only $(a_{\mathbf{j}})$ with two nonzero indexes, such propagation formulas become sums of trained polynomial, like in KAN parametrization - but additionally allowing e.g. multidirectional propagation also of probability distributions.

Additionally HCRNN allows for many additional learning ways: as estimation of joint distributions, which can be linearized into products/decomposition of tensors. For biological plausibility, information bottleneck training seems very promising like in Fig. 3 - by directly optimizing content of intermediate layers: to increase mutual information with the next layer, and decrease from the previous to remove unnecessary information like noise. It is related to HSIC [14, 15] approach, replacing kernel width dependent local basis, with global polynomial basis for normalized variables.

This article introduces to HCR from perspective of neural network applications, earlier suggested in [8]. This is early version of article - to be extended in the future e.g. with practical realizations, examples, e.g. replacing MLP, KAN.

II HCR neural networks (HCRNN)

This main Section introduces to HCR and discusses it as a basic building block for neural network.

II-A Introduction to Hierarchical Correlation Reconstrution

As in copula theory [6], it is convenient to use normalization of variables to nearly uniform distribution in $[0,1]$ . It requires transformation through cumulative distribution function (CDF): $x\leftrightarrow\textrm{CDF}(x)$ becoming its estimated quantile. This CDF can be modeled with some parametric distribution using parameters estimated from dataset, or be empirical distribution function: $x$ becomes its position in dataset rescaled to $(0,1)$ . For neural networks normalization is popular to be made in batches [16], here needed to be used between layers (but can be skipped in further linearization), in practice should be nearly constant between layers, can be approximated, parameterized e.g. by Gaussian, put into tables, inversed for backward propagation.

For $d$ normalized variables: $\mathbf{x}\in[0,1]^{d}$ , in HCR we represent joint distribution as a linear combination, conveniently in some product basis $B=B^{+}\cup\{\textbf{0}\}$ with $f_{0}(x)=f_{\mathbf{0}}(\mathbf{x})=1$ :

\rho(\mathbf{x})=1+\sum_{\mathbf{j}\in B^{+}}a_{\mathbf{j}}f_{\mathbf{j}}(% \mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}f_{\mathbf{j}}(\mathbf{x})=% \sum_{\mathbf{j}\in B}a_{\mathbf{j}}\prod_{i=1}^{d}f_{j_{i}}(x_{i})

(1)

where $B^{+}=B\backslash\{\textbf{0}\}$ removes zero corresponding to normalization as $f_{0}=1$ , bold fonts denote vectors: $\mathbf{j}=(j_{1},..,j_{d})$ .

Assuming orthonormal basis: $\int_{0}^{1}f_{i}(x)f_{j}(x)dx=\delta_{ij}$ , static estimation [17] (minimizing mean-squared error between kernel density estimation smoothed sample and parametrization) from $\bar{X}$ dataset becomes just:

a_{\mathbf{j}}=\frac{1}{|\bar{X}|}\sum_{\mathbf{x}\in\bar{X}}f_{\mathbf{j}}(% \mathbf{x})=\frac{1}{|\bar{X}|}\sum_{\mathbf{x}\in\bar{X}}\prod_{i=1}^{d}f_{j_% {i}}(x_{i})

(2)

We assume here orthonormal polynomial basis (rescaled Legendre), allowing to interpret coefficients as moments of normalized variables, becoming approximately expected value, variance, skewness, kurtosis. Independent $a_{\mathbf{j}}$ estimation allow to freely modify the basis - e.g. including information from additional mixed basis. Estimation as just average allows to control uncertainty of the found parameters.

Alternatively we could use various trigonometric bases (e.g. discrete cosine/sine transform), localized like B-splines used by KAN, from wavelets, finite elements methods. Alternatively, instead of normalization to uniform in $[0,1]$ , we could use a different normalization e.g. to Gaussian distribution, times Hermite polynomials for orthonormal basis. For discrete variables we can use one-hot encoding, or optimize it like in [13].

As in Fig. 4 and $f_{0}=1$ , $a_{\mathbf{j}}$ coefficients are mixed moments of $\{i:j_{i}\geq 1\}$ variables of nonzero indexes, independent from variables of zero indexes, allowing for literally hierarchical decomposition of statistical dependencies: start with $a_{0..0}=1$ for normalization, add single nonzero index coefficients to describe marginal distributions, then add pairwise dependencies with two nonzero indexes, then triplewise, and so on. For example $a_{2010}$ coefficient would describe dependence between 2nd moment of first variable and 1st moment of 3rd variable among $d=4$ variables. Generally the selection of basis $B$ is a difficult question, e.g. to use only pairwise dependencies up to a fixed moment $m$ , preferably optimized during training, maybe separately for each neuron or layer. Such decomposition also allows to efficiently work with missing data by using to estimate/update/propagate only $a$ coefficients with zero indexes for the missing variables, as zero index means independence from given variable.

While static estimation averages over dataset with equal weights, for dynamic updating we should increase weights of recent values, e.g. using computationally convenient exponential moving average: for some small memory parameter $\lambda$

a_{\mathbf{j}}\xrightarrow{\mathbf{x}}(1-\lambda)a_{\mathbf{j}}+\lambda\prod_{% i}f_{j_{i}}(x_{i})

(3)

However, modelling (joint) probability density as a linear combination can sometimes lead to negative densities - to avoid this issue, there is usually used calibration: instead of the modelled density $\rho$ , use e.g. $\max(\rho,0.1)$ and divide by integral to remain normalized density. However, it makes computations much more complex, especially in higher dimension - for neural network applications we should be able to ignore this issue to simplify calculations, especially working on expected values and normalizing between layers. Therefore, we ignore this issue/calibration in this article, however, it should be remembered, maybe adding calibration for some applications.

II-B Conditional distributions and expected value propagation

Having (1) model of joint distribution, to get conditional distribution we need to substitute known variables and normalize dividing by integral:

\rho(x_{1}|x_{2},\ldots,x_{d})=\frac{\sum_{\mathbf{j}}a_{\mathbf{j}}f_{j_{1}}(% x_{1})f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}{\int_{0}^{1}\sum_{\mathbf{j}}a_% {\mathbf{j}}f_{j_{1}}(x_{1})f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})dx_{1}}=

=\sum_{j_{1}}f_{j_{1}}(x_{1})\frac{\sum_{j_{2}\ldots j_{d}}a_{j_{1}j_{2}..j_{d% }}\,f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}{\sum_{j_{2}\ldots j_{d}}a_{0j_{2}% ..j_{d}}\,f_{j_{2}}(x_{2})\ldots f_{j_{d}}(x_{d})}

(4)

as $\int_{0}^{1}f_{i}(x)dx=\delta_{i0}$ . Such sums for pairwise dependencies use only two nonzero $j_{i}$ indexes (input-output), three for triplewise, and so on. Denominator corresponds to normalization, indeed the fraction becomes 1 for $j_{1}=0$ . Examples for $d=2,3$ are shown in Fig. 5, 1 - generally nominator sums over all indexes with the current indexes of predicted variables. Denominator replaces current variables indexes with zeros for normalization, could be removed if having further (inter-layer) normalization.

Here is example of analogous prediction of conditional joint distributions for multiple (2) variables:

\rho(y,z|x)=\sum_{j_{y}j_{z}}f_{j_{y}}(y)f_{j_{z}}(z)\frac{\sum_{j_{x}}a_{j_{x% }j_{y}j_{z}}\,f_{j_{x}}(x)}{\sum_{j_{x}}a_{j_{x}00}\,f_{j_{x}}(x)}

However, working on expected values would remove $y-z$ mixed moments, making $E[y|x],E[z|x]$ independent.

Having such conditional distribution, we can for example calculate expected value e.g. to be propagated by neural networks. For polynomial basis expected values contributions are: $\int_{0}^{1}xf_{0}(x)dx=1/2,\int_{0}^{1}xf_{1}(x)dx=1/\sqrt{12}$ , and zero for higher, leading to formulas including only $i=1$ the first moment as in Fig. 1, e.g.:

E[x|y,z]=\frac{1}{2}+\frac{1}{2\sqrt{3}}\frac{\sum_{jk}a_{1jk}f_{j}(y)f_{k}(z)% }{\sum_{jk}a_{0jk}f_{j}(y)f_{k}(z)}

(5)

As further there is rather required CDF normalization which both shifts and rescales, in practice it is sufficient to work on such marked nominators.

Restricting it to pairwise dependencies: (single variable of input - single variable of output), similarly to KAN we get summation of trained 1-parameter functions: here polynomials (could be different e.g. B-splines like in KAN) + e.g. approximate fixed CDF for normalization:

{\sum_{jk}a_{1jk}f_{j}(y)f_{k}(z)}\xrightarrow[\textrm{KAN-like}]{\textrm{% pairwise only}}\sum_{j}a_{1j0}f_{j}(y)+a_{10j}f_{j}(z)

(6)

However, in comparison to KAN, using the proposed HCRNN parametrization we get multiple advantages:

•

it can propagate in any direction (as BNNs),
•

propagate values or probability distributions (as BNNs),
•

interpretation of parameters as mixed moments,
•

consciously add triplewise and higher dependencies,
•

inexpensive evaluation of modeled mutual information,
•

additional training ways (needed for BNNs), e.g. direct estimation, tensor decomposition, information bottleneck.

II-C Propagation of probability distribution

Let us start with a simple example: that we would like to calculate conditional density like previously:

\rho(x|y)=\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ij}f_{j}(y)}{\sum_{j}a_{0j}f_{j}(y)}

(7)

but for $y$ being from $\rho(y)=\sum_{k}b_{k}\,f_{k}(y)$ probability density. So the propagated probability density of $x$ should be $\int_{0}^{1}\rho(x|y)\rho(y)dy$ . Approximating with constant denominator, using $\int_{0}^{1}f_{j}(y)f_{k}(y)dy=\delta_{jk}$ and finally normalizing, we get:

\rho(x)\leftarrow\int_{0}^{1}\rho(x|y)\rho(y)dy\approx\sum_{i}f_{i}(x)\frac{% \sum_{j}a_{ij}b_{j}}{\sum_{j}a_{0j}b_{j}}

(8)

Such constant denominator approximation allows to propagate (in any direction) through HCR neurons not only values, but also entire probability distributions - by just replacing $f_{j}(y)$ for concrete value of $y$ , with $b_{j}$ describing its probability distribution. It is easy to generalize, e.g. for $\rho(x|y,z)$ we could replace $f_{j}(y)f_{k}(z)$ with $b_{jk}$ when $\rho(y,z)=\sum_{jk}b_{jk}f_{j}(y)f_{k}(z)$ :

\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ijk}f_{j}(y)f_{k}(z)}{\sum_{j}a_{0jk}f_{j}(y)% f_{k}(z)}\quad\stackrel{{\scriptstyle[}}{{\textrm{}}}{density}]{\textrm{value}% }{\leftrightarrows}\quad\sum_{i}f_{i}(x)\frac{\sum_{j}a_{ijk}b_{jk}}{\sum_{j}a% _{0jk}b_{jk}}

II-D Tensor decomposition and linearization

Analogously for intermediate layers like in bottom of Fig. 1:

A^{k}_{i_{1}i_{2}i_{3}i_{4}}\approx\sum_{j_{1},j_{2}}a^{j_{1}}_{i_{1}i_{2}}\,b% ^{j_{2}}_{i_{3}i_{4}}c^{k}_{j_{1}j_{2}}\qquad\textrm{as}\qquad\begin{array}[]{% c}c^{k}\\ \Sigma_{j_{1}}\wedge\Sigma_{j_{2}}\\ a_{i_{1}i_{2}}\quad b_{i_{3}i_{4}}\\ \end{array}

integrating over the intermediate variables, approximating with constant denominator and normalizing at the end, thanks to basis orthogonality we get Kronecker deltas enforcing equality of intermediate indexes, leading to condition for approximation of higher order tensors with lower order ones, which is generally studied by tensor decomposition field [18] - hopefully leading to better training approaches.

While neural networks require nonlinearity, such tensor approach allows to linearize its intrinsic behavior: calculate nonlinearities in some basis e.g. polynomials only for the outer inputs/outputs $(f_{\mathbf{j}}(\mathbf{x}):\mathbf{j}\in B,\mathbf{x}\in\bar{X})$ , including multivariate dependencies. Then treat the entire neural network as a linear transformation of such features (normalization only at the end - no need for inter-layer), e.g. just changing indexes (like transposition) to modify propagation direction. However, it contains this constant denominator approximation, and it would become a tensor of exponentially increasing size if including all dependencies - we should combine it with reductions like tensor decomposition, information bottleneck - working on linear approximations to reduce dimension.

II-E Basis optimization and selection

Another direction is application of the found $a_{\mathbf{j}}$ coefficients, for example to optimize the arbitrarily chosen $\{f_{i}\}$ basis to be able to reduce the number of considered coefficients, also to reduce overfitting issues, e.g. discussed in [13, 19]. For this purpose we can for example treat the current coefficients as a rectangular matrix $M_{j_{1},j_{2}..j_{d}}:=a_{\mathbf{j}}$ - with blocked the remaining indexes for all considered coefficients in the basis. Now we can use SVD (singular value decomposition): find orthonormal eigenbasis of $MM^{T}=\sum_{i}\sigma_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T}$ and use $g_{i}=\sum_{j}u_{ij}f_{j}$ as the new basis for one or a few dominant eigenvectors. Similarly we can do for the remaining variables, getting separate or common optimized bases for them.

A more difficult question is basis selection - which $\mathbf{j}\in B$ indexes to use in considered linear combinations for each neuron. Extending all to $m$ -th moment/degree for $d$ variables, we would need $(m+1)^{d}=\sum_{k=0}^{d}{d\choose k}m^{k}$ coefficients: 1 for normalization, $dm$ for marginal distributions, $d(d-1)m^{2}/2$ for pairwise, and so on. With proper normalization the coefficients for marginal distributions should be close to 0 - can be neglected. To reduce the number of coefficients, we can restrict e.g. up to pairwise dependencies ( $\approx$ KAN). Generally we can e.g. calculate more coefficients and during training discard those close to zero. Using optimized bases as above, should allow to reduce their size.

II-F Some HCRNN training approaches

A single HCR neuron models multidimensional joint distribution, what is already quite powerful. However, for neural networks the main difficulty is training the intermediate layers. Here are some approaches:

•

Treat HCRNN as just a parametrization and use standard backpropagation like for other ANNs e.g. MLP, KAN. It can be mixed with other techniques, e.g. static parameter estimation/update from recent values, online basis optimization and selection, or just optimizing initial parameters to improve further main optimization.
•

Maybe find initial intermediate values by dimensionality reduction like PCA of $\{f_{\mathbf{j}}(x):\mathbf{j}\in B\}$ vectors of features as (nonlinear) products of functions of inputs - further extended into information bottleneck approach.
•

Maybe use propagation as e.g. expected values in both directions, combine with coefficient estimation/update.
•

Maybe use some tensor decomposition techniques - start with estimation of e.g. pairwise dependencies for a larger set of variables, and use algebraic methods to try to approximate it with multiple lower order tensors.

While backpropagation is available for various parametrizations, such HCRNN with neurons containing joint distribution models bring some additional possibilities - hopefully allowing for faster training, especially with further looking the most promising information bottleneck approach.

Coefficients of such trained HCRNN remain mixed moments - providing dependency interpretation between input/output and hidden intermediate variables, allowing for multidirectional propagation of values or distributions like in Fig. 6, and its parameters can be further continuously updated e.g. using (3).

III Information bottleneck based training

Let us consider $l$ hidden layer neural network intended to predict $Y$ from $X$ with $\{\theta^{i}\}_{i=0..l}$ sets of parameters:

Y\quad\textrm{label of}\quad X\xrightarrow{\theta^{0}}T^{1}\xrightarrow{\theta% ^{1}}\ldots\xrightarrow{\theta^{l-1}}T^{l}\xrightarrow{\theta^{l}}\hat{Y}

(9)

While the standard training approach is focused on optimization of neuron parameters $\theta$ used to process the data, alternatively we could try to directly optimize $T^{i}$ distributions of hidden/intermediate layers, in practice: dataset processed through some of layers (here could be in both directions) - like in image recognition: first layers extract low level features like edges, then some intermediate features, and finally e.g. faces.

Information bottleneck approach [20, 21] suggests how to directly optimize the intermediate layers. It uses information theory - offering nearly objective evaluation, being invariant to variable permutations, bijections. $I(X;Y)=H(X)+H(Y)-H(X,Y)$ mutual information is the number of bits (or nits) $X$ on average gives about $Y$ , or $Y$ about $X$ .

Optimizing intermediate layer $T^{i}$ , obviously it should maximize information about the predicted $Y$ : $\max_{T}I(T;Y)$ . However, focusing only on prediction would rather lead to overfitting. To prevent that, turns out neural networks performs also compression: $\min_{T}I(X;T)$ - removal of unnecessary information/noise from input. Finally information bottleneck approach, for some $\beta>0$ , assumes optimization of:

\inf_{T}\left(I(X;T)-\beta I(T;Y)\right)

(10)

III-A Information theory view on HCR

Mutual information has excellent properties, however, it is relatively difficult to calculate, requires a joint distributions model. HCR assumes $\rho(\mathbf{x})=\sum_{\mathbf{j}\in B}a_{\mathbf{j}}\,f_{\mathbf{j}}(\mathbf{% x})$ model of joint distribution. Using natural logarithm (information in nits) with first order approximation $\ln(1+a)\approx a$ , and orthogonality of basis, we get simple practical approximation:

H(X)=-\int_{[0,1]^{d}}\rho(\mathbf{x})\ln(\rho(\mathbf{x}))d\mathbf{x}\approx-% \sum_{\mathbf{j}\in B^{+}}(a_{\mathbf{j}})^{2}

(11)

For joint distribution of $(X,Y)$ (multivariate) variables, denoting $B_{X},B_{Y}$ as bases used for individual variables, the approximate mutual information formula becomes:

I(X;Y)\approx\sum_{\mathbf{j}_{x}\in B_{X}^{+}}\ \sum_{\mathbf{j}_{y}\in B_{X}% ^{+}}\left(a_{(\mathbf{j}_{x},\mathbf{j}_{y})}\right)^{2}

(12)

Hence this $\ln(1+a)\approx a$ approximation allows to evaluate mutual information by just summing squared nontrivial coefficients, mixed moments between the two variables.

Let us now add estimation of these coefficients/mixed moments which will become $\theta$ neuron parameters. Denoting $\{\mathbf{x}^{i}\}_{i=1..n},\{\mathbf{y}^{i}\}_{i=1..n}$ as the current batch from dataset, denote

\bar{X}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{x}^{i}))_{i=1..n,j\in B^{+}_% {X}}\qquad\bar{Y}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{y}^{i}))_{i=1..n,j% \in B^{+}_{Y}}

(13)

as $n\times|B^{+}_{X}|,n\times|B^{+}_{Y}|$ matrices containing features of vectors from the batch - values in the chosen basis, or alternatively they could be interpreted as description of propagated probability distribution, e.g. $\rho(\mathbf{x})=1+\sum_{\mathbf{j}\in B^{+}_{X}}a_{\mathbf{j}}\,f_{\mathbf{j}% }(\mathbf{x})$ .

Having above $\bar{X},\bar{Y}$ matrices of features, we can directly MSE estimate (2) parameters for all $B_{X}^{+}\times B_{Y}^{+}$ pairs, mutual information:

(a_{(j_{x},j_{y})}:j_{x}\in B_{X}^{+},j_{Y}\in B_{Y}^{+})=\bar{X}^{T}\bar{Y}

(14)

\qquad I(X;Y)\approx\textrm{Tr}\left(\bar{X}^{T}\bar{Y}(\bar{X}^{T}\bar{Y})^{T% }\right)=\textrm{Tr}(C_{X}C_{Y})

for $C_{X}=\bar{X}\bar{X}^{T},C_{Y}=\bar{Y}\bar{Y}^{T}$ matrices of size $n\times n$ , containing scalar products of points in batch as vectors of features. If subtracting the means earlier, they resemble covariance matrices of the used features - this subtraction might be worth including, but generally means should be close to zero. Also, instead of covariance matrix, this is rather $n\times n$ similarity matrix inside the size $n$ batch using the chosen features.

III-B Information bottleneck training

Using the above approximation, information bottleneck training of $X\to T\to Y$ for some hidden intermediate layer $T$ , as trace is linear and cyclic, becomes (similar to nHSIC in [15]):

\inf_{C_{T}\geq 0}(\textrm{Tr}(C_{X}C_{T})-\beta\textrm{Tr}(C_{T}C_{Y}))=\inf_% {C_{T}\geq 0}\textrm{Tr}(C_{T}(C_{X}-\beta C_{Y}))

(15)

for some symmetric $C_{T}=\bar{T}\bar{T}^{T}$ with nonnegative spectrum, maybe earlier subtracting the means to make it covariance matrix. Decomposing it to $C_{T}=ODO^{T}$ for $D=\textrm{diag}(\Lambda_{1},\ldots,\Lambda_{n})$ , the above optimization would become of:

\textrm{Tr}\left(D\,O^{T}(C_{X}-\beta C_{Y})O\right)=\sum_{i=1}^{n}\Lambda_{i}% \,(O^{T}(C_{X}-\beta C_{Y})O)_{ii}

Denoting $M=O^{T}(C_{X}-\beta C_{Y})O$ , the above becomes just $\sum_{i=1}^{n}\Lambda_{i}M_{ii}$ . Minimizing it over $\Lambda_{i}\geq 0$ , we could get to minus infinity for negative $M_{ii}$ . Hence it is crucial to add regularization e.g. $l^{2}$ : adding $\textrm{Tr}(C_{T}^{2})=\sum_{i}\Lambda_{i}^{2}$ times some $1/2\eta>0$ to the minimized:

\inf_{(\Lambda_{i})}\ \sum_{i=1}^{n}\Lambda_{i}M_{ii}+\frac{1}{2\eta}\sum_{i}% \Lambda_{i}^{2}

(16)

For positive $\{i:M_{ii}\geq 0\}$ minimization requires $\Lambda_{i}=0$ . For negative $\{i:M_{ii}<0\}$ minimization gives $\Lambda_{i}=-\eta M_{ii}$ .

There has remained optimization of $O$ rotation, with above $\Lambda_{i}$ optimization minimized (16) becomes:

\inf_{O:O^{T}O=I}-\frac{\eta}{2}\sum_{i:(O^{T}(C_{X}-\beta C_{Y})O)_{ii}<0}((O% ^{T}(C_{X}-\beta C_{Y})O)_{ii})^{2}

(17)

Eigendecomposition $C_{X}-\beta C_{Y}=O\,\textrm{diag}(\lambda_{i})\,O^{T}$ allows to choose $O$ for optimized $C_{T}=O\,\textrm{diag}(\Lambda_{i})\,O^{T}$ , which makes $M_{ii}=\lambda_{i}$ . As above, its optimal eigenvalues should be chosen as $\Lambda_{i}=\max(0,-\eta\lambda_{i})$ .

Choosing $\bar{T}=(O\,\textrm{diag}(\sqrt{\max(0,-\eta\lambda_{i})}))_{i=1..n,j=1..k}$ for sorted $\lambda_{1}\leq\ldots\leq\lambda_{n}$ and $\lambda_{k}<0,\lambda_{k+1}\geq 0$ , the matrix $C_{T}=\bar{T}\bar{T}^{T}$ will be as required, making hidden layer represented as size $k$ vectors of features.

For such $\bar{T}$ content of hidden layer, we can estimate (10) transition parameters: $\bar{X}^{T}\bar{T}$ for $X\to T$ and $\bar{T}^{T}\bar{Y}$ for $T\to Y$ . Multiplying both we get $\bar{X}^{T}\bar{T}\bar{T}^{T}\bar{Y}$ for indirection connection, while for direct connection we got $\bar{X}^{T}\bar{Y}$ . Both should be comparable, suggesting to use just $\Lambda_{i}=1$ for $i=1..k$ and 0 for the rest - content of $T$ layer as just projection to eigenvectors of $C_{X}-\beta C_{Y}$ corresponding to negative eigenvalues.

However, such $k$ number of features could be comparable to $n$ batch size - can be impractically large. We can decrease it by reducing $\beta$ . Alternatively we could fix $\beta$ and use a smaller e.g. fixed $k$ , or $\lambda$ below some negative threshold, or up to some large jump in eigenspectrum.

To summarize the suggested procedure: Information bottleneck HCR training for $X\to T\to Y$ of $T$ intermediate layer with $\beta$ parameters, $k$ features:

prepare matrices of features (if not done earlier):

\bar{X}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{x}^{i}))_{i=1..n,j\in B^{+}_% {X}},\quad\bar{Y}=\frac{1}{\sqrt{n}}(f_{\mathbf{j}}(\mathbf{y}^{i}))_{i=1..n,j% \in B^{+}_{Y}}

perform eigendecomposition $\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{n}$ :

\bar{X}\bar{X}^{T}-\beta\bar{Y}\bar{Y}^{T}=O\,\textrm{diag}(\lambda_{i})\,O^{T}

3.

Take $\bar{T}=O_{i=1..n,j=1..k}$ projection to $k$ lowest eigenvectors as content of $T$ layer: $k$ features for size $n$ batch.
4.

If needed, neuron weights can be calculated/estimated using (10) as $\bar{X}^{T}\bar{T}$ for $X\to T$ and $\bar{T}^{T}\bar{Y}$ for $T\to Y$ .

III-C Remarks, applications of information bottleneck training

•

The found $k$ features on size $n$ batch are abstract - there is no need to choose basis, however, choosing a size $k$ basis we would get interpretation of content of this intermediate layer as values or rather probability distribution. Working on such abstract values, they are no longer normalized, seems we can skip CDF normalization layers.
•

Diagonalized $\bar{X}\bar{X}^{T}-\beta\bar{Y}\bar{Y}^{T}$ is $n\times n$ matrix of distances in size $n$ batch as scalar products of feature vectors. Cost of its operations grows with the batch size - in practice we should split dataset into batches and e.g. estimate network parameters and average them over batches, or update the previous ones using exponential moving average: $\theta=\mu\theta+(1-\mu)\bar{X}^{T}\bar{T}$ , for example visiting various layers/neurons in some order using a given batch, maybe varying $\beta,\eta,k$ parameters.
•

For practical training we can optimize/update intermediate layers for $T$ in various ways, e.g. start with $X$ input, $Y$ output to optimize intermediate $T$ layer, then optimize another intermediate layer between $T$ and $Y$ adding e.g. pairwise features of $T$ (increase $\bar{T}$ to add nonlinearity), and so on recursively. Or we can optimize for succeeding layers $T_{i-1}\to T_{i}\to T_{i+1}$ . Also standard techniques like convolution, pooling can be added. The finally found parameters can be further improved with a different technique like backpropagation.
•

The found approach is different than for Gaussian variables [22] close to canonical correlation analysis, maybe it is worth co consider some intermediate approach, e.g. subtracting means to make $C_{X}$ closer to covariance matrix, improve on the use $\ln(1+a)\approx a$ approximation, etc.

IV Conclusions and further work

Neurons with joint distribution models seem powerful agnostic improvement for currently popular guessed parametrizations like MLP or KAN, and are practically accessible with HCR, up to omnidirectional neurons like in Fig 6 - allowing to freely choose inference directions, propagate both values and probability distributions, with clear coefficient interpretations.

As in Fig. 2, 3, BNNs are qualitatively superior than current ANNs, also have multidirectional propagation, including probabilities, and need different training than standard backpropagation. Proposed new ANNs allow to catch up with such low level behavior still looking biologically plausible e.g. KAN-like, hopefully allowing to also get closer for high level behavior, maybe recreating mathematics hidden in BNN behavior.

However, mastering such new neural network architecture will require a lot of work, planned also for future versions of this article. Here are some basic research directions:

•

Search for practical applications, from replacement of standard ANN, for multidirectional inference e.g. in Bayes-like scenarios, as neural networks propagating probability distributions, up to exploration of similarity/replacement for biological neurons.
•

Practical implementation, optimization especially of training and update, basis optimization and selection techniques, exploration of tensor decomposition approach.
•

Working on probability distributions makes it natural for information theoretic approaches like information bottleneck [20] optimization of intermediate layers, also hopefully leading to better understanding e.g. of information propagation during learning/inference, information held by intermediate layers, etc.
•

Adding time dependence like model update, also for similarity with biological neurons, e.g. long term potentiation, connection to various periodic processes/clocks.
•

While the discussed neurons containing joint distribution models seem very powerful and flexible, directly working in high dimensions they have various issues - suggesting to directly predict conditional distributions instead with HCR parametrization ([9, 10, 12, 13]), what might be also worth included in neural network, e.g. as a part of the training process - to be decomposed into single neurons.

References

[1] R. Follmann, E. Rosa Jr, and W. Stein, “Dynamics of signal propagation and collision in axons,” Physical Review E, vol. 92, no. 3, p. 032707, 2015.
[2] B. A. Richards, T. P. Lillicrap, P. Beaudoin, Y. Bengio, R. Bogacz, A. Christensen, C. Clopath, R. P. Costa, A. de Berker, S. Ganguli et al., “A deep learning framework for neuroscience,” Nature neuroscience, vol. 22, no. 11, pp. 1761–1770, 2019.
[3] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
[4] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark, “Kan: Kolmogorov-arnold networks,” arXiv preprint arXiv:2404.19756, 2024.
[5] I. Kononenko, “Bayesian neural networks,” Biological Cybernetics, vol. 61, no. 5, pp. 361–370, 1989.
[6] F. Durante and C. Sempi, “Copula theory: an introduction,” in Copula theory and its applications. Springer, 2010, pp. 3–31.
[7] G. R. Terrell and D. W. Scott, “Variable kernel density estimation,” The Annals of Statistics, pp. 1236–1265, 1992.
[8] J. Duda, “Hierarchical correlation reconstruction with missing data, for example for biology-inspired neuron,” arXiv preprint arXiv:1804.06218, 2018.
[9] J. Duda and A. Szulc, “Social benefits versus monetary and multidimensional poverty in poland: Imputed income exercise,” in International Conference on Applied Economics. Springer, 2019, pp. 87–102, preprint: https://arxiv.longhoe.net/abs/1812.08040.
[10] J. Duda, H. Gurgul, and R. Syrek, “Modelling bid-ask spread conditional distributions using hierarchical correlation reconstruction,” Statistics in Transition New Series, vol. 21, no. 5, 2020, preprint: https://arxiv.longhoe.net/abs/1911.02361.
[11] J. Duda and G. Bhatta, “Gamma-ray blazar variability: new statistical methods of time-flux distributions,” Monthly Notices of the Royal Astronomical Society, vol. 508, no. 1, pp. 1446–1458, 2021.
[12] J. Duda and S. Podlewska, “Prediction of probability distributions of molecular properties: towards more efficient virtual screening and better understanding of compound representations,” Molecular Diversity, pp. 1–12, 2022.
[13] J. Duda and G. Bhatta, “Predicting conditional probability distributions of redshifts of active galactic nuclei using hierarchical correlation reconstruction,” Monthly Notices of the Royal Astronomical Society, p. stae963, 2024.
[14] R. Pogodin and P. Latham, “Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 7296–7307, 2020.
[15] W.-D. K. Ma, J. Lewis, and W. B. Kleijn, “The hsic bottleneck: Deep learning without back-propagation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 5085–5092.
[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.
[17] J. Duda, “Rapid parametric density estimation,” arXiv preprint arXiv:1702.02144, 2017.
[18] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
[19] J. Duda, “Fast optimization of common basis for matrix set through common singular value decomposition,” arXiv preprint arXiv:2204.08242, 2022.
[20] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
[21] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 ieee information theory workshop (itw). IEEE, 2015, pp. 1–5.
[22] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information bottleneck for gaussian variables,” Advances in Neural Information Processing Systems, vol. 16, 2003.

Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks