The Impact of Feature Representation
on the Accuracy of Photonic Neural Networks

Mauricio Gomes de Queiroz [email protected] Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France Paul Jimenez Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France Raphael Cardoso Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France Mateus Vidaletti Costa Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France School of Engineering, RMIT University, Melbourne, VIC 3000, Australia Mohab Abdalla Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France School of Engineering, RMIT University, Melbourne, VIC 3000, Australia Ian O’Connor Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France Alberto Bosio Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France Fabio Pavanello Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, Grenoble INP, CROMA, 38000, Grenoble, France

(June 28, 2024)

Abstract

Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network’s handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network’s ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.

I Introduction

Refer to caption — Figure 1: (a) Representation of a generic neural network. An arbitrary layer $l$ is highlighted. (b) Schematic of the photonic implementation of a neural network layer using meshes of Mach-Zehnder Interferometers (MZIs). (c) An MZI and its two phase shifters $(\phi,2\theta)$ are illustrated alongside the transfer matrix representation of the transformation it performs over the field amplitude.

Artificial Intelligence (AI) systems gained widespread relevance in recent yearsDong, Wang, and Abbas (2021), finding diverse applications ranging from image classification Simonyan and Zisserman (2014) to speech recognition Purwins et al. (2019). These systems have traditionally been implemented on electronic hardware, benefiting from the steady performance improvements driven by the miniaturization of electronic integrated circuits. However, with components now shrinking to the atomic scale, the limitations of this platform become apparent Theis and Wong (2017). At this size, for example, quantum effects may disrupt functionality Powell (2008), and the heat from densely packed devices becomes hard to dissipate Leiserson et al. (2020). In response, new technologies are being explored to enable further improvements in AI. These emerging technologies are often not subject to the same constraints of their electronic counterparts, and thus might offer more efficient alternatives for certain applications Waldrop (2016).

Photonic Neural Networks (PNNs) are hardware implementations of AI systems that perform computations on optical signals, rather than on electronic ones. Using light, they are able to leverage several of its properties to potentially enable high parallelization, low latency, and reduced power consumption Shen et al. (2017). For example, PNNs have been demonstrated to perform sub-nanosecond image classification Ashtiani, Geers, and Aflatouni (2022) and to achieve up to $10^{12}$ Multiply-Accumulate operations per second Xu et al. (2021). However, transitioning from electronics to photonics remains challenging. Practical applications of medium to large-scale systems are currently limited by the large physical footprint of photonic circuits Shibata et al. (2008); Xiao et al. (2021), their loss accumulation, and the high power consumption of some of its electro-optic devices Tait (2022).

One way of alleviating these issues is by optimizing circuits Mourgias-Alexandris et al. (2022); de Queiroz et al. (2023), or carefully designing PNNs to minimize circuit size. A common practice found in literature involves taking advantage of the complex representation of light (using amplitude and phase) to represent multiple features in a single input, thus combining multiple real-valued features into fewer complex-valued inputs. By using fewer inputs, a circuit requires fewer components and a smaller network, which leads to a reduction in overall footprint. Such technique aligns well with the capabilities of photonic circuits, which are able to process complex inputs through complex transformations Zhang et al. (2021).

However, the way we represent features in Neural Networks (NNs) greatly influences the difficulty of the problems they solve. For example, in tasks with radial symmetry centered around the origin, opting for a polar coordinate system can emphasize the relevant feature relationships necessary for accurately solving the task, short-cutting the network’s need to learn it. This approach can significantly reduce the computational complexity required for achieving high accuracy. Moreover, the choice of feature representation also shapes the network’s approaches to solve tasks, as NNs tend to rely on the most straightforward cues available within the data Geirhos et al. (2020). This highlights the need to understand which feature relationships are emphasized by the representation strategies used in PNNs. By doing so, we can ensure that these networks not only achieve high accuracy, but also adopt desirable decision-making strategies.

In this paper, we explore the role of feature representation in the accuracy and decision-making strategies of PNNs. We investigate the common practice of combining various features into a single input, using eXplainable AI (XAI) methods to compare relative importances of the combined features. To our knowledge, only one work investigated such practice as a means of improving accuracy in PNNs Qiu et al. (2024). However, the consequences of the feature combination itself are still unknown, and different feature representations were not explored. Our work tackles these open questions with a mathematical analysis of feature combination focused on photonic implementations, where networks and circuits are constrained by size. We point out how different data representations and hardware implementations can be exploited for higher accuracy and lower complexity, as well as the shortcomings of current solutions.

The rest of this paper is structured as follows: in Sections II and III we review the basics of photonic implementations of AI and feature importance metrics. In Section IV, we calculate the relative importance of features that share a same input. Sections V and VI discuss practical examples and simulations of Artificial Neural Networks (ANNs) and PNNs. Finally, Section VII concludes the discussions brought up in this paper.

II Photonic Neural Networks

In this section, we provide a review of ANNs and their photonic implementations. We also address common strategies of representing features in light, which will be used in our further discussions in Section IV.

II.1 Artificial Neural Networks

Artificial Neural Networks (ANNs), first proposed in the 1940s McCulloch and Pitts (1943), are mathematical functions loosely inspired by how the human brain processes information. These functions are known to be universal approximators Hornik, Stinchcombe, and White (1989), hence their ability to handle a wide variety of tasks. The network’s behavior, i.e. the way it processes inputs, is determined by their connection strengths (called “weights”) and non-linearities (referred to as “activation functions”) Sze et al. (2017). Typically, these parameters are obtained through training, approximating the ANN to a probability function associated with the given task. For instance, in classification tasks, ANNs are designed to assign a class to an input, by approximating a function that calculates the likelihood of belonging to each class Goodfellow, Bengio, and Courville (2016).

Consider a fully connected, feed forward (FF) NN, consisting of $L$ layers and designed with $N$ inputs and $N$ outputs, as depicted in Fig. 1(a). The process by which a given layer $l$ transforms its inputs is described as:

	$\displaystyle\vec{z}^{\,(l)}$	$\displaystyle=\mathbf{W}^{\,(l)}\cdot\vec{y}^{\,(l-1)}+\vec{b}^{\,(l)}\,,$		(1)
	$\displaystyle\vec{y}^{\,(l)}$	$\displaystyle=\sigma^{\,(l)}(\vec{z}^{\,(l)})\,.$		(2)

Initially, inputs are combined through weighted sums by a weight matrix $\mathbf{W}^{\,(l)}$ to obtain $z^{\,(l)}$ . Then, the element-wise application of an activation function $\sigma(\cdot)$ to $z^{\,(l)}$ introduces non-linearity and yields the output of the layer, where ${y}^{\,(0)}$ is the input of the network and ${y}^{\,(L)}$ the output. A bias $\vec{b}$ might be added before the activation function to allow for the network to better adjust to the data. The entire network, from first to last layer, can be seen as a sequence of such transformations, written as $\vec{y}^{\,(L)}=f(\vec{y}^{\,(0)})$ .

Thus, ANNs implement input-output map**s that can be either real or complex. Real-Valued Neural Networks (RVNNs) are characterized by real parameters and inputs, with $f:\mathbb{R}^{N}\mapsto\mathbb{R}^{N}$ . In these networks, each layer scales and combines inputs before non-linearly transforming them. Complex-Valued Neural Networks (CVNNs), on the other hand, operate in the complex domain, meaning that both the input vector and the network’s parameters are complex-valued and $f:\mathbb{C}^{N}\mapsto\mathbb{C}^{N}\,$ Bassey, Qian, and Li (2021). In that case, each layer has the ability to not only scale and combine, but also rotate inputs in the complex plane. This rotation, inherent to complex algebra, makes CVNNs more suitable for tasks where phase information is important, such as in audio processing Kim, Han, and Ko (2024) or optical communications Masaad et al. (2023).

II.2 Photonic Implementations

Photonic computing is emerging as a promising approach to improve ANN implementations for specific applications by computing with light. This allows us to leverage its unique characteristics to potentially enable faster and more energy-efficient AI systems. For example, in the optical domain, linear transformations can be done passively Reck et al. (1994) and information can be easily parallelized and processed at high speeds Xu et al. (2021).

PNNs are implementations of ANNs through photonic inputs, components, and transformations Shastri et al. (2021). Although no single photonic component acts as an artificial neuron, a circuit can be designed to perform the mathematical operations of an ANN. This is achieved by using several components such as waveguides, interferometers, and modulators which guide and manipulate light signals. These circuits operate on complex signals and implement complex transformations, meaning that PNNs can act as RVNNs and CVNNs, depending on the task at hand.

Several PNN circuits were suggested and demonstrated experimentally. They can be broadly categorized by how different inputs are distinguished, whether through spatial, wavelength, or time domains Bai et al. (2023).

In this study, we focus on PNNs that use spatial differentiation of inputs. These networks assign a separate input to each optical signal, and implement weight matrix multiplications by making different inputs interfere with each other. Most notably, this is achieved by using meshes of Mach-Zehnder Interferometers (MZIs) Reck et al. (1994); Clements et al. (2016). The interference, and hence the specific mathematical operation performed by the mesh, can be selected by adjusting the phase shifters found in these devices. Activation functions, on the other hand, can be implemented by using any of the devices and circuits that exhibit optical non-linearity Williamson et al. (2020); Jha, Huang, and Prucnal (2020). The schematics of an ANN implementation and an MZI are shown in Fig. 1(b) and Fig. 1(c), respectively.

If PNNs use coherent light inputs, they can be represented in the complex domain. In these networks, the $i^{\text{th}}$ input is characterized by an amplitude $A_{i}$ and phase $\phi_{i}$ . Thus, the input vector can be expressed as $\vec{y}^{\,(0)}=\left[A_{1}e^{i\phi_{1}},\cdots,A_{N}e^{i\phi_{N}}\right]^{% \intercal}\in\mathbb{C}^{N}$ . Given the two degrees of freedom available for each input, feature encoding can be achieved using various methodologies. We divide common approaches found in literature into two distinct groups: real and complex encoding.

Real encoding simplifies the input representation by encoding data solely in the amplitude of the optical signals, maintaining a uniform initial phase across all inputs (in practice having $\phi_{i}=0\,\forall\,i$ and thus $\vec{y}^{\,(0)}\in\mathbb{R}^{N}$ ). Several researchers employ this encoding method for its compatibility with RVNNs used in electronic computers Shen et al. (2017); Mojaver et al. (2023). It allows for an easy map** of weights from electronically trained networks to photonic transformations. In these networks, while the nature of the transformations of individual MZIs is inherently complex, the overall behaviour can effectively be real-valued. Since no phase information is used, only the amplitude of the outputs is of interest, which simplifies the detection scheme. However, it is important to ensure that different inputs experience the same phase before reaching the network to maintain phase consistency, which might not be simple to achieve experimentally.

In contrast, complex encoding uses both amplitude and phase at the same time, having inputs that lie in the complex plane, that is $\vec{y}^{\,(0)}\in\mathbb{C}^{N}$ . The transformations in the PNN in this case are complex, and thus, detection of both intensity and phase in the outputs might be used, adding to the electronic complexity of the circuit. In image classification tasks, for example, real-valued input images can be transformed into Fourier space representation to obtain phase and amplitude information Banerjee, Nikdast, and Chakrabarty (2023); Hamerly, Bandyopadhyay, and Englund (2022); Wang et al. (2022), or have different sections mapped to the real and imaginary parts of complex numbers Fang et al. (2019); Qiu et al. (2024), which reduces by half the number of inputs.

The encoding choice for PNNs influences the network’s behaviour, the type of information that is detected at the output, and the overall size of the circuit, as it may imply the use of additional peripheral devices. Beyond hardware specifications, this choice might also impact how features are processed within the network. When two features share the same input, the network may process them differently from the way they would be processed individually. Understanding these dynamics is crucial for optimizing PNN performance.

III Feature Importance

In this section, we look to the field of XAI for methods of evaluating feature importance in ANNs, to later study the impact of combining features in PNNs. We focus on gradient-based techniques, particularly sensitivity analysis.

ANNs, especially those with several layers, are highly non-linear models that use numerous parameters. The network’s complexity often leads them to be regarded as opaque or “black-box” systems, since their decision-making processes are difficult to grasp intuitively. That is, while we can mathematically describe how a given output is obtained, it is difficult to specify “why” with an intuitive explanation.

Nonetheless, being able to explain the decision-making strategies of a model has a number of practical applications. Clear explanations can, for instance, enhance our understanding of a problem or be used to demonstrate fair treatment. In photonics research, XAI is currently used to explain the inverse design of circuits Jia et al. (2023), or to aid in the description of physical models Yeung et al. (2020). The concept of “explainability” is still subject of an ongoing debate Lipton (2018); Miller (2019) and, consequently, a variety of methods have been proposed to attain it Samek et al. (2021); Montavon, Samek, and Müller (2018). Highlighting which input features are considered as important to an ANN is a common way to explain its outputs. Several methods estimate such feature importance, of which we emphasize sensitivity analysis.

Sensitivity analysis quantifies feature importance by examining how sensitive the output of the model is to small variations in each feature Sung (1998); Fu and Chen (1993); Dimopoulos et al. (1999). The underlying principle is that if small changes in an input lead to significant changes in the output, then that input is likely to be important for the network, i.e. it contributes to the prediction of this output. In such case, the importance of the $i^{\text{th}}$ input, $y^{\,(0)}_{i}$ , to the the $c^{\text{th}}$ output of the network, $y^{\,(L)}_{c}$ , is denoted by $R_{i\to c}$ :

R_{i\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial y^{\,(0)}_{i}}\right|\,.

(3)

Gradient-based explanations are frequently used in image classification tasks to generate saliency maps Simonyan, Vedaldi, and Zisserman (2014), and also show fair performance in matching feature importance in simulated data Olden, Joy, and Death (2004). Over time, other methods built up on sensitivity analysis, addressing some of its drawbacks by suggesting additional forms of estimating feature importance Ancona et al. (2017). For example, adding Gaussian noise to the input and averaging their resulting gradients helps generating more consistent saliency maps Smilkov et al. (2017). These techniques are often easy to implement, given that the necessary partial derivatives can be computed through back-propagation.

IV Analytical Derivation of Relative Importance

Here, we use the concepts elaborated in previous sections to investigate how the importance of features is shaped in PNNs. Initially, we employ the sensitivity analysis shown in Section III to obtain the importance of an arbitrary feature encoded in one input. Then, we introduce the concept of encoding functions to describe the different feature encoding processes and representations in photonics, seen in Section II.

Consider a set of features $\mathcal{X}=\{x_{1},\cdots,x_{n}\}\in\mathbb{R}$ . Assume that we want all of the elements in $\mathcal{X}$ to be used by our model. However, due to either a prohibitively large quantity of features or size restrictions on our network, we also wish to use a number of inputs that is less than the number of elements in $\mathcal{X}$ . To achieve both objectives, we combine features into complex inputs, as seen in Section II. In this context, we calculate the relative importance of such combined features, to understand what relationships are highlighted by our inputs.

Our first objective is to obtain the importance of an arbitrary feature, say $x_{j}\in\mathcal{X}$ , with respect to an arbitrary output of the model. This feature is represented only in input $y^{\,(0)}_{i}$ , and its importance is assessed in relation to the $c^{\text{th}}$ output of the network. Applying the chain rule to Eq. (3), we write this importance $R_{j\to c}$ as:

R_{j\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial x_{j}}\right|=\left|% \frac{\partial y^{\,(L)}_{c}}{\partial y_{i}^{\,(0)}}\frac{\partial y_{i}^{\,(% 0)}}{\partial x_{j}}\right|\,.

(4)

The partial derivative $\partial y_{i}^{\,(0)}/\partial x_{j}$ of Eq. (4) relates to the way $x_{j}$ is represented in the input. The process of creating an input from elements of $\mathcal{X}$ is what we term feature encoding. An input $y_{i}^{\,(0)}$ obtained from the feature $x_{j}$ is hence written as $y_{i}^{\,(0)}=g_{i}(x_{j})$ , where $g_{i}$ is the encoding function for the $i^{th}$ input. Considering the encoding process as such, we can write:

R_{j\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial y_{i}^{\,(0)}}\frac{% \partial g_{i}(x_{j})}{\partial x_{j}}\right|\,.

(5)

Notice how the importance depends on both the network, represented in the derivative from output to input, and the feature encoding process, given the presence of the encoding function. The modulus operation ensures that the importance is always positive and real-valued. Here, we assume the network to be derivable in the vicinity of the current input, which might not be the case for some CVNN architectures.

Next, we examine the scenario where two arbitrary features, $x_{j}$ and $x_{k}$ , are represented using a single input $y_{i}^{\,(0)}$ , comparing their relevance. We define the relative importance between $x_{j}$ and $x_{k}$ to the $c^{\text{th}}$ output, $R_{j,k\to c}$ as the following ratio:

R_{j,k\to c}=\frac{R_{j\to c}}{R_{k\to c}}=\left|\frac{\frac{\partial g_{i}(x_% {j},x_{k})}{\partial x_{j}}}{\frac{\partial g_{i}(x_{j},x_{k})}{\partial x_{k}% }}\right|\,.

(6)

We see that the component of Eq. (4) related to the network is canceled, leaving only the derivatives of the encoding function. Thus, $R_{j,k\to c}$ is solely determined by the way features are encoded into $y_{i}$ , and hence it is independent of the considered output. To simplify the notation, we drop the subscript indicating the output for the rest of this paper. One of the consequences of Eq. (6) is that the encoding function chosen to combine $x_{j}$ and $x_{k}$ defines how these features are perceived by the model relative to one another.

Although encoding functions are a method of pre-processing features, in the context of PNNs they can also be implemented in hardware. The incorporation of encoding functions in the circuit is particularly interesting for low-latency applications, as the speed at which inputs are transformed and combined would be limited only by the reconfigurability of the driving electronics. We now explore two types of complex encoding functions to see how they dictate relative feature importances. We also point out how they could be implemented in hardware.

IV.1 Exponential Encoding

Since we are dealing with complex-valued inputs, one intuitive way to encode two features $x_{j}$ and $x_{k}$ into a single input would be to encode $x_{j}$ in its amplitude and $x_{k}$ in its phase. This encoding function can be written as:

g(x_{j},x_{k})=x_{j}e^{ix_{k}}\,.

(7)

The relative importance between these two features is calculated as:

R_{j,k\to c}=\left|\frac{e^{ix_{k}}}{ix_{j}e^{ix_{k}}}\right|=\frac{1}{\left|x% _{j}\right|}\,.

(8)

In this case, the relative importance between the two features is dynamic, establishing an amplitude-dependent relation between the importances of amplitude and phase.

A hardware version of an exponential encoding function is shown in Fig. 2(a), where a balanced MZI and a phase shifter are used to modulate the amplitude and phase of an input, respectively. The encodings and importances are not exactly the same since this amplitude modulation scheme is mediated by a sine function, it implements $g(x_{j},x_{k})=i\sin(x_{j})\exp(x_{k}i)$ . Here, Eq. (7) can be achieved short of a global phase shift by map** $x_{j}$ to $\arcsin(x_{j})$ .

IV.2 Linear Encoding

Another way to combine two features is by encoding one in the real part and the other in the imaginary part of a complex input. This can be represented by the function:

g(x_{j},x_{k})=x_{j}+ix_{k}\,.

(9)

Here, we find their relative importance to be:

R_{j,k}=\frac{1}{\left|i\right|}=1\,,

(10)

thus indicating that both features will be considered to have the same importance for the network. Since they are independent from the weights of the network, their relative importance cannot be unlearned, i.e. it cannot be modified by further training. This might pose problems when the chosen encoding leads to relative importances that do not match the data.

An encoding function similar to that of Eq. (9) implemented in hardware, can be seen in Fig. 2(b). There, two MZIs are used as amplitude modulators, while one of their outputs has its phase shifted by $\pi/2$ to encode the respective input in the imaginary axis. In that case, $g(x_{j},x_{k})=i(\sin(x_{j})+\sin(x_{k})i)$ . Eq. (9) can be achieved short of a global phase shift by map** $x_{j}$ and $x_{k}$ to $\arcsin(x_{j})$ and $\arcsin(x_{k})$ .

V On the Impact of Encoding Functions to ANNs

In this section, we address the practical implications of the discussions brought up in Section IV. Here, our objective is to demonstrate how a well-engineered encoding function can significantly improve the accuracy of an ANN on a test task. We begin by defining such task and studying the relative feature importances found in a solution to it. Later, we create an encoding function that reproduces these importances on trained ANNs, finally comparing its use against others.

Consider a simple classification problem with a known solution in the real domain: determining whether points lie inside or outside an $n$ -sphere. An $n$ -sphere is the generalization of a circle to $n+1$ dimensions, similar to how hyperplanes generalize planes. It is defined by a set of points $S^{(n)}$ that are equidistant from a central point $c_{0}=(c_{1},\cdots,c_{n+1})$ by a radius $r_{0}$ . The distance of a point $P=(x_{1},\cdots,x_{n+1})$ to $c_{0}$ is:

r_{p}(x_{1},...,x_{n+1})=\sqrt{\sum_{i=1}^{n+1}(x_{i}-c_{i})^{2}}\,.

(11)

Naturally, $P$ is considered outside of the $n$ -sphere if $r_{p}$ exceeds $r_{0}$ , and inside otherwise. In this context, a mathematical model that outputs a probability of P being outside of $S^{(n)}$ can be constructed using a logistic function. The logistic function $\sigma(x)=1/(1+e^{-x})$ is bounded between 0 and 1 with a smooth sigmoid transition, and is typically used in binary classification problems. Given the coordinates of $P$ , this model can be expressed as:

y=f(x_{1},...,x_{n+1})=\sigma\left(r_{p}-r_{0}\right)\,.

(12)

Here, $y$ represents the probability that $r_{p}>r_{0}$ given the coordinates of $P$ . When $y=0.5$ , Eq. (12) delineates the boundary defined by $S^{(n)}$ , allowing for accurate classification of points based on this threshold.

Since Eq. (12) can be used to accurately classify any point $P$ , we conjecture that its relative feature importances are desirable to other models that wish to do it as well. Thus, we examine the sensitivity of $y$ to an arbitrary feature $x_{j}$ , which can be calculated according to Eq. (4) as:

\displaystyle\begin{split}R_{j}&=\left|\frac{\partial y}{\partial x_{j}}\right% |\\ &=\left|\sigma(r_{p}-r_{0})(1-\sigma(r_{p}-r_{0}))\frac{x_{j}-c_{j}}{r_{p}}% \right|\forall\,r_{p}\neq 0\,.\end{split}

(13)

As can be seen in Fig. 2(c), the importance of a feature peaks when it is at the $n$ -sphere’s boundary. Exactly at that point, small variations in $x_{j}$ cause the largest deviations of the probability of $P$ being outside of $S^{(\,n)}$ . The relative feature importance between two features $x_{j}$ and $x_{k}$ is:

R_{j,k}=\left|\frac{x_{j}-c_{j}}{x_{k}-c_{k}}\right|\,.

(14)

Now assume that, as in section IV, we are constrained by size, and thus wish to combine different features to reduce the number of inputs. However, for this particular example (and by design), we have prior knowledge of the relative importances of features before combining them. Taking advantage of this, we propose an encoding function $g(x_{j},x_{k})$ that achieves the desired reduction in dimentionality while preserving the relationships given by Eq. (14). Given that after combining features, their relative importances should follow Eq. (6), we can obtain one such $g(x_{j},x_{k})$ by solving the following system of partial derivatives:

\left\{\begin{matrix}\left|\frac{\partial g(x_{j},x_{k})}{\partial x_{j}}% \right|=\left|x_{j}-c_{j}\right|\,,\\ \\ \left|\frac{\partial g(x_{j},x_{k})}{\partial x_{k}}\right|=\left|x_{k}-c_{k}% \right|\,.\end{matrix}\right.

(15)

This system can lead to one of many solutions of the form

g(x_{j},x_{k})=\frac{1}{2}(x_{j}^{2}+x_{k}^{2})-c_{j}x_{j}-c_{k}x_{k}+C\,,

(16)

where C is a constant.

In the same manner, we can obtain other functions that express different relative importances. A constant $R_{j,k}=1$ , for instance, is achieved by using $g(x_{j},x_{k})=(x_{j}+x_{k})^{n}$ $\forall\,n\in\mathbb{R}$ . Alternatively, $g(x_{j},x_{k})=(x_{j}\times x_{k})^{n}$ $\forall\,n\in\mathbb{R}$ leads to $R_{j,k}=|x_{k}/x_{j}|$ , which is the inverse of Eq. (14) when $c_{j}$ and $c_{k}$ are zero. To compare the use of these encoding functions, we trained several ANNs, benchmarking them on networks that do not combining inputs (called “independent” here). We are particularly interested in the performance of Eq. (16), which we call “engineered” encoding function. The description of the training procedures is as follows.

A dataset of $1000$ points in 4 dimensions was created, where each coordinate value was randomly chosen between $-2$ and $2$ . Points were labeled as either inside or outside of a $3$ -sphere $S^{\,(3)}$ of radius $1$ , centered at the origin $c_{0}=(0,0,0,0)$ , according to their position. In order to obtain a balanced dataset, we generated the same amount of points inside and outside of the sphere. The networks trained to solve this task, were composed of an input layer containing either $2$ or $4$ neurons (depending on the combination of features or not), a hidden layer of $6$ neurons and an output layer with a single neuron. A logistic activation function $\sigma$ was used for every layer. Each encoding function was used to train $100$ different networks, thus accounting for the random initialization of weights and random shuffling of the dataset prior to training. The networks were trained on $70\%$ of the available data for $100$ epochs with a learning rate of $0.001$ , and tested on the remaining data.

The results of these experiments are shown in Fig. 2(d). We notice that some representations can render the task harder to solve, while others maintain, to some extent, the accuracy achieved by the use of independent inputs. The engineered encoding function in Eq. (16) outperformed all others. With this example, we show that the way we combine features plays a role in the accuracy of ANNs. Given prior knowledge on how features relate to one another, which may come from domain-specific knowledge or from inspecting the data (noticing symmetries or class distributions, for example), we could estimate relative feature importances and obtain an encoding function that aligns with them. Combining features with said encoding function could improve the network performance.

VI Application of Encoding Functions in PNNs

In this section, we retake the subject of this study and explore the use of different encoding functions in PNNs trained on the Iris dataset Fisher (1936), a standard benchmark for classification algorithms. Our goal is to show how carefully chosen encoding functions might lead to higher accuracies in PNNs. To this end, we compare the performance of several encoding functions by means of simulations of PNNs, which differ significantly from the ANNs of the previous section in terms of their complex-valued inputs and transformations.

The Iris flower classification task involves categorizing three different Iris species (Setosa, Versicolour, and Virginica) based on four features: the lengths and widths of sepals and petals. The dataset, consisting of $150$ labeled data points, has considerable class overlaps, such that no single feature alone can distinguish all species, making this an ideal candidate for our experiments. Visualizations of feature distributions and class overlaps are presented in Fig. 3(a).

Our experimental design involves training PNNs by combining features in pairs as illustrated in Fig. 3(b). We assess their performance by averaging the accuracy of $100$ trained PNNs, benchmarking them against a PNN that does not combine features. This sample size was chosen to allow for convergence in the average values obtained for accuracy, given the variability in the training process. The architecture of the PNNs consists of a single hidden layer with $6$ neurons. Depending on the configuration, the number of input neurons varies between $3$ (when combining features) and $5$ (when using independent inputs), where one input acts as a bias for both configurations. The output layer has $3$ neurons, matching the number of classes. All configurations use the same underlying circuit, where NN layers are implemented using meshes of MZIs with trainable phase shifters, as represented in Fig 1(b). Every layer is followed by a softplus activation function, which can be implemented in integrated photonic circuits Campo and Pérez-López (2022). Although its hardware implementation would change both the modulus and the phase of the signals, we model it by applying $softplus(x)=\log(1+\exp(x))$ solely to the modulus of the complex numbers Banerjee, Nikdast, and Chakrabarty (2023). This approach allows us to simulate the gain and activation behavior while simplifying the model by avoiding additional phase changes. These phase changes can make the simulation and training more challenging and are less critical to the primary function of the softplus activation in this context.

The circuits were simulated using the Photontorch Python package Laporte, Dambre, and Bienstman (2019). The simulations were performed under ideal conditions, excluding noise and component imperfections. They were trained for $300$ epochs on $70\%$ of the dataset, reserving the remaining $30\%$ for testing. The dataset was divided into five shuffled batches per epoch to enhance training stability. A softmax function was used to convert the output light intensity values into class probabilities Goodfellow, Bengio, and Courville (2016). Weight updates were performed using a cross-entropy loss function combined with stochastic gradient descent. The initial learning rate was set at $0.01$ and adjusted at learning plateaus. While higher accuracies may be achieved by further optimizing the training process to each specific case, we opted for a constant training procedure across all circuits to isolate the effects of different encoding functions.

Here, we investigate the use of the encoding functions detailed in Section II: linear and exponential encoding. Given the anisotropic nature of the Iris classification task, unlike the $n$ -sphere problem, we also consider which features to combine. To explore the impacts of this choice on the obtained accuracy, we use two combination strategies for features: grou** by the lengths and widths $(l/w)$ or by petal and sepal information $(p/s)$ . We benchmarked the performance of PNNs using different encoding functions and grou**s of features to the independent case, where features were not combined.

The results of these experiments, shown in Fig. 4, are summarized as follows. Exponential encoding exhibited the lowest performance, falling up to $11\%$ in mean accuracy compared to the independent benchmark. In contrast, linear encoding, commonly used in the photonics community Banerjee, Nikdast, and Chakrabarty (2023); Hamerly, Bandyopadhyay, and Englund (2022); Wang et al. (2022); Fang et al. (2019); Qiu et al. (2024), was able to match the performance of the independent case. The difference between the best and worst performing encoding functions was $12.3\%$ . These results highlight that both the manner in which features are combined and the combination of features itself play significant roles in the final accuracy of PNNs. When comparing different feature grou**s, we found that $(l/w)$ consistently performed worse than $(p/s)$ , demonstrating that the choice of which features to combine can also impact accuracy for some tasks.

These findings are supported by heuristics found in the data. A closer inspection of Fig. 3(a) reveals that petal length and petal width together are highly discriminative of the different classes. These features separate different species in a similar fashion, as evidenced by the distribution of classes along the diagonal of the plot, suggesting that they may have similar importance. Thus, the combination ( $p/s$ ) with linear encoding would combine petal length and width with an equal relative importance, expressing such relationships.

VII Conclusions

Combining features into single inputs in PNNs can lead to reduced number of inputs and associated devices as well as enabling the use of smaller and more energy efficient NNs. These benefits would help to render some circuits more feasible to be simulated, fabricated, tested or deployed. However, this method of feature combination imposes predefined relationships among the features that may not necessarily reflect the nature of data or task at hand. Nonetheless, selecting or designing encoding functions based on an understanding of the dataset or from domain-specific knowledge can lead to improved accuracy. We have illustrated this first on an ideal simple example, and then for simulated PNNs.

In the scenarios shown here, as it is seen in literature, features are combined into a single input. As an alternative, we could distribute features across many inputs, circumventing the discussions brought up here and making it possible to learn other relative feature importances. For instance, Principal Component Analysis (PCA) can be used for dimensionality reduction, distributing features across many inputs simultaneously Mojaver et al. (2023). Expanding on this concept, a learnable encoding function that uses every feature available would be a fully connected layer of a NN Wang et al. (2022), which is more complex and less efficient than what is explored in our work. Besides, the approach used here could be applied directly at a hardware level, using integrated photonics and CMOS-compatible platforms for volume production.

Here, the discussions highlight that there is no neutral way of using this feature combination strategy in PNNs. Combining features in this manner will necessarily emphasize certain feature relationships. Sometimes, a PNN might achieve good performance metrics despite of combinations that are not ideal. However, even if high accuracy is achieved, these combinations can also introduce or amplify biases in the model outputs, depending on the specific features and their encoded interactions. Rather than leaving this to chance, we suggest to carefully assess how to encode features given the nature of the problem and data.

Acknowledgements.

The authors would like to acknowledge and thank Peter Bienstman and Thomas Van Vaerenbergh for their valuable advice and discussions during the early stages of this work, as well as for reviewing the paper before submission. This project received funding from École Centrale de Lyon and ANR (No. ANR-20-THIA-0007-01). Paul Jimenez and Fabio Pavanello thank ANR’s support (No. ANR-20-CE39-0004), and Fabio Pavanello acknowledges support by the European Union’s Horizon Europe research and innovation program (No.101070238). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Author Declarations

Conflict of Interest Statement

The authors have no conflicts to disclose.

Author Contributions

Mauricio Gomes de Queiroz: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing/original draft preparation (lead); Writing/review & editing (equal). Paul Jimenez: Formal analysis (supporting); Methodology (supporting); Writing/review & editing (equal). Raphael Cardoso: Methodology (supporting); Writing/review & editing (equal). Mateus Vidaletti da Costa: Methodology (supporting); Writing/review & editing (equal). Mohab Abdalla: Methodology (supporting); Writing/review & editing (equal). Ian O’Connor: Supervision (supporting); Writing/review & editing (equal). Alberto Bosio: Funding Acquisition (equal); Supervision (supporting); Writing/review & editing (equal). Fabio Pavanello: Conceptualization (supporting); Formal analysis (supporting); Methodology (supporting); Funding Acquisition (equal); Supervision (lead); Writing/review & editing (equal).

Data Availability Statement

The code that reproduces the experiments, and the data that support the findings of this study are available at github.com/mgomesq/feature_representation_pnns.

References

Dong, Wang, and Abbas (2021) S. Dong, P. Wang, and K. Abbas, “A survey on deep learning and its applications,” Computer Science Review 40, 100379 (2021).
Simonyan and Zisserman (2014) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).
Purwins et al. (2019) H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing 13, 206–219 (2019).
Theis and Wong (2017) T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A new beginning for information technology,” Computing in Science & Engineering 19, 41–50 (2017).
Powell (2008) J. R. Powell, “The quantum limit to moore’s law,” Proceedings of the IEEE 96, 1247–1248 (2008).
Leiserson et al. (2020) C. E. Leiserson, N. C. Thompson, J. S. Emer, B. C. Kuszmaul, B. W. Lampson, D. Sanchez, and T. B. Schardl, “There’s plenty of room at the top: What will drive computer performance after moore’s law?” Science 368, eaam9744 (2020), https://www.science.org/doi/pdf/10.1126/science.aam9744 .
Waldrop (2016) M. M. Waldrop, “The chips are down for moore’s law,” Nature News 530, 144 (2016).
Shen et al. (2017) Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, et al., “Deep learning with coherent nanophotonic circuits,” Nature photonics 11, 441–446 (2017).
Ashtiani, Geers, and Aflatouni (2022) F. Ashtiani, A. J. Geers, and F. Aflatouni, “An on-chip photonic deep neural network for image classification,” Nature 606, 501–506 (2022).
Xu et al. (2021) X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, et al., “11 tops photonic convolutional accelerator for optical neural networks,” Nature 589, 44–51 (2021).
Shibata et al. (2008) T. Shibata, S. Kamei, T. Kitoh, T. Tanaka, and M. Kohtoku, “Compact and low insertion loss (~ 1.0 db) mach-zehnder interferometer-synchronized arrayed-waveguide grating multiplexer with flat-top frequency response,” Optics express 16, 16546–16551 (2008).
Xiao et al. (2021) X. Xiao, M. B. On, T. Van Vaerenbergh, D. Liang, R. G. Beausoleil, and S. Yoo, “Large-scale and energy-efficient tensorized optical neural networks on iii–v-on-silicon moscap platform,” Apl Photonics 6 (2021).
Tait (2022) A. N. Tait, “Quantifying power in silicon photonic neural networks,” Physical Review Applied 17, 054029 (2022).
Mourgias-Alexandris et al. (2022) G. Mourgias-Alexandris, M. Moralis-Pegios, A. Tsakyridis, S. Simos, G. Dabos, A. Totovic, N. Passalis, M. Kirtas, T. Rutirawut, F. Gardes, et al., “Noise-resilient and high-speed deep learning with coherent silicon photonics,” Nature communications 13, 5572 (2022).
de Queiroz et al. (2023) M. G. de Queiroz, R. Cardoso, P. Jimenez, M. Abdalla, I. O’Connor, A. Bosio, and F. Pavanello, “Power reduction in photonic meshes by mzi optimization,” in Frontiers in Optics (Optica Publishing Group, 2023) pp. JW4A–7.
Zhang et al. (2021) H. Zhang, M. Gu, X. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. Yung, et al., “An optical neural chip for implementing complex-valued neural network,” Nature communications 12, 457 (2021).
Geirhos et al. (2020) R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” Nature Machine Intelligence 2, 665–673 (2020).
Qiu et al. (2024) R. Qiu, A. Eldebiky, L. Zhang, X. Yin, C. Zhuo, U. Schlichtmann, and B. Li, “Oplixnet: Towards area-efficient optical split-complex networks with real-to-complex data assignment and knowledge distillation,” in Design, Automation and Test in Europe (DATE) (2024).
McCulloch and Pitts (1943) W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics 5, 115–133 (1943).
Hornik, Stinchcombe, and White (1989) K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks 2, 359–366 (1989).
Sze et al. (2017) V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE 105, 2295–2329 (2017).
Goodfellow, Bengio, and Courville (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep learning (MIT press, 2016).
Bassey, Qian, and Li (2021) J. Bassey, L. Qian, and X. Li, “A survey of complex-valued neural networks,” arXiv preprint arXiv:2101.12249 (2021).
Kim, Han, and Ko (2024) G. Kim, D. K. Han, and H. Ko, “Sound source localization using complex-valued deep neural networks,” in 2024 IEEE International Conference on Consumer Electronics (ICCE) (IEEE, 2024) pp. 1–4.
Masaad et al. (2023) S. Masaad, E. Gooskens, S. Sackesyn, J. Dambre, and P. Bienstman, “Photonic reservoir computing for nonlinear equalization of 64-qam signals with a kramers–kronig receiver,” Nanophotonics 12, 925–935 (2023).
Reck et al. (1994) M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, “Experimental realization of any discrete unitary operator,” Physical review letters 73, 58 (1994).
Shastri et al. (2021) B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,” Nature Photonics 15, 102–114 (2021).
Bai et al. (2023) Y. Bai, X. Xu, M. Tan, Y. Sun, Y. Li, J. Wu, R. Morandotti, A. Mitchell, K. Xu, and D. J. Moss, “Photonic multiplexing techniques for neuromorphic computing,” Nanophotonics 12, 795–817 (2023).
Clements et al. (2016) W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walsmley, “Optimal design for universal multiport interferometers,” Optica 3, 1460 (2016).
Williamson et al. (2020) I. A. D. Williamson, T. W. Hughes, M. Minkov, B. Bartlett, S. Pai, and S. Fan, “Reprogrammable electro-optic nonlinear activation functions for optical neural networks,” IEEE Journal of Selected Topics in Quantum Electronics 26, 1–12 (2020).
Jha, Huang, and Prucnal (2020) A. Jha, C. Huang, and P. R. Prucnal, “Reconfigurable all-optical nonlinear activation functions for neuromorphic photonics,” Opt. Lett. 45, 4819–4822 (2020).
Mojaver et al. (2023) K. H. R. Mojaver, B. Zhao, E. Leung, S. M. R. Safaee, and O. Liboiron-Ladouceur, “Addressing the programming challenges of practical interferometric mesh based optical processors,” Optics Express 31, 23851–23866 (2023).
Banerjee, Nikdast, and Chakrabarty (2023) S. Banerjee, M. Nikdast, and K. Chakrabarty, “Characterizing coherent integrated photonic neural networks under imperfections,” Journal of Lightwave Technology 41, 1464–1479 (2023).
Hamerly, Bandyopadhyay, and Englund (2022) R. Hamerly, S. Bandyopadhyay, and D. Englund, “Asymptotically fault-tolerant programmable photonics,” Nature Communications 13, 6831 (2022).
Wang et al. (2022) R. Wang, P. Wang, C. Lyu, G. Luo, H. Yu, X. Zhou, Y. Zhang, and J. Pan, “Multicore photonic complex-valued neural network with transformation layer,” Photonics 9 (2022), 10.3390/photonics9060384.
Fang et al. (2019) M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi, and M. R. DeWeese, “Design of optical neural networks with component imprecisions,” Optics Express 27, 14009 (2019).
Jia et al. (2023) Z. Jia, W. Qarony, J. Park, S. Hooten, D. Wen, Y. Zhiyenbayev, M. Seclì, W. Redjem, S. Dhuey, A. Schwartzberg, E. Yablonovitch, and B. Kanté, “Interpretable inverse-designed cavity for on-chip nonlinear photon pair generation,” Optica 10, 1529–1534 (2023).
Yeung et al. (2020) C. Yeung, J.-M. Tsai, B. King, Y. Kawagoe, D. Ho, M. W. Knight, and A. P. Raman, “Elucidating the behavior of nanophotonic structures through explainable machine learning algorithms,” ACS Photonics 7, 2309–2318 (2020).
Lipton (2018) Z. C. Lipton, “The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.” Queue 16, 31–57 (2018).
Miller (2019) T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial intelligence 267, 1–38 (2019).
Samek et al. (2021) W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, “Explaining deep neural networks and beyond: A review of methods and applications,” Proceedings of the IEEE 109, 247–278 (2021).
Montavon, Samek, and Müller (2018) G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital signal processing 73, 1–15 (2018).
Sung (1998) A. H. Sung, “Ranking importance of input parameters of neural networks,” Expert systems with Applications 15, 405–411 (1998).
Fu and Chen (1993) L. Fu and T. Chen, “Sensitivity analysis for input vector in multilayer feedforward neural networks,” in IEEE international conference on neural networks (IEEE, 1993) pp. 215–218.
Dimopoulos et al. (1999) I. Dimopoulos, J. Chronopoulos, A. Chronopoulou-Sereli, and S. Lek, “Neural network models to study relationships between lead concentration in grasses and permanent urban descriptors in athens city (greece),” Ecological modelling 120, 157–165 (1999).
Simonyan, Vedaldi, and Zisserman (2014) K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” in Workshop at International Conference on Learning Representations (2014).
Olden, Joy, and Death (2004) J. D. Olden, M. K. Joy, and R. G. Death, “An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data,” Ecological Modelling 178, 389–397 (2004).
Ancona et al. (2017) M. Ancona, E. Ceolini, C. Öztireli, and M. Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” arXiv preprint arXiv:1711.06104 (2017).
Smilkov et al. (2017) D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” arXiv preprint arXiv:1706.03825 (2017).
Fisher (1936) R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics 7, 179–188 (1936).
Campo and Pérez-López (2022) J. R. R. Campo and D. Pérez-López, “Reconfigurable activation functions in integrated optical neural networks,” IEEE Journal of Selected Topics in Quantum Electronics 28, 1–13 (2022).
Laporte, Dambre, and Bienstman (2019) F. Laporte, J. Dambre, and P. Bienstman, “Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch,” Scientific reports 9, 5918 (2019).

The Impact of Feature Representation on the Accuracy of Photonic Neural Networks