The Impact of Feature Representation
on the Accuracy of Photonic Neural Networks

Mauricio Gomes de Queiroz [email protected] Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France    Paul Jimenez Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France    Raphael Cardoso Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France    Mateus Vidaletti Costa Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France School of Engineering, RMIT University, Melbourne, VIC 3000, Australia    Mohab Abdalla Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France School of Engineering, RMIT University, Melbourne, VIC 3000, Australia    Ian O’Connor Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France    Alberto Bosio Ecole Centrale de Lyon, INSA Lyon, CNRS, Universite Claude Bernard Lyon 1, CPE Lyon, INL, UMR5270, 69130 Ecully, France    Fabio Pavanello Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, Grenoble INP, CROMA, 38000, Grenoble, France
(June 28, 2024)
Abstract

Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network’s handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network’s ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.

I Introduction

Refer to caption
Figure 1: (a) Representation of a generic neural network. An arbitrary layer l𝑙litalic_l is highlighted. (b) Schematic of the photonic implementation of a neural network layer using meshes of Mach-Zehnder Interferometers (MZIs). (c) An MZI and its two phase shifters (ϕ,2θ)italic-ϕ2𝜃(\phi,2\theta)( italic_ϕ , 2 italic_θ ) are illustrated alongside the transfer matrix representation of the transformation it performs over the field amplitude.

Artificial Intelligence (AI) systems gained widespread relevance in recent yearsDong, Wang, and Abbas (2021), finding diverse applications ranging from image classification Simonyan and Zisserman (2014) to speech recognition Purwins et al. (2019). These systems have traditionally been implemented on electronic hardware, benefiting from the steady performance improvements driven by the miniaturization of electronic integrated circuits. However, with components now shrinking to the atomic scale, the limitations of this platform become apparent Theis and Wong (2017). At this size, for example, quantum effects may disrupt functionality Powell (2008), and the heat from densely packed devices becomes hard to dissipate Leiserson et al. (2020). In response, new technologies are being explored to enable further improvements in AI. These emerging technologies are often not subject to the same constraints of their electronic counterparts, and thus might offer more efficient alternatives for certain applications Waldrop (2016).

Photonic Neural Networks (PNNs) are hardware implementations of AI systems that perform computations on optical signals, rather than on electronic ones. Using light, they are able to leverage several of its properties to potentially enable high parallelization, low latency, and reduced power consumption Shen et al. (2017). For example, PNNs have been demonstrated to perform sub-nanosecond image classification Ashtiani, Geers, and Aflatouni (2022) and to achieve up to 1012superscript101210^{12}10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT Multiply-Accumulate operations per second Xu et al. (2021). However, transitioning from electronics to photonics remains challenging. Practical applications of medium to large-scale systems are currently limited by the large physical footprint of photonic circuits Shibata et al. (2008); Xiao et al. (2021), their loss accumulation, and the high power consumption of some of its electro-optic devices Tait (2022).

One way of alleviating these issues is by optimizing circuits Mourgias-Alexandris et al. (2022); de Queiroz et al. (2023), or carefully designing PNNs to minimize circuit size. A common practice found in literature involves taking advantage of the complex representation of light (using amplitude and phase) to represent multiple features in a single input, thus combining multiple real-valued features into fewer complex-valued inputs. By using fewer inputs, a circuit requires fewer components and a smaller network, which leads to a reduction in overall footprint. Such technique aligns well with the capabilities of photonic circuits, which are able to process complex inputs through complex transformations Zhang et al. (2021).

However, the way we represent features in Neural Networks (NNs) greatly influences the difficulty of the problems they solve. For example, in tasks with radial symmetry centered around the origin, opting for a polar coordinate system can emphasize the relevant feature relationships necessary for accurately solving the task, short-cutting the network’s need to learn it. This approach can significantly reduce the computational complexity required for achieving high accuracy. Moreover, the choice of feature representation also shapes the network’s approaches to solve tasks, as NNs tend to rely on the most straightforward cues available within the data Geirhos et al. (2020). This highlights the need to understand which feature relationships are emphasized by the representation strategies used in PNNs. By doing so, we can ensure that these networks not only achieve high accuracy, but also adopt desirable decision-making strategies.

In this paper, we explore the role of feature representation in the accuracy and decision-making strategies of PNNs. We investigate the common practice of combining various features into a single input, using eXplainable AI (XAI) methods to compare relative importances of the combined features. To our knowledge, only one work investigated such practice as a means of improving accuracy in PNNs Qiu et al. (2024). However, the consequences of the feature combination itself are still unknown, and different feature representations were not explored. Our work tackles these open questions with a mathematical analysis of feature combination focused on photonic implementations, where networks and circuits are constrained by size. We point out how different data representations and hardware implementations can be exploited for higher accuracy and lower complexity, as well as the shortcomings of current solutions.

The rest of this paper is structured as follows: in Sections II and III we review the basics of photonic implementations of AI and feature importance metrics. In Section IV, we calculate the relative importance of features that share a same input. Sections V and VI discuss practical examples and simulations of Artificial Neural Networks (ANNs) and PNNs. Finally, Section VII concludes the discussions brought up in this paper.

II Photonic Neural Networks

In this section, we provide a review of ANNs and their photonic implementations. We also address common strategies of representing features in light, which will be used in our further discussions in Section IV.

II.1 Artificial Neural Networks

Artificial Neural Networks (ANNs), first proposed in the 1940s McCulloch and Pitts (1943), are mathematical functions loosely inspired by how the human brain processes information. These functions are known to be universal approximators Hornik, Stinchcombe, and White (1989), hence their ability to handle a wide variety of tasks. The network’s behavior, i.e. the way it processes inputs, is determined by their connection strengths (called “weights”) and non-linearities (referred to as “activation functions”) Sze et al. (2017). Typically, these parameters are obtained through training, approximating the ANN to a probability function associated with the given task. For instance, in classification tasks, ANNs are designed to assign a class to an input, by approximating a function that calculates the likelihood of belonging to each class Goodfellow, Bengio, and Courville (2016).

Consider a fully connected, feed forward (FF) NN, consisting of L𝐿Litalic_L layers and designed with N𝑁Nitalic_N inputs and N𝑁Nitalic_N outputs, as depicted in Fig. 1(a). The process by which a given layer l𝑙litalic_l transforms its inputs is described as:

z(l)superscript𝑧𝑙\displaystyle\vec{z}^{\,(l)}over→ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT =𝐖(l)y(l1)+b(l),absentsuperscript𝐖𝑙superscript𝑦𝑙1superscript𝑏𝑙\displaystyle=\mathbf{W}^{\,(l)}\cdot\vec{y}^{\,(l-1)}+\vec{b}^{\,(l)}\,,= bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⋅ over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT + over→ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , (1)
y(l)superscript𝑦𝑙\displaystyle\vec{y}^{\,(l)}over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT =σ(l)(z(l)).absentsuperscript𝜎𝑙superscript𝑧𝑙\displaystyle=\sigma^{\,(l)}(\vec{z}^{\,(l)})\,.= italic_σ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( over→ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) . (2)

Initially, inputs are combined through weighted sums by a weight matrix 𝐖(l)superscript𝐖𝑙\mathbf{W}^{\,(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT to obtain z(l)superscript𝑧𝑙z^{\,(l)}italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Then, the element-wise application of an activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) to z(l)superscript𝑧𝑙z^{\,(l)}italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT introduces non-linearity and yields the output of the layer, where y(0)superscript𝑦0{y}^{\,(0)}italic_y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the input of the network and y(L)superscript𝑦𝐿{y}^{\,(L)}italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT the output. A bias b𝑏\vec{b}over→ start_ARG italic_b end_ARG might be added before the activation function to allow for the network to better adjust to the data. The entire network, from first to last layer, can be seen as a sequence of such transformations, written as y(L)=f(y(0))superscript𝑦𝐿𝑓superscript𝑦0\vec{y}^{\,(L)}=f(\vec{y}^{\,(0)})over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT = italic_f ( over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ).

Thus, ANNs implement input-output map**s that can be either real or complex. Real-Valued Neural Networks (RVNNs) are characterized by real parameters and inputs, with f:NN:𝑓maps-tosuperscript𝑁superscript𝑁f:\mathbb{R}^{N}\mapsto\mathbb{R}^{N}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. In these networks, each layer scales and combines inputs before non-linearly transforming them. Complex-Valued Neural Networks (CVNNs), on the other hand, operate in the complex domain, meaning that both the input vector and the network’s parameters are complex-valued and f:NN:𝑓maps-tosuperscript𝑁superscript𝑁f:\mathbb{C}^{N}\mapsto\mathbb{C}^{N}\,italic_f : blackboard_C start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ↦ blackboard_C start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT Bassey, Qian, and Li (2021). In that case, each layer has the ability to not only scale and combine, but also rotate inputs in the complex plane. This rotation, inherent to complex algebra, makes CVNNs more suitable for tasks where phase information is important, such as in audio processing Kim, Han, and Ko (2024) or optical communications Masaad et al. (2023).

II.2 Photonic Implementations

Photonic computing is emerging as a promising approach to improve ANN implementations for specific applications by computing with light. This allows us to leverage its unique characteristics to potentially enable faster and more energy-efficient AI systems. For example, in the optical domain, linear transformations can be done passively Reck et al. (1994) and information can be easily parallelized and processed at high speeds Xu et al. (2021).

PNNs are implementations of ANNs through photonic inputs, components, and transformations Shastri et al. (2021). Although no single photonic component acts as an artificial neuron, a circuit can be designed to perform the mathematical operations of an ANN. This is achieved by using several components such as waveguides, interferometers, and modulators which guide and manipulate light signals. These circuits operate on complex signals and implement complex transformations, meaning that PNNs can act as RVNNs and CVNNs, depending on the task at hand.

Several PNN circuits were suggested and demonstrated experimentally. They can be broadly categorized by how different inputs are distinguished, whether through spatial, wavelength, or time domains Bai et al. (2023).

In this study, we focus on PNNs that use spatial differentiation of inputs. These networks assign a separate input to each optical signal, and implement weight matrix multiplications by making different inputs interfere with each other. Most notably, this is achieved by using meshes of Mach-Zehnder Interferometers (MZIs) Reck et al. (1994); Clements et al. (2016). The interference, and hence the specific mathematical operation performed by the mesh, can be selected by adjusting the phase shifters found in these devices. Activation functions, on the other hand, can be implemented by using any of the devices and circuits that exhibit optical non-linearity Williamson et al. (2020); Jha, Huang, and Prucnal (2020). The schematics of an ANN implementation and an MZI are shown in Fig. 1(b) and Fig. 1(c), respectively.

If PNNs use coherent light inputs, they can be represented in the complex domain. In these networks, the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT input is characterized by an amplitude Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and phase ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, the input vector can be expressed as y(0)=[A1eiϕ1,,ANeiϕN]Nsuperscript𝑦0superscriptsubscript𝐴1superscript𝑒𝑖subscriptitalic-ϕ1subscript𝐴𝑁superscript𝑒𝑖subscriptitalic-ϕ𝑁superscript𝑁\vec{y}^{\,(0)}=\left[A_{1}e^{i\phi_{1}},\cdots,A_{N}e^{i\phi_{N}}\right]^{% \intercal}\in\mathbb{C}^{N}over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Given the two degrees of freedom available for each input, feature encoding can be achieved using various methodologies. We divide common approaches found in literature into two distinct groups: real and complex encoding.

Real encoding simplifies the input representation by encoding data solely in the amplitude of the optical signals, maintaining a uniform initial phase across all inputs (in practice having ϕi=0isubscriptitalic-ϕ𝑖0for-all𝑖\phi_{i}=0\,\forall\,iitalic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ∀ italic_i and thus y(0)Nsuperscript𝑦0superscript𝑁\vec{y}^{\,(0)}\in\mathbb{R}^{N}over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT). Several researchers employ this encoding method for its compatibility with RVNNs used in electronic computers Shen et al. (2017); Mojaver et al. (2023). It allows for an easy map** of weights from electronically trained networks to photonic transformations. In these networks, while the nature of the transformations of individual MZIs is inherently complex, the overall behaviour can effectively be real-valued. Since no phase information is used, only the amplitude of the outputs is of interest, which simplifies the detection scheme. However, it is important to ensure that different inputs experience the same phase before reaching the network to maintain phase consistency, which might not be simple to achieve experimentally.

In contrast, complex encoding uses both amplitude and phase at the same time, having inputs that lie in the complex plane, that is y(0)Nsuperscript𝑦0superscript𝑁\vec{y}^{\,(0)}\in\mathbb{C}^{N}over→ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The transformations in the PNN in this case are complex, and thus, detection of both intensity and phase in the outputs might be used, adding to the electronic complexity of the circuit. In image classification tasks, for example, real-valued input images can be transformed into Fourier space representation to obtain phase and amplitude information Banerjee, Nikdast, and Chakrabarty (2023); Hamerly, Bandyopadhyay, and Englund (2022); Wang et al. (2022), or have different sections mapped to the real and imaginary parts of complex numbers Fang et al. (2019); Qiu et al. (2024), which reduces by half the number of inputs.

The encoding choice for PNNs influences the network’s behaviour, the type of information that is detected at the output, and the overall size of the circuit, as it may imply the use of additional peripheral devices. Beyond hardware specifications, this choice might also impact how features are processed within the network. When two features share the same input, the network may process them differently from the way they would be processed individually. Understanding these dynamics is crucial for optimizing PNN performance.

III Feature Importance

In this section, we look to the field of XAI for methods of evaluating feature importance in ANNs, to later study the impact of combining features in PNNs. We focus on gradient-based techniques, particularly sensitivity analysis.

ANNs, especially those with several layers, are highly non-linear models that use numerous parameters. The network’s complexity often leads them to be regarded as opaque or “black-box” systems, since their decision-making processes are difficult to grasp intuitively. That is, while we can mathematically describe how a given output is obtained, it is difficult to specify “why” with an intuitive explanation.

Nonetheless, being able to explain the decision-making strategies of a model has a number of practical applications. Clear explanations can, for instance, enhance our understanding of a problem or be used to demonstrate fair treatment. In photonics research, XAI is currently used to explain the inverse design of circuits Jia et al. (2023), or to aid in the description of physical models Yeung et al. (2020). The concept of “explainability” is still subject of an ongoing debate Lipton (2018); Miller (2019) and, consequently, a variety of methods have been proposed to attain it Samek et al. (2021); Montavon, Samek, and Müller (2018). Highlighting which input features are considered as important to an ANN is a common way to explain its outputs. Several methods estimate such feature importance, of which we emphasize sensitivity analysis.

Sensitivity analysis quantifies feature importance by examining how sensitive the output of the model is to small variations in each feature Sung (1998); Fu and Chen (1993); Dimopoulos et al. (1999). The underlying principle is that if small changes in an input lead to significant changes in the output, then that input is likely to be important for the network, i.e. it contributes to the prediction of this output. In such case, the importance of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT input, yi(0)subscriptsuperscript𝑦0𝑖y^{\,(0)}_{i}italic_y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to the the cthsuperscript𝑐thc^{\text{th}}italic_c start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT output of the network, yc(L)subscriptsuperscript𝑦𝐿𝑐y^{\,(L)}_{c}italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, is denoted by Ricsubscript𝑅𝑖𝑐R_{i\to c}italic_R start_POSTSUBSCRIPT italic_i → italic_c end_POSTSUBSCRIPT:

Ric=|yc(L)yi(0)|.subscript𝑅𝑖𝑐subscriptsuperscript𝑦𝐿𝑐subscriptsuperscript𝑦0𝑖R_{i\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial y^{\,(0)}_{i}}\right|\,.italic_R start_POSTSUBSCRIPT italic_i → italic_c end_POSTSUBSCRIPT = | divide start_ARG ∂ italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | . (3)

Gradient-based explanations are frequently used in image classification tasks to generate saliency maps Simonyan, Vedaldi, and Zisserman (2014), and also show fair performance in matching feature importance in simulated data Olden, Joy, and Death (2004). Over time, other methods built up on sensitivity analysis, addressing some of its drawbacks by suggesting additional forms of estimating feature importance Ancona et al. (2017). For example, adding Gaussian noise to the input and averaging their resulting gradients helps generating more consistent saliency maps Smilkov et al. (2017). These techniques are often easy to implement, given that the necessary partial derivatives can be computed through back-propagation.

IV Analytical Derivation of Relative Importance

Here, we use the concepts elaborated in previous sections to investigate how the importance of features is shaped in PNNs. Initially, we employ the sensitivity analysis shown in Section III to obtain the importance of an arbitrary feature encoded in one input. Then, we introduce the concept of encoding functions to describe the different feature encoding processes and representations in photonics, seen in Section II.

Consider a set of features 𝒳={x1,,xn}𝒳subscript𝑥1subscript𝑥𝑛\mathcal{X}=\{x_{1},\cdots,x_{n}\}\in\mathbb{R}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∈ blackboard_R. Assume that we want all of the elements in 𝒳𝒳\mathcal{X}caligraphic_X to be used by our model. However, due to either a prohibitively large quantity of features or size restrictions on our network, we also wish to use a number of inputs that is less than the number of elements in 𝒳𝒳\mathcal{X}caligraphic_X. To achieve both objectives, we combine features into complex inputs, as seen in Section II. In this context, we calculate the relative importance of such combined features, to understand what relationships are highlighted by our inputs.

Refer to caption
Figure 2: (a) and (b) show photonic circuits that implement encoding functions. By modulating the phase shifters to the indicated value, one can achieve functions similar to exponential encoding (a) or linear encoding (b). (c) Importance of a feature xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT when rp=|xj|subscript𝑟𝑝subscript𝑥𝑗r_{p}=|x_{j}|italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |, which is equal to moving P𝑃Pitalic_P alongside the xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT axis. (d) Mean accuracy of 100100100100 ANNs trained with several encoding functions. “Independent” refers to the situation where features are not combined, and “engineered” to the implementation of Eq. (16).

Our first objective is to obtain the importance of an arbitrary feature, say xj𝒳subscript𝑥𝑗𝒳x_{j}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X, with respect to an arbitrary output of the model. This feature is represented only in input yi(0)subscriptsuperscript𝑦0𝑖y^{\,(0)}_{i}italic_y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and its importance is assessed in relation to the cthsuperscript𝑐thc^{\text{th}}italic_c start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT output of the network. Applying the chain rule to Eq. (3), we write this importance Rjcsubscript𝑅𝑗𝑐R_{j\to c}italic_R start_POSTSUBSCRIPT italic_j → italic_c end_POSTSUBSCRIPT as:

Rjc=|yc(L)xj|=|yc(L)yi(0)yi(0)xj|.subscript𝑅𝑗𝑐subscriptsuperscript𝑦𝐿𝑐subscript𝑥𝑗subscriptsuperscript𝑦𝐿𝑐superscriptsubscript𝑦𝑖0superscriptsubscript𝑦𝑖0subscript𝑥𝑗R_{j\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial x_{j}}\right|=\left|% \frac{\partial y^{\,(L)}_{c}}{\partial y_{i}^{\,(0)}}\frac{\partial y_{i}^{\,(% 0)}}{\partial x_{j}}\right|\,.italic_R start_POSTSUBSCRIPT italic_j → italic_c end_POSTSUBSCRIPT = | divide start_ARG ∂ italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | = | divide start_ARG ∂ italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | . (4)

The partial derivative yi(0)/xjsuperscriptsubscript𝑦𝑖0subscript𝑥𝑗\partial y_{i}^{\,(0)}/\partial x_{j}∂ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT / ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of Eq. (4) relates to the way xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is represented in the input. The process of creating an input from elements of 𝒳𝒳\mathcal{X}caligraphic_X is what we term feature encoding. An input yi(0)superscriptsubscript𝑦𝑖0y_{i}^{\,(0)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT obtained from the feature xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is hence written as yi(0)=gi(xj)superscriptsubscript𝑦𝑖0subscript𝑔𝑖subscript𝑥𝑗y_{i}^{\,(0)}=g_{i}(x_{j})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the encoding function for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT input. Considering the encoding process as such, we can write:

Rjc=|yc(L)yi(0)gi(xj)xj|.subscript𝑅𝑗𝑐subscriptsuperscript𝑦𝐿𝑐superscriptsubscript𝑦𝑖0subscript𝑔𝑖subscript𝑥𝑗subscript𝑥𝑗R_{j\to c}=\left|\frac{\partial y^{\,(L)}_{c}}{\partial y_{i}^{\,(0)}}\frac{% \partial g_{i}(x_{j})}{\partial x_{j}}\right|\,.italic_R start_POSTSUBSCRIPT italic_j → italic_c end_POSTSUBSCRIPT = | divide start_ARG ∂ italic_y start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | . (5)

Notice how the importance depends on both the network, represented in the derivative from output to input, and the feature encoding process, given the presence of the encoding function. The modulus operation ensures that the importance is always positive and real-valued. Here, we assume the network to be derivable in the vicinity of the current input, which might not be the case for some CVNN architectures.

Next, we examine the scenario where two arbitrary features, xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, are represented using a single input yi(0)superscriptsubscript𝑦𝑖0y_{i}^{\,(0)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, comparing their relevance. We define the relative importance between xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the cthsuperscript𝑐thc^{\text{th}}italic_c start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT output, Rj,kcsubscript𝑅𝑗𝑘𝑐R_{j,k\to c}italic_R start_POSTSUBSCRIPT italic_j , italic_k → italic_c end_POSTSUBSCRIPT as the following ratio:

Rj,kc=RjcRkc=|gi(xj,xk)xjgi(xj,xk)xk|.subscript𝑅𝑗𝑘𝑐subscript𝑅𝑗𝑐subscript𝑅𝑘𝑐subscript𝑔𝑖subscript𝑥𝑗subscript𝑥𝑘subscript𝑥𝑗subscript𝑔𝑖subscript𝑥𝑗subscript𝑥𝑘subscript𝑥𝑘R_{j,k\to c}=\frac{R_{j\to c}}{R_{k\to c}}=\left|\frac{\frac{\partial g_{i}(x_% {j},x_{k})}{\partial x_{j}}}{\frac{\partial g_{i}(x_{j},x_{k})}{\partial x_{k}% }}\right|\,.italic_R start_POSTSUBSCRIPT italic_j , italic_k → italic_c end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT italic_j → italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_k → italic_c end_POSTSUBSCRIPT end_ARG = | divide start_ARG divide start_ARG ∂ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG start_ARG divide start_ARG ∂ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG | . (6)

We see that the component of Eq. (4) related to the network is canceled, leaving only the derivatives of the encoding function. Thus, Rj,kcsubscript𝑅𝑗𝑘𝑐R_{j,k\to c}italic_R start_POSTSUBSCRIPT italic_j , italic_k → italic_c end_POSTSUBSCRIPT is solely determined by the way features are encoded into yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and hence it is independent of the considered output. To simplify the notation, we drop the subscript indicating the output for the rest of this paper. One of the consequences of Eq. (6) is that the encoding function chosen to combine xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT defines how these features are perceived by the model relative to one another.

Although encoding functions are a method of pre-processing features, in the context of PNNs they can also be implemented in hardware. The incorporation of encoding functions in the circuit is particularly interesting for low-latency applications, as the speed at which inputs are transformed and combined would be limited only by the reconfigurability of the driving electronics. We now explore two types of complex encoding functions to see how they dictate relative feature importances. We also point out how they could be implemented in hardware.

IV.1 Exponential Encoding

Since we are dealing with complex-valued inputs, one intuitive way to encode two features xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into a single input would be to encode xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in its amplitude and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in its phase. This encoding function can be written as:

g(xj,xk)=xjeixk.𝑔subscript𝑥𝑗subscript𝑥𝑘subscript𝑥𝑗superscript𝑒𝑖subscript𝑥𝑘g(x_{j},x_{k})=x_{j}e^{ix_{k}}\,.italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (7)

The relative importance between these two features is calculated as:

Rj,kc=|eixkixjeixk|=1|xj|.subscript𝑅𝑗𝑘𝑐superscript𝑒𝑖subscript𝑥𝑘𝑖subscript𝑥𝑗superscript𝑒𝑖subscript𝑥𝑘1subscript𝑥𝑗R_{j,k\to c}=\left|\frac{e^{ix_{k}}}{ix_{j}e^{ix_{k}}}\right|=\frac{1}{\left|x% _{j}\right|}\,.italic_R start_POSTSUBSCRIPT italic_j , italic_k → italic_c end_POSTSUBSCRIPT = | divide start_ARG italic_e start_POSTSUPERSCRIPT italic_i italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_i italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG | = divide start_ARG 1 end_ARG start_ARG | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG . (8)

In this case, the relative importance between the two features is dynamic, establishing an amplitude-dependent relation between the importances of amplitude and phase.

A hardware version of an exponential encoding function is shown in Fig. 2(a), where a balanced MZI and a phase shifter are used to modulate the amplitude and phase of an input, respectively. The encodings and importances are not exactly the same since this amplitude modulation scheme is mediated by a sine function, it implements g(xj,xk)=isin(xj)exp(xki)𝑔subscript𝑥𝑗subscript𝑥𝑘𝑖subscript𝑥𝑗subscript𝑥𝑘𝑖g(x_{j},x_{k})=i\sin(x_{j})\exp(x_{k}i)italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_i roman_sin ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) roman_exp ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_i ). Here, Eq.  (7) can be achieved short of a global phase shift by map** xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to arcsin(xj)subscript𝑥𝑗\arcsin(x_{j})roman_arcsin ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

IV.2 Linear Encoding

Another way to combine two features is by encoding one in the real part and the other in the imaginary part of a complex input. This can be represented by the function:

g(xj,xk)=xj+ixk.𝑔subscript𝑥𝑗subscript𝑥𝑘subscript𝑥𝑗𝑖subscript𝑥𝑘g(x_{j},x_{k})=x_{j}+ix_{k}\,.italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_i italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (9)

Here, we find their relative importance to be:

Rj,k=1|i|=1,subscript𝑅𝑗𝑘1𝑖1R_{j,k}=\frac{1}{\left|i\right|}=1\,,italic_R start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_i | end_ARG = 1 , (10)

thus indicating that both features will be considered to have the same importance for the network. Since they are independent from the weights of the network, their relative importance cannot be unlearned, i.e. it cannot be modified by further training. This might pose problems when the chosen encoding leads to relative importances that do not match the data.

An encoding function similar to that of Eq. (9) implemented in hardware, can be seen in Fig. 2(b). There, two MZIs are used as amplitude modulators, while one of their outputs has its phase shifted by π/2𝜋2\pi/2italic_π / 2 to encode the respective input in the imaginary axis. In that case, g(xj,xk)=i(sin(xj)+sin(xk)i)𝑔subscript𝑥𝑗subscript𝑥𝑘𝑖subscript𝑥𝑗subscript𝑥𝑘𝑖g(x_{j},x_{k})=i(\sin(x_{j})+\sin(x_{k})i)italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_i ( roman_sin ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_sin ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_i ). Eq. (9) can be achieved short of a global phase shift by map** xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to arcsin(xj)subscript𝑥𝑗\arcsin(x_{j})roman_arcsin ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and arcsin(xk)subscript𝑥𝑘\arcsin(x_{k})roman_arcsin ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

V On the Impact of Encoding Functions to ANNs

In this section, we address the practical implications of the discussions brought up in Section IV. Here, our objective is to demonstrate how a well-engineered encoding function can significantly improve the accuracy of an ANN on a test task. We begin by defining such task and studying the relative feature importances found in a solution to it. Later, we create an encoding function that reproduces these importances on trained ANNs, finally comparing its use against others.

Consider a simple classification problem with a known solution in the real domain: determining whether points lie inside or outside an n𝑛nitalic_n-sphere. An n𝑛nitalic_n-sphere is the generalization of a circle to n+1𝑛1n+1italic_n + 1 dimensions, similar to how hyperplanes generalize planes. It is defined by a set of points S(n)superscript𝑆𝑛S^{(n)}italic_S start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT that are equidistant from a central point c0=(c1,,cn+1)subscript𝑐0subscript𝑐1subscript𝑐𝑛1c_{0}=(c_{1},\cdots,c_{n+1})italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) by a radius r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The distance of a point P=(x1,,xn+1)𝑃subscript𝑥1subscript𝑥𝑛1P=(x_{1},\cdots,x_{n+1})italic_P = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) to c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is:

rp(x1,,xn+1)=i=1n+1(xici)2.subscript𝑟𝑝subscript𝑥1subscript𝑥𝑛1superscriptsubscript𝑖1𝑛1superscriptsubscript𝑥𝑖subscript𝑐𝑖2r_{p}(x_{1},...,x_{n+1})=\sqrt{\sum_{i=1}^{n+1}(x_{i}-c_{i})^{2}}\,.italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (11)

Naturally, P𝑃Pitalic_P is considered outside of the n𝑛nitalic_n-sphere if rpsubscript𝑟𝑝r_{p}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT exceeds r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and inside otherwise. In this context, a mathematical model that outputs a probability of P being outside of S(n)superscript𝑆𝑛S^{(n)}italic_S start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT can be constructed using a logistic function. The logistic function σ(x)=1/(1+ex)𝜎𝑥11superscript𝑒𝑥\sigma(x)=1/(1+e^{-x})italic_σ ( italic_x ) = 1 / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ) is bounded between 0 and 1 with a smooth sigmoid transition, and is typically used in binary classification problems. Given the coordinates of P𝑃Pitalic_P, this model can be expressed as:

y=f(x1,,xn+1)=σ(rpr0).𝑦𝑓subscript𝑥1subscript𝑥𝑛1𝜎subscript𝑟𝑝subscript𝑟0y=f(x_{1},...,x_{n+1})=\sigma\left(r_{p}-r_{0}\right)\,.italic_y = italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = italic_σ ( italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (12)

Here, y𝑦yitalic_y represents the probability that rp>r0subscript𝑟𝑝subscript𝑟0r_{p}>r_{0}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT given the coordinates of P𝑃Pitalic_P. When y=0.5𝑦0.5y=0.5italic_y = 0.5, Eq. (12) delineates the boundary defined by S(n)superscript𝑆𝑛S^{(n)}italic_S start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, allowing for accurate classification of points based on this threshold.

Since Eq. (12) can be used to accurately classify any point P𝑃Pitalic_P, we conjecture that its relative feature importances are desirable to other models that wish to do it as well. Thus, we examine the sensitivity of y𝑦yitalic_y to an arbitrary feature xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which can be calculated according to Eq. (4) as:

Rj=|yxj|=|σ(rpr0)(1σ(rpr0))xjcjrp|rp0.subscript𝑅𝑗𝑦subscript𝑥𝑗𝜎subscript𝑟𝑝subscript𝑟01𝜎subscript𝑟𝑝subscript𝑟0subscript𝑥𝑗subscript𝑐𝑗subscript𝑟𝑝for-allsubscript𝑟𝑝0\displaystyle\begin{split}R_{j}&=\left|\frac{\partial y}{\partial x_{j}}\right% |\\ &=\left|\sigma(r_{p}-r_{0})(1-\sigma(r_{p}-r_{0}))\frac{x_{j}-c_{j}}{r_{p}}% \right|\forall\,r_{p}\neq 0\,.\end{split}start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL = | divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = | italic_σ ( italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 1 - italic_σ ( italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG | ∀ italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≠ 0 . end_CELL end_ROW (13)

As can be seen in Fig. 2(c), the importance of a feature peaks when it is at the n𝑛nitalic_n-sphere’s boundary. Exactly at that point, small variations in xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT cause the largest deviations of the probability of P𝑃Pitalic_P being outside of S(n)superscript𝑆𝑛S^{(\,n)}italic_S start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT. The relative feature importance between two features xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is:

Rj,k=|xjcjxkck|.subscript𝑅𝑗𝑘subscript𝑥𝑗subscript𝑐𝑗subscript𝑥𝑘subscript𝑐𝑘R_{j,k}=\left|\frac{x_{j}-c_{j}}{x_{k}-c_{k}}\right|\,.italic_R start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = | divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG | . (14)
Refer to caption
Figure 3: (a) Class distribution of normalized features of the Iris dataset (b) Schematic of the process of combining features utilized in the training of PNNs.

Now assume that, as in section IV, we are constrained by size, and thus wish to combine different features to reduce the number of inputs. However, for this particular example (and by design), we have prior knowledge of the relative importances of features before combining them. Taking advantage of this, we propose an encoding function g(xj,xk)𝑔subscript𝑥𝑗subscript𝑥𝑘g(x_{j},x_{k})italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) that achieves the desired reduction in dimentionality while preserving the relationships given by Eq. (14). Given that after combining features, their relative importances should follow Eq. (6), we can obtain one such g(xj,xk)𝑔subscript𝑥𝑗subscript𝑥𝑘g(x_{j},x_{k})italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) by solving the following system of partial derivatives:

{|g(xj,xk)xj|=|xjcj|,|g(xj,xk)xk|=|xkck|.\left\{\begin{matrix}\left|\frac{\partial g(x_{j},x_{k})}{\partial x_{j}}% \right|=\left|x_{j}-c_{j}\right|\,,\\ \\ \left|\frac{\partial g(x_{j},x_{k})}{\partial x_{k}}\right|=\left|x_{k}-c_{k}% \right|\,.\end{matrix}\right.{ start_ARG start_ROW start_CELL | divide start_ARG ∂ italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | = | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | , end_CELL end_ROW start_ROW start_CELL end_CELL end_ROW start_ROW start_CELL | divide start_ARG ∂ italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG | = | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | . end_CELL end_ROW end_ARG (15)

This system can lead to one of many solutions of the form

g(xj,xk)=12(xj2+xk2)cjxjckxk+C,𝑔subscript𝑥𝑗subscript𝑥𝑘12superscriptsubscript𝑥𝑗2superscriptsubscript𝑥𝑘2subscript𝑐𝑗subscript𝑥𝑗subscript𝑐𝑘subscript𝑥𝑘𝐶g(x_{j},x_{k})=\frac{1}{2}(x_{j}^{2}+x_{k}^{2})-c_{j}x_{j}-c_{k}x_{k}+C\,,italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_C , (16)

where C is a constant.

In the same manner, we can obtain other functions that express different relative importances. A constant Rj,k=1subscript𝑅𝑗𝑘1R_{j,k}=1italic_R start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = 1, for instance, is achieved by using g(xj,xk)=(xj+xk)n𝑔subscript𝑥𝑗subscript𝑥𝑘superscriptsubscript𝑥𝑗subscript𝑥𝑘𝑛g(x_{j},x_{k})=(x_{j}+x_{k})^{n}italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT nfor-all𝑛\forall\,n\in\mathbb{R}∀ italic_n ∈ blackboard_R. Alternatively, g(xj,xk)=(xj×xk)n𝑔subscript𝑥𝑗subscript𝑥𝑘superscriptsubscript𝑥𝑗subscript𝑥𝑘𝑛g(x_{j},x_{k})=(x_{j}\times x_{k})^{n}italic_g ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT nfor-all𝑛\forall\,n\in\mathbb{R}∀ italic_n ∈ blackboard_R leads to Rj,k=|xk/xj|subscript𝑅𝑗𝑘subscript𝑥𝑘subscript𝑥𝑗R_{j,k}=|x_{k}/x_{j}|italic_R start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |, which is the inverse of Eq. (14) when cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are zero. To compare the use of these encoding functions, we trained several ANNs, benchmarking them on networks that do not combining inputs (called “independent” here). We are particularly interested in the performance of Eq. (16), which we call “engineered” encoding function. The description of the training procedures is as follows.

A dataset of 1000100010001000 points in 4 dimensions was created, where each coordinate value was randomly chosen between 22-2- 2 and 2222. Points were labeled as either inside or outside of a 3333-sphere S(3)superscript𝑆3S^{\,(3)}italic_S start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT of radius 1111, centered at the origin c0=(0,0,0,0)subscript𝑐00000c_{0}=(0,0,0,0)italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( 0 , 0 , 0 , 0 ), according to their position. In order to obtain a balanced dataset, we generated the same amount of points inside and outside of the sphere. The networks trained to solve this task, were composed of an input layer containing either 2222 or 4444 neurons (depending on the combination of features or not), a hidden layer of 6666 neurons and an output layer with a single neuron. A logistic activation function σ𝜎\sigmaitalic_σ was used for every layer. Each encoding function was used to train 100100100100 different networks, thus accounting for the random initialization of weights and random shuffling of the dataset prior to training. The networks were trained on 70%percent7070\%70 % of the available data for 100100100100 epochs with a learning rate of 0.0010.0010.0010.001, and tested on the remaining data.

The results of these experiments are shown in Fig. 2(d). We notice that some representations can render the task harder to solve, while others maintain, to some extent, the accuracy achieved by the use of independent inputs. The engineered encoding function in Eq. (16) outperformed all others. With this example, we show that the way we combine features plays a role in the accuracy of ANNs. Given prior knowledge on how features relate to one another, which may come from domain-specific knowledge or from inspecting the data (noticing symmetries or class distributions, for example), we could estimate relative feature importances and obtain an encoding function that aligns with them. Combining features with said encoding function could improve the network performance.

VI Application of Encoding Functions in PNNs

In this section, we retake the subject of this study and explore the use of different encoding functions in PNNs trained on the Iris dataset Fisher (1936), a standard benchmark for classification algorithms. Our goal is to show how carefully chosen encoding functions might lead to higher accuracies in PNNs. To this end, we compare the performance of several encoding functions by means of simulations of PNNs, which differ significantly from the ANNs of the previous section in terms of their complex-valued inputs and transformations.

The Iris flower classification task involves categorizing three different Iris species (Setosa, Versicolour, and Virginica) based on four features: the lengths and widths of sepals and petals. The dataset, consisting of 150150150150 labeled data points, has considerable class overlaps, such that no single feature alone can distinguish all species, making this an ideal candidate for our experiments. Visualizations of feature distributions and class overlaps are presented in Fig. 3(a).

Refer to caption
Figure 4: (a) Accuracy of PNNs for different encoding functions. “independent” represents the accuracy of a PNN trained without combining features. (b) Average training loss of the trained PNNs per epoch. The color scheme is the same as the one defined in (a).

Our experimental design involves training PNNs by combining features in pairs as illustrated in Fig. 3(b). We assess their performance by averaging the accuracy of 100100100100 trained PNNs, benchmarking them against a PNN that does not combine features. This sample size was chosen to allow for convergence in the average values obtained for accuracy, given the variability in the training process. The architecture of the PNNs consists of a single hidden layer with 6666 neurons. Depending on the configuration, the number of input neurons varies between 3333 (when combining features) and 5555 (when using independent inputs), where one input acts as a bias for both configurations. The output layer has 3333 neurons, matching the number of classes. All configurations use the same underlying circuit, where NN layers are implemented using meshes of MZIs with trainable phase shifters, as represented in Fig 1(b). Every layer is followed by a softplus activation function, which can be implemented in integrated photonic circuits Campo and Pérez-López (2022). Although its hardware implementation would change both the modulus and the phase of the signals, we model it by applying softplus(x)=log(1+exp(x))𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠𝑥1𝑥softplus(x)=\log(1+\exp(x))italic_s italic_o italic_f italic_t italic_p italic_l italic_u italic_s ( italic_x ) = roman_log ( 1 + roman_exp ( italic_x ) ) solely to the modulus of the complex numbers Banerjee, Nikdast, and Chakrabarty (2023). This approach allows us to simulate the gain and activation behavior while simplifying the model by avoiding additional phase changes. These phase changes can make the simulation and training more challenging and are less critical to the primary function of the softplus activation in this context.

The circuits were simulated using the Photontorch Python package Laporte, Dambre, and Bienstman (2019). The simulations were performed under ideal conditions, excluding noise and component imperfections. They were trained for 300300300300 epochs on 70%percent7070\%70 % of the dataset, reserving the remaining 30%percent3030\%30 % for testing. The dataset was divided into five shuffled batches per epoch to enhance training stability. A softmax function was used to convert the output light intensity values into class probabilities Goodfellow, Bengio, and Courville (2016). Weight updates were performed using a cross-entropy loss function combined with stochastic gradient descent. The initial learning rate was set at 0.010.010.010.01 and adjusted at learning plateaus. While higher accuracies may be achieved by further optimizing the training process to each specific case, we opted for a constant training procedure across all circuits to isolate the effects of different encoding functions.

Here, we investigate the use of the encoding functions detailed in Section II: linear and exponential encoding. Given the anisotropic nature of the Iris classification task, unlike the n𝑛nitalic_n-sphere problem, we also consider which features to combine. To explore the impacts of this choice on the obtained accuracy, we use two combination strategies for features: grou** by the lengths and widths (l/w)𝑙𝑤(l/w)( italic_l / italic_w ) or by petal and sepal information (p/s)𝑝𝑠(p/s)( italic_p / italic_s ). We benchmarked the performance of PNNs using different encoding functions and grou**s of features to the independent case, where features were not combined.

The results of these experiments, shown in Fig. 4, are summarized as follows. Exponential encoding exhibited the lowest performance, falling up to 11%percent1111\%11 % in mean accuracy compared to the independent benchmark. In contrast, linear encoding, commonly used in the photonics community Banerjee, Nikdast, and Chakrabarty (2023); Hamerly, Bandyopadhyay, and Englund (2022); Wang et al. (2022); Fang et al. (2019); Qiu et al. (2024), was able to match the performance of the independent case. The difference between the best and worst performing encoding functions was 12.3%percent12.312.3\%12.3 %. These results highlight that both the manner in which features are combined and the combination of features itself play significant roles in the final accuracy of PNNs. When comparing different feature grou**s, we found that (l/w)𝑙𝑤(l/w)( italic_l / italic_w ) consistently performed worse than (p/s)𝑝𝑠(p/s)( italic_p / italic_s ), demonstrating that the choice of which features to combine can also impact accuracy for some tasks.

These findings are supported by heuristics found in the data. A closer inspection of Fig. 3(a) reveals that petal length and petal width together are highly discriminative of the different classes. These features separate different species in a similar fashion, as evidenced by the distribution of classes along the diagonal of the plot, suggesting that they may have similar importance. Thus, the combination (p/s𝑝𝑠p/sitalic_p / italic_s) with linear encoding would combine petal length and width with an equal relative importance, expressing such relationships.

VII Conclusions

Combining features into single inputs in PNNs can lead to reduced number of inputs and associated devices as well as enabling the use of smaller and more energy efficient NNs. These benefits would help to render some circuits more feasible to be simulated, fabricated, tested or deployed. However, this method of feature combination imposes predefined relationships among the features that may not necessarily reflect the nature of data or task at hand. Nonetheless, selecting or designing encoding functions based on an understanding of the dataset or from domain-specific knowledge can lead to improved accuracy. We have illustrated this first on an ideal simple example, and then for simulated PNNs.

In the scenarios shown here, as it is seen in literature, features are combined into a single input. As an alternative, we could distribute features across many inputs, circumventing the discussions brought up here and making it possible to learn other relative feature importances. For instance, Principal Component Analysis (PCA) can be used for dimensionality reduction, distributing features across many inputs simultaneously Mojaver et al. (2023). Expanding on this concept, a learnable encoding function that uses every feature available would be a fully connected layer of a NN Wang et al. (2022), which is more complex and less efficient than what is explored in our work. Besides, the approach used here could be applied directly at a hardware level, using integrated photonics and CMOS-compatible platforms for volume production.

Here, the discussions highlight that there is no neutral way of using this feature combination strategy in PNNs. Combining features in this manner will necessarily emphasize certain feature relationships. Sometimes, a PNN might achieve good performance metrics despite of combinations that are not ideal. However, even if high accuracy is achieved, these combinations can also introduce or amplify biases in the model outputs, depending on the specific features and their encoded interactions. Rather than leaving this to chance, we suggest to carefully assess how to encode features given the nature of the problem and data.

Acknowledgements.
The authors would like to acknowledge and thank Peter Bienstman and Thomas Van Vaerenbergh for their valuable advice and discussions during the early stages of this work, as well as for reviewing the paper before submission. This project received funding from École Centrale de Lyon and ANR (No. ANR-20-THIA-0007-01). Paul Jimenez and Fabio Pavanello thank ANR’s support (No. ANR-20-CE39-0004), and Fabio Pavanello acknowledges support by the European Union’s Horizon Europe research and innovation program (No.101070238). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Author Declarations

Conflict of Interest Statement

The authors have no conflicts to disclose.

Author Contributions

Mauricio Gomes de Queiroz: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Software (lead); Visualization (lead); Writing/original draft preparation (lead); Writing/review & editing (equal). Paul Jimenez: Formal analysis (supporting); Methodology (supporting); Writing/review & editing (equal). Raphael Cardoso: Methodology (supporting); Writing/review & editing (equal). Mateus Vidaletti da Costa: Methodology (supporting); Writing/review & editing (equal). Mohab Abdalla: Methodology (supporting); Writing/review & editing (equal). Ian O’Connor: Supervision (supporting); Writing/review & editing (equal). Alberto Bosio: Funding Acquisition (equal); Supervision (supporting); Writing/review & editing (equal). Fabio Pavanello: Conceptualization (supporting); Formal analysis (supporting); Methodology (supporting); Funding Acquisition (equal); Supervision (lead); Writing/review & editing (equal).

Data Availability Statement

The code that reproduces the experiments, and the data that support the findings of this study are available at github.com/mgomesq/feature_representation_pnns.

References

  • Dong, Wang, and Abbas (2021) S. Dong, P. Wang,  and K. Abbas, “A survey on deep learning and its applications,” Computer Science Review 40, 100379 (2021).
  • Simonyan and Zisserman (2014) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556  (2014).
  • Purwins et al. (2019) H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang,  and T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing 13, 206–219 (2019).
  • Theis and Wong (2017) T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A new beginning for information technology,” Computing in Science & Engineering 19, 41–50 (2017).
  • Powell (2008) J. R. Powell, “The quantum limit to moore’s law,” Proceedings of the IEEE 96, 1247–1248 (2008).
  • Leiserson et al. (2020) C. E. Leiserson, N. C. Thompson, J. S. Emer, B. C. Kuszmaul, B. W. Lampson, D. Sanchez,  and T. B. Schardl, “There’s plenty of room at the top: What will drive computer performance after moore’s law?” Science 368, eaam9744 (2020)https://www.science.org/doi/pdf/10.1126/science.aam9744 .
  • Waldrop (2016) M. M. Waldrop, “The chips are down for moore’s law,” Nature News 530, 144 (2016).
  • Shen et al. (2017) Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, et al., “Deep learning with coherent nanophotonic circuits,” Nature photonics 11, 441–446 (2017).
  • Ashtiani, Geers, and Aflatouni (2022) F. Ashtiani, A. J. Geers,  and F. Aflatouni, “An on-chip photonic deep neural network for image classification,” Nature 606, 501–506 (2022).
  • Xu et al. (2021) X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, et al., “11 tops photonic convolutional accelerator for optical neural networks,” Nature 589, 44–51 (2021).
  • Shibata et al. (2008) T. Shibata, S. Kamei, T. Kitoh, T. Tanaka,  and M. Kohtoku, “Compact and low insertion loss (~ 1.0 db) mach-zehnder interferometer-synchronized arrayed-waveguide grating multiplexer with flat-top frequency response,” Optics express 16, 16546–16551 (2008).
  • Xiao et al. (2021) X. Xiao, M. B. On, T. Van Vaerenbergh, D. Liang, R. G. Beausoleil,  and S. Yoo, “Large-scale and energy-efficient tensorized optical neural networks on iii–v-on-silicon moscap platform,” Apl Photonics 6 (2021).
  • Tait (2022) A. N. Tait, “Quantifying power in silicon photonic neural networks,” Physical Review Applied 17, 054029 (2022).
  • Mourgias-Alexandris et al. (2022) G. Mourgias-Alexandris, M. Moralis-Pegios, A. Tsakyridis, S. Simos, G. Dabos, A. Totovic, N. Passalis, M. Kirtas, T. Rutirawut, F. Gardes, et al., “Noise-resilient and high-speed deep learning with coherent silicon photonics,” Nature communications 13, 5572 (2022).
  • de Queiroz et al. (2023) M. G. de Queiroz, R. Cardoso, P. Jimenez, M. Abdalla, I. O’Connor, A. Bosio,  and F. Pavanello, “Power reduction in photonic meshes by mzi optimization,” in Frontiers in Optics (Optica Publishing Group, 2023) pp. JW4A–7.
  • Zhang et al. (2021) H. Zhang, M. Gu, X. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. Yung, et al., “An optical neural chip for implementing complex-valued neural network,” Nature communications 12, 457 (2021).
  • Geirhos et al. (2020) R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge,  and F. A. Wichmann, “Shortcut learning in deep neural networks,” Nature Machine Intelligence 2, 665–673 (2020).
  • Qiu et al. (2024) R. Qiu, A. Eldebiky, L. Zhang, X. Yin, C. Zhuo, U. Schlichtmann,  and B. Li, “Oplixnet: Towards area-efficient optical split-complex networks with real-to-complex data assignment and knowledge distillation,” in Design, Automation and Test in Europe (DATE) (2024).
  • McCulloch and Pitts (1943) W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics 5, 115–133 (1943).
  • Hornik, Stinchcombe, and White (1989) K. Hornik, M. Stinchcombe,  and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks 2, 359–366 (1989).
  • Sze et al. (2017) V. Sze, Y.-H. Chen, T.-J. Yang,  and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE 105, 2295–2329 (2017).
  • Goodfellow, Bengio, and Courville (2016) I. Goodfellow, Y. Bengio,  and A. Courville, Deep learning (MIT press, 2016).
  • Bassey, Qian, and Li (2021) J. Bassey, L. Qian,  and X. Li, “A survey of complex-valued neural networks,” arXiv preprint arXiv:2101.12249  (2021).
  • Kim, Han, and Ko (2024) G. Kim, D. K. Han,  and H. Ko, “Sound source localization using complex-valued deep neural networks,” in 2024 IEEE International Conference on Consumer Electronics (ICCE) (IEEE, 2024) pp. 1–4.
  • Masaad et al. (2023) S. Masaad, E. Gooskens, S. Sackesyn, J. Dambre,  and P. Bienstman, “Photonic reservoir computing for nonlinear equalization of 64-qam signals with a kramers–kronig receiver,” Nanophotonics 12, 925–935 (2023).
  • Reck et al. (1994) M. Reck, A. Zeilinger, H. J. Bernstein,  and P. Bertani, “Experimental realization of any discrete unitary operator,” Physical review letters 73, 58 (1994).
  • Shastri et al. (2021) B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright,  and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,” Nature Photonics 15, 102–114 (2021).
  • Bai et al. (2023) Y. Bai, X. Xu, M. Tan, Y. Sun, Y. Li, J. Wu, R. Morandotti, A. Mitchell, K. Xu,  and D. J. Moss, “Photonic multiplexing techniques for neuromorphic computing,” Nanophotonics 12, 795–817 (2023).
  • Clements et al. (2016) W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer,  and I. A. Walsmley, “Optimal design for universal multiport interferometers,” Optica 3, 1460 (2016).
  • Williamson et al. (2020) I. A. D. Williamson, T. W. Hughes, M. Minkov, B. Bartlett, S. Pai,  and S. Fan, “Reprogrammable electro-optic nonlinear activation functions for optical neural networks,” IEEE Journal of Selected Topics in Quantum Electronics 26, 1–12 (2020).
  • Jha, Huang, and Prucnal (2020) A. Jha, C. Huang,  and P. R. Prucnal, “Reconfigurable all-optical nonlinear activation functions for neuromorphic photonics,” Opt. Lett. 45, 4819–4822 (2020).
  • Mojaver et al. (2023) K. H. R. Mojaver, B. Zhao, E. Leung, S. M. R. Safaee,  and O. Liboiron-Ladouceur, “Addressing the programming challenges of practical interferometric mesh based optical processors,” Optics Express 31, 23851–23866 (2023).
  • Banerjee, Nikdast, and Chakrabarty (2023) S. Banerjee, M. Nikdast,  and K. Chakrabarty, “Characterizing coherent integrated photonic neural networks under imperfections,” Journal of Lightwave Technology 41, 1464–1479 (2023).
  • Hamerly, Bandyopadhyay, and Englund (2022) R. Hamerly, S. Bandyopadhyay,  and D. Englund, “Asymptotically fault-tolerant programmable photonics,” Nature Communications 13, 6831 (2022).
  • Wang et al. (2022) R. Wang, P. Wang, C. Lyu, G. Luo, H. Yu, X. Zhou, Y. Zhang,  and J. Pan, “Multicore photonic complex-valued neural network with transformation layer,” Photonics 9 (2022), 10.3390/photonics9060384.
  • Fang et al. (2019) M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi,  and M. R. DeWeese, “Design of optical neural networks with component imprecisions,” Optics Express 27, 14009 (2019).
  • Jia et al. (2023) Z. Jia, W. Qarony, J. Park, S. Hooten, D. Wen, Y. Zhiyenbayev, M. Seclì, W. Redjem, S. Dhuey, A. Schwartzberg, E. Yablonovitch,  and B. Kanté, “Interpretable inverse-designed cavity for on-chip nonlinear photon pair generation,” Optica 10, 1529–1534 (2023).
  • Yeung et al. (2020) C. Yeung, J.-M. Tsai, B. King, Y. Kawagoe, D. Ho, M. W. Knight,  and A. P. Raman, “Elucidating the behavior of nanophotonic structures through explainable machine learning algorithms,” ACS Photonics 7, 2309–2318 (2020).
  • Lipton (2018) Z. C. Lipton, “The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.” Queue 16, 31–57 (2018).
  • Miller (2019) T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial intelligence 267, 1–38 (2019).
  • Samek et al. (2021) W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders,  and K.-R. Müller, “Explaining deep neural networks and beyond: A review of methods and applications,” Proceedings of the IEEE 109, 247–278 (2021).
  • Montavon, Samek, and Müller (2018) G. Montavon, W. Samek,  and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital signal processing 73, 1–15 (2018).
  • Sung (1998) A. H. Sung, “Ranking importance of input parameters of neural networks,” Expert systems with Applications 15, 405–411 (1998).
  • Fu and Chen (1993) L. Fu and T. Chen, “Sensitivity analysis for input vector in multilayer feedforward neural networks,” in IEEE international conference on neural networks (IEEE, 1993) pp. 215–218.
  • Dimopoulos et al. (1999) I. Dimopoulos, J. Chronopoulos, A. Chronopoulou-Sereli,  and S. Lek, “Neural network models to study relationships between lead concentration in grasses and permanent urban descriptors in athens city (greece),” Ecological modelling 120, 157–165 (1999).
  • Simonyan, Vedaldi, and Zisserman (2014) K. Simonyan, A. Vedaldi,  and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” in Workshop at International Conference on Learning Representations (2014).
  • Olden, Joy, and Death (2004) J. D. Olden, M. K. Joy,  and R. G. Death, “An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data,” Ecological Modelling 178, 389–397 (2004).
  • Ancona et al. (2017) M. Ancona, E. Ceolini, C. Öztireli,  and M. Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” arXiv preprint arXiv:1711.06104  (2017).
  • Smilkov et al. (2017) D. Smilkov, N. Thorat, B. Kim, F. Viégas,  and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” arXiv preprint arXiv:1706.03825  (2017).
  • Fisher (1936) R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics 7, 179–188 (1936).
  • Campo and Pérez-López (2022) J. R. R. Campo and D. Pérez-López, “Reconfigurable activation functions in integrated optical neural networks,” IEEE Journal of Selected Topics in Quantum Electronics 28, 1–13 (2022).
  • Laporte, Dambre, and Bienstman (2019) F. Laporte, J. Dambre,  and P. Bienstman, “Highly parallel simulation and optimization of photonic circuits in time and frequency domain based on the deep-learning framework pytorch,” Scientific reports 9, 5918 (2019).