peerRTF: Robust MVDR Beamforming Using Graph Convolutional Network

Amit Sofer, Daniel Levi and Sharon Gannot

Abstract

Accurate and reliable identification of the relative transfer functions between microphones with respect to a desired source is an essential component in the design of microphone array beamformers, specifically the minimum variance distortionless response (MVDR) criterion. Since an accurate estimation of the RTF in a noisy and reverberant environment is a cumbersome task, we aim at leveraging prior knowledge of the acoustic enclosure to robustify the RTFs estimation by learning the RTF manifold. In this paper, we present a novel robust RTF identification method, tested and trained with real recordings, which relies on learning the RTF manifold using a graph convolutional network (GCN) to infer a robust representation of the RTFs in a confined area, and consequently enhance the beamformer’s performance. ¹¹1The authors are with Bar-Ilan University, Israel. e-mail: {amit.sofer,daniel.levi1,sharon.gannot}@biu.ac.il. The work was partially supported by grant #3-16416 from the Ministry of Science & Technology, Israel, and from the European Union’s Horizon 2020 Research and Innovation Programme, Grant Agreement No. 871245. Amit Sofer and Daniel Levi equaly contributed to the paper. Project Page: https://peerrtf.github.io/

Index Terms:

robust MVDR beamformer, manifold learning, graph convolutional network

I Introduction

Modern acoustic beamformers outperform conventional Direction of Arrival (DOA)-based beamformers, due to their ability to consider the entire acoustic propagation path, rather than only the direct-path. However, an estimation of the acoustic impulse responses relating the source and the microphones (or their corresponding acoustic transfer functions) is essential. Given that ATF estimation poses a blind problem, the approach in [1] suggests replacing ATFs with RTFs in the beamformer design.

While various algorithms for estimating RTFs can be found in the literature, such as those proposed in [1, 2, 3, 4, 5], they often face degradation in low signal-to-noise ratio (SNR) and high reverberation conditions. The literature extensively covers approaches to enhance beamforming robustness, commonly achieved through techniques like beam widening, as discussed in [6, 7, 8, 9, 10]. In this work, our approach focuses on improving the estimated RTF by leveraging a pre-learned set of RTFs and learning the RTFs manifold.

Despite their intricate structure, [11] demonstrated that the RTFs are primarily influenced by a limited set of parameters, such as the size and geometry of the room, the positions of the source and the microphones, and the reflective properties of the walls. Consequently, acoustic paths exhibit geometric structures of low dimensionality, commonly referred to as manifolds, and can be analyzed using manifold learning methods. In a fixed room with a static microphone array location, the only degree of freedom is the source location, causing the RTF to vary only based on the speaker’s position. Consequently, RTFs from different locations lie on a manifold. By assembling a clean set of RTFs as a training dataset, we can explore the RTF manifold and derive a more robust estimate of the RTF from noisy recordings.

Several manifold learning approaches, such as those proposed by [12, 13, 14], typically follow a standard framework. In this framework, manifold samples are initially represented as a graph. Subsequently, a low-dimensional representation (embedding) of the data is inferred, preserving its structure meaningfully. This representation effectively ’flattens’ the original non-Euclidean structure of the manifold into an Euclidean space, simplifying subsequent analysis. Post-inference, an algorithm is applied to the low-dimensional embedding to accomplish the desired task.

The MVDR beamformer is a spatial filter designed to minimize the noise power in its output while preserving the desired source without distortion. There is some justification for using the RTFs as the steering vector for calculating the MVDR weights [1, 15, 16]. In our research, we adopt this approach.

In recent years, geometric deep learning (GDL), a term describing techniques that extend deep neural models to non-Euclidean inputs like graphs and manifolds, has seen significant application in classification, segmentation, clustering, and recommendation tasks. Its adoption is more prevalent in fields like social sciences (e.g., analyzing social network using graphs), chemistry (where molecules can be represented as graphs), biology (where biomolecular interactions form graph structures), 3D point cloud manifold learning, computer vision, and others. Those methods usually focus on classification, segmentation, clustering, and recommendation tasks but not on regression tasks. A particular type of graph neural network (GNN) is the GCN which is based on the principles of learning through shared-weights, similar to convolutional neural networks [17, 18, 19, 20, 21, 22].

Previous efforts to learn the manifold of RTFs [23, 24] have employed a graph representation, utilizing the Gaussian heat kernel to determine edge weights. Spectral graph theory is then applied to infer a low-dimensional embedding of the manifold in Euclidean space. The Euclidean distance between samples in this transformed space reflects the diffusion distance on the manifold surface. Subsequently, leveraging geometric harmonics, an algorithm is employed to extend the training data and estimate the RTF based on the acquired manifold and the noisy signals. This algorithm effectively projects the noisy RTF onto the learned manifold of potential RTFs, resulting in a more robust estimation of the RTF.

Drawing inspiration from recent developments in the GNN field, demonstrating that graphs naturally emerge in the learning a manifold, this paper aims to enhance the traditional manifold learning blueprint. The conventional blueprint involves flattening the non-Euclidean manifold into an Euclidean space. We will harness the power of GCN to learn the high dimensional RTF manifold and infer a robust estimator of an RTF from noisy RTF directly from the graph representing the manifold.

Our contribution is threefold: 1) a novel robust RTF estimation algorithm that infers the RTF manifold using a GCN and levereges it to robustify the RTF estimation; 2) a comprehenseve assesment of the proposed scheme and its performance advantages as compared with competing methods in various SNR levels and real-world acoustic scenarios; 3) exemplifying how this framework can be used in further research to expand speech enhancement and localization algorithms’ capabilities.

The remainder of this paper is organized as follows. In Section II we formalize the problem. Section III describes the relations between manifold learning and graphs and presents the GNN framework, and in particular, the GCN variant. Section IV explains a general robust beamforming approach which includes the vanilla RTF estimation and RTF-based beamforming. Section V elaborates on our approach, in particular, the creation of the graph data, the architecture of our network, and the objective Functions. Section VI describes the experimental setup and presents the results together with a comparison to other methods. Section VII concludes the paper.

II Problem Formulation

An $M$ -microphone array is positioned in a reverberant enclosure. We assume that the desired source location is confined to a known region. Examples of such environments include conference rooms, where the microphone array is placed at a fixed location on the table, and speakers occupy designated positions around it. Similarly, in office setups, the microphone array is fixed on the desk or computer screen, with the speaker typically seated behind the desk. In a car, the microphone array is positioned at a fixed location at the visor, while the speaker occupies one of the seats.

Let $r_{m}[n],m=0,\ldots,M-1$ , denote the measured signal at the $m$ th microphone. Here, $s[n]$ represents the desired speech signal, and $v_{m}[n]$ represents the contribution of all noise sources captured by the $m$ th microphone. The signal captured by the $m$ th microphone can be modeled as:

r_{m}[n]=\{s*a_{m}\}[n]+v_{m}[n].

(1)

Here, ${a}_{m}[n]$ stands for the AIR from the source to the $m$ th microphone at time $n$ , and $*$ denotes the convolution operator. In scenarios where the speaker remains static, the AIR remains constant over time. The time-domain convolution in (1) can be approximated by multiplication in the short-time Fourier transform (STFT) domain. All $M$ equations can then be written in a single vector form as:

\mathbf{r}(l,k)=s(l,k)\mathbf{a}(k)+\mathbf{v}(l,k).

(2)

Here, $l$ and $k$ represent the time-frame and frequency-bin indexes, respectively, with $l\in\{0,\ldots,L-1\}$ and $k\in\{0,\ldots,K-1\}$ . The vector $\mathbf{a}(k)=[a_{0}(k),\ldots,a_{M-1}(k)]^{\top}$ , comprises all ATFs from the source to the microphone array. We define $a_{\textrm{ref}}(k)$ as the component of the vector $\mathbf{a}(k)$ that corresponds to the reference microphone. Equation (2) can also be reformulated as a function of $\tilde{s}(l,k)=s(l,k)a_{\textrm{ref}}(k)$ , representing the source signal as captured by the reference microphone:

\mathbf{r}(l,k)=\tilde{s}(l,k)\mathbf{h}(k)+\mathbf{v}(l,k),

(3)

where $\mathbf{h}(k)$ is the vector of RTFs:

\mathbf{h}(k)\triangleq\frac{\mathbf{a}(k)}{a_{\textrm{ref}}(k)}.

(4)

Playing a precisely defined signal, such as a chirp or white noise, without any background noise, from different locations within the desired acoustic environment enables us to use standard methods for identifying the system. This process yields a collection of RTFs. Our objective is to glean insights into the RTF manifold, using this ensemble clean RTFs. This in turn, can be leveraged to enhance estimates of noisy RTFs from the same enclosure, thereby robustifying the beamformer’s design.

III Manifold learning & Graph neural networks

III-A Graphs in manifold learning

In many areas, there is a need to comprehend a manifold. Typically, when attempting to learn a manifold, there is no existing mathematical model, and only a limited number of samples are available. All manifold learning algorithms follow a standard blueprint. First, they generate a data representation by constructing a neighbor graph. Second, they compute a low-dimensional representation (embedding) of the data, preserving a specific aspect of the original manifold structure. For instance, Locally Linear Embedding [13], Isomap [12], and Laplacian eigenmaps [14] use different techniques. Variational autoencoders [25] introduce a distribution in the embedding through the encoder. Extensions like conditional VAEs [26] and adversarial Autoencoders [27] aim for a structured data representation. This new embedding ”flattens” the original non-Euclidean structure of the manifold, making it more manageable. Third, a task-dependent algorithm (classification, clustering, or regression) is applied after inferring the representation.

Manifold learning has found various applications in audio, including localization [28, 29, 30, 31] and speech enhancement [24, 23, 32]. In [23], the RTF manifold is initially represented by a graph where the RTFs serve as graph nodes, and the edges’ weights are defined using the heat kernel function. A Markov process is established on the graph by constructing a transition matrix representing the manifold diffusion process. Subsequently, leveraging spectral graph theory, a low-dimensional embedding of the dataset in Euclidean space is derived. In this space, the Euclidean distance between samples reflects the diffusion distance across the high-dimensional manifold surface. Once this low-dimensional embedding is obtained, geometric harmonics [33], a method extending low-dimensional embeddings to new data points, is employed to create a supervised RTF identification estimator. In [32], a VAE-based manifold model for RTFs is proposed to robustify RTF estimation. Unlike linear methods, this approach provides a high degree of expressiveness by avoiding constraints associated with linearity. The VAE is trained unsupervised using data collected under benign acoustic conditions, enabling it to reconstruct RTFs within the specified enclosure. The method introduces an Least squares (LS)-based RTF estimator that is regularized by the trained VAE. This regularization significantly improves the quality of RTF estimates compared to traditional VAE-based denoising methods. A hybrid model is proposed, combining classic RTF estimation with the capabilities of the trained VAE.

In [34], the relation between the graph structure, manifold learning (ML), and GNN is established, emphasizing how the graph structure contributes to the model’s accuracy. Building upon the these established foundations, we propose to harness the ML capabilities of GNNs to obtain a more accurate and robust estimator of RTFs in noisy and reverberant environments.

III-B Graph Convolution Networks

In this section, we first define the mathematical representation of a graph, then describe the two different types of GCNs, explain the differences between them, and finally, formally define the spatial GCNs.

A graph $\mathcal{G}=(\mathcal{E},\mathcal{V})$ consists of a set of nodes $\mathcal{V}=\{v_{1},\ldots,v_{N}\}$ and edges $\mathcal{E}$ , where the edges are assumed to be scalars denoted as $e_{j,i}\in\mathbb{R}$ connecting the $j$ th node to the $i$ th node. Alternatively, the graph can be represented as $\mathcal{G}=(\mathbf{V},\mathbf{A})$ , where $\mathbf{V}\in\mathbb{R}^{N\times d}$ is the nodes’ feature matrix with $d$ the dimension of features and $\mathbf{A}\in\mathbb{R}^{N\times N}$ is the graph adjacency matrix. Denote $\mathcal{N}(i)$ the neighborhood of nodes connected to node $i$ .The adjacency matrix should then satisfy $\mathbf{A}_{i,j}=1$ if $j\in\mathcal{N}(i)$ (in the general case, $\mathbf{A}_{i,j}$ may be less than 1).

GNNs are a generalization of conventional neural networks designed to process non-Euclidean inputs represented as graphs. Graphs offer considerable flexibility in data representation, and GNNs extend neural network methods to graph-structured data. They achieve this by iteratively propagating information through nodes and edges of the graph, enabling them to capture and exploit the inherent information encoded in the graph structure. A specific variant of GNN is the GCN, inspired by the principles of learning through shared weights, similar to the approach used in CNNs for image analysis and computer vision. Designing local operations with shared weights is the key to efficient learning on graphs. This involves message passing between each node and its neighbors, facilitated by the shared weights. The use of shared weights, implying parameter sharing across different parts of the graph, enhances the efficiency and scalability of the learning process.

Current GCNs algorithms can be categorized into spectral-based and spatial-based. Spectral GCNs rely on the principles of spectral graph theory [35, 36]. This involves processing the graph through the eigendecomposition of the graph Laplacian, which is used to compute the Fourier transform of a graph signal. This, in turn, defines graph filtering operations [37, 38, 39, 17]. In contrast, spatial-based GCNs operate on the principle of message passing and directly define convolutions on the graph itself. They aim to capture information by aggregating features from neighboring nodes through shared weights [40, 41, 42, 43, 20, 44, 21, 34]. Spatial GCNs can operate locally on each node without considering the entire graph, making them well-suited for node-specific tasks. Given that our problem, as indicated by the graph construction, involves a node regression task, we will focus on spatial-based GCNs from this point onward.

III-C Spatial GCN

Much like the convolutional operation employed by conventional CNNs for image processing, spatial-based methods extend this concept to define graph convolutions based on the spatial relations among nodes. In this analogy, images can be seen as a specific type of graph, where each pixel serves as a node, and direct connections exist between each pixel and its adjacent counterparts.

In a CNN, the operation involves computing the weighted average of pixel values for the central node and its neighbors across each channel. Similarly, in spatial-based graph convolutions, the representation of the central node is convolved with the representations of its neighboring nodes to formulate an updated representation for the central node.

The permutation invariance observed in graph operations significantly differs from classical deep neural networks designed for grid-structured data. This invariance implies independence from the order of neighboring nodes, given the absence of a canonical way to arrange them. Consequently, a substantial distinction emerges between the kernels of CNNs, which leverage the knowledge of neighbor ordering by assigning varying weights during convolution, and GCNs kernels. The latter lack this knowledge, resulting in weights being shared across all neighboring nodes and the entire graph. A figure comparing 2D convolution and graph convolution can be seen in Fig. 1.

Figure 1: 2D convolution vs. graph convolution. Left: In conventional 2D convolution on an Euclidean input, such as an image, the central pixel (depicted in red) of the next layer is calculated as a weighted average of itself and its neighbors, determined by the kernel size. The input is ordered and numbered accordingly. Right: In spatial graph convolution, the representation of the central node in the next layer is computed by aggregating features from neighboring nodes, with no regard to the order of neighbors or fixed graph size. (inspired by [45]).

A spatial GCN comprises a sequence of graph convolution layers. The nodes’ representations undergo two fundamental steps within each layer: aggregating features from neighboring nodes and a subsequent nonlinear transformation. Each convolutional layer is often implemented as Multi-Layer Perceptrons. The initial representation of nodes at the input to the first convolutional layer relies on their input features.

Before training a GCN, a crucial factor to consider is the network’s depth, determining how many neighbor layers are used for information aggregation. As the depth increases, information gathering expands exponentially [46]. We did not aggregate information from second-order neighbors, which will be elaborated on later.

While the description is quite general, most existing algorithms primarily focus on tasks like classification, segmentation, and clustering, rather than regression. We aim to leverage the capabilities of GCNs for a more intricate form of regression directly on the manifold. This involves predicting a highly precise continuous vector, operating on a high-dimensional abstract manifold, and develo** a supervised manifold learning algorithm.

IV RTFs based MVDR Beamformer

This section introduces a general robust speech enhancement framework using a microphone array, covering fundamental concepts such as beamforming and classical RTF estimation.

IV-A Optimizing Beamforming Through Robust RTF Estimation

A general framework for robust microphone array speech enhancement is depicted in Fig. 2. There are several blocks in the framework. First, the vanilla steering vector (the RTF in our case) is estimated from the noisy input signals. Then, the RTF estimates are adopted using additional data based on the acoustic environment, other than the noisy input signals, to produce a more robust estimate of the RTF. The additional information was collected in an ideal acoustic condition. Finally, using these RTFs, a beamformer is constructed and applied to the noisy input signals to obtain an estimate of the desired source signal.

We utilize the generalized eigenvalue decomposition (GEVD) to estimate the RTFs from the noisy input signals and then use the MVDR as the optimization criterion to construct the beamformer. A concise explanation of these methods will be provided in the next section.

Refer to caption — Figure 2: A general block diagram of robust RTF-based beamforming. First, the noisy signal is used to compute the correlation matrices. Then, using the GEVD, we estimate the vanilla RTFs. These vanilla RTFs are robust using the architecture and additional information. Finally, MVDR is applied to estimate the enhanced recording.

IV-A1 GEVD-Based RTF Estimation: A Concise Overview

In [47, 48], it was demonstrated that the RTF could be estimated through the GEVD of the spatial correlation matrices of the noisy signal segments $\boldsymbol{\Phi}_{rr}(k)$ ²²2In the more general form, it can be time-varying, but here we assume that the RTF is time-invariant, and can therefore be estimated by averaging over all active-speech time segments. and of the noise-only signal segments $\boldsymbol{\Phi}_{vv}(k)$ . The latter is estimated from noise-only segments assumed to be available. The RTF is determined by solving

\boldsymbol{\Phi}_{rr}(k)\boldsymbol{\varphi}(k)=\mu(k)\boldsymbol{\Phi}_{vv}(% k)\boldsymbol{\varphi}(k).

(5)

Using $\boldsymbol{\varphi}(k)$ , the generalized eigenvector corresponding to the largest generalized eigenvalue $\mu(k)$ , we can obtain the vector of RTFs

\hat{\mathbf{h}}_{\textrm{GEVD}}(k)\triangleq[\hat{h}_{\textrm{GEVD}}^{0}(k),% \ldots,\hat{h}_{\textrm{GEVD}}^{M-1}(k)]^{\top}

(6)

using the following normalization:

\hat{\mathbf{h}}_{\textrm{GEVD}}(k)=\frac{\boldsymbol{\Phi}_{vv}(k)\boldsymbol% {\varphi}(k)}{\left(\boldsymbol{\Phi}_{vv}(k)\boldsymbol{\varphi}(k)\right)_{% \textrm{ref}}}.

(7)

The next step is constructing the graph by calculating distances between the feature vectors. We use the RTFs estimated in noiseless environments as the features of the graph vertices. Using the clean RTFs to construct the graph may facilitate the enhancement of noisy features acquired in the same environment but in noisy conditions.

Define ${h}_{\ell}^{m}(k)$ , the RTF associated with the $\ell$ th training location, where $\ell\in\{1,\ldots,N_{\textrm{train}}\}$ , $m$ represents the $m$ th microphone, and $k$ the frequency bin. Further, define $\mathbf{h}_{\ell}^{m}$ as the corresponding vector formed by concatenating all frequencies.

The training set of all RTFs associated with the $m$ th microphone, denoted $\mathcal{H}^{m}=\{\mathbf{h}_{\ell}^{m}\}_{\ell=1}^{N_{\textrm{train}}}$ , is obtained by applying the GEVD procedure to the noiseless training recordings. In the absence of noise, $\boldsymbol{\Phi}_{vv}(k)$ in (7) is substituted by an identity matrix, cosequently (5) simplifies to the eigenvalue decomposition (EVD) problem. $\mathcal{H}^{m}$ is referred to as the $m$ th RTF manifold.

IV-A2 The MVDR Beamformer

Let $\hat{\mathbf{h}}(k)$ represent a RTF from our dataset at a specific position, whether before or after the GCN. Define $\boldsymbol{\Phi}_{vv}^{-1}(k)$ as the $M\times M$ power spectral density (PSD) matrix of the received noise signals at the $k$ th frequency bin for the same position. It is assumed that noise-only segments are available and can be identified, e.g., by applying a voice activity detection (VAD).

The MVDR beamformer is a spatial filter designed to minimize the noise power at its output while maintaining a distortionless response toward the desired source. Its optimal weights are given by:

\mathbf{w}_{\textrm{MVDR}}(k)=\frac{\boldsymbol{\Phi}_{vv}^{-1}(k)\hat{\mathbf% {h}}(k)}{\hat{\mathbf{h}}(k)^{\mathsf{H}}\boldsymbol{\Phi}_{vv}^{-1}(k)\hat{% \mathbf{h}}(k)}.

(8)

Here, we follow [1] and subsequent publications and use the RTF as the steering vector of the MVDR beamformer. It was shown in multiple works (see, e.g., [15, 16]) that implementing an MVDR beamformer with the RTF rather than with a steering vector solely based on the direction of the sources, yields significantly improved results in reverberant environments.

V peerRTF: A GCN-based Robust RTF Estimation

This section introduces our method to robust RTF estimation. We delve into the preprocessing of the data, the construction of a feature vector, and the associated graph data. Finally, we explore the derived GCN architecture and our objective functions.

Our method aims to achieve robust RTF estimation, inspired by manifold-learning methods such as those proposed by [24, 23], we propose a modern DNN-based approach, leveraging previous knowledge on the acoustic environment to project noisy examples onto the manifold. Given that our data is represented as a graph, we utilize message-passing techniques to achieve this goal.

V-A Graph Construction

The learning process involves understanding the relations between neighboring entities. In our case, this requires to learn the GNN weights. We need to train the GNN on known nodes, i.e., known room speaker locations, to learn these weights. After learning these weights, we will evaluate our performance on unknown nodes corresponding to different speaker locations. This section outlines the dataset’s procedure, which involves creating feature vectors, constructing the graphs, and the training procedure.

V-A1 Feature Vector

Beamforming using the RTFs is typically performed in the frequency domain. However, in the time domain, the RTFs display a distinct shape, characterized by a prominent peak around zero and rapid decay on both sides. This characteristic allows us to simplify the estimation process by truncating the time-domain RTFs around their central region, thereby reducing the number of data points that need to be estimated. An example from our training set of a time domain representation of an RTF recorded in a room with $T_{60}=300\text{ms}$ is depicted in Fig. 3. This example represents an RTF estimated using the GEVD procedure in (7) in noiseless conditions, i.e., the identity matrix substitutes the spatial correlation matrix of the noise. The clean signal is obtained by convolving an AIR from the MIRaGe dataset [49] with pink noise under. We truncate the time domain RTF $l_{\textrm{uncausal}}$ taps left of the peak and $l_{\textrm{causal}}$ taps right of the peak. Applying the GCN to the time-domain representation rather than the frequency-domain representation of the RTFs circumvents the need to work with complex-valued neural networks.

When dealing with an array of $M$ microphones, each speaker location has $M-1$ RTFs, as the RTF between the reference microphone and itself is trivial. These $M-1$ components are usually estimated independently. Truncating the RTFs reduces the feature dimension, thereby enhancing learning compared to using the full RTF. In total, we have the feature dimension: $d=l_{\textrm{uncausal}}+l_{\textrm{causal}}$ .

V-A2 Graph Dataset Construction

We assume that we have two types of feature vectors: $N_{\textrm{train}}$ ideal RTFs estimated in a noiseless scenario, and $N_{\textrm{test}}$ vanilla RTFs estimated in a noisy scenario. The first step is to construct the clean nodes graph. For each microphone, we have an individual graph that contains $N_{\textrm{train}}$ RTFs, We will denote to as $\mathbf{H}^{m}\in R^{N_{\textrm{train}}\times d}$ . By applying $\mathcal{K}$ Nearest Neighbors ( $\mathcal{K}$ NN) we build the graphs. We decided to utilize the $\mathcal{K}$ NN metric since it selects the most similar RTFs from the dataset, allowing us to effectively robustify the RTFs for the noisy feature vectors. As we mentioned for each microphone, we construct a separate graph. This helps to capture specific relationships and dependencies within the data. For evaluation, we include one noisy feature vector into the clean nodes using $\mathcal{K}$ NN. In total, we have $\mathbf{H}\in\mathbb{R}^{N_{\textrm{train}}+1\times d}$ for each microphone.

V-A3 Training Procedure

For the training setup, our goal is to learn optimal weights to enhance the noisy feature vectors during evaluation. Starting with $\mathbf{H}^{m}\in\mathbb{R}^{N_{\textrm{train}}\times d}$ , we select a training position (represented by a node in the graph) and exclude the clean feature vector associated with this grid position from all $M-1$ graphs. Next, we incorporate noisy feature vectors associated with this speaker position into the graphs using $\mathcal{K}$ NN. Consequently, each training example comprises $N_{\textrm{train}}-1$ clean feature vectors and one noisy vector.

V-B The GCN Architecture

A key factor in the success of CNNs is their ability to design and reliably train intense models that extract higher-level features at each layer. In contrast, training deep GCN architectures is not as straightforward, and several works have studied their limitations [45, 22, 50, 51]. Stacking more layers into a GCN leads to the common vanishing gradient and over-smoothing problems. Due to these limitations, most state-of-the-art GCNs are not deeper than four layers [22]. In each layer, the transformation function is usually a single shallow, fully-connected (FC) layer followed by a non-linearity. Such shallow architectures are sufficient for classification, segmentation, clustering, and recommendation tasks. However, these shallow networks lack the expressive power to perform more challenging tasks, such as regression on high-dimensional data. In our problem setting, where nodes correspond to different positions, we opt not to aggregate information from second-order neighbors. Instead, we choose a deeper than one FC layer. Nevertheless, we aim to incorporate a deep network with sufficient expressive power to execute a regression task on the high-dimensional abstract manifold. Drawing inspiration from [21], which learns 3D manifolds from point clouds, we structured our architecture accordingly. Drawing an analogy to convolution in images, we consider $\mathbf{h}_{i}$ as the central pixel and $\mathbf{h}_{i}^{(j)},j\in\mathcal{N}(i)$ as a patch around it. To calculate the contribution of each neighbor $\mathbf{h}_{i}^{(j)}$ within each graph, we concatenate the feature vector of the central node $\mathbf{h}_{i}$ with the feature vector of the neighbor $\mathbf{h}_{i}^{(j)}$ and pass this concatenated vector through the neural network. The neural network output is then aggregated from all $\mathbf{h}_{i}$ neighbors $\mathcal{N}(i)$ .

When deliberating on selecting an aggregation function, it is essential to consider the essence of our regression task on the manifold. Given that our objective is to predict a continuous value falling within the range of the input values, this criterion guides our choice of aggregation functions. In this context, sum and max are not optimal choices. Instead, we opt for the mean operation as $\frac{1}{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}$ .

Figure 4 details the selected architecture.

Figure 4: Left: The massage

\mathbf{m}_{i,j}

passed from the

j

th neighbor of the

i

th node is calculated by concatenating

\mathbf{h}_{i}

and

\mathbf{h}_{i}^{(j)}

and passing this concatenation through the neural network.
Right: The representation of the

i

th node at the output is calculated by aggregating the messages from all the nodes in

\mathcal{N}(i)

. For each microphone, there is a separate graph and the neighbors are arbitrarily numbered.
*inspired by [21]

We utilized message passing, one of several commonly used methods in GNN. As mentioned, this process involves information exchange between nodes and their neighbors on the graph, enabling them to update their knowledge based on local interactions. Message passing facilitates effective learning and inference in graph-based models. For our graphs, we have $\mathcal{K}$ representing the number of neighbors.

Our neural network architecture consists of three FC layers, followed by an activation function. The input to the network is a concatenated vector of length $2d$ , and the architecture can be represented as follows: $2d\xrightarrow{}2d\xrightarrow{}2d\Rightarrow d$ . Here, each $\xrightarrow{}$ represents a single FC layer followed by a rectified linear unit (ReLU) activation function, while $\Rightarrow$ denotes only a FC layer. Our architecture uses shared weights across all microphones. No significant changes were observed while exploring $M-1$ individual sub-GCN variations. Consequently, we opted for a less complex architecture with fewer parameters, making it flexible for various numbers of microphones.

Algorithm 1 succinctly summarizes the procedural steps for GCN-based RTF estimation. Figure 5 describes the full architecture.

Training Stage:

1.

Given $\mathbf{H}^{m}\in R^{N_{\textrm{train}}\times d}$ clean RTFs matrix build the graphs using $\mathcal{K}$ NN for each microphone.
2.

Select one grid position and remove the clean feature vectors that associate with this location, replace them with noisy feature vectors, and connect them to the graph using $\mathcal{K}$ NN.
3.

Train GCN for robust representation of noisy RTFs.
Repeat for the entire dataset until convergence.

Inference Stage:

1.

Randomly choose a test location and add out of grid (OOG) noise to clean speech at a random SNR.
2.

Use (7) to estimate the noisy RTFs.
3.

Build graphs using $\mathcal{K}$ NN for each microphone with $N_{\textrm{train}}$ clean feature vectors and one noisy.
4.

Pass through trained GCN to improve estimation for noisy RTFs nodes.
Repeat for all test position points.

Algorithm 1 Enhancing RTF Robustness on GNN

V-C Objective Functions

To efficiently train the model, we experimented with several objective functions. We examined two alternatives. In the first alternative, we have directly optimized the outcome of the GCN, namely the RTF estimate. The second alternative is to optimize the output of the MVDR beamformer by adjusting the RTF estimate. a figure that described this can be seen in 6

V-C1 Direct Optimization of the RTF

Define the Signal Blocking Factor (SBF) as:

\textrm{SBF}=\frac{1}{M-1}\sum_{m=0,m\neq\textrm{ref}}^{M-1}10\log_{10}\left(% \frac{\sum_{n}x^{2}[n]}{\sum_{n}d_{m}^{2}[n]}\right)

(9)

where

x[n]=\{\hat{h}_{\textrm{oracle}}^{m}\ast\tilde{s}\}[n]

and

d_{m}[n]=\{\hat{h}_{\textrm{oracle}}^{m}*\tilde{s}\}[n]-\{\hat{h}_{\textrm{GCN% }}^{m}*\tilde{s}\}[n].

Here, $\tilde{s}[n]$ is the reference signal, $\hat{h}_{\textrm{oracle}}^{m}[n]$ is the oracle RTF corresponding to the $m$ th microphone, and $\hat{h}_{\textrm{GCN}}^{m}[n]$ is the robust RTF of the $m$ th microphone. The term $d_{m}[n]$ is defined as the difference between convolution of $\hat{h}_{\textrm{oracle}}^{m}[n]$ and $\tilde{s}[n]$ with the convolution of $\hat{h}_{\textrm{GCN}}^{m}[n]$ and $\tilde{s}[n]$ . This function encourages the robust RTF to be as close as possible to the oracle RTF. Inspire from [52].

V-C2 RTF Estimation via Beamformer Output Optimization

Here we optimize the Scale-Invariant Source-to-Distortion Ratio (SI-SDR) at the output of the beamformer. The SI-SDR is defined as:

\textrm{SI-SDR}\left(\tilde{\mathbf{s}},\hat{\mathbf{s}}\right)=10\log_{10}% \left(\frac{\|{\frac{\langle{\tilde{\mathbf{s}},\hat{\mathbf{s}}}\rangle}{% \langle{\tilde{\mathbf{s}},\tilde{\mathbf{s}}}\rangle}\tilde{\mathbf{s}}}\|^{2% }}{\|{\frac{\langle{\tilde{\mathbf{s}},\hat{\mathbf{s}}}\rangle}{\langle{% \tilde{\mathbf{s}},\tilde{\mathbf{s}}}\rangle}\tilde{\mathbf{s}}-\hat{\mathbf{% s}}}\|^{2}}\right)

(10)

where $\tilde{\mathbf{s}}$ represents the reference source vector for all samples, and $\hat{\mathbf{s}}$ denotes the output of the beamformer vector for all samples. The SI-SDR loss is a metric commonly used to evaluate the quality of source separation or speech enhancement algorithms[53]. It measures the enhancement quality between the estimated source signal and the true source signal, considering both the distortion and the interference introduced during the enhancement process. This approach aims to bring the beamformer output closer to the clean reference signal. Correspondingly, the RTF estimate should be adjusted.

Additionally, we explore an alternative approach by examining the SI-SDR compared to the output of the Oracle RTF beamformer. Here, we compute the MVDR weights using the RTFs estimated under ideal conditions—the oracle scenario, and evaluate the resulting SI-SDR against this signal. This approach aligns more closely with a supervised paradigm, akin to the RTF level loss. Importantly, it eliminates the necessity for a clean reference signal in the loss function, addressing a common limitation in scenarios where such a reference signal is unavailable. Still, in this objective, we need the oracle RTFs to be available, which is another limitation. We will designate the first version as SI-SDR I and the second as SI-SDR II.

short-time objective intelligibility (STOI)

We also incorporate an implementation of STOI as a loss function³³3adopted from https://github.com/mpariente/pytorch_stoi.. This implementation is integrated with VAD, ensuring a fit with the original function. STOI serves as a metric evaluating the intelligibility of speech signals.

VI Experimental Setup & Results

We use the MIRaGe dataset [49], consisting of real multichannel recordings acquired at Bar-Ilan acoustic lab. We evaluate the proposed GCN method using various objective and subjective performance measures. Additionally, we explore the impact of the graph structure on the results and compare several objective functions.

VI-A Experimental Setup:

The database was created by placing a loudspeaker on a grid of points in a cube-shaped volume with dimensions $46\times 36\times 32$ cm. The loudspeaker positions were sampled every $2$ cm along the x and y axes and every $4$ cm along the z-axis, totaling $24\times 19\times 9=4104$ possible source positions (grid vertices). In addition, 16 other positions, referred to as OOG, were designated as possible locations for noise sources. A chirp signal was played for each position in the grid and OOG. The setup was recorded using six static linear microphone arrays, each consisting of $M=5$ microphones with an inter-microphone spacing of $-13,-5,0,+5,$ and $+13$ cm relative to the central microphone (the reference microphone). Recordings were made at three different reverberation levels: $100,$ $300,$ and $600$ ms.

For our experiments, we utilized microphone array #2, positioned directly in front of the grid at a distance of $2$ m from the center. The recordings were randomly divided into $N_{\textrm{train}}=3500$ training positions, $N_{\textrm{validation}}=100$ , and $N_{\textrm{test}}=504$ . We use $2048$ frequency bins, which is also the number of time domain RTF samples. Additionally, we set $l_{\textrm{uncausal}}=128$ and $l_{\textrm{causal}}=256$ .

The RTFs estimation involved the following procedure: 1) using chirp signals recorded in the MIRaGe database, the AIRs from the source position to the microphone arrays were estimated using LS; 2) for estimating the clean RTFs, pink noise signals covering all relevant frequencies were convolved with the AIRs to generate the desired signal; for the noisy signals, we took a speech signal and convolved it with the AIRs and we added pink noise from an OOG location.; finally, 3) RTFs were estimated using (7), and for the clean signals, we used EVD.

To generate a sufficiently large training set, we add each of the 3500 clean training speech signals to three different noise signals with random SNR from $16$ different OOG locations, resulting in $3500\times 3=10500$ noisy training examples. For our graphs, we selected $\mathcal{K}=5$ as the $\mathcal{K}$ NN parameter.

The network was trained using a linear scheduler with a warmup ratio of 0.1, a learning rate of $1\times 10^{-4}$ . We chose the SI-SDR in its second version as the objective function for all reverberation times, incorporating a dropout rate of 0.5 during the training stage over 100 epochs.

TABLE I: Parameters

parameter	value
M	5
Frequency bins	2048
$l_{\textrm{uncausal}}$	128
$l_{\textrm{causal}}$	256
$\mathcal{K}$	5

All the parameters are listed in Table I.

VI-B Quality Measure:

The results are analyzed using several quality metrics: the SNR at the beamformer output, calculated as:

\textrm{SNR}\left(\hat{\mathbf{s}},\hat{\mathbf{v}}\right)=10\log_{10}\left(% \frac{\left\|{\hat{\mathbf{s}}}\right\|^{2}}{\left\|\hat{\mathbf{v}}\right\|^{% 2}}\right)

(11)

Here, $\hat{\mathbf{s}}$ represents the reference signal vector of the beamformer outputs for all samples, and $\hat{\mathbf{v}}$ represents the noise vector in the beamformer outputs for all samples. Additionally, we employ the STOI measure [54] to assess the intelligibility of speech signals, along with the Deep Noise Suppression Mean Opinion Score (DNSMOS) metric [55]. We also examine the SI-SDR (10) in its first approach, comparing it to the reference signal to measure the distortion of the outputs.

VI-C Results:

We calculate averages across all testing locations at various input SNR levels for each metric we present. To assess performance, we compare these measures with other MVDR beamformers. This involves using either the traditional GEVD estimate of the RTFs or the RTF estimate derived from the method introduced in [23], known as manifold projection learning (MP), alongside the oracle RTF. The oracle RTF is the RTF suitable for this location, estimated under noise-free conditions (similar to the train set). For MP, two parameters need to be chosen: $\epsilon$ , the kernel scale parameter, and $\mathfrak{l}$ , the number of dominant eigenvalues in the algorithm. Specifically, we chose $\epsilon=0.3$ for all reverberation times, and $\mathfrak{l}$ varied. For $T_{\textrm{60}}=100~{}\text{ms}$ , we selected $\mathfrak{l}=12$ . Similarly, for $T_{\textrm{60}}=300~{}\text{ms}$ , $\mathfrak{l}=5$ , and finally, for $T_{\textrm{60}}=600~{}\text{ms}$ , $\mathfrak{l}=15$ . Figs. 8 9 10 depict the $\acs{SNR}_{\textrm{out}}$ , STOI and DNSMOS quality measures, for $T_{\textrm{60}}=100,300,600~{}\text{ms}$ , comparing the proposed method (peerRTF) with GEVD, Oracle, and the ML algorithm.

As can be seen, the proposed method outperforms the vanilla GEVD-based beamformer in terms of speech intelligibility across all SNR and reverberation levels. Compared to the ML-based beamformer, there is an improvement in almost all areas. The SNR at the beamformer output is consistently higher than that of the vanilla GEVD and ML-based beamformers across all SNR levels and reverberation conditions. Furthermore, our method outperforms the oracle RTF in certain regions.

Those advantages are also subjectively demonstrated by assessing the sonograms for a specific example (randomly chosen) from all the testing examples at $\textrm{SNR}_{\textrm{in}}=-10$ dB and $T_{60}=600$ ms, in Fig. 7.

In the overall sonograms, zoom in on the lower frequencies to assess them more precisely. When comparing the output of the beamformer to the reference signal, it’s evident that the peerRTF not only preserves more of the original signal frequency content but also generates fewer artifacts. This can be observed in the red left rectangle, with a slight break in the original sonogram. In contrast, the GEVD shows no break, which is also noticeable in the peerRTF. In the right rectangle, a distinct frequency bin appears in the GEVD absent in both the reference and the peerRTF. Furthermore, the peerRTF exhibit less noise.

An example of the results can be heard on the project page.

VI-D Alternative Loss Functions and Network Architecture

In this section, we will compare graph structures to underscore the importance of neighbors and evaluate different objective functions. In the tables, we compare different techniques for $T_{60}=600_{ms}$ and $\textrm{\acs{SNR}}=-10$ dB, as we know that there is a significant contribution in this condition due to the GEVD being less effective in high reverberation time and low SNR. Our goal is to examine the significance of the graph, recognizing that the challenge in neural networks often arises from a lack of clear indication of actual occurrences in various architectures. Consequently, we aim to establish a genuine influence on both neighbors and the network, considering not only the noisy RTFs. In Table II, we compare our method to the self-RTFs graph structure. In the self-RTFs structure, the noisy RTF node has only itself and doesn’t receive information from clean nodes. All other factors were trained in the same manner. The results demonstrate that our method consistently outperforms the self-RTFs structure across all quality measures. In certain metrics, such as SI-SDR, DNSMOS, and SNR, the outcomes were inferior to those of the GEVD, representing the noisy RTF that enters the network. This observation highlights the crucial role played by neighbors in this procedure.

TABLE II: neighbors importance

Model	STOI	ESTOI	SISDR	P808 MOS	SNR
Unprocessed	23.85	9.32	-10.2	2.22	-10
Reference	-	-	-	2.94	-
Oracle	72.07	55.95	0.57	2.53	16.83
GEVD	66.52	49.5	-3.33	2.52	14.21
MP	70.23	54.21	-1.77	2.52	15.65
peerRTF	71.63	55.53	0.46	2.62	17.3
Self-RTFs	68.06	51.54	-6.2	2.48	12.29

In Table III, we train and evaluate the GCN with different objective functions. As seen in the table, various quality measures are emphasized by different functions. For instance, the STOI function highlights the STOI and extended short-time objective intelligibility (ESTOI) measures, aligning with its purpose. Similarly, SI-SDR I function emphasizes both SI-SDR and SNR, reflecting their shared concept. In contrast, the SI-SDR II loss, which we discussed earlier as being more supervised, spotlights the DNSMOS. We chose the objective function that improves all the metrics as our objective.

TABLE III: compare objectives

Model	STOI	ESTOI	SISDR	P808 MOS	SNR
Unprocessed	23.85	9.32	-10.2	2.22	-10
Reference	-	-	-	2.94	-
Oracle	72.07	55.95	0.57	2.53	16.83
GEVD	66.52	49.5	-3.33	2.52	14.21
MP	70.23	54.21	-1.77	2.52	15.65
Si-Sdr I	72.52	57.21	2.34	2.57	17.96
Si-Sdr II	71.63	55.53	0.46	2.62	17.3
SBF	71.32	55.53	-0.5	2.52	15.5
STOI	72.87	57.32	-2.23	2.51	15.12

VII Conclusion

In this paper we have presented a novel RTF identification method which rely on learning the RTF manifold using a GCN to infer a robust estimation of the RTF in a noisy and reverberate environment. This opens the door of robust acoustic beamforming into a new field of learning methods. As seen in recent years, has significantly improved classical algorithms and has great flexibility and room for improvement. There are still a lot of improvements to be made in terms of the GCN model, both in the model architecture which can be stronger and more effective, and in the graph inference itself, which can be constructed in a more informative manner as it can incorporate a weighting of the graph according to some similarity measure or to include a model to infer the graph from the RTFs directly.

The results shown here verify the robustness of this approach and show the benefit of using a learning method on the graph representing the manifold directly as opposed to learning a projection of the graph into an euclidean space and flattening the manifold and operating in this space.

References

[1] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Tran. on Sig. Proc., vol. 49, no. 8, pp. 1614–1626, 2001.
[2] S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 544–548.
[3] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 320–324.
[4] O. Shalvi and E. Weinstein, “System identification using nonstationary signals,” IEEE Tran. on Sig. Proc., vol. 44, no. 8, pp. 2055–2063, 1996.
[5] Z. Koldovskỳ, J. Málek, and S. Gannot, “Spatial source subtraction based on incomplete measurements of relative transfer function,” IEEE/ACM Tran. on Au., Sp., and Lang. Proc., vol. 23, no. 8, pp. 1335–1347, 2015.
[6] Y. R. Zheng, R. A. Goubran, and M. El-Tanany, “Robust near-field adaptive beamforming with distance discrimination,” IEEE Tran. on Sp. and Au. Proc., vol. 12, no. 5, pp. 478–488, 2004.
[7] S. Doclo, S. Gannot, M. Moonen, A. Spriet, S. Haykin, and K. R. Liu, “Acoustic beamforming for hearing aid applications,” Handbook on array processing and sensor networks, pp. 269–302, 2010.
[8] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987.
[9] J. Li, P. Stoica, and Z. Wang, “On robust Capon beamforming and diagonal loading,” IEEE Tran. on Sig. Proc., vol. 51, no. 7, pp. 1702–1715, 2003.
[10] ——, “Doubly constrained robust Capon beamformer,” IEEE Tran. on Sig. Proc., vol. 52, no. 9, pp. 2407–2423, 2004.
[11] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “A study on manifolds of acoustic responses,” in Int. Conf. on Latent Variable Analysis and Sig. Separation. Springer, 2015, pp. 203–210.
[12] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[13] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[14] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” in Nips, vol. 14, no. 14, 2001, pp. 585–591.
[15] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Tran. on Au., Sp. and Lang. Proc., vol. 25, no. 4, pp. 692–730, 2017.
[16] O. Shmaryahu and S. Gannot, “On the importance of acoustic reflections in beamforming,” in International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2022.
[17] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[18] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European semantic web conference. Springer, 2018, pp. 593–607.
[19] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: a comprehensive review,” Computational Social Networks, vol. 6, no. 1, pp. 1–23, 2019.
[20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[21] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions On Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019.
[22] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
[23] A. Sofer, T. Kounovskỳ, J. Čmejla, Z. Koldovskỳ, and S. Gannot, “Robust relative transfer function identification on manifolds for speech enhancement,” in 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 401–405.
[24] R. Talmon and S. Gannot, “Relative transfer function identification on manifolds for supervised gsc beamformers,” in 21st European Signal Processing Conference (EUSIPCO 2013). IEEE, 2013, pp. 1–5.
[25] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[26] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, 2015.
[27] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
[28] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervised sound source localization based on manifold regularization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1393–1407, 2016.
[29] A. Deleforge and R. Horaud, “2d sound-source localization on the binaural manifold,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE, 2012, pp. 1–6.
[30] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervised source localization on multiple manifolds with distributed microphones,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1477–1491, 2017.
[31] A. Deleforge, F. Forbes, and R. Horaud, “Acoustic space learning for sound-source separation and localization on binaural manifolds,” International journal of neural systems, vol. 25, no. 01, p. 1440003, 2015.
[32] A. Brendel, J. Zeitler, and W. Kellermann, “Manifold learning-supported estimation of relative transfer functions for spatial filtering,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8792–8796.
[33] R. R. Coifman and S. Lafon, “Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 31–52, 2006.
[34] J. Svoboda, J. Masci, F. Monti, M. M. Bronstein, and L. Guibas, “Peernets: Exploiting peer wisdom against adversarial attacks,” arXiv preprint arXiv:1806.00088, 2018.
[35] F. R. Chung and F. C. Graham, Spectral graph theory. American Mathematical Soc., 1997, no. 92.
[36] D. Spielman, “Spectral graph theory,” Combinatorial scientific computing, vol. 18, 2012.
[37] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
[38] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[39] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in neural information processing systems, vol. 29, pp. 3844–3852, 2016.
[40] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in neural information processing systems, 2016, pp. 1993–2001.
[41] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning. PMLR, 2016, pp. 2014–2023.
[42] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
[43] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
[44] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5115–5124.
[45] W. Cao, Z. Yan, Z. He, and Z. He, “A comprehensive survey on geometric deep learning,” IEEE Access, vol. 8, pp. 35 929–35 949, 2020.
[46] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” arXiv preprint arXiv:2006.05205, 2020.
[47] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” IEEE Tran. on Au., Sp., and Lang. Proc., vol. 17, no. 6, pp. 1071–1086, 2009.
[48] S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” in IEEE Int. Conf. on Acous., Sp. and Sig. Proc. (ICASSP), 2015, pp. 544–548.
[49] J. Čmejla, T. Kounovský, S. Gannot, Z. Koldovský, and P. Tandeitnik, “MIRaGe: multichannel database of room impulse responses measured on high-resolution cube-shaped grid,” in 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 56–60.
[50] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9267–9276.
[51] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Thirty-Second AAAI conference on artificial intelligence, 2018.
[52] R. Talmon and S. Gannot, “Relative transfer function identification on manifolds for supervised GSC beamformers,” in 21st Euro. Sig. Proc. Conf. (EUSIPCO), 2013.
[53] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
[54] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Tran. on Au., Sp., and Lang. Proc., vol. 19, no. 7, pp. 2125–2136, 2011.
[55] C. K. Reddy, V. Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 886–890.