peerRTF: Robust MVDR Beamforming Using Graph Convolutional Network

Amit Sofer, Daniel Levi and Sharon Gannot
Abstract

Accurate and reliable identification of the relative transfer functions between microphones with respect to a desired source is an essential component in the design of microphone array beamformers, specifically the minimum variance distortionless response (MVDR) criterion. Since an accurate estimation of the RTF in a noisy and reverberant environment is a cumbersome task, we aim at leveraging prior knowledge of the acoustic enclosure to robustify the RTFs estimation by learning the RTF manifold. In this paper, we present a novel robust RTF identification method, tested and trained with real recordings, which relies on learning the RTF manifold using a graph convolutional network (GCN) to infer a robust representation of the RTFs in a confined area, and consequently enhance the beamformer’s performance. 111The authors are with Bar-Ilan University, Israel. e-mail: {amit.sofer,daniel.levi1,sharon.gannot}@biu.ac.il. The work was partially supported by grant #3-16416 from the Ministry of Science & Technology, Israel, and from the European Union’s Horizon 2020 Research and Innovation Programme, Grant Agreement No. 871245. Amit Sofer and Daniel Levi equaly contributed to the paper. Project Page: https://peerrtf.github.io/

Index Terms:
robust MVDR beamformer, manifold learning, graph convolutional network

I Introduction

Modern acoustic beamformers outperform conventional Direction of Arrival (DOA)-based beamformers, due to their ability to consider the entire acoustic propagation path, rather than only the direct-path. However, an estimation of the acoustic impulse responses relating the source and the microphones (or their corresponding acoustic transfer functions) is essential. Given that ATF estimation poses a blind problem, the approach in [1] suggests replacing ATFs with RTFs in the beamformer design.

While various algorithms for estimating RTFs can be found in the literature, such as those proposed in [1, 2, 3, 4, 5], they often face degradation in low signal-to-noise ratio (SNR) and high reverberation conditions. The literature extensively covers approaches to enhance beamforming robustness, commonly achieved through techniques like beam widening, as discussed in [6, 7, 8, 9, 10]. In this work, our approach focuses on improving the estimated RTF by leveraging a pre-learned set of RTFs and learning the RTFs manifold.

Despite their intricate structure, [11] demonstrated that the RTFs are primarily influenced by a limited set of parameters, such as the size and geometry of the room, the positions of the source and the microphones, and the reflective properties of the walls. Consequently, acoustic paths exhibit geometric structures of low dimensionality, commonly referred to as manifolds, and can be analyzed using manifold learning methods. In a fixed room with a static microphone array location, the only degree of freedom is the source location, causing the RTF to vary only based on the speaker’s position. Consequently, RTFs from different locations lie on a manifold. By assembling a clean set of RTFs as a training dataset, we can explore the RTF manifold and derive a more robust estimate of the RTF from noisy recordings.

Several manifold learning approaches, such as those proposed by [12, 13, 14], typically follow a standard framework. In this framework, manifold samples are initially represented as a graph. Subsequently, a low-dimensional representation (embedding) of the data is inferred, preserving its structure meaningfully. This representation effectively ’flattens’ the original non-Euclidean structure of the manifold into an Euclidean space, simplifying subsequent analysis. Post-inference, an algorithm is applied to the low-dimensional embedding to accomplish the desired task.

The MVDR beamformer is a spatial filter designed to minimize the noise power in its output while preserving the desired source without distortion. There is some justification for using the RTFs as the steering vector for calculating the MVDR weights [1, 15, 16]. In our research, we adopt this approach.

In recent years, geometric deep learning (GDL), a term describing techniques that extend deep neural models to non-Euclidean inputs like graphs and manifolds, has seen significant application in classification, segmentation, clustering, and recommendation tasks. Its adoption is more prevalent in fields like social sciences (e.g., analyzing social network using graphs), chemistry (where molecules can be represented as graphs), biology (where biomolecular interactions form graph structures), 3D point cloud manifold learning, computer vision, and others. Those methods usually focus on classification, segmentation, clustering, and recommendation tasks but not on regression tasks. A particular type of graph neural network (GNN) is the GCN which is based on the principles of learning through shared-weights, similar to convolutional neural networks [17, 18, 19, 20, 21, 22].

Previous efforts to learn the manifold of RTFs [23, 24] have employed a graph representation, utilizing the Gaussian heat kernel to determine edge weights. Spectral graph theory is then applied to infer a low-dimensional embedding of the manifold in Euclidean space. The Euclidean distance between samples in this transformed space reflects the diffusion distance on the manifold surface. Subsequently, leveraging geometric harmonics, an algorithm is employed to extend the training data and estimate the RTF based on the acquired manifold and the noisy signals. This algorithm effectively projects the noisy RTF onto the learned manifold of potential RTFs, resulting in a more robust estimation of the RTF.

Drawing inspiration from recent developments in the GNN field, demonstrating that graphs naturally emerge in the learning a manifold, this paper aims to enhance the traditional manifold learning blueprint. The conventional blueprint involves flattening the non-Euclidean manifold into an Euclidean space. We will harness the power of GCN to learn the high dimensional RTF manifold and infer a robust estimator of an RTF from noisy RTF directly from the graph representing the manifold.

Our contribution is threefold: 1) a novel robust RTF estimation algorithm that infers the RTF manifold using a GCN and levereges it to robustify the RTF estimation; 2) a comprehenseve assesment of the proposed scheme and its performance advantages as compared with competing methods in various SNR levels and real-world acoustic scenarios; 3) exemplifying how this framework can be used in further research to expand speech enhancement and localization algorithms’ capabilities.

The remainder of this paper is organized as follows. In Section II we formalize the problem. Section III describes the relations between manifold learning and graphs and presents the GNN framework, and in particular, the GCN variant. Section IV explains a general robust beamforming approach which includes the vanilla RTF estimation and RTF-based beamforming. Section V elaborates on our approach, in particular, the creation of the graph data, the architecture of our network, and the objective Functions. Section VI describes the experimental setup and presents the results together with a comparison to other methods. Section VII concludes the paper.

II Problem Formulation

An M𝑀Mitalic_M-microphone array is positioned in a reverberant enclosure. We assume that the desired source location is confined to a known region. Examples of such environments include conference rooms, where the microphone array is placed at a fixed location on the table, and speakers occupy designated positions around it. Similarly, in office setups, the microphone array is fixed on the desk or computer screen, with the speaker typically seated behind the desk. In a car, the microphone array is positioned at a fixed location at the visor, while the speaker occupies one of the seats.

Let rm[n],m=0,,M1formulae-sequencesubscript𝑟𝑚delimited-[]𝑛𝑚0𝑀1r_{m}[n],m=0,\ldots,M-1italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] , italic_m = 0 , … , italic_M - 1, denote the measured signal at the m𝑚mitalic_mth microphone. Here, s[n]𝑠delimited-[]𝑛s[n]italic_s [ italic_n ] represents the desired speech signal, and vm[n]subscript𝑣𝑚delimited-[]𝑛v_{m}[n]italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] represents the contribution of all noise sources captured by the m𝑚mitalic_mth microphone. The signal captured by the m𝑚mitalic_mth microphone can be modeled as:

rm[n]={sam}[n]+vm[n].subscript𝑟𝑚delimited-[]𝑛𝑠subscript𝑎𝑚delimited-[]𝑛subscript𝑣𝑚delimited-[]𝑛r_{m}[n]=\{s*a_{m}\}[n]+v_{m}[n].italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] = { italic_s ∗ italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } [ italic_n ] + italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] . (1)

Here, am[n]subscript𝑎𝑚delimited-[]𝑛{a}_{m}[n]italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] stands for the AIR from the source to the m𝑚mitalic_mth microphone at time n𝑛nitalic_n, and * denotes the convolution operator. In scenarios where the speaker remains static, the AIR remains constant over time. The time-domain convolution in (1) can be approximated by multiplication in the short-time Fourier transform (STFT) domain. All M𝑀Mitalic_M equations can then be written in a single vector form as:

𝐫(l,k)=s(l,k)𝐚(k)+𝐯(l,k).𝐫𝑙𝑘𝑠𝑙𝑘𝐚𝑘𝐯𝑙𝑘\mathbf{r}(l,k)=s(l,k)\mathbf{a}(k)+\mathbf{v}(l,k).bold_r ( italic_l , italic_k ) = italic_s ( italic_l , italic_k ) bold_a ( italic_k ) + bold_v ( italic_l , italic_k ) . (2)

Here, l𝑙litalic_l and k𝑘kitalic_k represent the time-frame and frequency-bin indexes, respectively, with l{0,,L1}𝑙0𝐿1l\in\{0,\ldots,L-1\}italic_l ∈ { 0 , … , italic_L - 1 } and k{0,,K1}𝑘0𝐾1k\in\{0,\ldots,K-1\}italic_k ∈ { 0 , … , italic_K - 1 }. The vector 𝐚(k)=[a0(k),,aM1(k)]𝐚𝑘superscriptsubscript𝑎0𝑘subscript𝑎𝑀1𝑘top\mathbf{a}(k)=[a_{0}(k),\ldots,a_{M-1}(k)]^{\top}bold_a ( italic_k ) = [ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_k ) , … , italic_a start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT ( italic_k ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, comprises all ATFs from the source to the microphone array. We define aref(k)subscript𝑎ref𝑘a_{\textrm{ref}}(k)italic_a start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_k ) as the component of the vector 𝐚(k)𝐚𝑘\mathbf{a}(k)bold_a ( italic_k ) that corresponds to the reference microphone. Equation (2) can also be reformulated as a function of s~(l,k)=s(l,k)aref(k)~𝑠𝑙𝑘𝑠𝑙𝑘subscript𝑎ref𝑘\tilde{s}(l,k)=s(l,k)a_{\textrm{ref}}(k)over~ start_ARG italic_s end_ARG ( italic_l , italic_k ) = italic_s ( italic_l , italic_k ) italic_a start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_k ), representing the source signal as captured by the reference microphone:

𝐫(l,k)=s~(l,k)𝐡(k)+𝐯(l,k),𝐫𝑙𝑘~𝑠𝑙𝑘𝐡𝑘𝐯𝑙𝑘\mathbf{r}(l,k)=\tilde{s}(l,k)\mathbf{h}(k)+\mathbf{v}(l,k),bold_r ( italic_l , italic_k ) = over~ start_ARG italic_s end_ARG ( italic_l , italic_k ) bold_h ( italic_k ) + bold_v ( italic_l , italic_k ) , (3)

where 𝐡(k)𝐡𝑘\mathbf{h}(k)bold_h ( italic_k ) is the vector of RTFs:

𝐡(k)𝐚(k)aref(k).𝐡𝑘𝐚𝑘subscript𝑎ref𝑘\mathbf{h}(k)\triangleq\frac{\mathbf{a}(k)}{a_{\textrm{ref}}(k)}.bold_h ( italic_k ) ≜ divide start_ARG bold_a ( italic_k ) end_ARG start_ARG italic_a start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_k ) end_ARG . (4)

Playing a precisely defined signal, such as a chirp or white noise, without any background noise, from different locations within the desired acoustic environment enables us to use standard methods for identifying the system. This process yields a collection of RTFs. Our objective is to glean insights into the RTF manifold, using this ensemble clean RTFs. This in turn, can be leveraged to enhance estimates of noisy RTFs from the same enclosure, thereby robustifying the beamformer’s design.

III Manifold learning & Graph neural networks

III-A Graphs in manifold learning

In many areas, there is a need to comprehend a manifold. Typically, when attempting to learn a manifold, there is no existing mathematical model, and only a limited number of samples are available. All manifold learning algorithms follow a standard blueprint. First, they generate a data representation by constructing a neighbor graph. Second, they compute a low-dimensional representation (embedding) of the data, preserving a specific aspect of the original manifold structure. For instance, Locally Linear Embedding [13], Isomap [12], and Laplacian eigenmaps [14] use different techniques. Variational autoencoders [25] introduce a distribution in the embedding through the encoder. Extensions like conditional VAEs [26] and adversarial Autoencoders [27] aim for a structured data representation. This new embedding ”flattens” the original non-Euclidean structure of the manifold, making it more manageable. Third, a task-dependent algorithm (classification, clustering, or regression) is applied after inferring the representation.

Manifold learning has found various applications in audio, including localization [28, 29, 30, 31] and speech enhancement [24, 23, 32]. In [23], the RTF manifold is initially represented by a graph where the RTFs serve as graph nodes, and the edges’ weights are defined using the heat kernel function. A Markov process is established on the graph by constructing a transition matrix representing the manifold diffusion process. Subsequently, leveraging spectral graph theory, a low-dimensional embedding of the dataset in Euclidean space is derived. In this space, the Euclidean distance between samples reflects the diffusion distance across the high-dimensional manifold surface. Once this low-dimensional embedding is obtained, geometric harmonics [33], a method extending low-dimensional embeddings to new data points, is employed to create a supervised RTF identification estimator. In [32], a VAE-based manifold model for RTFs is proposed to robustify RTF estimation. Unlike linear methods, this approach provides a high degree of expressiveness by avoiding constraints associated with linearity. The VAE is trained unsupervised using data collected under benign acoustic conditions, enabling it to reconstruct RTFs within the specified enclosure. The method introduces an Least squares (LS)-based RTF estimator that is regularized by the trained VAE. This regularization significantly improves the quality of RTF estimates compared to traditional VAE-based denoising methods. A hybrid model is proposed, combining classic RTF estimation with the capabilities of the trained VAE.

In [34], the relation between the graph structure, manifold learning (ML), and GNN is established, emphasizing how the graph structure contributes to the model’s accuracy. Building upon the these established foundations, we propose to harness the ML capabilities of GNNs to obtain a more accurate and robust estimator of RTFs in noisy and reverberant environments.

III-B Graph Convolution Networks

In this section, we first define the mathematical representation of a graph, then describe the two different types of GCNs, explain the differences between them, and finally, formally define the spatial GCNs.

A graph 𝒢=(,𝒱)𝒢𝒱\mathcal{G}=(\mathcal{E},\mathcal{V})caligraphic_G = ( caligraphic_E , caligraphic_V ) consists of a set of nodes 𝒱={v1,,vN}𝒱subscript𝑣1subscript𝑣𝑁\mathcal{V}=\{v_{1},\ldots,v_{N}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and edges \mathcal{E}caligraphic_E, where the edges are assumed to be scalars denoted as ej,isubscript𝑒𝑗𝑖e_{j,i}\in\mathbb{R}italic_e start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ∈ blackboard_R connecting the j𝑗jitalic_jth node to the i𝑖iitalic_ith node. Alternatively, the graph can be represented as 𝒢=(𝐕,𝐀)𝒢𝐕𝐀\mathcal{G}=(\mathbf{V},\mathbf{A})caligraphic_G = ( bold_V , bold_A ), where 𝐕N×d𝐕superscript𝑁𝑑\mathbf{V}\in\mathbb{R}^{N\times d}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT is the nodes’ feature matrix with d𝑑ditalic_d the dimension of features and 𝐀N×N𝐀superscript𝑁𝑁\mathbf{A}\in\mathbb{R}^{N\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is the graph adjacency matrix. Denote 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ) the neighborhood of nodes connected to node i𝑖iitalic_i.The adjacency matrix should then satisfy 𝐀i,j=1subscript𝐀𝑖𝑗1\mathbf{A}_{i,j}=1bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if j𝒩(i)𝑗𝒩𝑖j\in\mathcal{N}(i)italic_j ∈ caligraphic_N ( italic_i ) (in the general case, 𝐀i,jsubscript𝐀𝑖𝑗\mathbf{A}_{i,j}bold_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT may be less than 1).

GNNs are a generalization of conventional neural networks designed to process non-Euclidean inputs represented as graphs. Graphs offer considerable flexibility in data representation, and GNNs extend neural network methods to graph-structured data. They achieve this by iteratively propagating information through nodes and edges of the graph, enabling them to capture and exploit the inherent information encoded in the graph structure. A specific variant of GNN is the GCN, inspired by the principles of learning through shared weights, similar to the approach used in CNNs for image analysis and computer vision. Designing local operations with shared weights is the key to efficient learning on graphs. This involves message passing between each node and its neighbors, facilitated by the shared weights. The use of shared weights, implying parameter sharing across different parts of the graph, enhances the efficiency and scalability of the learning process.

Current GCNs algorithms can be categorized into spectral-based and spatial-based. Spectral GCNs rely on the principles of spectral graph theory [35, 36]. This involves processing the graph through the eigendecomposition of the graph Laplacian, which is used to compute the Fourier transform of a graph signal. This, in turn, defines graph filtering operations [37, 38, 39, 17]. In contrast, spatial-based GCNs operate on the principle of message passing and directly define convolutions on the graph itself. They aim to capture information by aggregating features from neighboring nodes through shared weights [40, 41, 42, 43, 20, 44, 21, 34]. Spatial GCNs can operate locally on each node without considering the entire graph, making them well-suited for node-specific tasks. Given that our problem, as indicated by the graph construction, involves a node regression task, we will focus on spatial-based GCNs from this point onward.

III-C Spatial GCN

Much like the convolutional operation employed by conventional CNNs for image processing, spatial-based methods extend this concept to define graph convolutions based on the spatial relations among nodes. In this analogy, images can be seen as a specific type of graph, where each pixel serves as a node, and direct connections exist between each pixel and its adjacent counterparts.

In a CNN, the operation involves computing the weighted average of pixel values for the central node and its neighbors across each channel. Similarly, in spatial-based graph convolutions, the representation of the central node is convolved with the representations of its neighboring nodes to formulate an updated representation for the central node.

The permutation invariance observed in graph operations significantly differs from classical deep neural networks designed for grid-structured data. This invariance implies independence from the order of neighboring nodes, given the absence of a canonical way to arrange them. Consequently, a substantial distinction emerges between the kernels of CNNs, which leverage the knowledge of neighbor ordering by assigning varying weights during convolution, and GCNs kernels. The latter lack this knowledge, resulting in weights being shared across all neighboring nodes and the entire graph. A figure comparing 2D convolution and graph convolution can be seen in Fig. 1.

21222324251618192011121314156789101234517
Figure 1: 2D convolution vs. graph convolution. Left: In conventional 2D convolution on an Euclidean input, such as an image, the central pixel (depicted in red) of the next layer is calculated as a weighted average of itself and its neighbors, determined by the kernel size. The input is ordered and numbered accordingly. Right: In spatial graph convolution, the representation of the central node in the next layer is computed by aggregating features from neighboring nodes, with no regard to the order of neighbors or fixed graph size. (inspired by [45]).

A spatial GCN comprises a sequence of graph convolution layers. The nodes’ representations undergo two fundamental steps within each layer: aggregating features from neighboring nodes and a subsequent nonlinear transformation. Each convolutional layer is often implemented as Multi-Layer Perceptrons. The initial representation of nodes at the input to the first convolutional layer relies on their input features.

Before training a GCN, a crucial factor to consider is the network’s depth, determining how many neighbor layers are used for information aggregation. As the depth increases, information gathering expands exponentially [46]. We did not aggregate information from second-order neighbors, which will be elaborated on later.

While the description is quite general, most existing algorithms primarily focus on tasks like classification, segmentation, and clustering, rather than regression. We aim to leverage the capabilities of GCNs for a more intricate form of regression directly on the manifold. This involves predicting a highly precise continuous vector, operating on a high-dimensional abstract manifold, and develo** a supervised manifold learning algorithm.

IV RTFs based MVDR Beamformer

This section introduces a general robust speech enhancement framework using a microphone array, covering fundamental concepts such as beamforming and classical RTF estimation.

IV-A Optimizing Beamforming Through Robust RTF Estimation

A general framework for robust microphone array speech enhancement is depicted in Fig. 2. There are several blocks in the framework. First, the vanilla steering vector (the RTF in our case) is estimated from the noisy input signals. Then, the RTF estimates are adopted using additional data based on the acoustic environment, other than the noisy input signals, to produce a more robust estimate of the RTF. The additional information was collected in an ideal acoustic condition. Finally, using these RTFs, a beamformer is constructed and applied to the noisy input signals to obtain an estimate of the desired source signal.

We utilize the generalized eigenvalue decomposition (GEVD) to estimate the RTFs from the noisy input signals and then use the MVDR as the optimization criterion to construct the beamformer. A concise explanation of these methods will be provided in the next section.

Refer to caption
Figure 2: A general block diagram of robust RTF-based beamforming. First, the noisy signal is used to compute the correlation matrices. Then, using the GEVD, we estimate the vanilla RTFs. These vanilla RTFs are robust using the architecture and additional information. Finally, MVDR is applied to estimate the enhanced recording.

IV-A1 GEVD-Based RTF Estimation: A Concise Overview

In [47, 48], it was demonstrated that the RTF could be estimated through the GEVD of the spatial correlation matrices of the noisy signal segments 𝚽rr(k)subscript𝚽𝑟𝑟𝑘\boldsymbol{\Phi}_{rr}(k)bold_Φ start_POSTSUBSCRIPT italic_r italic_r end_POSTSUBSCRIPT ( italic_k )222In the more general form, it can be time-varying, but here we assume that the RTF is time-invariant, and can therefore be estimated by averaging over all active-speech time segments. and of the noise-only signal segments 𝚽vv(k)subscript𝚽𝑣𝑣𝑘\boldsymbol{\Phi}_{vv}(k)bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ( italic_k ). The latter is estimated from noise-only segments assumed to be available. The RTF is determined by solving

𝚽rr(k)𝝋(k)=μ(k)𝚽vv(k)𝝋(k).subscript𝚽𝑟𝑟𝑘𝝋𝑘𝜇𝑘subscript𝚽𝑣𝑣𝑘𝝋𝑘\boldsymbol{\Phi}_{rr}(k)\boldsymbol{\varphi}(k)=\mu(k)\boldsymbol{\Phi}_{vv}(% k)\boldsymbol{\varphi}(k).bold_Φ start_POSTSUBSCRIPT italic_r italic_r end_POSTSUBSCRIPT ( italic_k ) bold_italic_φ ( italic_k ) = italic_μ ( italic_k ) bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ( italic_k ) bold_italic_φ ( italic_k ) . (5)

Using 𝝋(k)𝝋𝑘\boldsymbol{\varphi}(k)bold_italic_φ ( italic_k ), the generalized eigenvector corresponding to the largest generalized eigenvalue μ(k)𝜇𝑘\mu(k)italic_μ ( italic_k ), we can obtain the vector of RTFs

𝐡^GEVD(k)[h^GEVD0(k),,h^GEVDM1(k)]subscript^𝐡GEVD𝑘superscriptsuperscriptsubscript^GEVD0𝑘superscriptsubscript^GEVD𝑀1𝑘top\hat{\mathbf{h}}_{\textrm{GEVD}}(k)\triangleq[\hat{h}_{\textrm{GEVD}}^{0}(k),% \ldots,\hat{h}_{\textrm{GEVD}}^{M-1}(k)]^{\top}over^ start_ARG bold_h end_ARG start_POSTSUBSCRIPT GEVD end_POSTSUBSCRIPT ( italic_k ) ≜ [ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT GEVD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_k ) , … , over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT GEVD end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( italic_k ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (6)

using the following normalization:

𝐡^GEVD(k)=𝚽vv(k)𝝋(k)(𝚽vv(k)𝝋(k))ref.subscript^𝐡GEVD𝑘subscript𝚽𝑣𝑣𝑘𝝋𝑘subscriptsubscript𝚽𝑣𝑣𝑘𝝋𝑘ref\hat{\mathbf{h}}_{\textrm{GEVD}}(k)=\frac{\boldsymbol{\Phi}_{vv}(k)\boldsymbol% {\varphi}(k)}{\left(\boldsymbol{\Phi}_{vv}(k)\boldsymbol{\varphi}(k)\right)_{% \textrm{ref}}}.over^ start_ARG bold_h end_ARG start_POSTSUBSCRIPT GEVD end_POSTSUBSCRIPT ( italic_k ) = divide start_ARG bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ( italic_k ) bold_italic_φ ( italic_k ) end_ARG start_ARG ( bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ( italic_k ) bold_italic_φ ( italic_k ) ) start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT end_ARG . (7)

The next step is constructing the graph by calculating distances between the feature vectors. We use the RTFs estimated in noiseless environments as the features of the graph vertices. Using the clean RTFs to construct the graph may facilitate the enhancement of noisy features acquired in the same environment but in noisy conditions.

Define hm(k)superscriptsubscript𝑚𝑘{h}_{\ell}^{m}(k)italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_k ), the RTF associated with the \ellroman_ℓth training location, where {1,,Ntrain}1subscript𝑁train\ell\in\{1,\ldots,N_{\textrm{train}}\}roman_ℓ ∈ { 1 , … , italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT }, m𝑚mitalic_m represents the m𝑚mitalic_mth microphone, and k𝑘kitalic_k the frequency bin. Further, define 𝐡msuperscriptsubscript𝐡𝑚\mathbf{h}_{\ell}^{m}bold_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT as the corresponding vector formed by concatenating all frequencies.

The training set of all RTFs associated with the m𝑚mitalic_mth microphone, denoted m={𝐡m}=1Ntrainsuperscript𝑚superscriptsubscriptsuperscriptsubscript𝐡𝑚1subscript𝑁train\mathcal{H}^{m}=\{\mathbf{h}_{\ell}^{m}\}_{\ell=1}^{N_{\textrm{train}}}caligraphic_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { bold_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, is obtained by applying the GEVD procedure to the noiseless training recordings. In the absence of noise, 𝚽vv(k)subscript𝚽𝑣𝑣𝑘\boldsymbol{\Phi}_{vv}(k)bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT ( italic_k ) in (7) is substituted by an identity matrix, cosequently (5) simplifies to the eigenvalue decomposition (EVD) problem. msuperscript𝑚\mathcal{H}^{m}caligraphic_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is referred to as the m𝑚mitalic_mth RTF manifold.

IV-A2 The MVDR Beamformer

Let 𝐡^(k)^𝐡𝑘\hat{\mathbf{h}}(k)over^ start_ARG bold_h end_ARG ( italic_k ) represent a RTF from our dataset at a specific position, whether before or after the GCN. Define 𝚽vv1(k)superscriptsubscript𝚽𝑣𝑣1𝑘\boldsymbol{\Phi}_{vv}^{-1}(k)bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k ) as the M×M𝑀𝑀M\times Mitalic_M × italic_M power spectral density (PSD) matrix of the received noise signals at the k𝑘kitalic_kth frequency bin for the same position. It is assumed that noise-only segments are available and can be identified, e.g., by applying a voice activity detection (VAD).

The MVDR beamformer is a spatial filter designed to minimize the noise power at its output while maintaining a distortionless response toward the desired source. Its optimal weights are given by:

𝐰MVDR(k)=𝚽vv1(k)𝐡^(k)𝐡^(k)𝖧𝚽vv1(k)𝐡^(k).subscript𝐰MVDR𝑘superscriptsubscript𝚽𝑣𝑣1𝑘^𝐡𝑘^𝐡superscript𝑘𝖧superscriptsubscript𝚽𝑣𝑣1𝑘^𝐡𝑘\mathbf{w}_{\textrm{MVDR}}(k)=\frac{\boldsymbol{\Phi}_{vv}^{-1}(k)\hat{\mathbf% {h}}(k)}{\hat{\mathbf{h}}(k)^{\mathsf{H}}\boldsymbol{\Phi}_{vv}^{-1}(k)\hat{% \mathbf{h}}(k)}.bold_w start_POSTSUBSCRIPT MVDR end_POSTSUBSCRIPT ( italic_k ) = divide start_ARG bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k ) over^ start_ARG bold_h end_ARG ( italic_k ) end_ARG start_ARG over^ start_ARG bold_h end_ARG ( italic_k ) start_POSTSUPERSCRIPT sansserif_H end_POSTSUPERSCRIPT bold_Φ start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k ) over^ start_ARG bold_h end_ARG ( italic_k ) end_ARG . (8)

Here, we follow [1] and subsequent publications and use the RTF as the steering vector of the MVDR beamformer. It was shown in multiple works (see, e.g., [15, 16]) that implementing an MVDR beamformer with the RTF rather than with a steering vector solely based on the direction of the sources, yields significantly improved results in reverberant environments.

V peerRTF: A GCN-based Robust RTF Estimation

This section introduces our method to robust RTF estimation. We delve into the preprocessing of the data, the construction of a feature vector, and the associated graph data. Finally, we explore the derived GCN architecture and our objective functions.

Our method aims to achieve robust RTF estimation, inspired by manifold-learning methods such as those proposed by [24, 23], we propose a modern DNN-based approach, leveraging previous knowledge on the acoustic environment to project noisy examples onto the manifold. Given that our data is represented as a graph, we utilize message-passing techniques to achieve this goal.

V-A Graph Construction

The learning process involves understanding the relations between neighboring entities. In our case, this requires to learn the GNN weights. We need to train the GNN on known nodes, i.e., known room speaker locations, to learn these weights. After learning these weights, we will evaluate our performance on unknown nodes corresponding to different speaker locations. This section outlines the dataset’s procedure, which involves creating feature vectors, constructing the graphs, and the training procedure.

V-A1 Feature Vector

Beamforming using the RTFs is typically performed in the frequency domain. However, in the time domain, the RTFs display a distinct shape, characterized by a prominent peak around zero and rapid decay on both sides. This characteristic allows us to simplify the estimation process by truncating the time-domain RTFs around their central region, thereby reducing the number of data points that need to be estimated. An example from our training set of a time domain representation of an RTF recorded in a room with T60=300mssubscript𝑇60300msT_{60}=300\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 300 ms is depicted in Fig. 3. This example represents an RTF estimated using the GEVD procedure in (7) in noiseless conditions, i.e., the identity matrix substitutes the spatial correlation matrix of the noise. The clean signal is obtained by convolving an AIR from the MIRaGe dataset [49] with pink noise under. We truncate the time domain RTF luncausalsubscript𝑙uncausall_{\textrm{uncausal}}italic_l start_POSTSUBSCRIPT uncausal end_POSTSUBSCRIPT taps left of the peak and lcausalsubscript𝑙causall_{\textrm{causal}}italic_l start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT taps right of the peak. Applying the GCN to the time-domain representation rather than the frequency-domain representation of the RTFs circumvents the need to work with complex-valued neural networks.

Refer to caption
Figure 3: Typical time-domain representation of the RTF, here recorded in an acoustic environment with reveberation time T60=300mssubscript𝑇60300msT_{60}=300\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 300 ms.

When dealing with an array of M𝑀Mitalic_M microphones, each speaker location has M1𝑀1M-1italic_M - 1 RTFs, as the RTF between the reference microphone and itself is trivial. These M1𝑀1M-1italic_M - 1 components are usually estimated independently. Truncating the RTFs reduces the feature dimension, thereby enhancing learning compared to using the full RTF. In total, we have the feature dimension: d=luncausal+lcausal𝑑subscript𝑙uncausalsubscript𝑙causald=l_{\textrm{uncausal}}+l_{\textrm{causal}}italic_d = italic_l start_POSTSUBSCRIPT uncausal end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT.

V-A2 Graph Dataset Construction

We assume that we have two types of feature vectors: Ntrainsubscript𝑁trainN_{\textrm{train}}italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ideal RTFs estimated in a noiseless scenario, and Ntestsubscript𝑁testN_{\textrm{test}}italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT vanilla RTFs estimated in a noisy scenario. The first step is to construct the clean nodes graph. For each microphone, we have an individual graph that contains Ntrainsubscript𝑁trainN_{\textrm{train}}italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT RTFs, We will denote to as 𝐇mRNtrain×dsuperscript𝐇𝑚superscript𝑅subscript𝑁train𝑑\mathbf{H}^{m}\in R^{N_{\textrm{train}}\times d}bold_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT. By applying 𝒦𝒦\mathcal{K}caligraphic_K Nearest Neighbors (𝒦𝒦\mathcal{K}caligraphic_KNN) we build the graphs. We decided to utilize the 𝒦𝒦\mathcal{K}caligraphic_KNN metric since it selects the most similar RTFs from the dataset, allowing us to effectively robustify the RTFs for the noisy feature vectors. As we mentioned for each microphone, we construct a separate graph. This helps to capture specific relationships and dependencies within the data. For evaluation, we include one noisy feature vector into the clean nodes using 𝒦𝒦\mathcal{K}caligraphic_KNN. In total, we have 𝐇Ntrain+1×d𝐇superscriptsubscript𝑁train1𝑑\mathbf{H}\in\mathbb{R}^{N_{\textrm{train}}+1\times d}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT + 1 × italic_d end_POSTSUPERSCRIPT for each microphone.

V-A3 Training Procedure

For the training setup, our goal is to learn optimal weights to enhance the noisy feature vectors during evaluation. Starting with 𝐇mNtrain×dsuperscript𝐇𝑚superscriptsubscript𝑁train𝑑\mathbf{H}^{m}\in\mathbb{R}^{N_{\textrm{train}}\times d}bold_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, we select a training position (represented by a node in the graph) and exclude the clean feature vector associated with this grid position from all M1𝑀1M-1italic_M - 1 graphs. Next, we incorporate noisy feature vectors associated with this speaker position into the graphs using 𝒦𝒦\mathcal{K}caligraphic_KNN. Consequently, each training example comprises Ntrain1subscript𝑁train1N_{\textrm{train}}-1italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT - 1 clean feature vectors and one noisy vector.

V-B The GCN Architecture

A key factor in the success of CNNs is their ability to design and reliably train intense models that extract higher-level features at each layer. In contrast, training deep GCN architectures is not as straightforward, and several works have studied their limitations [45, 22, 50, 51]. Stacking more layers into a GCN leads to the common vanishing gradient and over-smoothing problems. Due to these limitations, most state-of-the-art GCNs are not deeper than four layers [22]. In each layer, the transformation function is usually a single shallow, fully-connected (FC) layer followed by a non-linearity. Such shallow architectures are sufficient for classification, segmentation, clustering, and recommendation tasks. However, these shallow networks lack the expressive power to perform more challenging tasks, such as regression on high-dimensional data. In our problem setting, where nodes correspond to different positions, we opt not to aggregate information from second-order neighbors. Instead, we choose a deeper than one FC layer. Nevertheless, we aim to incorporate a deep network with sufficient expressive power to execute a regression task on the high-dimensional abstract manifold. Drawing inspiration from [21], which learns 3D manifolds from point clouds, we structured our architecture accordingly. Drawing an analogy to convolution in images, we consider 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the central pixel and 𝐡i(j),j𝒩(i)superscriptsubscript𝐡𝑖𝑗𝑗𝒩𝑖\mathbf{h}_{i}^{(j)},j\in\mathcal{N}(i)bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_j ∈ caligraphic_N ( italic_i ) as a patch around it. To calculate the contribution of each neighbor 𝐡i(j)superscriptsubscript𝐡𝑖𝑗\mathbf{h}_{i}^{(j)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT within each graph, we concatenate the feature vector of the central node 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the feature vector of the neighbor 𝐡i(j)superscriptsubscript𝐡𝑖𝑗\mathbf{h}_{i}^{(j)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and pass this concatenated vector through the neural network. The neural network output is then aggregated from all 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT neighbors 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ).

When deliberating on selecting an aggregation function, it is essential to consider the essence of our regression task on the manifold. Given that our objective is to predict a continuous value falling within the range of the input values, this criterion guides our choice of aggregation functions. In this context, sum and max are not optimal choices. Instead, we opt for the mean operation as 1|𝒩(i)|j𝒩(i)1𝒩𝑖subscript𝑗𝒩𝑖\frac{1}{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}divide start_ARG 1 end_ARG start_ARG | caligraphic_N ( italic_i ) | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT.

Figure 4 details the selected architecture.

𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT𝐦i,jsubscript𝐦𝑖𝑗\mathbf{m}_{i,j}bold_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT𝐡i(j)superscriptsubscript𝐡𝑖𝑗\mathbf{h}_{i}^{(j)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT(a)
𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT𝐡i(1)superscriptsubscript𝐡𝑖1\mathbf{h}_{i}^{(1)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT𝐡i(2)superscriptsubscript𝐡𝑖2\mathbf{h}_{i}^{(2)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT𝐡i(3)superscriptsubscript𝐡𝑖3\mathbf{h}_{i}^{(3)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT𝐡i(4)superscriptsubscript𝐡𝑖4\mathbf{h}_{i}^{(4)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT𝐡i(5)superscriptsubscript𝐡𝑖5\mathbf{h}_{i}^{(5)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 5 ) end_POSTSUPERSCRIPT𝐦i,1subscript𝐦𝑖1\mathbf{m}_{i,1}bold_m start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT𝐦i,2subscript𝐦𝑖2\mathbf{m}_{i,2}bold_m start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT𝐦i,3subscript𝐦𝑖3\mathbf{m}_{i,3}bold_m start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT𝐦i,4subscript𝐦𝑖4\mathbf{m}_{i,4}bold_m start_POSTSUBSCRIPT italic_i , 4 end_POSTSUBSCRIPT𝐦i,5subscript𝐦𝑖5\mathbf{m}_{i,5}bold_m start_POSTSUBSCRIPT italic_i , 5 end_POSTSUBSCRIPT(b)
Figure 4: Left: The massage 𝐦i,jsubscript𝐦𝑖𝑗\mathbf{m}_{i,j}bold_m start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT passed from the j𝑗jitalic_jth neighbor of the i𝑖iitalic_ith node is calculated by concatenating 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐡i(j)superscriptsubscript𝐡𝑖𝑗\mathbf{h}_{i}^{(j)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and passing this concatenation through the neural network.
Right: The representation of the i𝑖iitalic_ith node at the output is calculated by aggregating the messages from all the nodes in 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ). For each microphone, there is a separate graph and the neighbors are arbitrarily numbered.
*inspired by [21]

We utilized message passing, one of several commonly used methods in GNN. As mentioned, this process involves information exchange between nodes and their neighbors on the graph, enabling them to update their knowledge based on local interactions. Message passing facilitates effective learning and inference in graph-based models. For our graphs, we have 𝒦𝒦\mathcal{K}caligraphic_K representing the number of neighbors.

Our neural network architecture consists of three FC layers, followed by an activation function. The input to the network is a concatenated vector of length 2d2𝑑2d2 italic_d, and the architecture can be represented as follows: 2d2d2ddabsent2𝑑2𝑑absent2𝑑𝑑2d\xrightarrow{}2d\xrightarrow{}2d\Rightarrow d2 italic_d start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW 2 italic_d start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW 2 italic_d ⇒ italic_d. Here, each absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW represents a single FC layer followed by a rectified linear unit (ReLU) activation function, while \Rightarrow denotes only a FC layer. Our architecture uses shared weights across all microphones. No significant changes were observed while exploring M1𝑀1M-1italic_M - 1 individual sub-GCN variations. Consequently, we opted for a less complex architecture with fewer parameters, making it flexible for various numbers of microphones.

Algorithm  1 succinctly summarizes the procedural steps for GCN-based RTF estimation. Figure 5 describes the full architecture.

Refer to caption
Figure 5: In the figure, the input consists of a noisy RTF that enters the 𝒦𝒦\mathcal{K}caligraphic_KNN algorithm along with the clean RTFs to construct the graphs. Subsequently, the graphs enter the GCN, resulting in the generation of robustified RTFs.
Training Stage:
  1. 1.

    Given 𝐇mRNtrain×dsuperscript𝐇𝑚superscript𝑅subscript𝑁train𝑑\mathbf{H}^{m}\in R^{N_{\textrm{train}}\times d}bold_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT clean RTFs matrix build the graphs using 𝒦𝒦\mathcal{K}caligraphic_KNN for each microphone.

  2. 2.

    Select one grid position and remove the clean feature vectors that associate with this location, replace them with noisy feature vectors, and connect them to the graph using 𝒦𝒦\mathcal{K}caligraphic_KNN.

  3. 3.

    Train GCN for robust representation of noisy RTFs.

  4. Repeat for the entire dataset until convergence.

Inference Stage:
  1. 1.

    Randomly choose a test location and add out of grid (OOG) noise to clean speech at a random SNR.

  2. 2.

    Use (7) to estimate the noisy RTFs.

  3. 3.

    Build graphs using 𝒦𝒦\mathcal{K}caligraphic_KNN for each microphone with Ntrainsubscript𝑁trainN_{\textrm{train}}italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT clean feature vectors and one noisy.

  4. 4.

    Pass through trained GCN to improve estimation for noisy RTFs nodes.

  5. Repeat for all test position points.

Algorithm 1 Enhancing RTF Robustness on GNN

V-C Objective Functions

To efficiently train the model, we experimented with several objective functions. We examined two alternatives. In the first alternative, we have directly optimized the outcome of the GCN, namely the RTF estimate. The second alternative is to optimize the output of the MVDR beamformer by adjusting the RTF estimate. a figure that described this can be seen in  6

Refer to caption
Figure 6: The architecture of training objectives

V-C1 Direct Optimization of the RTF

Define the Signal Blocking Factor (SBF) as:

SBF=1M1m=0,mrefM110log10(nx2[n]ndm2[n])SBF1𝑀1superscriptsubscriptformulae-sequence𝑚0𝑚ref𝑀110subscript10subscript𝑛superscript𝑥2delimited-[]𝑛subscript𝑛superscriptsubscript𝑑𝑚2delimited-[]𝑛\textrm{SBF}=\frac{1}{M-1}\sum_{m=0,m\neq\textrm{ref}}^{M-1}10\log_{10}\left(% \frac{\sum_{n}x^{2}[n]}{\sum_{n}d_{m}^{2}[n]}\right)SBF = divide start_ARG 1 end_ARG start_ARG italic_M - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_m = 0 , italic_m ≠ ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_n ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_n ] end_ARG ) (9)

where

x[n]={h^oraclems~}[n]𝑥delimited-[]𝑛superscriptsubscript^oracle𝑚~𝑠delimited-[]𝑛x[n]=\{\hat{h}_{\textrm{oracle}}^{m}\ast\tilde{s}\}[n]italic_x [ italic_n ] = { over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT oracle end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∗ over~ start_ARG italic_s end_ARG } [ italic_n ]

and

dm[n]={h^oraclems~}[n]{h^GCNms~}[n].subscript𝑑𝑚delimited-[]𝑛superscriptsubscript^oracle𝑚~𝑠delimited-[]𝑛superscriptsubscript^GCN𝑚~𝑠delimited-[]𝑛d_{m}[n]=\{\hat{h}_{\textrm{oracle}}^{m}*\tilde{s}\}[n]-\{\hat{h}_{\textrm{GCN% }}^{m}*\tilde{s}\}[n].italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] = { over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT oracle end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∗ over~ start_ARG italic_s end_ARG } [ italic_n ] - { over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∗ over~ start_ARG italic_s end_ARG } [ italic_n ] .

Here, s~[n]~𝑠delimited-[]𝑛\tilde{s}[n]over~ start_ARG italic_s end_ARG [ italic_n ] is the reference signal, h^oraclem[n]superscriptsubscript^oracle𝑚delimited-[]𝑛\hat{h}_{\textrm{oracle}}^{m}[n]over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT oracle end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_n ] is the oracle RTF corresponding to the m𝑚mitalic_mth microphone, and h^GCNm[n]superscriptsubscript^GCN𝑚delimited-[]𝑛\hat{h}_{\textrm{GCN}}^{m}[n]over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_n ] is the robust RTF of the m𝑚mitalic_mth microphone. The term dm[n]subscript𝑑𝑚delimited-[]𝑛d_{m}[n]italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_n ] is defined as the difference between convolution of h^oraclem[n]superscriptsubscript^oracle𝑚delimited-[]𝑛\hat{h}_{\textrm{oracle}}^{m}[n]over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT oracle end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_n ] and s~[n]~𝑠delimited-[]𝑛\tilde{s}[n]over~ start_ARG italic_s end_ARG [ italic_n ] with the convolution of h^GCNm[n]superscriptsubscript^GCN𝑚delimited-[]𝑛\hat{h}_{\textrm{GCN}}^{m}[n]over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_n ] and s~[n]~𝑠delimited-[]𝑛\tilde{s}[n]over~ start_ARG italic_s end_ARG [ italic_n ]. This function encourages the robust RTF to be as close as possible to the oracle RTF. Inspire from [52].

V-C2 RTF Estimation via Beamformer Output Optimization

Here we optimize the Scale-Invariant Source-to-Distortion Ratio (SI-SDR) at the output of the beamformer. The SI-SDR is defined as:

SI-SDR(𝐬~,𝐬^)=10log10(𝐬~,𝐬^𝐬~,𝐬~𝐬~2𝐬~,𝐬^𝐬~,𝐬~𝐬~𝐬^2)SI-SDR~𝐬^𝐬10subscript10superscriptnorm~𝐬^𝐬~𝐬~𝐬~𝐬2superscriptnorm~𝐬^𝐬~𝐬~𝐬~𝐬^𝐬2\textrm{SI-SDR}\left(\tilde{\mathbf{s}},\hat{\mathbf{s}}\right)=10\log_{10}% \left(\frac{\|{\frac{\langle{\tilde{\mathbf{s}},\hat{\mathbf{s}}}\rangle}{% \langle{\tilde{\mathbf{s}},\tilde{\mathbf{s}}}\rangle}\tilde{\mathbf{s}}}\|^{2% }}{\|{\frac{\langle{\tilde{\mathbf{s}},\hat{\mathbf{s}}}\rangle}{\langle{% \tilde{\mathbf{s}},\tilde{\mathbf{s}}}\rangle}\tilde{\mathbf{s}}-\hat{\mathbf{% s}}}\|^{2}}\right)SI-SDR ( over~ start_ARG bold_s end_ARG , over^ start_ARG bold_s end_ARG ) = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∥ divide start_ARG ⟨ over~ start_ARG bold_s end_ARG , over^ start_ARG bold_s end_ARG ⟩ end_ARG start_ARG ⟨ over~ start_ARG bold_s end_ARG , over~ start_ARG bold_s end_ARG ⟩ end_ARG over~ start_ARG bold_s end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ divide start_ARG ⟨ over~ start_ARG bold_s end_ARG , over^ start_ARG bold_s end_ARG ⟩ end_ARG start_ARG ⟨ over~ start_ARG bold_s end_ARG , over~ start_ARG bold_s end_ARG ⟩ end_ARG over~ start_ARG bold_s end_ARG - over^ start_ARG bold_s end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (10)

where 𝐬~~𝐬\tilde{\mathbf{s}}over~ start_ARG bold_s end_ARG represents the reference source vector for all samples, and 𝐬^^𝐬\hat{\mathbf{s}}over^ start_ARG bold_s end_ARG denotes the output of the beamformer vector for all samples. The SI-SDR loss is a metric commonly used to evaluate the quality of source separation or speech enhancement algorithms[53]. It measures the enhancement quality between the estimated source signal and the true source signal, considering both the distortion and the interference introduced during the enhancement process. This approach aims to bring the beamformer output closer to the clean reference signal. Correspondingly, the RTF estimate should be adjusted.

Additionally, we explore an alternative approach by examining the SI-SDR compared to the output of the Oracle RTF beamformer. Here, we compute the MVDR weights using the RTFs estimated under ideal conditions—the oracle scenario, and evaluate the resulting SI-SDR against this signal. This approach aligns more closely with a supervised paradigm, akin to the RTF level loss. Importantly, it eliminates the necessity for a clean reference signal in the loss function, addressing a common limitation in scenarios where such a reference signal is unavailable. Still, in this objective, we need the oracle RTFs to be available, which is another limitation. We will designate the first version as SI-SDR I and the second as SI-SDR II.

short-time objective intelligibility (STOI)

We also incorporate an implementation of STOI as a loss function333adopted from https://github.com/mpariente/pytorch_stoi.. This implementation is integrated with VAD, ensuring a fit with the original function. STOI serves as a metric evaluating the intelligibility of speech signals.

VI Experimental Setup & Results

We use the MIRaGe dataset [49], consisting of real multichannel recordings acquired at Bar-Ilan acoustic lab. We evaluate the proposed GCN method using various objective and subjective performance measures. Additionally, we explore the impact of the graph structure on the results and compare several objective functions.

VI-A Experimental Setup:

The database was created by placing a loudspeaker on a grid of points in a cube-shaped volume with dimensions 46×36×3246363246\times 36\times 3246 × 36 × 32 cm. The loudspeaker positions were sampled every 2222 cm along the x and y axes and every 4444 cm along the z-axis, totaling 24×19×9=410424199410424\times 19\times 9=410424 × 19 × 9 = 4104 possible source positions (grid vertices). In addition, 16 other positions, referred to as OOG, were designated as possible locations for noise sources. A chirp signal was played for each position in the grid and OOG. The setup was recorded using six static linear microphone arrays, each consisting of M=5𝑀5M=5italic_M = 5 microphones with an inter-microphone spacing of 13,5,0,+5,13505-13,-5,0,+5,- 13 , - 5 , 0 , + 5 , and +1313+13+ 13 cm relative to the central microphone (the reference microphone). Recordings were made at three different reverberation levels: 100,100100,100 , 300,300300,300 , and 600600600600 ms.

For our experiments, we utilized microphone array #2, positioned directly in front of the grid at a distance of 2222 m from the center. The recordings were randomly divided into Ntrain=3500subscript𝑁train3500N_{\textrm{train}}=3500italic_N start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = 3500 training positions, Nvalidation=100subscript𝑁validation100N_{\textrm{validation}}=100italic_N start_POSTSUBSCRIPT validation end_POSTSUBSCRIPT = 100, and Ntest=504subscript𝑁test504N_{\textrm{test}}=504italic_N start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = 504. We use 2048204820482048 frequency bins, which is also the number of time domain RTF samples. Additionally, we set luncausal=128subscript𝑙uncausal128l_{\textrm{uncausal}}=128italic_l start_POSTSUBSCRIPT uncausal end_POSTSUBSCRIPT = 128 and lcausal=256subscript𝑙causal256l_{\textrm{causal}}=256italic_l start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT = 256.

The RTFs estimation involved the following procedure: 1) using chirp signals recorded in the MIRaGe database, the AIRs from the source position to the microphone arrays were estimated using LS; 2) for estimating the clean RTFs, pink noise signals covering all relevant frequencies were convolved with the AIRs to generate the desired signal; for the noisy signals, we took a speech signal and convolved it with the AIRs and we added pink noise from an OOG location.; finally, 3) RTFs were estimated using (7), and for the clean signals, we used EVD.

To generate a sufficiently large training set, we add each of the 3500 clean training speech signals to three different noise signals with random SNR from 16161616 different OOG locations, resulting in 3500×3=1050035003105003500\times 3=105003500 × 3 = 10500 noisy training examples. For our graphs, we selected 𝒦=5𝒦5\mathcal{K}=5caligraphic_K = 5 as the 𝒦𝒦\mathcal{K}caligraphic_KNN parameter.

The network was trained using a linear scheduler with a warmup ratio of 0.1, a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We chose the SI-SDR in its second version as the objective function for all reverberation times, incorporating a dropout rate of 0.5 during the training stage over 100 epochs.

TABLE I: Parameters
parameter value
M 5
Frequency bins 2048
luncausalsubscript𝑙uncausall_{\textrm{uncausal}}italic_l start_POSTSUBSCRIPT uncausal end_POSTSUBSCRIPT 128
lcausalsubscript𝑙causall_{\textrm{causal}}italic_l start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT 256
𝒦𝒦\mathcal{K}caligraphic_K 5

All the parameters are listed in Table I.

VI-B Quality Measure:

The results are analyzed using several quality metrics: the SNR at the beamformer output, calculated as:

SNR(𝐬^,𝐯^)=10log10(𝐬^2𝐯^2)SNR^𝐬^𝐯10subscript10superscriptnorm^𝐬2superscriptnorm^𝐯2\textrm{SNR}\left(\hat{\mathbf{s}},\hat{\mathbf{v}}\right)=10\log_{10}\left(% \frac{\left\|{\hat{\mathbf{s}}}\right\|^{2}}{\left\|\hat{\mathbf{v}}\right\|^{% 2}}\right)SNR ( over^ start_ARG bold_s end_ARG , over^ start_ARG bold_v end_ARG ) = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∥ over^ start_ARG bold_s end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over^ start_ARG bold_v end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (11)

Here, 𝐬^^𝐬\hat{\mathbf{s}}over^ start_ARG bold_s end_ARG represents the reference signal vector of the beamformer outputs for all samples, and 𝐯^^𝐯\hat{\mathbf{v}}over^ start_ARG bold_v end_ARG represents the noise vector in the beamformer outputs for all samples. Additionally, we employ the STOI measure [54] to assess the intelligibility of speech signals, along with the Deep Noise Suppression Mean Opinion Score (DNSMOS) metric [55]. We also examine the SI-SDR (10) in its first approach, comparing it to the reference signal to measure the distortion of the outputs.

VI-C Results:

We calculate averages across all testing locations at various input SNR levels for each metric we present. To assess performance, we compare these measures with other MVDR beamformers. This involves using either the traditional GEVD estimate of the RTFs or the RTF estimate derived from the method introduced in [23], known as manifold projection learning (MP), alongside the oracle RTF. The oracle RTF is the RTF suitable for this location, estimated under noise-free conditions (similar to the train set). For MP, two parameters need to be chosen: ϵitalic-ϵ\epsilonitalic_ϵ, the kernel scale parameter, and 𝔩𝔩\mathfrak{l}fraktur_l, the number of dominant eigenvalues in the algorithm. Specifically, we chose ϵ=0.3italic-ϵ0.3\epsilon=0.3italic_ϵ = 0.3 for all reverberation times, and 𝔩𝔩\mathfrak{l}fraktur_l varied. For T60=100mssubscript𝑇60100msT_{\textrm{60}}=100~{}\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 100 ms, we selected 𝔩=12𝔩12\mathfrak{l}=12fraktur_l = 12. Similarly, for T60=300mssubscript𝑇60300msT_{\textrm{60}}=300~{}\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 300 ms, 𝔩=5𝔩5\mathfrak{l}=5fraktur_l = 5, and finally, for T60=600mssubscript𝑇60600msT_{\textrm{60}}=600~{}\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 600 ms, 𝔩=15𝔩15\mathfrak{l}=15fraktur_l = 15. Figs. 8 9  10 depict the SNRoutsubscriptSNRout\acs{SNR}_{\textrm{out}}start_POSTSUBSCRIPT out end_POSTSUBSCRIPT , STOI and DNSMOS quality measures, for T60=100,300,600mssubscript𝑇60100300600msT_{\textrm{60}}=100,300,600~{}\text{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 100 , 300 , 600 ms, comparing the proposed method (peerRTF) with GEVD, Oracle, and the ML algorithm.

As can be seen, the proposed method outperforms the vanilla GEVD-based beamformer in terms of speech intelligibility across all SNR and reverberation levels. Compared to the ML-based beamformer, there is an improvement in almost all areas. The SNR at the beamformer output is consistently higher than that of the vanilla GEVD and ML-based beamformers across all SNR levels and reverberation conditions. Furthermore, our method outperforms the oracle RTF in certain regions.

Those advantages are also subjectively demonstrated by assessing the sonograms for a specific example (randomly chosen) from all the testing examples at SNRin=10subscriptSNRin10\textrm{SNR}_{\textrm{in}}=-10SNR start_POSTSUBSCRIPT in end_POSTSUBSCRIPT = - 10 dB and T60=600subscript𝑇60600T_{60}=600italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 600 ms, in Fig. 7.

Refer to caption
(a) Reference signal
Refer to caption
(b) Noisy signal
Refer to caption
(c) GEVD output
Refer to caption
(d) GCN output
Figure 7: Sonograms: SNRin=10subscriptSNRin10\textrm{SNR}_{\textrm{in}}=-10SNR start_POSTSUBSCRIPT in end_POSTSUBSCRIPT = - 10 dB and T60=600subscript𝑇60600T_{60}=600italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 600 ms.

In the overall sonograms, zoom in on the lower frequencies to assess them more precisely. When comparing the output of the beamformer to the reference signal, it’s evident that the peerRTF not only preserves more of the original signal frequency content but also generates fewer artifacts. This can be observed in the red left rectangle, with a slight break in the original sonogram. In contrast, the GEVD shows no break, which is also noticeable in the peerRTF. In the right rectangle, a distinct frequency bin appears in the GEVD absent in both the reference and the peerRTF. Furthermore, the peerRTF exhibit less noise.

Refer to caption
Refer to caption
Refer to caption
Figure 8: SNRoutsubscriptSNRout\textrm{SNR}_{\textrm{out}}SNR start_POSTSUBSCRIPT out end_POSTSUBSCRIPT[dB] (left), STOI[%] (middle) and DNSMOS(right) for T60=100subscript𝑇60100T_{60}=100italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 100[ms]
Refer to caption
Refer to caption
Refer to caption
Figure 9: SNRoutsubscriptSNRout\textrm{SNR}_{\textrm{out}}SNR start_POSTSUBSCRIPT out end_POSTSUBSCRIPT[dB] (left), STOI[%] (middle) and DNSMOS(right) for T60=300subscript𝑇60300T_{60}=300italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 300[ms]
Refer to caption
Refer to caption
Refer to caption
Figure 10: SNRoutsubscriptSNRout\textrm{SNR}_{\textrm{out}}SNR start_POSTSUBSCRIPT out end_POSTSUBSCRIPT[dB] (left), STOI[%] (middle) and DNSMOS(right) for T60=600subscript𝑇60600T_{60}=600italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 600[ms]

An example of the results can be heard on the project page.

VI-D Alternative Loss Functions and Network Architecture

In this section, we will compare graph structures to underscore the importance of neighbors and evaluate different objective functions. In the tables, we compare different techniques for T60=600mssubscript𝑇60subscript600𝑚𝑠T_{60}=600_{ms}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT = 600 start_POSTSUBSCRIPT italic_m italic_s end_POSTSUBSCRIPT and SNR=10SNR10\textrm{\acs{SNR}}=-10= - 10 dB, as we know that there is a significant contribution in this condition due to the GEVD being less effective in high reverberation time and low SNR. Our goal is to examine the significance of the graph, recognizing that the challenge in neural networks often arises from a lack of clear indication of actual occurrences in various architectures. Consequently, we aim to establish a genuine influence on both neighbors and the network, considering not only the noisy RTFs. In Table II, we compare our method to the self-RTFs graph structure. In the self-RTFs structure, the noisy RTF node has only itself and doesn’t receive information from clean nodes. All other factors were trained in the same manner. The results demonstrate that our method consistently outperforms the self-RTFs structure across all quality measures. In certain metrics, such as SI-SDR, DNSMOS, and SNR, the outcomes were inferior to those of the GEVD, representing the noisy RTF that enters the network. This observation highlights the crucial role played by neighbors in this procedure.

TABLE II: neighbors importance
Model STOI ESTOI SISDR P808 MOS SNR
Unprocessed 23.85 9.32 -10.2 2.22 -10
Reference - - - 2.94 -
Oracle 72.07 55.95 0.57 2.53 16.83
GEVD 66.52 49.5 -3.33 2.52 14.21
MP 70.23 54.21 -1.77 2.52 15.65
peerRTF 71.63 55.53 0.46 2.62 17.3
Self-RTFs 68.06 51.54 -6.2 2.48 12.29

In Table III, we train and evaluate the GCN with different objective functions. As seen in the table, various quality measures are emphasized by different functions. For instance, the STOI function highlights the STOI and extended short-time objective intelligibility (ESTOI) measures, aligning with its purpose. Similarly, SI-SDR I function emphasizes both SI-SDR and SNR, reflecting their shared concept. In contrast, the SI-SDR II loss, which we discussed earlier as being more supervised, spotlights the DNSMOS. We chose the objective function that improves all the metrics as our objective.

TABLE III: compare objectives
Model STOI ESTOI SISDR P808 MOS SNR
Unprocessed 23.85 9.32 -10.2 2.22 -10
Reference - - - 2.94 -
Oracle 72.07 55.95 0.57 2.53 16.83
GEVD 66.52 49.5 -3.33 2.52 14.21
MP 70.23 54.21 -1.77 2.52 15.65
Si-Sdr I 72.52 57.21 2.34 2.57 17.96
Si-Sdr II 71.63 55.53 0.46 2.62 17.3
SBF 71.32 55.53 -0.5 2.52 15.5
STOI 72.87 57.32 -2.23 2.51 15.12

VII Conclusion

In this paper we have presented a novel RTF identification method which rely on learning the RTF manifold using a GCN to infer a robust estimation of the RTF in a noisy and reverberate environment. This opens the door of robust acoustic beamforming into a new field of learning methods. As seen in recent years, has significantly improved classical algorithms and has great flexibility and room for improvement. There are still a lot of improvements to be made in terms of the GCN model, both in the model architecture which can be stronger and more effective, and in the graph inference itself, which can be constructed in a more informative manner as it can incorporate a weighting of the graph according to some similarity measure or to include a model to infer the graph from the RTFs directly.

The results shown here verify the robustness of this approach and show the benefit of using a learning method on the graph representing the manifold directly as opposed to learning a projection of the graph into an euclidean space and flattening the manifold and operating in this space.

References

  • [1] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Tran. on Sig. Proc., vol. 49, no. 8, pp. 1614–1626, 2001.
  • [2] S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 544–548.
  • [3] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 320–324.
  • [4] O. Shalvi and E. Weinstein, “System identification using nonstationary signals,” IEEE Tran. on Sig. Proc., vol. 44, no. 8, pp. 2055–2063, 1996.
  • [5] Z. Koldovskỳ, J. Málek, and S. Gannot, “Spatial source subtraction based on incomplete measurements of relative transfer function,” IEEE/ACM Tran. on Au., Sp., and Lang. Proc., vol. 23, no. 8, pp. 1335–1347, 2015.
  • [6] Y. R. Zheng, R. A. Goubran, and M. El-Tanany, “Robust near-field adaptive beamforming with distance discrimination,” IEEE Tran. on Sp. and Au. Proc., vol. 12, no. 5, pp. 478–488, 2004.
  • [7] S. Doclo, S. Gannot, M. Moonen, A. Spriet, S. Haykin, and K. R. Liu, “Acoustic beamforming for hearing aid applications,” Handbook on array processing and sensor networks, pp. 269–302, 2010.
  • [8] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987.
  • [9] J. Li, P. Stoica, and Z. Wang, “On robust Capon beamforming and diagonal loading,” IEEE Tran. on Sig. Proc., vol. 51, no. 7, pp. 1702–1715, 2003.
  • [10] ——, “Doubly constrained robust Capon beamformer,” IEEE Tran. on Sig. Proc., vol. 52, no. 9, pp. 2407–2423, 2004.
  • [11] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “A study on manifolds of acoustic responses,” in Int. Conf. on Latent Variable Analysis and Sig. Separation.   Springer, 2015, pp. 203–210.
  • [12] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [13] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [14] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” in Nips, vol. 14, no. 14, 2001, pp. 585–591.
  • [15] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Tran. on Au., Sp. and Lang. Proc., vol. 25, no. 4, pp. 692–730, 2017.
  • [16] O. Shmaryahu and S. Gannot, “On the importance of acoustic reflections in beamforming,” in International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2022.
  • [17] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [18] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European semantic web conference.   Springer, 2018, pp. 593–607.
  • [19] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: a comprehensive review,” Computational Social Networks, vol. 6, no. 1, pp. 1–23, 2019.
  • [20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • [21] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions On Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019.
  • [22] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
  • [23] A. Sofer, T. Kounovskỳ, J. Čmejla, Z. Koldovskỳ, and S. Gannot, “Robust relative transfer function identification on manifolds for speech enhancement,” in 2021 29th European Signal Processing Conference (EUSIPCO).   IEEE, 2021, pp. 401–405.
  • [24] R. Talmon and S. Gannot, “Relative transfer function identification on manifolds for supervised gsc beamformers,” in 21st European Signal Processing Conference (EUSIPCO 2013).   IEEE, 2013, pp. 1–5.
  • [25] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [26] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, 2015.
  • [27] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • [28] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervised sound source localization based on manifold regularization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1393–1407, 2016.
  • [29] A. Deleforge and R. Horaud, “2d sound-source localization on the binaural manifold,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing.   IEEE, 2012, pp. 1–6.
  • [30] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Semi-supervised source localization on multiple manifolds with distributed microphones,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1477–1491, 2017.
  • [31] A. Deleforge, F. Forbes, and R. Horaud, “Acoustic space learning for sound-source separation and localization on binaural manifolds,” International journal of neural systems, vol. 25, no. 01, p. 1440003, 2015.
  • [32] A. Brendel, J. Zeitler, and W. Kellermann, “Manifold learning-supported estimation of relative transfer functions for spatial filtering,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8792–8796.
  • [33] R. R. Coifman and S. Lafon, “Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 31–52, 2006.
  • [34] J. Svoboda, J. Masci, F. Monti, M. M. Bronstein, and L. Guibas, “Peernets: Exploiting peer wisdom against adversarial attacks,” arXiv preprint arXiv:1806.00088, 2018.
  • [35] F. R. Chung and F. C. Graham, Spectral graph theory.   American Mathematical Soc., 1997, no. 92.
  • [36] D. Spielman, “Spectral graph theory,” Combinatorial scientific computing, vol. 18, 2012.
  • [37] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
  • [38] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
  • [39] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in neural information processing systems, vol. 29, pp. 3844–3852, 2016.
  • [40] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in neural information processing systems, 2016, pp. 1993–2001.
  • [41] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning.   PMLR, 2016, pp. 2014–2023.
  • [42] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning.   PMLR, 2017, pp. 1263–1272.
  • [43] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
  • [44] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5115–5124.
  • [45] W. Cao, Z. Yan, Z. He, and Z. He, “A comprehensive survey on geometric deep learning,” IEEE Access, vol. 8, pp. 35 929–35 949, 2020.
  • [46] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” arXiv preprint arXiv:2006.05205, 2020.
  • [47] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” IEEE Tran. on Au., Sp., and Lang. Proc., vol. 17, no. 6, pp. 1071–1086, 2009.
  • [48] S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” in IEEE Int. Conf. on Acous., Sp. and Sig. Proc. (ICASSP), 2015, pp. 544–548.
  • [49] J. Čmejla, T. Kounovský, S. Gannot, Z. Koldovský, and P. Tandeitnik, “MIRaGe: multichannel database of room impulse responses measured on high-resolution cube-shaped grid,” in 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 56–60.
  • [50] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9267–9276.
  • [51] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Thirty-Second AAAI conference on artificial intelligence, 2018.
  • [52] R. Talmon and S. Gannot, “Relative transfer function identification on manifolds for supervised GSC beamformers,” in 21st Euro. Sig. Proc. Conf. (EUSIPCO), 2013.
  • [53] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 626–630.
  • [54] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Tran. on Au., Sp., and Lang. Proc., vol. 19, no. 7, pp. 2125–2136, 2011.
  • [55] C. K. Reddy, V. Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 886–890.