-
Injectivity of ReLU-layers: Perspectives from Frame Theory
Authors:
Daniel Haider,
Martin Ehler,
Peter Balazs
Abstract:
Injectivity is the defining property of a map** that ensures no information is lost and any input can be perfectly reconstructed from its output. By performing hard thresholding, the ReLU function naturally interferes with this property, making the injectivity analysis of ReLU-layers in neural networks a challenging yet intriguing task that has not yet been fully solved. This article establishes…
▽ More
Injectivity is the defining property of a map** that ensures no information is lost and any input can be perfectly reconstructed from its output. By performing hard thresholding, the ReLU function naturally interferes with this property, making the injectivity analysis of ReLU-layers in neural networks a challenging yet intriguing task that has not yet been fully solved. This article establishes a frame theoretic perspective to approach this problem. The main objective is to develop the most general characterization of the injectivity behavior of ReLU-layers in terms of all three involved ingredients: (i) the weights, (ii) the bias, and (iii) the domain where the data is drawn from. Maintaining a focus on practical applications, we limit our attention to bounded domains and present two methods for numerically approximating a maximal bias for given weights and data domains. These methods provide sufficient conditions for the injectivity of a ReLU-layer on those domains and yield a novel practical methodology for studying the information loss in ReLU layers. Finally, we derive explicit reconstruction formulas based on the duality concept from frame theory.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Instabilities in Convnets for Raw Audio
Authors:
Daniel Haider,
Vincent Lostanlen,
Martin Ehler,
Peter Balazs
Abstract:
What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal appr…
▽ More
What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
△ Less
Submitted 26 April, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Fitting Auditory Filterbanks with Multiresolution Neural Networks
Authors:
Vincent Lostanlen,
Daniel Haider,
Han Han,
Mathieu Lagrange,
Peter Balazs,
Martin Ehler
Abstract:
Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optima…
▽ More
Waveform-based deep learning faces a dilemma between nonparametric and parametric approaches. On one hand, convolutional neural networks (convnets) may approximate any linear time-invariant system; yet, in practice, their frequency responses become more irregular as their receptive fields grow. On the other hand, a parametric model such as LEAF is guaranteed to yield Gabor filters, hence an optimal time-frequency localization; yet, this strong inductive bias comes at the detriment of representational capacity. In this paper, we aim to overcome this dilemma by introducing a neural audio model, named multiresolution neural network (MuReNN). The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT). Since the scale of DWT atoms grows exponentially between octaves, the receptive fields of the subsequent learnable convolutions in MuReNN are dilated accordingly. For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank: Gammatone for speech, CQT for music, and third-octave for urban sounds, respectively. This is a form of knowledge distillation (KD), in which the filterbank ''teacher'' is engineered by domain knowledge while the neural network ''student'' is optimized from data. We compare MuReNN to the state of the art in terms of goodness of fit after KD on a hold-out set and in terms of Heisenberg time-frequency localization. Compared to convnets and Gabor convolutions, we find that MuReNN reaches state-of-the-art performance on all three optimization problems.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Convex Geometry of ReLU-layers, Injectivity on the Ball and Local Reconstruction
Authors:
Daniel Haider,
Martin Ehler,
Peter Balazs
Abstract:
The paper uses a frame-theoretic setting to study the injectivity of a ReLU-layer on the closed ball of $\mathbb{R}^n$ and its non-negative part. In particular, the interplay between the radius of the ball and the bias vector is emphasized. Together with a perspective from convex geometry, this leads to a computationally feasible method of verifying the injectivity of a ReLU-layer under reasonable…
▽ More
The paper uses a frame-theoretic setting to study the injectivity of a ReLU-layer on the closed ball of $\mathbb{R}^n$ and its non-negative part. In particular, the interplay between the radius of the ball and the bias vector is emphasized. Together with a perspective from convex geometry, this leads to a computationally feasible method of verifying the injectivity of a ReLU-layer under reasonable restrictions in terms of an upper bound of the bias vector. Explicit reconstruction formulas are provided, inspired by the duality concept from frame theory. All this gives rise to the possibility of quantifying the invertibility of a ReLU-layer and a concrete reconstruction algorithm for any input vector on the ball.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Fast Matching Pursuit with Multi-Gabor Dictionaries
Authors:
Zdeněk Průša,
Nicki Holighaus,
Peter Balazs
Abstract:
Finding the best K-sparse approximation of a signal in a redundant dictionary is an NP-hard problem. Suboptimal greedy matching pursuit (MP) algorithms are generally used for this task. In this work, we present an acceleration technique and an implementation of the matching pursuit algorithm acting on a multi-Gabor dictionary, i.e., a concatenation of several Gabor-type time-frequency dictionaries…
▽ More
Finding the best K-sparse approximation of a signal in a redundant dictionary is an NP-hard problem. Suboptimal greedy matching pursuit (MP) algorithms are generally used for this task. In this work, we present an acceleration technique and an implementation of the matching pursuit algorithm acting on a multi-Gabor dictionary, i.e., a concatenation of several Gabor-type time-frequency dictionaries, each of which consisting of translations and modulations of a possibly different window and time and frequency shift parameters. The technique is based on pre-computing and thresholding inner products between atoms and on updating the residual directly in the coefficient domain, i.e., without the round-trip to the signal domain. Since the proposed acceleration technique involves an approximate update step, we provide theoretical and experimental results illustrating the convergence of the resulting algorithm. The implementation is written in C (compatible with C99 and C++11) and we also provide Matlab and GNU Octave interfaces. For some settings, the implementation is up to 70 times faster than the standard Matching Pursuit Toolkit (MPTK).
△ Less
Submitted 16 February, 2022;
originally announced February 2022.
-
Phase-Based Signal Representations for Scattering
Authors:
Daniel Haider,
Peter Balazs,
Nicki Holighaus
Abstract:
The scattering transform is a non-linear signal representation method based on cascaded wavelet transform magnitudes. In this paper we introduce phase scattering, a novel approach where we use phase derivatives in a scattering procedure. We first revisit phase-related concepts for representing time-frequency information of audio signals, in particular, the partial derivatives of the phase in the t…
▽ More
The scattering transform is a non-linear signal representation method based on cascaded wavelet transform magnitudes. In this paper we introduce phase scattering, a novel approach where we use phase derivatives in a scattering procedure. We first revisit phase-related concepts for representing time-frequency information of audio signals, in particular, the partial derivatives of the phase in the time-frequency domain. By putting analytical and numerical results in a new light, we set the basis to extend the phase-based representations to higher orders by means of a scattering transform, which leads to well localized signal representations of large-scale structures. All the ideas are introduced in a general way and then applied using the STFT.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Audio Inpainting via $\ell_1$-Minimization and Dictionary Learning
Authors:
Shristi Rajbamshi,
Georg Tauböck,
Peter Balazs,
Nicki Holighaus
Abstract:
Audio inpainting refers to signal processing techniques that aim at restoring missing or corrupted consecutive samples in audio signals. Prior works have shown that $\ell_1$- minimization with appropriate weighting is capable of solving audio inpainting problems, both for the analysis and the synthesis models. These models assume that audio signals are sparse with respect to some redundant diction…
▽ More
Audio inpainting refers to signal processing techniques that aim at restoring missing or corrupted consecutive samples in audio signals. Prior works have shown that $\ell_1$- minimization with appropriate weighting is capable of solving audio inpainting problems, both for the analysis and the synthesis models. These models assume that audio signals are sparse with respect to some redundant dictionary and exploit that sparsity for inpainting purposes. Remaining within the sparsity framework, we utilize dictionary learning to further increase the sparsity and combine it with weighted $\ell_1$-minimization adapted for audio inpainting to compensate for the loss of energy within the gap after restoration. Our experiments demonstrate that our approach is superior in terms of signal-to-distortion ratio (SDR) and objective difference grade (ODG) compared with its original counterpart.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Image-based material characterization of complex microarchitectured additively manufactured structures
Authors:
N. Korshunova,
J. Jomo,
G. Lékó,
D. Reznik,
P. Balázs,
S. Kollmannsberger
Abstract:
Significant developments in the field of additive manufacturing (AM) allowed the fabrication of complex microarchitectured components with varying porosity across different scales. However, due to the high complexity of this process, the final parts can exhibit significant variations in the nominal geometry. Computer tomographic images of 3D printed components provide extensive information about t…
▽ More
Significant developments in the field of additive manufacturing (AM) allowed the fabrication of complex microarchitectured components with varying porosity across different scales. However, due to the high complexity of this process, the final parts can exhibit significant variations in the nominal geometry. Computer tomographic images of 3D printed components provide extensive information about these microstructural variations, such as process-induced porosity, surface roughness, and other undesired morphological discrepancies. Yet, techniques to incorporate these imperfect AM geometries into the numerical material characterization analysis are computationally demanding. In this contribution, an efficient image-to-material-characterization framework using the high-order parallel Finite Cell Method is proposed. In this way, a flexible non-geometry-conforming discretization facilitates mesh generation for very complex microstructures at hand and allows a direct analysis of the images stemming from CT-scans. Numerical examples including a comparison to the experiments illustrate the potential of the proposed framework in the field of additive manufacturing product simulation.
△ Less
Submitted 22 July, 2020; v1 submitted 16 December, 2019;
originally announced December 2019.
-
Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker Comparison
Authors:
Feng Huang,
Peter Balazs
Abstract:
We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analy…
▽ More
We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analysis. This study extends the frame mask approach to speech signals. For voiced speech, the fundamental frequency is usually changing consecutively over time. We employ NSGT with pitch-dependent and therefore time-varying frequency resolution to attain harmonic alignment in the transform domain and hence yield harmonic-aligned frame masks for speech signals. We propose to apply the harmonic-aligned frame mask to content-dependent speaker comparison. Frame masks, computed from voiced signals of a same vowel but from different speakers, were utilized as similarity measures to compare and distinguish the speaker identities (SID). Results obtained with deep neural networks demonstrate that the proposed frame mask is valid in representing speaker characteristics and shows a potential for SID applications in limited data scenarios.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Frame Theory for Signal Processing in Psychoacoustics
Authors:
Peter Balazs,
Nicki Holighaus,
Thibaud Necciari,
Diana Stoeva
Abstract:
This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. The goal is to reveal to hearing scientists how this mathematical theory could be relevant for their research. In particular, we focus on frame th…
▽ More
This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. The goal is to reveal to hearing scientists how this mathematical theory could be relevant for their research. In particular, we focus on frame theory in a filter bank approach, which is probably the most relevant view-point for audio signal processing. On the other side, basic psychoacoustic concepts are presented to stimulate mathematicians to apply their knowledge in this field.
△ Less
Submitted 3 November, 2016;
originally announced November 2016.
-
A Non-iterative Method for (Re)Construction of Phase from STFT Magnitude
Authors:
Zdeněk Průša,
Peter Balazs,
Peter L. Søndergaard
Abstract:
A non-iterative method for the construction of the Short-Time Fourier Transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the…
▽ More
A non-iterative method for the construction of the Short-Time Fourier Transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (Discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows like the Hann window.
Due to the non-iterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm.
We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner.
△ Less
Submitted 1 September, 2016;
originally announced September 2016.
-
Inpainting of long audio segments with similarity graphs
Authors:
Nathanael Perraudin,
Nicki Holighaus,
Piotr Majdak,
Peter Balazs
Abstract:
We present a novel method for the compensation of long duration data loss in audio signals, in particular music. The concealment of such signal defects is based on a graph that encodes signal structure in terms of time-persistent spectral similarity. A suitable candidate segment for the substitution of the lost content is proposed by an intuitive optimization scheme and smoothly inserted into the…
▽ More
We present a novel method for the compensation of long duration data loss in audio signals, in particular music. The concealment of such signal defects is based on a graph that encodes signal structure in terms of time-persistent spectral similarity. A suitable candidate segment for the substitution of the lost content is proposed by an intuitive optimization scheme and smoothly inserted into the gap, i.e. the lost or distorted signal region. Extensive listening tests show that the proposed algorithm provides highly promising results when applied to a variety of real-world music signals.
△ Less
Submitted 23 February, 2018; v1 submitted 22 July, 2016;
originally announced July 2016.
-
A Perceptually Motivated Filter Bank with Perfect Reconstruction for Audio Signal Processing
Authors:
Thibaud Necciari,
Nicki Holighaus,
Peter Balazs,
Zdenek Prusa
Abstract:
Many audio applications rely on filter banks (FBs) to analyze, process, and re-synthesize sounds. To approximate the auditory frequency resolution in the signal chain, some applications rely on perceptually motivated FBs, the gammatone FB being a popular example. However, most perceptually motivated FBs only allow partial signal reconstruction at high redundancies and/or do not have good resistanc…
▽ More
Many audio applications rely on filter banks (FBs) to analyze, process, and re-synthesize sounds. To approximate the auditory frequency resolution in the signal chain, some applications rely on perceptually motivated FBs, the gammatone FB being a popular example. However, most perceptually motivated FBs only allow partial signal reconstruction at high redundancies and/or do not have good resistance to sub-channel processing. This paper introduces an oversampled perceptually motivated FB enabling perfect reconstruction, efficient FB design, and adaptable redundancy. The filters are directly constructed in the frequency domain and linearly distributed on a perceptual frequency scale (e.g. ERB, Bark, or Mel scale). The proposed design allows for various filter shapes, uniform or non-uniform FB setting, and large down-sampling factors. For redundancies $\geq$ 3 perfect reconstruction is achieved by computing the canonical dual FB analytically. For lower redundancies perfect reconstruction is achieved using an iterative method. Experiments show performance improvements of the proposed approach when compared to the gammatone FB in terms of reconstruction error and resistance to sub-channel processing, especially at low redundancies.
△ Less
Submitted 25 January, 2016;
originally announced January 2016.
-
Sound Analysis and Synthesis Adaptive in Time and Two Frequency Bands
Authors:
Marco Liuni,
Peter Balazs,
Axel Röbel
Abstract:
We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-wei…
▽ More
We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-weighting, from the point of view of Gabor frame theory. We analyze in particular the case of different adaptive time-varying resolutions within two complementary frequency bands; this is a typical case where perfect signal reconstruction cannot in general be achieved with fast algorithms, causing a certain error to be minimized. We provide examples of adaptive analyses of a music sound, and outline several possibilities that this work opens.
△ Less
Submitted 29 September, 2011;
originally announced September 2011.