Search | arXiv e-print repository

arXiv:2310.19067 [pdf, other]

Expanding memory in recurrent spiking networks

Authors: Ismael Balafrej, Fabien Alibart, Jean Rouat

Abstract: Recurrent spiking neural networks (RSNNs) are notoriously difficult to train because of the vanishing gradient problem that is enhanced by the binary nature of the spikes. In this paper, we review the ability of the current state-of-the-art RSNNs to solve long-term memory tasks, and show that they have strong constraints both in performance, and for their implementation on hardware analog neuromor… ▽ More Recurrent spiking neural networks (RSNNs) are notoriously difficult to train because of the vanishing gradient problem that is enhanced by the binary nature of the spikes. In this paper, we review the ability of the current state-of-the-art RSNNs to solve long-term memory tasks, and show that they have strong constraints both in performance, and for their implementation on hardware analog neuromorphic processors. We present a novel spiking neural network that circumvents these limitations. Our biologically inspired neural network uses synaptic delays, branching factor regularization and a novel surrogate derivative for the spiking function. The proposed network proves to be more successful in using the recurrent connections on memory tasks. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2308.12075 [pdf, other]

Stabilizing RNN Gradients through Pre-training

Authors: Luca Herranz-Celotti, Jean Rouat

Abstract: Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stabil… ▽ More Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically. △ Less

Submitted 4 January, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

arXiv:2308.10831 [pdf, other]

doi 10.3389/fncom.2023.1223258

Excitatory/Inhibitory Balance Emerges as a Key Factor for RBN Performance, Overriding Attractor Dynamics

Authors: Emmanuel Calvet, Jean Rouat, Bertrand Reulet

Abstract: Reservoir computing provides a time and cost-efficient alternative to traditional learning methods.Critical regimes, known as the "edge of chaos," have been found to optimize computational performance in binary neural networks. However, little attention has been devoted to studying reservoir-to-reservoir variability when investigating the link between connectivity, dynamics, and performance. As ph… ▽ More Reservoir computing provides a time and cost-efficient alternative to traditional learning methods.Critical regimes, known as the "edge of chaos," have been found to optimize computational performance in binary neural networks. However, little attention has been devoted to studying reservoir-to-reservoir variability when investigating the link between connectivity, dynamics, and performance. As physical reservoir computers become more prevalent, develo** a systematic approach to network design is crucial. In this article, we examine Random Boolean Networks (RBNs) and demonstrate that specific distribution parameters can lead to diverse dynamics near critical points. We identify distinct dynamical attractors and quantify their statistics, revealing that most reservoirs possess a dominant attractor. We then evaluate performance in two challenging tasks, memorization and prediction, and find that a positive excitatory balance produces a critical point with higher memory performance. In comparison, a negative inhibitory balance delivers another critical point with better prediction performance. Interestingly, we show that the intrinsic attractor dynamics have little influence on performance in either case. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 22 pages, 6 figures

Journal ref: Front. Comput. Neurosci. Volume 17 - 2023

arXiv:2304.05462 [pdf, other]

Evaluation of short range depth sonifications for visual-to-auditory sensory substitution

Authors: Louis Commère, Jean Rouat

Abstract: Visual to auditory sensory substitution devices convert visual information into sound and can provide valuable assistance for blind people. Recent iterations of these devices rely on depth sensors. Rules for converting depth into sound (i.e. the sonifications) are often designed arbitrarily, with no strong evidence for choosing one over another. The purpose of this work is to compare and understan… ▽ More Visual to auditory sensory substitution devices convert visual information into sound and can provide valuable assistance for blind people. Recent iterations of these devices rely on depth sensors. Rules for converting depth into sound (i.e. the sonifications) are often designed arbitrarily, with no strong evidence for choosing one over another. The purpose of this work is to compare and understand the effectiveness of five depth sonifications in order to assist the design process of future visual to auditory systems for blind people which rely on depth sensors. The frequency, amplitude and reverberation of the sound as well as the repetition rate of short high-pitched sounds and the signal-to-noise ratio of a mixture between pure sound and noise are studied. We conducted positioning experiments with twenty-eight sighted blindfolded participants. Stage 1 incorporates learning phases followed by depth estimation tasks. Stage 2 adds the additional challenge of azimuth estimation to the first stage's protocol. Stage 3 tests learning retention by incorporating a 10-minute break before re-testing depth estimation. The best depth estimates in stage 1 were obtained with the sound frequency and the repetition rate of beeps. In stage 2, the beep repetition rate yielded the best depth estimation and no significant difference was observed for the azimuth estimation. Results of stage 3 showed that the beep repetition rate was the easiest sonification to memorize. Based on statistical analysis of the results, we discuss the effectiveness of each sonification and compare with other studies that encode depth into sounds. Finally we provide recommendations for the design of depth encoding. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2207.07073 [pdf, other]

doi 10.1145/3546790.3546803

Efficient spike encoding algorithms for neuromorphic speech recognition

Authors: Sidi Yaya Arnaud Yarga, Jean Rouat, Sean U. N. Wood

Abstract: Spiking Neural Networks (SNN) are known to be very effective for neuromorphic processor implementations, achieving orders of magnitude improvements in energy efficiency and computational latency over traditional deep learning approaches. Comparable algorithmic performance was recently made possible as well with the adaptation of supervised training algorithms to the context of SNN. However, inform… ▽ More Spiking Neural Networks (SNN) are known to be very effective for neuromorphic processor implementations, achieving orders of magnitude improvements in energy efficiency and computational latency over traditional deep learning approaches. Comparable algorithmic performance was recently made possible as well with the adaptation of supervised training algorithms to the context of SNN. However, information including audio, video, and other sensor-derived data are typically encoded as real-valued signals that are not well-suited to SNN, preventing the network from leveraging spike timing information. Efficient encoding from real-valued signals to spikes is therefore critical and significantly impacts the performance of the overall system. To efficiently encode signals into spikes, both the preservation of information relevant to the task at hand as well as the density of the encoded spikes must be considered. In this paper, we study four spike encoding methods in the context of a speaker independent digit classification system: Send on Delta, Time to First Spike, Leaky Integrate and Fire Neuron and Bens Spiker Algorithm. We first show that all encoding methods yield higher classification accuracy using significantly fewer spikes when encoding a bio-inspired cochleagram as opposed to a traditional short-time Fourier transform. We then show that two Send On Delta variants result in classification results comparable with a state of the art deep convolutional neural network baseline, while simultaneously reducing the encoded bit rate. Finally, we show that several encoding methods result in improved performance over the conventional deep learning baseline in certain cases, further demonstrating the power of spike encoding algorithms in the encoding of real-valued signals and that neuromorphic implementation has the potential to outperform state of the art techniques. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: Accepted to International Conference on Neuromorphic Systems (ICONS 2022)

arXiv:2204.06063 [pdf, other]

Sonified distance in sensory substitution does not always improve localization: comparison with a 2D and 3D handheld device

Authors: Louis Commère, Jean Rouat

Abstract: Early visual to auditory substitution devices encode 2D monocular images into sounds while more recent devices use distance information from 3D sensors. This study assesses whether the addition of sound-encoded distance in recent systems helps to convey the "where" information. This is important to the design of new sensory substitution devices. We conducted experiments for object localization and… ▽ More Early visual to auditory substitution devices encode 2D monocular images into sounds while more recent devices use distance information from 3D sensors. This study assesses whether the addition of sound-encoded distance in recent systems helps to convey the "where" information. This is important to the design of new sensory substitution devices. We conducted experiments for object localization and navigation tasks with a handheld visual to audio substitution system. It comprises 2D and 3D modes. Both encode in real-time the position of objects in images captured by a camera. The 3D mode encodes in addition the distance between the system and the object. Experiments have been conducted with 16 blindfolded sighted participants. For the localization, participants were quicker to understand the scene with the 3D mode that encodes distances. On the other hand, with the 2D only mode, they were able to compensate for the lack of distance encoding after a small training. For the navigation, participants were as good with the 2D only mode than with the 3D mode encoding distance. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: 14 pages, 9 figures

arXiv:2204.00094 [pdf, other]

doi 10.1007/11520153_14

Perceptive, non-linear Speech Processing and Spiking Neural Networks

Authors: Jean Rouat, Ramin Pichevar, Stéphane Loiselle

Abstract: Source separation and speech recognition are very difficult in the context of noisy and corrupted speech. Most conventional techniques need huge databases to estimate speech (or noise) density probabilities to perform separation or recognition. We discuss the potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors. We illustrate t… ▽ More Source separation and speech recognition are very difficult in the context of noisy and corrupted speech. Most conventional techniques need huge databases to estimate speech (or noise) density probabilities to perform separation or recognition. We discuss the potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors. We illustrate the potential of such non-linear processing of speech on a source separation system inspired by an Auditory Scene Analysis paradigm. We also discuss a potential application in speech recognition. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Comments: preprint of the 2005 published paper: Perceptive, Non-linear Speech Processing and Spiking Neural Networks. In: Chollet, G., Esposito, A., Faundez-Zanuy, M., Marinaro, M. (eds) Nonlinear Speech Modeling and Applications. NN 2004. Lecture Notes in Computer Science, vol 3445. Springer, Berlin, Heidelberg

arXiv:2203.11022 [pdf]

doi 10.3389/fnins.2022.983950

Voltage-Dependent Synaptic Plasticity (VDSP): Unsupervised probabilistic Hebbian plasticity rule based on neurons membrane potential

Authors: Nikhil Garg, Ismael Balafrej, Terrence C. Stewart, Jean Michel Portal, Marc Bocquet, Damien Querlioz, Dominique Drouin, Jean Rouat, Yann Beilliard, Fabien Alibart

Abstract: This study proposes voltage-dependent-synaptic plasticity (VDSP), a novel brain-inspired unsupervised local learning rule for the online implementation of Hebb's plasticity mechanism on neuromorphic hardware. The proposed VDSP learning rule updates the synaptic conductance on the spike of the postsynaptic neuron only, which reduces by a factor of two the number of updates with respect to standard… ▽ More This study proposes voltage-dependent-synaptic plasticity (VDSP), a novel brain-inspired unsupervised local learning rule for the online implementation of Hebb's plasticity mechanism on neuromorphic hardware. The proposed VDSP learning rule updates the synaptic conductance on the spike of the postsynaptic neuron only, which reduces by a factor of two the number of updates with respect to standard spike-timing-dependent plasticity (STDP). This update is dependent on the membrane potential of the presynaptic neuron, which is readily available as part of neuron implementation and hence does not require additional memory for storage. Moreover, the update is also regularized on synaptic weight and prevents explosion or vanishing of weights on repeated stimulation. Rigorous mathematical analysis is performed to draw an equivalence between VDSP and STDP. To validate the system-level performance of VDSP, we train a single-layer spiking neural network (SNN) for the recognition of handwritten digits. We report 85.01 $ \pm $ 0.76% (Mean $ \pm $ S.D.) accuracy for a network of 100 output neurons on the MNIST dataset. The performance improves when scaling the network size (89.93 $ \pm $ 0.41% for 400 output neurons, 90.56 $ \pm $ 0.27 for 500 neurons), which validates the applicability of the proposed learning rule for spatial pattern recognition tasks. Future work will consider more complicated tasks. Interestingly, the learning rule better adapts than STDP to the frequency of input signal and does not require hand-tuning of hyperparameters △ Less

Submitted 22 October, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: Front. Neurosci., 21 October 2022 Sec. Neuromorphic Engineering

Journal ref: Front. Neurosci. 16:983950 (2022)

arXiv:2202.09619 [pdf, other]

Evaluation of Neuromorphic Spike Encoding of Sound Using Information Theory

Authors: Ahmad El Ferdaoussi, Éric Plourde, Jean Rouat

Abstract: The problem of spike encoding of sound consists in transforming a sound waveform into spikes. It is of interest in many domains, including the development of audio-based spiking neural networks, where it is the first and most crucial stage of processing. Many algorithms have been proposed to perform spike encoding of sound. However, a systematic approach to quantitatively evaluate their performanc… ▽ More The problem of spike encoding of sound consists in transforming a sound waveform into spikes. It is of interest in many domains, including the development of audio-based spiking neural networks, where it is the first and most crucial stage of processing. Many algorithms have been proposed to perform spike encoding of sound. However, a systematic approach to quantitatively evaluate their performance is currently lacking. We propose the use of an information-theoretic framework to solve this problem. Specifically, we evaluate the coding efficiency of four spike encoding algorithms on two coding tasks that consist of coding the fundamental characteristics of sound: frequency and amplitude. The algorithms investigated are: Independent Spike Coding, Send-on-Delta coding, Ben's Spiker Algorithm, and Leaky Integrate-and-Fire coding. Using the tools of information theory, we estimate the information that the spikes carry on relevant aspects of an input stimulus. We find disparities in the coding efficiencies of the algorithms, where Leaky Integrate-and-Fire coding performs best. The information-theoretic analysis of their performance on these coding tasks provides insight on the encoding of richer and more complex sound stimuli. △ Less

Submitted 15 February, 2023; v1 submitted 19 February, 2022; originally announced February 2022.

Comments: 10 pages, 7 figures, internal report

arXiv:2202.00282 [pdf, other]

Stabilizing Spiking Neuron Training

Authors: Luca Herranz-Celotti, Jean Rouat

Abstract: Stability arguments are often used to prevent learning algorithms from having ever increasing activity and weights that hinder generalization. However, stability conditions can clash with the sparsity required to augment the energy efficiency of spiking neurons. Nonetheless it can also provide solutions. In fact, spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligenc… ▽ More Stability arguments are often used to prevent learning algorithms from having ever increasing activity and weights that hinder generalization. However, stability conditions can clash with the sparsity required to augment the energy efficiency of spiking neurons. Nonetheless it can also provide solutions. In fact, spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligence energy efficiency. However, its non-smoothness requires approximate gradients, known as Surrogate Gradients (SG), to close the performance gap with Deep Learning. Several SG have been proposed in the literature, but it remains unclear how to determine the best SG for a given task and network. Thus, we aim at theoretically define the best SG, through stability arguments, to reduce the need for grid search. In fact, we show that more complex tasks and networks need more careful choice of SG, even if overall the derivative of the fast sigmoid tends to outperform the other, for a wide range of learning rates. We therefore design a stability based theoretical method to choose initialization and SG shape before training on the most common spiking neuron, the Leaky Integrate and Fire (LIF). Since our stability method suggests the use of high firing rates at initialization, which is non-standard in the neuromorphic literature, we show that high initial firing rates, combined with a sparsity encouraging loss term introduced gradually, can lead to better generalization, depending on the SG shape. Our stability based theoretical solution, finds a SG and initialization that experimentally result in improved accuracy. We show how it can be used to reduce the need of extensive grid-search of dampening, sharpness and tail-fatness of the SG. We also show that our stability concepts can be extended to be applicable on different LIF variants, such as DECOLLE and fluctuations-driven initializations. △ Less

Submitted 4 January, 2024; v1 submitted 1 February, 2022; originally announced February 2022.

arXiv:2109.14705 [pdf, ps, other]

doi 10.1109/MLSP52302.2021.9596348

Adaptive Approach For Sparse Representations Using The Locally Competitive Algorithm For Audio

Authors: Soufiyan Bahadi, Jean Rouat, Éric Plourde

Abstract: Gammachirp filterbank has been used to approximate the cochlea in sparse coding algorithms. An oriented grid search optimization was applied to adapt the gammachirp's parameters and improve the Matching Pursuit (MP) algorithm's sparsity along with the reconstruction quality. However, this combination of a greedy algorithm with a grid search at each iteration is computationally demanding and not su… ▽ More Gammachirp filterbank has been used to approximate the cochlea in sparse coding algorithms. An oriented grid search optimization was applied to adapt the gammachirp's parameters and improve the Matching Pursuit (MP) algorithm's sparsity along with the reconstruction quality. However, this combination of a greedy algorithm with a grid search at each iteration is computationally demanding and not suitable for real-time applications. This paper presents an adaptive approach to optimize the gammachirp's parameters but in the context of the Locally Competitive Algorithm (LCA) that requires much fewer computations than MP. The proposed method consists of taking advantage of the LCA's neural architecture to automatically adapt the gammachirp's filterbank using the backpropagation algorithm. Results demonstrate an improvement in the LCA's performance with our approach in terms of sparsity, reconstruction quality, and convergence time. This approach can yield a significant advantage over existing approaches for real-time applications. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: To be published at IEEE Machine Learning for Signal Processing 2021

Journal ref: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)

arXiv:2106.11169 [pdf, other]

doi 10.1145/3477145.3477267

Signals to Spikes for Neuromorphic Regulated Reservoir Computing and EMG Hand Gesture Recognition

Authors: Nikhil Garg, Ismael Balafrej, Yann Beilliard, Dominique Drouin, Fabien Alibart, Jean Rouat

Abstract: Surface electromyogram (sEMG) signals result from muscle movement and hence they are an ideal candidate for benchmarking event-driven sensing and computing. We propose a simple yet novel approach for optimizing the spike encoding algorithm's hyper-parameters inspired by the readout layer concept in reservoir computing. Using a simple machine learning algorithm after spike encoding, we report perfo… ▽ More Surface electromyogram (sEMG) signals result from muscle movement and hence they are an ideal candidate for benchmarking event-driven sensing and computing. We propose a simple yet novel approach for optimizing the spike encoding algorithm's hyper-parameters inspired by the readout layer concept in reservoir computing. Using a simple machine learning algorithm after spike encoding, we report performance higher than the state-of-the-art spiking neural networks on two open-source datasets for hand gesture recognition. The spike encoded data is processed through a spiking reservoir with a biologically inspired topology and neuron model. When trained with the unsupervised activity regulation CRITICAL algorithm to operate at the edge of chaos, the reservoir yields better performance than state-of-the-art convolutional neural networks. The reservoir performance with regulated activity was found to be 89.72% for the Roshambo EMG dataset and 70.6% for the EMG subset of sensor fusion dataset. Therefore, the biologically-inspired computing paradigm, which is known for being power efficient, also proves to have a great potential when compared with conventional AI algorithms. △ Less

Submitted 3 August, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted to International Conference on Neuromorphic Systems (ICONS 2021)

arXiv:2106.06736 [pdf, other]

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Authors: Mathilde Brousmiche, Jean Rouat, Stéphane Dupont

Abstract: Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we coup… ▽ More Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we couple both modalities at different levels of visual and audio paths. Furthermore, the network dynamically highlights a modality at a given time window relevant to classify events. Experimental results in AVE (Audio-Visual Event), UCF51, and Kinetics-Sounds datasets show that the approach can effectively improve the accuracy in audio-visual event classification. Code is available at: https://github.com/numediart/MAFnet △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: Preprint submitted to the Information Fusion journal in August 2020

arXiv:2106.06147 [pdf, other]

doi 10.1109/TPAMI.2022.3194311

NAAQA: A Neural Architecture for Acoustic Question Answering

Authors: Jerome Abdelnour, Jean Rouat, Giampiero Salvi

Abstract: The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of var… ▽ More The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ~17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ~4 times fewer parameters than the previously explored VQA model. We evaluate the perfomance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task. △ Less

Submitted 12 January, 2024; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) in April 2021 (first revision February 2022)

ACM Class: I.2.7; I.2.10; I.5.0

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, Volume: 45 Issue: 4, Page(s): 4997-5009

arXiv:2011.15096 [pdf, other]

doi 10.1007/s00779-020-01388-1

A proposal and evaluation of new timbre visualisation methods for audio sample browsers

Authors: Etienne Richan, Jean Rouat

Abstract: Searching through vast libraries of sound samples can be a daunting and time-consuming task. Modern audio sample browsers use map**s between acoustic properties and visual attributes to visually differentiate displayed items. There are few studies focused on how well these map**s help users search for a specific sample. We propose new methods for generating textural labels and positioning samp… ▽ More Searching through vast libraries of sound samples can be a daunting and time-consuming task. Modern audio sample browsers use map**s between acoustic properties and visual attributes to visually differentiate displayed items. There are few studies focused on how well these map**s help users search for a specific sample. We propose new methods for generating textural labels and positioning samples based on perceptual representations of timbre. We perform a series of studies to evaluate the benefits of using shape, color or texture as labels in a known-item search task. We describe the motivation and implementation of the study, and present an in-depth analysis of results. We find that shape significantly improves task performance, while color and texture have little effect. We also compare results between in-person and online participants and propose research directions for further studies. △ Less

Submitted 30 November, 2020; originally announced November 2020.

Comments: 14 pages. Personal and Ubiquitous Computing (2020)

ACM Class: H.5.2; H.5.5

arXiv:2011.01018 [pdf, other]

AVECL-UMONS database for audio-visual event classification and localization

Authors: Mathilde Brousmiche, Stéphane Dupont, Jean Rouat

Abstract: We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multil… ▽ More We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multilabel sequences corresponding to a total of 5.24 hours. The dataset is publicly accessible online : https://zenodo.org/record/3965492#.X09wsobgrCI. △ Less

Submitted 2 October, 2020; originally announced November 2020.

arXiv:2010.09041 [pdf, ps, other]

Evaluation of a Vision-to-Audition Substitution System that Provides 2D WHERE Information and Fast User Learning

Authors: Louis Commère, Sean U. N. Wood, Jean Rouat

Abstract: Vision to audition substitution devices are designed to convey visual information through auditory input. The acceptance of such systems depends heavily on their ease of use, training time, reliability and on the amount of coverage of online auditory perception of current auditory scenes. Existing devices typically require extensive training time or complex and computationally demanding technology… ▽ More Vision to audition substitution devices are designed to convey visual information through auditory input. The acceptance of such systems depends heavily on their ease of use, training time, reliability and on the amount of coverage of online auditory perception of current auditory scenes. Existing devices typically require extensive training time or complex and computationally demanding technology. The purpose of this work is to investigate the learning curve for a vision to audition substitution system that provides simple location features. Forty-two blindfolded users participated in experiments involving location and navigation tasks. Participants had no prior experience with the system. For the location task, participants had to locate 3 objects on a table after a short familiarisation period (10 minutes). Then once they understood the manipulation of the device, they proceeded to the navigation task: participants had to walk through a large corridor without colliding with obstacles randomly placed on the floor. Participants were asked to repeat the task 5 times. In the end of the experiment, each participant had to fill out a questionnaire to provide feedback. They were able to perform localisation and navigation effectively after a short training time with an average of 10 minutes. Their navigation skills greatly improved across the trials. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: 15 pages, 10 figures

arXiv:2009.05593 [pdf, other]

doi 10.1088/2634-4386/ac6533

P-CRITICAL: A Reservoir Autoregulation Plasticity Rule for Neuromorphic Hardware

Authors: Ismael Balafrej, Jean Rouat

Abstract: Backpropagation algorithms on recurrent artificial neural networks require an unfolding of accumulated states over time. These states must be kept in memory for an undefined period of time which is task-dependent. This paper uses the reservoir computing paradigm where an untrained recurrent neural network layer is used as a preprocessor stage to learn temporal and limited data. These so-called res… ▽ More Backpropagation algorithms on recurrent artificial neural networks require an unfolding of accumulated states over time. These states must be kept in memory for an undefined period of time which is task-dependent. This paper uses the reservoir computing paradigm where an untrained recurrent neural network layer is used as a preprocessor stage to learn temporal and limited data. These so-called reservoirs require either extensive fine-tuning or neuroplasticity with unsupervised learning rules. We propose a new local plasticity rule named P-CRITICAL designed for automatic reservoir tuning that translates well to Intel's Loihi research chip, a recent neuromorphic processor. We compare our approach on well-known datasets from the machine learning community while using a spiking neuronal architecture. We observe an improved performance on tasks coming from various modalities without the need to tune parameters. Such algorithms could be a key to end-to-end energy-efficient neuromorphic-based machine learning on edge devices. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: 10 pages, 5 figures

Journal ref: Neuromorphic Computing and Engineering, IOP Publishing Ltd, 2022

arXiv:2003.13600 [pdf, other]

AriEL: volume coding for sentence generation

Authors: Luca Celotti, Simon Brodeur, Jean Rouat

Abstract: Map** sequences of discrete data to a point in a continuous space makes it difficult to retrieve those sequences via random sampling. Map** the input to a volume would make it easier to retrieve at test time, and that's the strategy followed by the family of approaches based on Variational Autoencoder. However the fact that they are at the same time optimizing for prediction and for smoothness… ▽ More Map** sequences of discrete data to a point in a continuous space makes it difficult to retrieve those sequences via random sampling. Map** the input to a volume would make it easier to retrieve at test time, and that's the strategy followed by the family of approaches based on Variational Autoencoder. However the fact that they are at the same time optimizing for prediction and for smoothness of representation, forces them to trade-off between the two. We improve on the performance of some of the standard methods in deep learning to generate sentences by uniformly sampling a continuous space. We do it by proposing AriEL, that constructs volumes in a continuous space, without the need of encouraging the creation of volumes through the loss function. We first benchmark on a toy grammar, that allows to automatically evaluate the language learned and generated by the models. Then, we benchmark on a real dataset of human dialogues. Our results indicate that the random access to the stored information is dramatically improved, and our method AriEL is able to generate a wider variety of correct language by randomly sampling the latent space. VAE follows in performance for the toy dataset while, AE and Transformer follow for the real dataset. This partially supports to the hypothesis that encoding information into volumes instead of into points, can lead to improved retrieval of learned information with random sampling. This can lead to better generators and we also discuss potential disadvantages. △ Less

Submitted 21 April, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

arXiv:1911.02002 [pdf, ps, other]

Language coverage and generalization in RNN-based continuous sentence embeddings for interacting agents

Authors: Luca Celotti, Simon Brodeur, Jean Rouat

Abstract: Continuous sentence embeddings using recurrent neural networks (RNNs), where variable-length sentences are encoded into fixed-dimensional vectors, are often the main building blocks of architectures applied to language tasks such as dialogue generation. While it is known that those embeddings are able to learn some structures of language (e.g. grammar) in a purely data-driven manner, there is very… ▽ More Continuous sentence embeddings using recurrent neural networks (RNNs), where variable-length sentences are encoded into fixed-dimensional vectors, are often the main building blocks of architectures applied to language tasks such as dialogue generation. While it is known that those embeddings are able to learn some structures of language (e.g. grammar) in a purely data-driven manner, there is very little work on the objective evaluation of their ability to cover the whole language space and to generalize to sentences outside the language bias of the training data. Using a manually designed context-free grammar (CFG) to generate a large-scale dataset of sentences related to the content of realistic 3D indoor scenes, we evaluate the language coverage and generalization abilities of the most common continuous sentence embeddings based on RNNs. We also propose a new embedding method based on arithmetic coding, AriEL, that is not data-driven and that efficiently encodes in continuous space any sentence from the CFG. We find that RNN-based embeddings underfit the training data and cover only a small subset of the language defined by the CFG. They also fail to learn the underlying CFG and generalize to unbiased sentences from that same CFG. We found that AriEL provides an insightful baseline. △ Less

Submitted 5 November, 2019; originally announced November 2019.

arXiv:1904.03130 [pdf, other]

doi 10.1109/JSTSP.2019.2909193

Unsupervised Low Latency Speech Enhancement with RT-GCC-NMF

Authors: Sean U. N. Wood, Jean Rouat

Abstract: In this paper, we present RT-GCC-NMF: a real-time (RT), two-channel blind speech enhancement algorithm that combines the non-negative matrix factorization (NMF) dictionary learning algorithm with the generalized cross-correlation (GCC) spatial localization method. Using a pre-learned universal NMF dictionary, RT-GCC-NMF operates in a frame-by-frame fashion by associating individual dictionary atom… ▽ More In this paper, we present RT-GCC-NMF: a real-time (RT), two-channel blind speech enhancement algorithm that combines the non-negative matrix factorization (NMF) dictionary learning algorithm with the generalized cross-correlation (GCC) spatial localization method. Using a pre-learned universal NMF dictionary, RT-GCC-NMF operates in a frame-by-frame fashion by associating individual dictionary atoms to target speech or background interference based on their estimated time-delay of arrivals (TDOA). We evaluate RT-GCC-NMF on two-channel mixtures of speech and real-world noise from the Signal Separation and Evaluation Campaign (SiSEC). We demonstrate that this approach generalizes to new speakers, acoustic environments, and recording setups from very little training data, and outperforms all but one of the algorithms from the SiSEC challenge in terms of overall Perceptual Evaluation methods for Audio Source Separation (PEASS) scores and compares favourably to the ideal binary mask baseline. Over a wide range of input SNRs, we show that this approach simultaneously improves the PEASS and signal to noise ratio (SNR)-based Blind Source Separation (BSS) Eval objective quality metrics as well as the short-time objective intelligibility (STOI) and extended STOI (ESTOI) objective speech intelligibility metrics. A flexible, soft masking function in the space of NMF activation coefficients offers real-time control of the trade-off between interference suppression and target speaker fidelity. Finally, we use an asymmetric short-time Fourier transform (STFT) to reduce the inherent algorithmic latency of RT-GCC-NMF from 64 ms to 2 ms with no loss in performance. We demonstrate that latencies within the tolerable range for hearing aids are possible on current hardware platforms. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Comments: Accepted for publication in the IEEE JSTSP Special Issue on Data Science: Machine Learning for Audio Signal Processing

arXiv:1902.11280 [pdf, ps, other]

From Visual to Acoustic Question Answering

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer… ▽ More We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer correctly. Although similar problems have been extensively studied in the domain of visual reasoning, we are not aware of any previous studies addressing the problem in the acoustic domain. We propose a method for generating the acoustic scenes from elementary sounds and a number of relevant questions for each scene using templates. We also present preliminary results obtained with two models (FiLM and MAC) that have been shown to work for visual reasoning. △ Less

Submitted 28 February, 2019; originally announced February 2019.

arXiv:1811.10561 [pdf, other]

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of func… ▽ More We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of functional programs that can be used to compose questions and answers that exploit the relationships between the attributes of the elementary sounds in each scene. We provide AQA datasets of various sizes as well as the data generation code. As a preliminary experiment to validate our data, we report the accuracy of current state of the art visual question answering models when they are applied to the AQA task without modifications. Although there is a plethora of question answering tasks based on text, image or video data, to our knowledge, we are the first to propose answering questions directly on audio streams. We hope this contribution will facilitate the development of research in the area. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: NeurIPS 2018 Visually Grounded Interaction and Language (ViGIL) Workshop

arXiv:1804.10322 [pdf, other]

Classification of auditory stimuli from EEG signals with a regulated recurrent neural network reservoir

Authors: Marc-Antoine Moinnereau, Thomas Brienne, Simon Brodeur, Jean Rouat, Kevin Whittingstall, Eric Plourde

Abstract: The use of electroencephalogram (EEG) as the main input signal in brain-machine interfaces has been widely proposed due to the non-invasive nature of the EEG. Here we are specifically interested in interfaces that extract information from the auditory system and more specifically in the task of classifying heard speech from EEGs. To do so, we propose to limit the preprocessing of the EEGs and use… ▽ More The use of electroencephalogram (EEG) as the main input signal in brain-machine interfaces has been widely proposed due to the non-invasive nature of the EEG. Here we are specifically interested in interfaces that extract information from the auditory system and more specifically in the task of classifying heard speech from EEGs. To do so, we propose to limit the preprocessing of the EEGs and use machine learning approaches to automatically extract their meaningful characteristics. More specifically, we use a regulated recurrent neural network (RNN) reservoir, which has been shown to outperform classic machine learning approaches when applied to several different bio-signals, and we compare it with a deep neural network approach. Moreover, we also investigate the classification performance as a function of the number of EEG electrodes. A set of 8 subjects were presented randomly with 3 different auditory stimuli (English vowels a, i and u). We obtained an excellent classification rate of 83.2% with the RNN when considering all 64 electrodes. A rate of 81.7% was achieved with only 10 electrodes. △ Less

Submitted 26 April, 2018; originally announced April 2018.

arXiv:1801.10214 [pdf, other]

doi 10.21227/H2M94J

CREATE: Multimodal Dataset for Unsupervised Learning, Generative Modeling and Prediction of Sensory Data from a Mobile Robot in Indoor Environments

Authors: Simon Brodeur, Simon Carrier, Jean Rouat

Abstract: The CREATE database is composed of 14 hours of multimodal recordings from a mobile robotic platform based on the iRobot Create. The various sensors cover vision, audition, motors and proprioception. The dataset has been designed in the context of a mobile robot that can learn multimodal representations of its environment, thanks to its ability to navigate the environment. This ability can also be… ▽ More The CREATE database is composed of 14 hours of multimodal recordings from a mobile robotic platform based on the iRobot Create. The various sensors cover vision, audition, motors and proprioception. The dataset has been designed in the context of a mobile robot that can learn multimodal representations of its environment, thanks to its ability to navigate the environment. This ability can also be used to learn the dependencies and relationships between the different modalities of the robot (e.g. vision, audition), as they reflect both the external environment and the internal state of the robot. The provided multimodal dataset is expected to have multiple usages, such as multimodal unsupervised object learning, multimodal prediction and egomotion/causality detection. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: The CREATE dataset is Open access and available on IEEE Dataport (https://ieee-dataport.org/open-access/create-multimodal-dataset-unsupervised-learning-and-generative-modeling-sensory-data)

arXiv:1711.11017 [pdf, other]

HoME: a Household Multimodal Environment

Authors: Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, Aaron Courville

Abstract: We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible… ▽ More We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more. We hope HoME better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting. △ Less

Submitted 29 November, 2017; originally announced November 2017.

Comments: Presented at NIPS 2017's Visually-Grounded Interaction and Language Workshop

arXiv:1607.00359 [pdf, other]

Moving Toward High Precision Dynamical Modelling in Hidden Markov Models

Authors: Sébastien Gagnon, Jean Rouat

Abstract: Hidden Markov Model (HMM) is often regarded as the dynamical model of choice in many fields and applications. It is also at the heart of most state-of-the-art speech recognition systems since the 70's. However, from Gaussian mixture models HMMs (GMM-HMM) to deep neural network HMMs (DNN-HMM), the underlying Markovian chain of state-of-the-art models did not changed much. The "left-to-right" topolo… ▽ More Hidden Markov Model (HMM) is often regarded as the dynamical model of choice in many fields and applications. It is also at the heart of most state-of-the-art speech recognition systems since the 70's. However, from Gaussian mixture models HMMs (GMM-HMM) to deep neural network HMMs (DNN-HMM), the underlying Markovian chain of state-of-the-art models did not changed much. The "left-to-right" topology is mostly always employed because very few other alternatives exist. In this paper, we propose that finely-tuned HMM topologies are essential for precise temporal modelling and that this approach should be investigated in state-of-the-art HMM system. As such, we propose a proof-of-concept framework for learning efficient topologies by pruning down complex generic models. Speech recognition experiments that were conducted indicate that complex time dependencies can be better learned by this approach than with classical "left-to-right" models. △ Less

Submitted 1 July, 2016; originally announced July 2016.

arXiv:1604.01642 [pdf, ps, other]

doi 10.1109/ICASSP.2006.1661100

Robust 3D Localization and Tracking of Sound Sources Using Beamforming and Particle Filtering

Authors: Jean-Marc Valin, François Michaud, Jean Rouat

Abstract: In this paper we present a new robust sound source localization and tracking method using an array of eight microphones (US patent pending) . The method uses a steered beamformer based on the reliability-weighted phase transform (RWPHAT) along with a particle filter-based tracking algorithm. The proposed system is able to estimate both the direction and the distance of the sources. In a videoconfe… ▽ More In this paper we present a new robust sound source localization and tracking method using an array of eight microphones (US patent pending) . The method uses a steered beamformer based on the reliability-weighted phase transform (RWPHAT) along with a particle filter-based tracking algorithm. The proposed system is able to estimate both the direction and the distance of the sources. In a videoconferencing context, the direction was estimated with an accuracy better than one degree while the distance was accurate within 10% RMS. Tracking of up to three simultaneous moving speakers is demonstrated in a noisy environment. △ Less

Submitted 27 February, 2016; originally announced April 2016.

Comments: 4 pages. arXiv admin note: substantial text overlap with arXiv:1602.08139

Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 841-844, 2006

arXiv:1603.03215 [pdf, other]

doi 10.1109/ICASSP.2004.1325962

Microphone array post-filter for separation of simultaneous non-stationary sources

Authors: Jean-Marc Valin, Jean Rouat, François Michaud

Abstract: Microphone array post-filters have demonstrated their ability to greatly reduce noise at the output of a beamformer. However, current techniques only consider a single source of interest, most of the time assuming stationary background noise. We propose a microphone array post-filter that enhances the signals produced by the separation of simultaneous sources using common source separation algorit… ▽ More Microphone array post-filters have demonstrated their ability to greatly reduce noise at the output of a beamformer. However, current techniques only consider a single source of interest, most of the time assuming stationary background noise. We propose a microphone array post-filter that enhances the signals produced by the separation of simultaneous sources using common source separation algorithms. Our method is based on a loudness-domain optimal spectral estimator and on the assumption that the noise can be described as the sum of a stationary component and of a transient component that is due to leakage between the channels of the initial source separation algorithm. The system is evaluated in the context of mobile robotics and is shown to produce better results than current post-filtering techniques, greatly reducing interference while causing little distortion to the signal of interest, even at very low SNR. △ Less

Submitted 10 March, 2016; originally announced March 2016.

Comments: 4 pages. arXiv admin note: substantial text overlap with arXiv:1603.02341

Journal ref: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 221-224, 2004

arXiv:1603.02341 [pdf, ps, other]

doi 10.1109/IROS.2004.1389723

Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter

Authors: Jean-Marc Valin, Jean Rouat, François Michaud

Abstract: We propose a system that gives a mobile robot the ability to separate simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation and a post-filter that gives us a further reduction of interferences from other sources. We present results and comparisons for separation of multiple non-stationary speech sources combined with n… ▽ More We propose a system that gives a mobile robot the ability to separate simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation and a post-filter that gives us a further reduction of interferences from other sources. We present results and comparisons for separation of multiple non-stationary speech sources combined with noise sources. The main advantage of our approach for mobile robots resides in the fact that both the frequency-domain Geometric Source Separation algorithm and the post-filter are able to adapt rapidly to new sources and non-stationarity. Separation results are presented for three simultaneous interfering speakers in the presence of noise. A reduction of log spectral distortion (LSD) and increase of signal-to-noise ratio (SNR) of approximately 10 dB and 14 dB are observed. △ Less

Submitted 7 March, 2016; originally announced March 2016.

Comments: 6 pages

Journal ref: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1820-1825, 2004

arXiv:1602.08629 [pdf, ps, other]

doi 10.1109/ROBOT.2004.1307286

Localization of Simultaneous Moving Sound Sources for Mobile Robot Using a Frequency-Domain Steered Beamformer Approach

Authors: Jean-Marc Valin, François Michaud, Brahim Hadjou, Jean Rouat

Abstract: Mobile robots in real-life settings would benefit from being able to localize sound sources. Such a capability can nicely complement vision to help localize a person or an interesting event in the environment, and also to provide enhanced processing for other capabilities such as speech recognition. In this paper we present a robust sound source localization method in three-dimensional space using… ▽ More Mobile robots in real-life settings would benefit from being able to localize sound sources. Such a capability can nicely complement vision to help localize a person or an interesting event in the environment, and also to provide enhanced processing for other capabilities such as speech recognition. In this paper we present a robust sound source localization method in three-dimensional space using an array of 8 microphones. The method is based on a frequency-domain implementation of a steered beamformer along with a probabilistic post-processor. Results show that a mobile robot can localize in real time multiple moving sources of different types over a range of 5 meters with a response time of 200 ms. △ Less

Submitted 27 February, 2016; originally announced February 2016.

Comments: 6 pages. arXiv admin note: substantial text overlap with arXiv:1602.08139

Journal ref: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 1033-1038, 2004

arXiv:1602.08213 [pdf, ps, other]

doi 10.1109/IROS.2003.1248813

Robust Sound Source Localization Using a Microphone Array on a Mobile Robot

Authors: Jean-Marc Valin, François Michaud, Jean Rouat, Dominic Létourneau

Abstract: The hearing sense on a mobile robot is important because it is omnidirectional and it does not require direct line-of-sight with the sound source. Such capabilities can nicely complement vision to help localize a person or an interesting event in the environment. To do so the robot auditory system must be able to work in noisy, unknown and diverse environmental conditions. In this paper we present… ▽ More The hearing sense on a mobile robot is important because it is omnidirectional and it does not require direct line-of-sight with the sound source. Such capabilities can nicely complement vision to help localize a person or an interesting event in the environment. To do so the robot auditory system must be able to work in noisy, unknown and diverse environmental conditions. In this paper we present a robust sound source localization method in three-dimensional space using an array of 8 microphones. The method is based on time delay of arrival estimation. Results show that a mobile robot can localize in real time different types of sound sources over a range of 3 meters and with a precision of 3 degrees. △ Less

Submitted 26 February, 2016; originally announced February 2016.

Comments: 6 pages

Journal ref: Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1228-1233, 2003

arXiv:1602.08139 [pdf, ps, other]

doi 10.1016/j.robot.2006.08.004

Robust Localization and Tracking of Simultaneous Moving Sound Sources Using Beamforming and Particle Filtering

Authors: Jean-Marc Valin, François Michaud, Jean Rouat

Abstract: Mobile robots in real-life settings would benefit from being able to localize and track sound sources. Such a capability can help localizing a person or an interesting event in the environment, and also provides enhanced processing for other capabilities such as speech recognition. To give this capability to a robot, the challenge is not only to localize simultaneous sound sources, but to track th… ▽ More Mobile robots in real-life settings would benefit from being able to localize and track sound sources. Such a capability can help localizing a person or an interesting event in the environment, and also provides enhanced processing for other capabilities such as speech recognition. To give this capability to a robot, the challenge is not only to localize simultaneous sound sources, but to track them over time. In this paper we propose a robust sound source localization and tracking method using an array of eight microphones. The method is based on a frequency-domain implementation of a steered beamformer along with a particle filter-based tracking algorithm. Results show that a mobile robot can localize and track in real-time multiple moving sources of different types over a range of 7 meters. These new capabilities allow a mobile robot to interact using more natural means with people in real life settings. △ Less

Submitted 25 February, 2016; originally announced February 2016.

Comments: 26 pages

Journal ref: Robotics and Autonomous Systems Journal (Elsevier), Vol. 55, No. 3, pp. 216-228, 2007

arXiv:1602.06442 [pdf, ps, other]

doi 10.1109/TRO.2007.900612

Robust Recognition of Simultaneous Speech By a Mobile Robot

Authors: Jean-Marc Valin, Shun'ichi Yamamoto, Jean Rouat, Francois Michaud, Kazuhiro Nakadai, Hiroshi G. Okuno

Abstract: This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of Geometric Source Separation and a post-filter that gives a further reduction of interference from other sources. The post-filter is also used to estimate the reliability of spectral features and c… ▽ More This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of Geometric Source Separation and a post-filter that gives a further reduction of interference from other sources. The post-filter is also used to estimate the reliability of spectral features and compute a missing feature mask. The mask is used in a missing feature theory-based speech recognition system to recognize the speech from simultaneous Japanese speakers in the context of a humanoid robot. Recognition rates are presented for three simultaneous speakers located at 2 meters from the robot. The system was evaluated on a 200 word vocabulary at different azimuths between sources, ranging from 10 to 90 degrees. Compared to the use of the microphone array source separation alone, we demonstrate an average reduction in relative recognition error rate of 24% with the post-filter and of 42% when the missing features approach is combined with the post-filter. We demonstrate the effectiveness of our multi-source microphone array post-filter and the improvement it provides when used in conjunction with the missing features theory. △ Less

Submitted 20 February, 2016; originally announced February 2016.

Comments: 12 pages

Journal ref: IEEE Transactions on Robotics, Vol. 23, No. 4, pp. 742-752, 2007

arXiv:1311.5924 [pdf, other]

Objets Sonores: Une Représentation Bio-Inspirée Hiérarchique Parcimonieuse À Très Grandes Dimensions Utilisable En Reconnaissance; Auditory Objects: Bio-Inspired Hierarchical Sparse High Dimensional Representation for Recognition

Authors: Simon Brodeur, Jean Rouat

Abstract: L'accent est placé dans cet article sur la structure hiérarchique, l'aspect parcimonieux de la représentation de l'information sonore, la très grande dimension des caractéristiques ainsi que sur l'indépendance des caractéristiques permettant de définir les composantes des objets sonores. Les notions d'objet sonore et de représentation neuronale sont d'abord introduites, puis illustrées avec une ap… ▽ More L'accent est placé dans cet article sur la structure hiérarchique, l'aspect parcimonieux de la représentation de l'information sonore, la très grande dimension des caractéristiques ainsi que sur l'indépendance des caractéristiques permettant de définir les composantes des objets sonores. Les notions d'objet sonore et de représentation neuronale sont d'abord introduites, puis illustrées avec une application en analyse de signaux sonores variés: parole, musique et environnements naturels extérieurs. Finalement, un nouveau système de reconnaissance automatique de parole est proposé. Celui-ci est comparé à un système statistique conventionnel. Il montre très clairement que l'analyse par objets sonores introduit une grande polyvalence et robustesse en reconnaissance de parole. Cette intégration des connaissances en neurosciences et traitement des signaux acoustiques ouvre de nouvelles perspectives dans le domaine de la reconnaissance de signaux acoustiques. The emphasis is put on the hierarchical structure, independence and sparseness aspects of auditory signal representations in high-dimensional spaces, so as to define the components of auditory objects. The concept of an auditory object and its neural representation is introduced. An illustrative application then follows, consisting in the analysis of various auditory signals: speech, music and natural outdoor environments. A new automatic speech recognition (ASR) system is then proposed and compared to a conventional statistical system. The proposed system clearly shows that an object-based analysis introduces a great flexibility and robustness for the task of speech recognition. The integration of knowledge from neuroscience and acoustic signal processing brings new ways of thinking to the field of classification of acoustic signals. △ Less

Submitted 22 November, 2013; originally announced November 2013.

Comments: 16 pages, Invited journal paper

Journal ref: Canadian Acoustics / Acoustique Canadienne, Vol 41, nb 2, June 2013, pp. 33 - 48

arXiv:1304.0640 [pdf, ps, other]

doi 10.1016/j.neunet.2013.02.005

Event management for large scale event-driven digital hardware spiking neural networks

Authors: Louis-Charles Caron, \and Michiel D'Haene, \and Frédéric Mailhot, \and Benjamin Schrauwen, \and Jean Rouat

Abstract: The interest in brain-like computation has led to the design of a plethora of innovative neuromorphic systems. Individually, spiking neural networks (SNNs), event-driven simulation and digital hardware neuromorphic systems get a lot of attention. Despite the popularity of event-driven SNNs in software, very few digital hardware architectures are found. This is because existing hardware solutions f… ▽ More The interest in brain-like computation has led to the design of a plethora of innovative neuromorphic systems. Individually, spiking neural networks (SNNs), event-driven simulation and digital hardware neuromorphic systems get a lot of attention. Despite the popularity of event-driven SNNs in software, very few digital hardware architectures are found. This is because existing hardware solutions for event management scale badly with the number of events. This paper introduces the structured heap queue, a pipelined digital hardware data structure, and demonstrates its suitability for event management. The structured heap queue scales gracefully with the number of events, allowing the efficient implementation of large scale digital hardware event-driven SNNs. The scaling is linear for memory, logarithmic for logic resources and constant for processing time. The use of the structured heap queue is demonstrated on field-programmable gate array (FPGA) with an image segmentation experiment and a SNN of 65~536 neurons and 513~184 synapses. Events can be processed at the rate of 1 every 7 clock cycles and a 406$\times$158 pixel image is segmented in 200 ms. △ Less

Submitted 2 April, 2013; originally announced April 2013.

Showing 1–36 of 36 results for author: Rouat, J