-
Rethinking Non-Negative Matrix Factorization with Implicit Neural Representations
Authors:
Krishna Subramani,
Paris Smaragdis,
Takuya Higuchi,
Mehrez Souden
Abstract:
Non-negative Matrix Factorization (NMF) is a powerful technique for analyzing regularly-sampled data, i.e., data that can be stored in a matrix. For audio, this has led to numerous applications using time-frequency (TF) representations like the Short-Time Fourier Transform. However extending these applications to irregularly-spaced TF representations, like the Constant-Q transform, wavelets, or si…
▽ More
Non-negative Matrix Factorization (NMF) is a powerful technique for analyzing regularly-sampled data, i.e., data that can be stored in a matrix. For audio, this has led to numerous applications using time-frequency (TF) representations like the Short-Time Fourier Transform. However extending these applications to irregularly-spaced TF representations, like the Constant-Q transform, wavelets, or sinusoidal analysis models, has not been possible since these representations cannot be directly stored in matrix form. In this paper, we formulate NMF in terms of continuous functions (instead of fixed vectors) and show that NMF can be extended to a wider variety of signal classes that need not be regularly sampled.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Authors:
Zakaria Aldeneh,
Takuya Higuchi,
Jee-weon Jung,
Skyler Seto,
Tatiana Likhomanenko,
Stephen Shum,
Ahmed Hussen Abdelaziz,
Shinji Watanabe,
Barry-John Theobald
Abstract:
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe…
▽ More
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the design of the downstream model for speaker verification using self-supervised features. We show that we can simplify the model to use 97.51% fewer parameters while achieving a 29.93% average improvement in performance on SUPERB. Consequently, we show that the simplified downstream model is more data efficient compared to baseline--it achieves better performance with only 60% of the training data.
△ Less
Submitted 13 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Authors:
Jee-weon Jung,
Wangyou Zhang,
Jiatong Shi,
Zakaria Aldeneh,
Takuya Higuchi,
Barry-John Theobald,
Ahmed Hussen Abdelaziz,
Shinji Watanabe
Abstract:
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also…
▽ More
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.
△ Less
Submitted 13 June, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study
Authors:
Avamarie Brueggeman,
Takuya Higuchi,
Masood Delfarah,
Stephen Shum,
Vineet Garg
Abstract:
Noise robustness is a key aspect of successful speech applications. Speech enhancement (SE) has been investigated to improve automatic speech recognition accuracy; however, its effectiveness for keyword spotting (KWS) is still under-investigated. In this paper, we conduct a comprehensive study on single-channel speech enhancement for keyword spotting on the Google Speech Command (GSC) dataset. To…
▽ More
Noise robustness is a key aspect of successful speech applications. Speech enhancement (SE) has been investigated to improve automatic speech recognition accuracy; however, its effectiveness for keyword spotting (KWS) is still under-investigated. In this paper, we conduct a comprehensive study on single-channel speech enhancement for keyword spotting on the Google Speech Command (GSC) dataset. To investigate robustness to noise, the GSC dataset is augmented with noise signals from the WSJ0 Hipster Ambient Mixtures (WHAM!) noise dataset. Our investigation includes not only applying SE before KWS but also performing joint training of the SE frontend and KWS backend models. Moreover, we explore audio injection, a common approach to reduce distortions by using a weighted average of the enhanced and original signals. Audio injection is then further optimized by using another model that predicts the weight for each utterance. Our investigation reveals that SE can improve KWS accuracy on noisy speech when the backend model is trained on clean speech; however, despite our extensive exploration, it is difficult to improve the KWS accuracy with SE when the backend is trained on noisy speech.
△ Less
Submitted 21 February, 2024; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Multichannel Voice Trigger Detection Based on Transform-average-concatenate
Authors:
Takuya Higuchi,
Avamarie Brueggeman,
Masood Delfarah,
Stephen Shum
Abstract:
Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems take only single-channel audio as input, channel selection is performed. A drawback of this approach is that unselected channels a…
▽ More
Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems take only single-channel audio as input, channel selection is performed. A drawback of this approach is that unselected channels are discarded, even if the discarded channels could contain useful information for VT. In this work, we propose multichannel acoustic models for VT, where the multichannel output from the frond-end is fed directly into a VT model. We adopt a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the conventional channel selection so that the model can attend to a target speaker when multiple speakers are present. The proposed approach achieves up to 30% reduction in the false rejection rate compared to the baseline channel selection approach.
△ Less
Submitted 13 February, 2024; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Improving Voice Trigger Detection with Metric Learning
Authors:
Prateeth Nayak,
Takuya Higuchi,
Anmol Gupta,
Shivesh Ranjan,
Stephen Shum,
Siddharth Sigtia,
Erik Marchi,
Varun Lakshminarasimhan,
Minsik Cho,
Saurabh Adya,
Chandra Dhir,
Ahmed Tewfik
Abstract:
Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented…
▽ More
Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.
△ Less
Submitted 13 September, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Multi-task Learning with Cross Attention for Keyword Spotting
Authors:
Takuya Higuchi,
Anmol Gupta,
Chandra Dhir
Abstract:
Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-tas…
▽ More
Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data. In this approach, an output of an acoustic model is split into two branches for the two tasks, one for phoneme transcription trained with the ASR data and one for keyword classification trained with the KWS data. In this paper, we introduce a cross attention decoder in the multi-task learning framework. Unlike the conventional multi-task learning approach with the simple split of the output layer, the cross attention decoder summarizes information from a phonetic encoder by performing cross attention between the encoder outputs and a trainable query sequence to predict a confidence score for the KWS task. Experimental results on KWS tasks show that the proposed approach achieves a 12% relative reduction in the false reject ratios compared to the conventional multi-task learning with split branches and a bi-directional long short-team memory decoder.
△ Less
Submitted 22 September, 2021; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Dynamic curriculum learning via data parameters for noise robust keyword spotting
Authors:
Takuya Higuchi,
Shreyas Saxena,
Mehrez Souden,
Tien Dung Tran,
Masood Delfarah,
Chandra Dhir
Abstract:
We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances durin…
▽ More
We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances during training, which enables automatic curriculum learning without additional annotations for training data. Similarly, in this paper, we propose using this curriculum learning approach for acoustic modeling, and train an acoustic model on clean and noisy utterances with the data parameters. The proposed approach automatically learns the difficulty of the classes and instances, e.g. due to low speech to noise ratio (SNR), in the gradient descent optimization and performs curriculum learning. This curriculum learning leads to overall improvement of the accuracy of the acoustic model. We evaluate the effectiveness of the proposed approach on a keyword spotting task. Experimental results show 7.7% relative reduction in false reject ratio with the data parameters compared to a baseline model which is simply trained on the multiconditioned dataset.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Hybrid Vehicular and Cloud Distributed Computing: A Case for Cooperative Perception
Authors:
Enes Krijestorac,
Agon Memedi,
Takamasa Higuchi,
Seyhan Ucar,
Onur Altintas,
Danijela Cabric
Abstract:
In this work, we propose the use of hybrid offloading of computing tasks simultaneously to edge servers (vertical offloading) via LTE communication and to nearby cars (horizontal offloading) via V2V communication, in order to increase the rate at which tasks are processed compared to local processing. Our main contribution is an optimized resource assignment and scheduling framework for hybrid off…
▽ More
In this work, we propose the use of hybrid offloading of computing tasks simultaneously to edge servers (vertical offloading) via LTE communication and to nearby cars (horizontal offloading) via V2V communication, in order to increase the rate at which tasks are processed compared to local processing. Our main contribution is an optimized resource assignment and scheduling framework for hybrid offloading of computing tasks. The framework optimally utilizes the computational resources in the edge and in the micro cloud, while taking into account communication constraints and task requirements. While cooperative perception is the primary use case of our framework, the framework is applicable to other cooperative vehicular applications with high computing demand and significant transmission overhead. The framework is tested in a simulated environment built on top of car traces and communication rates exported from the Veins vehicular networking simulator. We observe a significant increase in the processing rate of cooperative perception sensor frames when hybrid offloading with optimized resource assignment is adopted. Furthermore, the processing rate increases with V2V connectivity as more computing tasks can be offloaded horizontally.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection
Authors:
Takuya Higuchi,
Mohammad Ghasemzadeh,
Kisun You,
Chandra Dhir
Abstract:
We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device.…
▽ More
We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
Cooperative Perception with Deep Reinforcement Learning for Connected Vehicles
Authors:
Shunsuke Aoki,
Takamasa Higuchi,
Onur Altintas
Abstract:
Sensor-based perception on vehicles are becoming prevalent and important to enhance the road safety. Autonomous driving systems use cameras, LiDAR, and radar to detect surrounding objects, while human-driven vehicles use them to assist the driver. However, the environmental perception by individual vehicles has the limitations on coverage and/or detection accuracy. For example, a vehicle cannot de…
▽ More
Sensor-based perception on vehicles are becoming prevalent and important to enhance the road safety. Autonomous driving systems use cameras, LiDAR, and radar to detect surrounding objects, while human-driven vehicles use them to assist the driver. However, the environmental perception by individual vehicles has the limitations on coverage and/or detection accuracy. For example, a vehicle cannot detect objects occluded by other moving/static obstacles. In this paper, we present a cooperative perception scheme with deep reinforcement learning to enhance the detection accuracy for the surrounding objects. By using the deep reinforcement learning to select the data to transmit, our scheme mitigates the network load in vehicular communication networks and enhances the communication reliability. To design, test, and verify the cooperative perception scheme, we develop a Cooperative & Intelligent Vehicle Simulation (CIVS) Platform, which integrates three software components: traffic simulator, vehicle simulator, and object classifier. We evaluate that our scheme decreases packet loss and thereby increases the detection accuracy by up to 12%, compared to the baseline protocol.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Path Loss Models for V2V mmWave Communication: Performance Evaluation and Open Challenges
Authors:
Marco Giordani,
Takayuki Shimizu,
Andrea Zanella,
Takamasa Higuchi,
Onur Altintas,
Michele Zorzi
Abstract:
Recently, millimeter wave (mmWave) bands have been investigated as a means to enhance automated driving and address the challenging data rate and latency demands of emerging automotive applications. For the development of those systems to operate in bands above 6 GHz, there is a need to have accurate channel models able to predict the peculiarities of the vehicular propagation at these bands, espe…
▽ More
Recently, millimeter wave (mmWave) bands have been investigated as a means to enhance automated driving and address the challenging data rate and latency demands of emerging automotive applications. For the development of those systems to operate in bands above 6 GHz, there is a need to have accurate channel models able to predict the peculiarities of the vehicular propagation at these bands, especially as far as Vehicle-to-Vehicle (V2V) communications are concerned. In this paper, we validate the channel model that the 3GPP has proposed for NR-V2X systems, which (i) supports deployment scenarios for urban/highway propagation, and (ii) incorporates the effects of path loss, shadowing, line of sight probability, and static/dynamic blockage attenuation. We also exemplify the impact of several automotive-specific parameters on the overall network performance considering realistic system-level simulation assumptions for typical scenarios. Finally, we highlight potential inconsistencies of the model and provide recommendations for future measurement campaigns in vehicular environments.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
A Framework to Assess Value of Information in Future Vehicular Networks
Authors:
Marco Giordani,
Takamasa Higuchi,
Andrea Zanella,
Onur Altintas,
Michele Zorzi
Abstract:
Vehicles are becoming increasingly intelligent and connected, incorporating more and more sensors to support safer and more efficient driving. The large volume of data generated by such sensors, however, will likely saturate the capacity of vehicular communication technologies, making it challenging to guarantee the required quality of service. In this perspective, it is essential to assess the va…
▽ More
Vehicles are becoming increasingly intelligent and connected, incorporating more and more sensors to support safer and more efficient driving. The large volume of data generated by such sensors, however, will likely saturate the capacity of vehicular communication technologies, making it challenging to guarantee the required quality of service. In this perspective, it is essential to assess the value of information (VoI) provided by each data source, to prioritize the transmissions that have the greatest importance for the target applications. In this paper, we propose and evaluate a framework that uses analytic hierarchy multicriteria decision processes to predict VoI based on space, time, and quality attributes. Our results shed light on the impact of the propagation scenario, the sensor resolution, the type of observation, and the communication distance on the value assessment performance. In particular, we show that VoI evolves at different rates as a function of the target application's characteristics.
△ Less
Submitted 22 May, 2019;
originally announced May 2019.