Skip to main content

Showing 1–35 of 35 results for author: Siniscalchi, S M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15862  [pdf, other

    cs.CL

    Speech Analysis of Language Varieties in Italy

    Authors: Moreno La Quatra, Alkis Koudounas, Elena Baralis, Sabato Marco Siniscalchi

    Abstract: Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely rela… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: Accepted to LREC-COLING 2024 - https://aclanthology.org/2024.lrec-main.1317/

  2. arXiv:2406.02488  [pdf, other

    eess.AS cs.CL cs.SD

    Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition

    Authors: Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  4. arXiv:2405.00934  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Benchmarking Representations for Speech, Music, and Acoustic Events

    Authors: Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, Sabato Marco Siniscalchi

    Abstract: Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-traine… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  5. arXiv:2402.05457  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

    Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

    Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

  6. arXiv:2401.13766  [pdf, ps, other

    eess.AS cs.SD

    Bayesian adaptive learning to latent variables via Variational Bayes and Maximum a Posteriori

    Authors: Hu Hu, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: In this work, we aim to establish a Bayesian adaptive learning framework by focusing on estimating latent variables in deep neural network (DNN) models. Latent variables indeed encode both transferable distributional information and structural relationships. Thus the distributions of the source latent variables (prior) can be combined with the knowledge learned from the target data (likelihood) to… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: ASRU2023 Bayesian Symposium. arXiv admin note: text overlap with arXiv:2110.08598

  7. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  8. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  9. arXiv:2309.08828  [pdf, other

    eess.AS cs.SD

    Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

    Authors: Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme map… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  10. arXiv:2309.08348  [pdf, other

    eess.AS cs.SD

    The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

    Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, **gdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

    Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  11. arXiv:2307.06701  [pdf, other

    cs.CV cs.AI cs.LG

    S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

    Authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

    Abstract: We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capab… ▽ More

    Submitted 11 June, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: 14 pages, 7 figures, 3 tables. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence on 2023-07-12

    ACM Class: I.2.10; I.4.10; I.4.5; I.4.2; I.2.6

  12. How word semantics and phonology affect handwriting of Alzheimer's patients: a machine learning based analysis

    Authors: Nicole Dalia Cilia, Claudio De Stefano, Francesco Fontanella, Sabato Marco Siniscalchi

    Abstract: Using kinematic properties of handwriting to support the diagnosis of neurodegenerative disease is a real challenge: non-invasive detection techniques combined with machine learning approaches promise big steps forward in this research field. In literature, the tasks proposed focused on different cognitive skills to elicitate handwriting movements. In particular, the meaning and phonology of words… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Journal ref: Computers in Biology and Medicine 169 (2024) 107891

  13. arXiv:2306.00331  [pdf, other

    eess.AS cs.AI cs.SD eess.SP eess.SY

    A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

    Authors: Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF)… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023. Code will be released at https://github.com/Kuray107/S4ND-U-Net_speech_enhancement

  14. arXiv:2305.11360  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Differentially Private Adapters for Parameter Efficient Acoustic Modeling

    Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi

    Abstract: In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-tra… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023. Code will be available at: https://github.com/Chun-wei-Ho/Private-Speech-Adapter. The authors would like to express their gratitude to Prof. Chin-Hui Lee from Georgia Tech for providing helpful insights and suggestions

  15. arXiv:2211.01263  [pdf, other

    cs.SD cs.LG eess.AS quant-ph

    A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  16. arXiv:2211.01189  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Inference and Denoise: Causal Inference-based Neural Speech Enhancement

    Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

    Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  17. arXiv:2210.06382  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

    Authors: Chao-Han Huck Yang, Jun Qi, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms,… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted to ISCA, ISCSLP 2022, Singapore. 5 Pages

  18. arXiv:2210.05614  [pdf, other

    cs.SD cs.LG cs.NE eess.AS

    An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, I-Fan Chen, Andreas Stolcke, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilit… ▽ More

    Submitted 13 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021

  19. arXiv:2208.04554  [pdf, other

    cs.CV cs.LG

    Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation

    Authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

    Abstract: We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data. By utilizing a novel objective function, each layer in HR-VQVAE learns a discrete representation of the residual from previous layers through a vector quantized encoder. Furthermore, the representations at each layer are hierarchically linked to those at previou… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

    Comments: 12 pages plus supplementary material. Submitted to BMVC 2022

    ACM Class: I.4; I.2

  20. arXiv:2203.04114  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

    Authors: Qing Wang, Jun Du, Siyuan Zheng, Yunqing Li, Yajian Wang, Yuzhong Wu, Hu Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

    Abstract: In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to… ▽ More

    Submitted 31 August, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: 5 pages, 1 figure

  21. arXiv:2110.08598  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer

    Authors: Hu Hu, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Chin-Hui Lee

    Abstract: We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge num… ▽ More

    Submitted 20 February, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022. Code is available at https://github.com/MihawkHu/ASC_Knowledge_Transfer

  22. arXiv:2110.03894  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Neural Model Reprogramming with Similarity Based Map** for Low-Resource Spoken Command Recognition

    Authors: Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao

    Abstract: In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, w… ▽ More

    Submitted 30 October, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Accepted to Interspeech 2023. Code is available at: https://github.com/dodohow1011/SpeechAdvReprogram. Selected as Best Student Paper Candidate

  23. Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching

    Authors: Zhen Huang, Xiaodan Zhuang, Daben Liu, Xiaoqiang Xiao, Yuchen Zhang, Sabato Marco Siniscalchi

    Abstract: In this paper, we present our initial efforts for building a code-switching (CS) speech recognition system leveraging existing acoustic models (AMs) and language models (LMs), i.e., no training required, and specifically targeting intra-sentential switching. To achieve such an ambitious goal, new mechanisms for foreign pronunciation generation and language model (LM) enrichment have been devised.… ▽ More

    Submitted 27 August, 2021; originally announced September 2021.

    Journal ref: ICASSP2019 12-17 May 2019

  24. arXiv:2107.01461  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

    Authors: Hao Yen, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Qing Wang, Yuyang Wang, Xianjun Xia, Yuanjun Zhao, Yuzhong Wu, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model a… ▽ More

    Submitted 1 May, 2022; v1 submitted 3 July, 2021; originally announced July 2021.

    Comments: 5 figures. DCASE 2021. The project started in November 2020. Revised version

  25. arXiv:2104.01271  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification

    Authors: Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications. The AAE architecture allows us to obtain good synthetic speech leveraging upon a discriminative training of latent vectors. Such synthetic speech is used to build a privacy-pres… ▽ More

    Submitted 15 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021

  26. arXiv:2011.01447  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    A Two-Stage Approach to Device-Robust Acoustic Scene Classification

    Authors: Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (i… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021. Code available: https://github.com/MihawkHu/DCASE2020_task1

    Report number: 845--849

    Journal ref: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  27. arXiv:2010.13309  [pdf, other

    cs.SD cs.LG cs.NE eess.AS quant-ph

    Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition

    Authors: Chao-Han Huck Yang, Jun Qi, Samuel Yen-Chi Chen, Pin-Yu Chen, Sabato Marco Siniscalchi, Xiaoli Ma, Chin-Hui Lee

    Abstract: We propose a novel decentralized feature extraction approach in federated learning to address privacy-preservation issues for speech recognition. It is built upon a quantum convolutional neural network (QCNN) composed of a quantum circuit encoder for feature extraction, and a recurrent neural network (RNN) based end-to-end acoustic model (AM). To enhance model parameter protection in a decentraliz… ▽ More

    Submitted 12 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to IEEE ICASSP 2021. Code is available: https://github.com/huckiyang/QuantumSpeech-QCNN

    Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  28. arXiv:2008.07281  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP stat.ML

    On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

    Authors: Jun Qi, Jun Du, Sabato Marco Siniscalchi, Xiaoli Ma, Chin-Hui Lee

    Abstract: In this paper, we exploit the properties of mean absolute error (MAE) as a loss function for the deep neural network (DNN) based vector-to-vector regression. The goal of this work is two-fold: (i) presenting performance bounds of MAE, and (ii) demonstrating new properties of MAE that make it more appropriate than mean squared error (MSE) as a loss function for DNN based vector-to-vector regression… ▽ More

    Submitted 12 August, 2020; originally announced August 2020.

    Journal ref: IEEE Signal Processing Letters, 2020

  29. arXiv:2008.05459  [pdf, other

    cs.LG eess.SP stat.ML

    Analyzing Upper Bounds on Mean Absolute Errors for Deep Neural Network Based Vector-to-Vector Regression

    Authors: Jun Qi, Jun Du, Sabato Marco Siniscalchi, Xiaoli Ma, Chin-Hui Lee

    Abstract: In this paper, we show that, in vector-to-vector regression utilizing deep neural networks (DNNs), a generalized loss of mean absolute error (MAE) between the predicted and expected feature vectors is upper bounded by the sum of an approximation error, an estimation error, and an optimization error. Leveraging upon error decomposition techniques in statistical learning theory and non-convex optimi… ▽ More

    Submitted 4 August, 2020; originally announced August 2020.

    Journal ref: IEEE Transactions on Signal Processing, Vol 68, pp. 3411-3422, 2020

  30. arXiv:2008.00110  [pdf, other

    eess.AS cs.CL cs.SD

    Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification

    Authors: Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

    Abstract: In this paper, we propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification leveraging upon neural label embedding (NLE) and relational teacher student learning (RTSL). Taking into account the structural relationships between acoustic scene classes, our proposed framework captures such relationships which are intrinsically device-independent. In th… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: Accepted by Interspeech 2020

  31. arXiv:2008.00107  [pdf, other

    eess.AS cs.CL cs.SD

    An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances

    Authors: Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee

    Abstract: In this paper, we propose a sub-utterance unit selection framework to remove acoustic segments in audio recordings that carry little information for acoustic scene classification (ASC). Our approach is built upon a universal set of acoustic segment units covering the overall acoustic scene space. First, those units are modeled with acoustic segment models (ASMs) used to tokenize acoustic scene utt… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: Accepted by Interspeech 2020

  32. arXiv:2007.13024  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

    Authors: Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bo… ▽ More

    Submitted 2 August, 2020; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: Accepted to InterSpeech 2020

  33. arXiv:2007.08389  [pdf, other

    eess.AS cs.LG cs.SD

    Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

    Authors: Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with cla… ▽ More

    Submitted 26 August, 2020; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Revised Technical Report. Proposed systems attain 2nds in both Task-1a and Task-1b in the official DCASE challenge 2020

  34. arXiv:2002.00544  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Tensor-to-Vector Regression for Multi-channel Speech Enhancement based on Tensor-Train Network

    Authors: Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a tensor-to-vector regression approach to multi-channel speech enhancement in order to address the issue of input size explosion and hidden-layer size expansion. The key idea is to cast the conventional deep neural network (DNN) based vector-to-vector regression formulation under a tensor-train network (TTN) framework. TTN is a recently emerged solution for compact representation of dee… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020. Update reproducible code

    Journal ref: IEEE ICASSP 2020

  35. arXiv:1503.02108  [pdf, other

    cs.LG cs.CL cs.NE

    Maximum a Posteriori Adaptation of Network Parameters in Deep Models

    Authors: Zhen Huang, Sabato Marco Siniscalchi, I-Fan Chen, Jiadong Wu, Chin-Hui Lee

    Abstract: We present a Bayesian approach to adapting parameters of a well-trained context-dependent, deep-neural-network, hidden Markov model (CD-DNN-HMM) to improve automatic speech recognition performance. Given an abundance of DNN parameters but with only a limited amount of data, the effectiveness of the adapted DNN model can often be compromised. We formulate maximum a posteriori (MAP) adaptation of pa… ▽ More

    Submitted 12 August, 2015; v1 submitted 6 March, 2015; originally announced March 2015.