Skip to main content

Showing 1–50 of 94 results for author: Tan, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.06295  [pdf, other

    cs.SD eess.AS

    Zero-Shot Audio Captioning Using Soft and Hard Prompts

    Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

    Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these model… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

  2. arXiv:2406.06160  [pdf, other

    eess.AS

    The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  3. arXiv:2406.05270  [pdf

    physics.med-ph cs.CV cs.LG eess.IV

    fastMRI Breast: A publicly available radial k-space dataset of breast dynamic contrast-enhanced MRI

    Authors: Eddy Solomon, Patricia M. Johnson, Zhengguo Tan, Radhika Tibrewala, Yvonne W. Lui, Florian Knoll, Linda Moy, Sungheon Gene Kim, Laura Heacock

    Abstract: This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  4. arXiv:2406.02178  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

    Authors: Sarthak Yadav, Zheng-Hua Tan

    Abstract: Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. T… ▽ More

    Submitted 7 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  5. Misaka: Interactive Swarm Testbed for Smart Grid Distributed Algorithm Test and Evaluation

    Authors: Tingliang Zhang, Haiwang Zhong, Zhenfei Tan, Xinfei Yan

    Abstract: In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activi… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Journal ref: 2020 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia)

  6. arXiv:2403.18560  [pdf, other

    eess.AS cs.LG cs.SD

    Noise-Robust Keyword Spotting through Self-supervised Pretraining

    Authors: Jacob Mørk, Holger Severin Bovbjerg, Gergely Kiss, Zheng-Hua Tan

    Abstract: Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explore… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    MSC Class: 68T10 ACM Class: I.2.6

  7. arXiv:2403.17701   

    eess.IV cs.CV cs.LG

    Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

    Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

    Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its var… ▽ More

    Submitted 3 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Experimental method encountered errors, undergoing experiment again

  8. How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses

    Authors: Peter Leer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

    Abstract: Advanced auditory models are useful in designing signal-processing algorithms for hearing-loss compensation or speech enhancement. Such auditory models provide rich and detailed descriptions of the auditory pathway, and might allow for individualization of signal-processing strategies, based on physiological measurements. However, these auditory models are often computationally demanding, requirin… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details

  9. arXiv:2403.10420  [pdf, other

    eess.AS

    Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks

    Authors: Peter Leer, Jesper Jensen, Laurel Carney, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

    Abstract: This article investigates the use of deep neural networks (DNNs) for hearing-loss compensation. Hearing loss is a prevalent issue affecting millions of people worldwide, and conventional hearing aids have limitations in providing satisfactory compensation. DNNs have shown remarkable performance in various auditory tasks, including speech recognition, speaker identification, and music classificatio… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  10. arXiv:2403.03675  [pdf, other

    cs.IT eess.SP

    ZF Beamforming Tensor Compression for Massive MIMO Fronthaul

    Authors: Libin Zheng, Zihao Wang, Minru Bai, Zhenjie Tan

    Abstract: In the rapidly evolving landscape of 5G and beyond 5G (B5G) mobile cellular communications, efficient data compression and reconstruction strategies become paramount, especially in massive multiple-input multiple-output (MIMO) systems. A critical challenge in these systems is the capacity-limited fronthaul, particularly in the context of the Ethernet-based common public radio interface (eCPRI) con… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  11. arXiv:2402.02327  [pdf, other

    cs.CV cs.SD eess.AS

    Bootstrap** Audio-Visual Segmentation by Strengthening Audio Cues

    Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jie** Ye, Nenghai Yu

    Abstract: How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate… ▽ More

    Submitted 6 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  12. arXiv:2312.16613  [pdf, other

    cs.SD cs.LG eess.AS

    Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

    Authors: Holger Severin Bovbjerg, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

    Abstract: In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with… ▽ More

    Submitted 23 January, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

    Comments: To be published at ICASSP2024, 14th of April 2024, Seoul, South Korea. Copyright (c) 2023 IEEE. 5 pages, 2, figures, 5 tables

    MSC Class: 68T10 ACM Class: I.2.6

  13. arXiv:2312.04370  [pdf, other

    eess.AS cs.LG cs.SD

    Investigating the Design Space of Diffusion Models for Speech Enhancement

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  14. arXiv:2312.02683  [pdf, other

    eess.AS cs.LG cs.SD

    Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on t… ▽ More

    Submitted 16 January, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024

  15. arXiv:2310.04369  [pdf, other

    cs.SD cs.LG eess.AS

    MBTFNet: Multi-Band Temporal-Frequency Neural Network For Singing Voice Enhancement

    Authors: Weiming Xu, Zhouxuan Chen, Zhili Tan, Shubo Lv, Runduo Han, Wenjiang Zhou, Weifeng Zhao, Lei Xie

    Abstract: A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-freque… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

  16. arXiv:2309.16390  [pdf, other

    cs.CV cs.RO eess.IV

    An Enhanced Low-Resolution Image Recognition Method for Traffic Environments

    Authors: Zongcai Tan, Zhenhai Gao

    Abstract: Currently, low-resolution image recognition is confronted with a significant challenge in the field of intelligent traffic perception. Compared to high-resolution images, low-resolution images suffer from small size, low quality, and lack of detail, leading to a notable decrease in the accuracy of traditional neural network recognition algorithms. The key to low-resolution image recognition lies i… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  17. arXiv:2309.11243  [pdf, other

    eess.AS cs.SD

    Joint Minimum Processing Beamforming and Near-end Listening Enhancement

    Authors: Andreas J. Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars S. Bertelsen, Jens Christian Lindof, Jan Østergaard

    Abstract: We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions… ▽ More

    Submitted 5 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at IEEE ICASSP 2024 Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2024

  18. arXiv:2307.10495  [pdf, other

    cs.LG cs.CV eess.SP

    Novel Batch Active Learning Approach and Its Application to Synthetic Aperture Radar Datasets

    Authors: James Chapman, Bohan Chen, Zheng Tan, Jeff Calder, Kevin Miller, Andrea L. Bertozzi

    Abstract: Active learning improves the performance of machine learning methods by judiciously selecting a limited number of unlabeled data points to query for labels, with the aim of maximally improving the underlying classifier's performance. Recent gains have been made using sequential active learning for synthetic aperture radar (SAR) data arXiv:2204.00005. In each iteration, sequential active learning s… ▽ More

    Submitted 19 July, 2023; originally announced July 2023.

    Comments: 16 pages, 7 figures, Preprint

    ACM Class: I.2.6; I.2.10; I.4.0; I.4.9

    Journal ref: Proc. SPIE. Algorithms for Synthetic Aperture Radar Imagery XXX (Vol. 12520, pp. 96-111). 13 June 2023

  19. arXiv:2307.05893  [pdf, ps, other

    eess.SP cs.LG

    Deep Unrolling for Nonconvex Robust Principal Component Analysis

    Authors: Elizabeth Z. C. Tan, Caroline Chaux, Emmanuel Soubies, Vincent Y. F. Tan

    Abstract: We design algorithms for Robust Principal Component Analysis (RPCA) which consists in decomposing a matrix into the sum of a low rank matrix and a sparse matrix. We propose a deep unrolled algorithm based on an accelerated alternating projection algorithm which aims to solve RPCA in its nonconvex form. The proposed procedure combines benefits of deep neural networks and the interpretability of the… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

    Comments: 7 pages, 3 figures; Accepted to the 2023 IEEE International Workshop on Machine Learning for Signal Processing

  20. arXiv:2306.00561  [pdf, other

    cs.SD cs.AI eess.AS

    Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

    Authors: Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan

    Abstract: In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standar… ▽ More

    Submitted 1 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

  21. arXiv:2306.00489  [pdf, other

    cs.SD cs.AI eess.AS

    Speech inpainting: Context-based speech synthesis guided by video

    Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

    Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech23

  22. arXiv:2303.15474  [pdf

    cs.LG cs.AR cs.NE eess.SY

    A Heterogeneous Parallel Non-von Neumann Architecture System for Accurate and Efficient Machine Learning Molecular Dynamics

    Authors: Zhuoying Zhao, Ziling Tan, **hui Mo, Xiaonan Wang, Dan Zhao, Xin Zhang, Ming Tao, Jie Liu

    Abstract: This paper proposes a special-purpose system to achieve high-accuracy and high-efficiency machine learning (ML) molecular dynamics (MD) calculations. The system consists of field programmable gate array (FPGA) and application specific integrated circuit (ASIC) working in heterogeneous parallelization. To be specific, a multiplication-less neural network (NN) is deployed on the non-von Neumann (NvN… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

  23. arXiv:2303.12360  [pdf

    cs.CV eess.IV

    Automatically Predict Material Properties with Microscopic Image Example Polymer Compatibility

    Authors: Zhilong Liang, Zhenzhi Tan, Ruixin Hong, Wanli Ouyang, **ying Yuan, Changshui Zhang

    Abstract: Many material properties are manifested in the morphological appearance and characterized with microscopic image, such as scanning electron microscopy (SEM). Polymer miscibility is a key physical quantity of polymer material and commonly and intuitively judged by SEM images. However, human observation and judgement for the images is time-consuming, labor-intensive and hard to be quantified. Comput… ▽ More

    Submitted 3 August, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  24. Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part II: Method and Applications

    Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

    Abstract: This two-part paper develops a non-iterative coordinated optimal dispatch framework, i.e., free of iterative information exchange, via the innovation of the equivalent projection (EP) theory. The EP eliminates internal variables from technical and economic operation constraints of the subsystem and obtains an equivalent model with reduced scale, which is the key to the non-iterative coordinated op… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  25. Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part I: Theory

    Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

    Abstract: Coordinated optimal dispatch is of utmost importance for the efficient and secure operation of hierarchically structured power systems. Conventional coordinated optimization methods, such as the Lagrangian relaxation and Benders decomposition, require iterative information exchange among subsystems. Iterative coordination methods have drawbacks including slow convergence, risk of oscillation and d… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  26. arXiv:2211.10565  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

    Authors: Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen

    Abstract: In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but… ▽ More

    Submitted 23 February, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

  27. arXiv:2211.08191  [pdf, other

    eess.AS cs.LG

    Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

    Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

    Abstract: Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribut… ▽ More

    Submitted 14 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: accepted by EUSIPCO 2023

  28. arXiv:2211.01621  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

    Authors: Christian Heider Nielsen, Zheng-Hua Tan

    Abstract: In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of dee… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

  29. arXiv:2210.17154  [pdf, other

    eess.AS cs.SD eess.SP

    Minimum Processing Near-end Listening Enhancement

    Authors: Andreas Jonas Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard

    Abstract: The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility i… ▽ More

    Submitted 30 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

  30. arXiv:2210.01703  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

    Authors: Holger Severin Bovbjerg, Zheng-Hua Tan

    Abstract: Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled… ▽ More

    Submitted 24 May, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: To be published at ICASSP2023 Workshop on Self-supervision in Audio, Speech and Beyond, 10th of June 2023, Rhodes, Greece. Copyright (c) 2023 IEEE. 5 pages, 3 figures, 3 tables

    MSC Class: 68T10 ACM Class: I.2.6

  31. arXiv:2207.01691  [pdf, other

    eess.AS

    Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay

    Authors: Claus Meyer Larsen, Peter Koch, Zheng-Hua Tan

    Abstract: Voice Activity Detection (VAD) is an important pre-processing step in a wide variety of speech processing systems. VAD should in a practical application be able to detect speech in both noisy and noise-free environments, while not introducing significant latency. In this work we propose using an adversarial multi-task learning method when training a supervised VAD. The method has been applied to t… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

  32. arXiv:2206.10750  [pdf, other

    eess.SP cs.CV

    Floor Map Reconstruction Through Radio Sensing and Learning By a Large Intelligent Surface

    Authors: Cristian J. Vaca-Rubio, Roberto Pereira, Xavier Mestre, David Gregoratti, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski

    Abstract: Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots. Equally important, it is also vital to ensure reliable communication between the robot and its controller. Large Intelligent Surface (LIS) is a technology that has been extensively studied due to its co… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

  33. arXiv:2205.10321  [pdf, other

    eess.SP cs.CV cs.LG

    User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars

    Authors: Cristian J. Vaca-Rubio, Dariush Salami, Petar Popovski, Elisabeth de Carvalho, Zheng-Hua Tan, Stephan Sigg

    Abstract: Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  34. arXiv:2205.03229  [pdf

    eess.SP physics.optics

    Multi-core fiber enabled fading noise suppression in φ-OFDR based quantitative distributed vibration sensing

    Authors: Yuxiang Feng, Weilin Xie, Yinxia Meng, Jiang Yang, Qiang Yang, Yan Ren, Tianwai Bo, Zhongwei Tan, Wei Wei, Yi Dong

    Abstract: Coherent fading has been regarded as a critical issue in phase-sensitive optical frequency domain reflectometry (φ-OFDR) based distributed fiber-optic sensing. Here, we report on an approach for fading noise suppression in φ-OFDR with multi-core fiber. By exploiting the independent nature of the randomness in the distribution of reflective index in each of the cores, the drastic phase fluctuations… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: 4 pages

  35. arXiv:2204.04456  [pdf, other

    eess.SY

    Approximation-free control based on the bioinspired reference model for suspension systems with uncertainty and unknown nonlinearity

    Authors: Xiaoyan Hu, Guilin Wen, Shan Yin, Zhao Tan, Zebang Pan

    Abstract: Uncertainty and unknown nonlinearity are often inevitable in the suspension systems, which were often solved using fuzzy logic system (FLS) or neural networks (NNs). However, these methods are restricted by the structural complexity of the controller and the huge computing cost. Meanwhile, the estimation error of such approximators is affected by adopted adaptive laws and learning gains. Thus, in… ▽ More

    Submitted 9 April, 2022; originally announced April 2022.

  36. arXiv:2204.02195  [pdf, other

    eess.AS

    Complex Recurrent Variational Autoencoder with Application to Speech Enhancement

    Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

    Abstract: As an extension of variational autoencoder (VAE), complex VAE uses complex Gaussian distributions to model latent variables and data. This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used. Firstly, to account for the temporal property of speech signals, this work introduces complex-valued recurrent ne… ▽ More

    Submitted 12 May, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  37. Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

    Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

    Abstract: Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive pre… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: Published in: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)

  38. openFEAT: Improving Speaker Identification by Open-set Few-shot Embedding Adaptation with Transformer

    Authors: Kishan K C, Zhenning Tan, Long Chen, Minho **, Eunjung Han, Andreas Stolcke, Chul Lee

    Abstract: Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household spe… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: To appear in Proc. IEEE ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7062-7066

  39. arXiv:2202.05492  [pdf, other

    eess.IV cs.CV

    Entroformer: A Transformer-based Entropy Model for Learned Image Compression

    Authors: Yichen Qian, Ming Lin, Xiuyu Sun, Zhiyu Tan, Rong **

    Abstract: One critical component in lossy deep image compression is the entropy model, which predicts the probability distribution of the quantized latent representation in the encoding and decoding modules. Previous works build entropy models upon convolutional neural networks which are inefficient in capturing global dependencies. In this work, we propose a novel transformer-based entropy model, termed En… ▽ More

    Submitted 14 March, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: Accepted at ICLR 2022 for poster. Camera ready version

    Journal ref: International Conference on Learning Representations (2022)

  40. arXiv:2202.03647  [pdf, other

    cs.SD eess.AS

    Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More

    Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  41. arXiv:2201.06426  [pdf, ps, other

    cs.SD cs.LG eess.AS

    On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

    Authors: Achintya kr. Sarkar, Zheng-Hua Tan

    Abstract: Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.

  42. arXiv:2111.10592  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    Deep Spoken Keyword Spotting: An Overview

    Authors: Iván López-Espejo, Zheng-Hua Tan, John Hansen, Jesper Jensen

    Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in te… ▽ More

    Submitted 20 November, 2021; originally announced November 2021.

  43. Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

    Authors: Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan

    Abstract: This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model t… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  44. arXiv:2111.06099  [pdf

    cs.DL eess.SY

    On Novel Peer Review System for Academic Journals: Experimental Study Based on Social Computing

    Authors: Li Liu, Qian Wang, Zong-Yuan Tan, Ning Cai

    Abstract: For improving the performance and effectiveness of peer review, a novel review system is proposed, based on analysis of peer review process for academic journals under a parallel model built via Monte Carlo method. The model can simulate the review, application and acceptance activities of the review systems, in a distributed manner. According to simulation experiments on two distinct review syste… ▽ More

    Submitted 11 November, 2021; originally announced November 2021.

  45. arXiv:2111.02783  [pdf, other

    eess.SP

    Radio Sensing with Large Intelligent Surface for 6G

    Authors: Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho

    Abstract: This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks. Major research has been undergone about its communication capabilities but it can be exploited as a formidable tool for radio sensing. By taking advantage of arbitrary communication signals occurring in the scenario, we apply direct processing to the output signal from the LIS to obtain… ▽ More

    Submitted 3 February, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

  46. Interpolation variable rate image compression

    Authors: Zhenhong Sun, Zhiyu Tan, Xiuyu Sun, Fangyi Zhang, Yichen Qian, Dongyang Li, Hao Li

    Abstract: Compression standards have been used to reduce the cost of image storage and transmission for decades. In recent years, learned image compression methods have been proposed and achieved compelling performance to the traditional standards. However, in these methods, a set of different networks are used for various compression rates, resulting in a high cost in model storage and training. Although s… ▽ More

    Submitted 19 September, 2021; originally announced September 2021.

  47. Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

    Authors: Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

    Abstract: Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same ho… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

    Comments: Submitted to ASRU 2021

    Journal ref: Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Dec. 2021, pp. 1124-1131

  48. arXiv:2107.12794  [pdf, other

    cs.LG eess.SY

    Short-Term Electricity Price Forecasting based on Graph Convolution Network and Attention Mechanism

    Authors: Yuyun Yang, Zhenfei Tan, Haitao Yang, Guangchun Ruan, Haiwang Zhong

    Abstract: In electricity markets, locational marginal price (LMP) forecasting is particularly important for market participants in making reasonable bidding strategies, managing potential trading risks, and supporting efficient system planning and operation. Unlike existing methods that only consider LMPs' temporal features, this paper tailors a spectral graph convolutional network (GCN) to greatly improve… ▽ More

    Submitted 26 July, 2021; originally announced July 2021.

    Comments: Submitted to IET RPG. 9 pages, 15 figures, 6 tables

  49. arXiv:2105.14826  [pdf, other

    eess.AS cs.SD

    PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

    Authors: Wencheng Li, Zhenhua Tan, **gyu Ning, Zhenche Xia, Danke Wu

    Abstract: Speaker recognition using i-vector has been replaced by speaker recognition using deep learning. Speaker recognition based on Convolutional Neural Networks (CNNs) has been widely used in recent years, which learn low-level speech representations from raw waveforms. On this basis, a CNN architecture called SincNet proposes a kind of unique convolutional layer, which has achieved band-pass filters.… ▽ More

    Submitted 18 June, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

  50. arXiv:2104.06083  [pdf, other

    eess.IV cs.CV

    Spatiotemporal Entropy Model is All You Need for Learned Video Compression

    Authors: Zhenhong Sun, Zhiyu Tan, Xiuyu Sun, Fangyi Zhang, Dongyang Li, Yichen Qian, Hao Li

    Abstract: The framework of dominant learned video compression methods is usually composed of motion prediction modules as well as motion vector and residual image compression modules, suffering from its complex structure and error propagation problem. Approaches have been proposed to reduce the complexity by replacing motion prediction modules with implicit flow networks. Error propagation aware training st… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.