Search | arXiv e-print repository

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these model… ▽ More In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2406.06160 [pdf, other]

The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but… ▽ More The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems do not benefit from increasing the training dataset size as much as the discriminative systems. They perform the best relative to the discriminative systems with datasets of 10 h or less, but they are outperformed by the discriminative systems with datasets of 100 h or more. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.05270 [pdf]

fastMRI Breast: A publicly available radial k-space dataset of breast dynamic contrast-enhanced MRI

Authors: Eddy Solomon, Patricia M. Johnson, Zhengguo Tan, Radhika Tibrewala, Yvonne W. Lui, Florian Knoll, Linda Moy, Sungheon Gene Kim, Laura Heacock

Abstract: This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will… ▽ More This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will support research and development of fast and quantitative breast image reconstruction and machine learning methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.02178 [pdf, other]

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

Authors: Sarthak Yadav, Zheng-Hua Tan

Abstract: Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. T… ▽ More Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons. △ Less

Submitted 7 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2404.17125 [pdf, other]

doi 10.1109/ICPSAsia48933.2020.9208421

Misaka: Interactive Swarm Testbed for Smart Grid Distributed Algorithm Test and Evaluation

Authors: Tingliang Zhang, Haiwang Zhong, Zhenfei Tan, Xinfei Yan

Abstract: In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activi… ▽ More In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activity sheets, and a software framework for application development and control, while remaining affordable (per unit cost about 30 USD at the prototype stage). We illustrate the potential of tabletop swarm user interfaces through a set of smart grid algorithm application scenarios developed with Misaka. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Journal ref: 2020 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia)

arXiv:2403.18560 [pdf, other]

Noise-Robust Keyword Spotting through Self-supervised Pretraining

Authors: Jacob Mørk, Holger Severin Bovbjerg, Gergely Kiss, Zheng-Hua Tan

Abstract: Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explore… ▽ More Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored. Models of three different sizes are pretrained using different pretraining approaches and then fine-tuned for KWS. These models are then tested and compared to models trained using two baseline supervised learning methods, one being standard training using clean data and the other one being multi-style training (MTR). The results show that pretraining and fine-tuning on clean data is superior to supervised learning on clean data across all testing conditions, and superior to supervised MTR for testing conditions of SNR above 5 dB. This indicates that pretraining alone can increase the model's robustness. Finally, it is found that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions. △ Less

Submitted 27 March, 2024; originally announced March 2024.

MSC Class: 68T10 ACM Class: I.2.6

arXiv:2403.17701

Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its var… ▽ More Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its variants, have demonstrated notable performance in the field of vision. However, their feature extraction methods may not be sufficiently effective and retain some redundant structures, leaving room for parameter reduction. Motivated by previous spatial and channel attention methods, we propose Triplet Mamba-UNet. The method leverages residual VSS Blocks to extract intensive contextual features, while Triplet SSM is employed to fuse features across spatial and channel dimensions. We conducted experiments on ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, and Kvasir-Instrument datasets, demonstrating the superior segmentation performance of our proposed TM-UNet. Additionally, compared to the previous VM-UNet, our model achieves a one-third reduction in parameters. △ Less

Submitted 3 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Experimental method encountered errors, undergoing experiment again

arXiv:2403.10428 [pdf, other]

doi 10.1109/TASLP.2024.3378099

How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses

Authors: Peter Leer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

Abstract: Advanced auditory models are useful in designing signal-processing algorithms for hearing-loss compensation or speech enhancement. Such auditory models provide rich and detailed descriptions of the auditory pathway, and might allow for individualization of signal-processing strategies, based on physiological measurements. However, these auditory models are often computationally demanding, requirin… ▽ More Advanced auditory models are useful in designing signal-processing algorithms for hearing-loss compensation or speech enhancement. Such auditory models provide rich and detailed descriptions of the auditory pathway, and might allow for individualization of signal-processing strategies, based on physiological measurements. However, these auditory models are often computationally demanding, requiring significant time to compute. To address this issue, previous studies have explored the use of deep neural networks to emulate auditory models and reduce inference time. While these deep neural networks offer impressive efficiency gains in terms of computational time, they may suffer from uneven emulation performance as a function of auditory-model frequency-channels and input sound pressure level, making them unsuitable for many tasks. In this study, we demonstrate that the conventional machine-learning optimization objective used in existing state-of-the-art methods is the primary source of this limitation. Specifically, the optimization objective fails to account for the frequency- and level-dependencies of the auditory model, caused by a large input dynamic range and different types of hearing losses emulated by the auditory model. To overcome this limitation, we propose a new optimization objective that explicitly embeds the frequency- and level-dependencies of the auditory model. Our results show that this new optimization objective significantly improves the emulation performance of deep neural networks across relevant input sound levels and auditory-model frequency channels, without increasing the computational load during inference. Addressing these limitations is essential for advancing the application of auditory models in signal-processing tasks, ensuring their efficacy in diverse scenarios. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details

arXiv:2403.10420 [pdf, other]

Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks

Authors: Peter Leer, Jesper Jensen, Laurel Carney, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

Abstract: This article investigates the use of deep neural networks (DNNs) for hearing-loss compensation. Hearing loss is a prevalent issue affecting millions of people worldwide, and conventional hearing aids have limitations in providing satisfactory compensation. DNNs have shown remarkable performance in various auditory tasks, including speech recognition, speaker identification, and music classificatio… ▽ More This article investigates the use of deep neural networks (DNNs) for hearing-loss compensation. Hearing loss is a prevalent issue affecting millions of people worldwide, and conventional hearing aids have limitations in providing satisfactory compensation. DNNs have shown remarkable performance in various auditory tasks, including speech recognition, speaker identification, and music classification. In this study, we propose a DNN-based approach for hearing-loss compensation, which is trained on the outputs of hearing-impaired and normal-hearing DNN-based auditory models in response to speech signals. First, we introduce a framework for emulating auditory models using DNNs, focusing on an auditory-nerve model in the auditory pathway. We propose a linearization of the DNN-based approach, which we use to analyze the DNN-based hearing-loss compensation. Additionally we develop a simple approach to choose the acoustic center frequencies of the auditory model used for the compensation strategy. Finally, we evaluate the DNN-based hearing-loss compensation strategies using listening tests with hearing impaired listeners. The results demonstrate that the proposed approach results in feasible hearing-loss compensation strategies. Our proposed approach was shown to provide an increase in speech intelligibility and was found to outperform a conventional approach in terms of perceived speech quality. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.03675 [pdf, other]

ZF Beamforming Tensor Compression for Massive MIMO Fronthaul

Authors: Libin Zheng, Zihao Wang, Minru Bai, Zhenjie Tan

Abstract: In the rapidly evolving landscape of 5G and beyond 5G (B5G) mobile cellular communications, efficient data compression and reconstruction strategies become paramount, especially in massive multiple-input multiple-output (MIMO) systems. A critical challenge in these systems is the capacity-limited fronthaul, particularly in the context of the Ethernet-based common public radio interface (eCPRI) con… ▽ More In the rapidly evolving landscape of 5G and beyond 5G (B5G) mobile cellular communications, efficient data compression and reconstruction strategies become paramount, especially in massive multiple-input multiple-output (MIMO) systems. A critical challenge in these systems is the capacity-limited fronthaul, particularly in the context of the Ethernet-based common public radio interface (eCPRI) connecting baseband units (BBUs) and remote radio units (RRUs). This capacity limitation hinders the effective handling of increased traffic and data flows. We propose a novel two-stage compression approach to address this bottleneck. The first stage employs sparse Tucker decomposition, targeting the weight tensor's low-rank components for compression. The second stage further compresses these components using complex givens decomposition and run-length encoding, substantially improving the compression ratio. Our approach specifically targets the Zero-Forcing (ZF) beamforming weights in BBUs. By reconstructing these weights in RRUs, we significantly alleviate the burden on eCPRI traffic, enabling a higher number of concurrent streams in the radio access network (RAN). Through comprehensive evaluations, we demonstrate the superior effectiveness of our method in Channel State Information (CSI) compression, paving the way for more efficient 5G/B5G fronthaul links. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.02327 [pdf, other]

Bootstrap** Audio-Visual Segmentation by Strengthening Audio Cues

Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jie** Ye, Nenghai Yu

Abstract: How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate… ▽ More How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method attains new benchmarks in AVS performance. △ Less

Submitted 6 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2312.16613 [pdf, other]

Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

Authors: Holger Severin Bovbjerg, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

Abstract: In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with… ▽ More In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning. △ Less

Submitted 23 January, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

MSC Class: 68T10 ACM Class: I.2.6

arXiv:2312.04370 [pdf, other]

Investigating the Design Space of Diffusion Models for Speech Enhancement

Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Abstract: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals… ▽ More Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.02683 [pdf, other]

Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Abstract: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on t… ▽ More Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement. △ Less

Submitted 16 January, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP 2024

arXiv:2310.04369 [pdf, other]

MBTFNet: Multi-Band Temporal-Frequency Neural Network For Singing Voice Enhancement

Authors: Weiming Xu, Zhouxuan Chen, Zhili Tan, Shubo Lv, Runduo Han, Wenjiang Zhou, Weifeng Zhao, Lei Xie

Abstract: A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-freque… ▽ More A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling are introduced to expand the receptive field of the model. We propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2309.16390 [pdf, other]

An Enhanced Low-Resolution Image Recognition Method for Traffic Environments

Authors: Zongcai Tan, Zhenhai Gao

Abstract: Currently, low-resolution image recognition is confronted with a significant challenge in the field of intelligent traffic perception. Compared to high-resolution images, low-resolution images suffer from small size, low quality, and lack of detail, leading to a notable decrease in the accuracy of traditional neural network recognition algorithms. The key to low-resolution image recognition lies i… ▽ More Currently, low-resolution image recognition is confronted with a significant challenge in the field of intelligent traffic perception. Compared to high-resolution images, low-resolution images suffer from small size, low quality, and lack of detail, leading to a notable decrease in the accuracy of traditional neural network recognition algorithms. The key to low-resolution image recognition lies in effective feature extraction. Therefore, this paper delves into the fundamental dimensions of residual modules and their impact on feature extraction and computational efficiency. Based on experiments, we introduce a dual-branch residual network structure that leverages the basic architecture of residual networks and a common feature subspace algorithm. Additionally, it incorporates the utilization of intermediate-layer features to enhance the accuracy of low-resolution image recognition. Furthermore, we employ knowledge distillation to reduce network parameters and computational overhead. Experimental results validate the effectiveness of this algorithm for low-resolution image recognition in traffic environments. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2309.11243 [pdf, other]

Joint Minimum Processing Beamforming and Near-end Listening Enhancement

Authors: Andreas J. Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars S. Bertelsen, Jens Christian Lindof, Jan Østergaard

Abstract: We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions… ▽ More We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions, where intelligibility is already at ceiling level. Recently [1,2] propose to remedy this with a minimum processing framework that either reduces noise or enhances listening a minimum amount given that a certain intelligibility criterion is still satisfied Additionally, it has been shown that joint consideration of both environments improves speech enhancement performance. In this paper, we formulate a joint far- and near-end minimum processing framework, that improves intelligibility while limiting speech distortions in favorable noise conditions. We provide closed-form solutions to specific boundary scenarios and investigate performance for the general case using numerical optimization. We also show concatenating existing minimum processing far- and near-end enhancement methods preserves the effects of the initial methods. Results show that the joint optimization can further improve performance compared to the concatenated approach. △ Less

Submitted 5 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at IEEE ICASSP 2024 Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2024

arXiv:2307.10495 [pdf, other]

doi 10.1117/12.2662393

Novel Batch Active Learning Approach and Its Application to Synthetic Aperture Radar Datasets

Authors: James Chapman, Bohan Chen, Zheng Tan, Jeff Calder, Kevin Miller, Andrea L. Bertozzi

Abstract: Active learning improves the performance of machine learning methods by judiciously selecting a limited number of unlabeled data points to query for labels, with the aim of maximally improving the underlying classifier's performance. Recent gains have been made using sequential active learning for synthetic aperture radar (SAR) data arXiv:2204.00005. In each iteration, sequential active learning s… ▽ More Active learning improves the performance of machine learning methods by judiciously selecting a limited number of unlabeled data points to query for labels, with the aim of maximally improving the underlying classifier's performance. Recent gains have been made using sequential active learning for synthetic aperture radar (SAR) data arXiv:2204.00005. In each iteration, sequential active learning selects a query set of size one while batch active learning selects a query set of multiple datapoints. While batch active learning methods exhibit greater efficiency, the challenge lies in maintaining model accuracy relative to sequential active learning methods. We developed a novel, two-part approach for batch active learning: Dijkstra's Annulus Core-Set (DAC) for core-set generation and LocalMax for batch sampling. The batch active learning process that combines DAC and LocalMax achieves nearly identical accuracy as sequential active learning but is more efficient, proportional to the batch size. As an application, a pipeline is built based on transfer learning feature embedding, graph learning, DAC, and LocalMax to classify the FUSAR-Ship and OpenSARShip datasets. Our pipeline outperforms the state-of-the-art CNN-based methods. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: 16 pages, 7 figures, Preprint

ACM Class: I.2.6; I.2.10; I.4.0; I.4.9

Journal ref: Proc. SPIE. Algorithms for Synthetic Aperture Radar Imagery XXX (Vol. 12520, pp. 96-111). 13 June 2023

arXiv:2307.05893 [pdf, ps, other]

Deep Unrolling for Nonconvex Robust Principal Component Analysis

Authors: Elizabeth Z. C. Tan, Caroline Chaux, Emmanuel Soubies, Vincent Y. F. Tan

Abstract: We design algorithms for Robust Principal Component Analysis (RPCA) which consists in decomposing a matrix into the sum of a low rank matrix and a sparse matrix. We propose a deep unrolled algorithm based on an accelerated alternating projection algorithm which aims to solve RPCA in its nonconvex form. The proposed procedure combines benefits of deep neural networks and the interpretability of the… ▽ More We design algorithms for Robust Principal Component Analysis (RPCA) which consists in decomposing a matrix into the sum of a low rank matrix and a sparse matrix. We propose a deep unrolled algorithm based on an accelerated alternating projection algorithm which aims to solve RPCA in its nonconvex form. The proposed procedure combines benefits of deep neural networks and the interpretability of the original algorithm and it automatically learns hyperparameters. We demonstrate the unrolled algorithm's effectiveness on synthetic datasets and also on a face modeling problem, where it leads to both better numerical and visual performances. △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: 7 pages, 3 figures; Accepted to the 2023 IEEE International Workshop on Machine Learning for Signal Processing

arXiv:2306.00561 [pdf, other]

Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

Authors: Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan

Abstract: In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standar… ▽ More In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. Code for feature extraction and downstream experiments along with pre-trained models will be released publically. △ Less

Submitted 1 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2306.00489 [pdf, other]

Speech inpainting: Context-based speech synthesis guided by video

Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted in Interspeech23

arXiv:2303.15474 [pdf]

doi 10.1109/TCSI.2023.3255199

A Heterogeneous Parallel Non-von Neumann Architecture System for Accurate and Efficient Machine Learning Molecular Dynamics

Authors: Zhuoying Zhao, Ziling Tan, **hui Mo, Xiaonan Wang, Dan Zhao, Xin Zhang, Ming Tao, Jie Liu

Abstract: This paper proposes a special-purpose system to achieve high-accuracy and high-efficiency machine learning (ML) molecular dynamics (MD) calculations. The system consists of field programmable gate array (FPGA) and application specific integrated circuit (ASIC) working in heterogeneous parallelization. To be specific, a multiplication-less neural network (NN) is deployed on the non-von Neumann (NvN… ▽ More This paper proposes a special-purpose system to achieve high-accuracy and high-efficiency machine learning (ML) molecular dynamics (MD) calculations. The system consists of field programmable gate array (FPGA) and application specific integrated circuit (ASIC) working in heterogeneous parallelization. To be specific, a multiplication-less neural network (NN) is deployed on the non-von Neumann (NvN)-based ASIC (SilTerra 180 nm process) to evaluate atomic forces, which is the most computationally expensive part of MD. All other calculations of MD are done using FPGA (Xilinx XC7Z100). It is shown that, to achieve similar-level accuracy, the proposed NvN-based system based on low-end fabrication technologies (180 nm) is 1.6x faster and 10^2-10^3x more energy efficiency than state-of-the-art vN based MLMD using graphics processing units (GPUs) based on much more advanced technologies (12 nm), indicating superiority of the proposed NvN-based heterogeneous parallel architecture. △ Less

Submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.12360 [pdf]

Automatically Predict Material Properties with Microscopic Image Example Polymer Compatibility

Authors: Zhilong Liang, Zhenzhi Tan, Ruixin Hong, Wanli Ouyang, **ying Yuan, Changshui Zhang

Abstract: Many material properties are manifested in the morphological appearance and characterized with microscopic image, such as scanning electron microscopy (SEM). Polymer miscibility is a key physical quantity of polymer material and commonly and intuitively judged by SEM images. However, human observation and judgement for the images is time-consuming, labor-intensive and hard to be quantified. Comput… ▽ More Many material properties are manifested in the morphological appearance and characterized with microscopic image, such as scanning electron microscopy (SEM). Polymer miscibility is a key physical quantity of polymer material and commonly and intuitively judged by SEM images. However, human observation and judgement for the images is time-consuming, labor-intensive and hard to be quantified. Computer image recognition with machine learning method can make up the defects of artificial judging, giving accurate and quantitative judgement. We achieve automatic miscibility recognition utilizing convolution neural network and transfer learning method, and the model obtains up to 94% accuracy. We also put forward a quantitative criterion for polymer miscibility with this model. The proposed method can be widely applied to the quantitative characterization of the microstructure and properties of various materials. △ Less

Submitted 3 August, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2302.13280 [pdf, other]

doi 10.48550/arXiv.2302.13280

Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part II: Method and Applications

Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

Abstract: This two-part paper develops a non-iterative coordinated optimal dispatch framework, i.e., free of iterative information exchange, via the innovation of the equivalent projection (EP) theory. The EP eliminates internal variables from technical and economic operation constraints of the subsystem and obtains an equivalent model with reduced scale, which is the key to the non-iterative coordinated op… ▽ More This two-part paper develops a non-iterative coordinated optimal dispatch framework, i.e., free of iterative information exchange, via the innovation of the equivalent projection (EP) theory. The EP eliminates internal variables from technical and economic operation constraints of the subsystem and obtains an equivalent model with reduced scale, which is the key to the non-iterative coordinated optimization. In Part II of this paper, a novel projection algorithm with the explicit error guarantee measured by the Hausdorff distance is proposed, which characterizes the EP model by the convex hull of its vertices. This algorithm is proven to yield a conservative approximation within the prespecified error tolerance and can obtain the exact EP model if the error tolerance is set to zero, which provides flexibility to balance the computation accuracy and effort. Applications of the EP-based coordinated dispatch are demonstrated based on the multi-area coordination and transmission-distribution coordination. Case studies with a wide range of system scales verify the superiority of the proposed projection algorithm in terms of computational efficiency and scalability, and validate the effectiveness of the EP-based coordinated dispatch in comparison with the joint optimization. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2302.13278 [pdf, other]

doi 10.48550/arXiv.2302.13278

Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part I: Theory

Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

Abstract: Coordinated optimal dispatch is of utmost importance for the efficient and secure operation of hierarchically structured power systems. Conventional coordinated optimization methods, such as the Lagrangian relaxation and Benders decomposition, require iterative information exchange among subsystems. Iterative coordination methods have drawbacks including slow convergence, risk of oscillation and d… ▽ More Coordinated optimal dispatch is of utmost importance for the efficient and secure operation of hierarchically structured power systems. Conventional coordinated optimization methods, such as the Lagrangian relaxation and Benders decomposition, require iterative information exchange among subsystems. Iterative coordination methods have drawbacks including slow convergence, risk of oscillation and divergence, and incapability of multi-level optimization problems. To this end, this paper aims at the non-iterative coordinated optimization method for hierarchical power systems. The theory of the equivalent projection (EP) is proposed, which makes external equivalence of the optimal dispatch model of the subsystem. Based on the EP theory, a coordinated optimization framework is developed, where each subsystem submits the EP model as a substitute for its original model to participate in the cross-system coordination. The proposed coordination framework is proven to guarantee the same optimality as the joint optimization, with additional benefits of avoiding iterative information exchange, protecting privacy, compatibility with practical dispatch scheme, and capability of multi-level problems. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2211.10565 [pdf, other]

Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

Authors: Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen

Abstract: In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but… ▽ More In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3x energy consumption reduction. △ Less

Submitted 23 February, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

arXiv:2211.08191 [pdf, other]

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

Abstract: Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribut… ▽ More Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort{fhvae} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been changed in this work but only the training process, thus no additional cost is needed during testing. Voice conversion has been chosen as the application in this paper. Latent variable evaluations include speaker verification and identification for the speaker identity variable, and speech recognition for the content variable. Furthermore, assessments of voice conversion performance are on the grounds of fake speech detection experiments. Results show that the proposed method improves both speaker identity and content feature extraction compared to \acrshort{fhvae}, and has better performance than baseline on conversion. △ Less

Submitted 14 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: accepted by EUSIPCO 2023

arXiv:2211.01621 [pdf, other]

Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

Authors: Christian Heider Nielsen, Zheng-Hua Tan

Abstract: In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of dee… ▽ More In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of deep learning models while lacking exploration of domain specific features. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Furthermore, the paper analyses the potentials of using speech and non-speech parts separately in detecting adversarial attacks. In the end, considering adverse environments where ASR systems may be deployed, we study the impact of acoustic noise of various types and signal-to-noise ratios. Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.17154 [pdf, other]

doi 10.1109/TASLP.2023.3282094

Minimum Processing Near-end Listening Enhancement

Authors: Andreas Jonas Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard

Abstract: The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility i… ▽ More The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility in harsh noise environments, in favorable noise conditions the intelligibility of speech reaches a ceiling where it cannot be further enhanced. Actually, the focus of existing methods solely on improving the intelligibility causes unnecessary processing of the speech signal and leads to speech distortions and quality degradations. In this paper, we provide a new rationale for NLE, where the target speech is minimally processed in terms of a processing penalty, provided that a certain performance constraint, e.g., intelligibility, is satisfied. We present a closed-form solution for the case where the performance criterion is an intelligibility estimator based on the approximated speech intelligibility index and the processing penalty is the mean-square error between the processed and the clean speech. This produces an NLE method that adapts to changing noise conditions via a simple gain rule by limiting the processing to the minimum necessary to achieve a desired intelligibility, while at the same time focusing on quality in favorable noise situations by minimizing the amount of speech distortions. Through simulation studies, we show the proposed method attains speech quality on par or better than existing methods in both objective measurements and subjective listening tests, whilst still sustaining objective speech intelligibility performance on par with existing methods. △ Less

Submitted 30 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.01703 [pdf, other]

Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

Authors: Holger Severin Bovbjerg, Zheng-Hua Tan

Abstract: Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled… ▽ More Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data. Most SSL methods for speech have primarily been studied for large models, whereas this is not ideal, as compact KWS models are generally required. This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce. We pretrain three compact transformer-based KWS models using Data2Vec, and fine-tune them on a label-deficient setup of the Google Speech Commands data set. It is found that Data2Vec pretraining leads to a significant increase in accuracy, with label-deficient scenarios showing an improvement of 8.22% 11.18% absolute accuracy. △ Less

Submitted 24 May, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

MSC Class: 68T10 ACM Class: I.2.6

arXiv:2207.01691 [pdf, other]

Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay

Authors: Claus Meyer Larsen, Peter Koch, Zheng-Hua Tan

Abstract: Voice Activity Detection (VAD) is an important pre-processing step in a wide variety of speech processing systems. VAD should in a practical application be able to detect speech in both noisy and noise-free environments, while not introducing significant latency. In this work we propose using an adversarial multi-task learning method when training a supervised VAD. The method has been applied to t… ▽ More Voice Activity Detection (VAD) is an important pre-processing step in a wide variety of speech processing systems. VAD should in a practical application be able to detect speech in both noisy and noise-free environments, while not introducing significant latency. In this work we propose using an adversarial multi-task learning method when training a supervised VAD. The method has been applied to the state-of-the-art VAD Waveform-based Voice Activity Detection. Additionally the performance of the VADis investigated under different algorithmic delays, which is an important factor in latency. Introducing adversarial multi-task learning to the model is observed to increase performance in terms of Area Under Curve (AUC), particularly in noisy environments, while the performance is not degraded at higher SNR levels. The adversarial multi-task learning is only applied in the training phase and thus introduces no additional cost in testing. Furthermore the correlation between performance and algorithmic delays is investigated, and it is observed that the VAD performance degradation is only moderate when lowering the algorithmic delay from 398 ms to 23 ms. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.10750 [pdf, other]

Floor Map Reconstruction Through Radio Sensing and Learning By a Large Intelligent Surface

Authors: Cristian J. Vaca-Rubio, Roberto Pereira, Xavier Mestre, David Gregoratti, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski

Abstract: Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots. Equally important, it is also vital to ensure reliable communication between the robot and its controller. Large Intelligent Surface (LIS) is a technology that has been extensively studied due to its co… ▽ More Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots. Equally important, it is also vital to ensure reliable communication between the robot and its controller. Large Intelligent Surface (LIS) is a technology that has been extensively studied due to its communication capabilities. Moreover, due to the number of antenna elements, these surfaces arise as a powerful solution to radio sensing. This paper presents a novel method to translate radio environmental maps obtained at the LIS to floor plans of the indoor environment built of scatterers spread along its area. The usage of a Least Squares (LS) based method, U-Net (UN) and conditional Generative Adversarial Networks (cGANs) were leveraged to perform this task. We show that the floor plan can be correctly reconstructed using both local and global measurements. △ Less

Submitted 21 June, 2022; originally announced June 2022.

arXiv:2205.10321 [pdf, other]

User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars

Authors: Cristian J. Vaca-Rubio, Dariush Salami, Petar Popovski, Elisabeth de Carvalho, Zheng-Hua Tan, Stephan Sigg

Abstract: Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have… ▽ More Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have been successfully applied to a wide range of applications. In this work, we compare LIS and mmWave radars for localization in real-world and simulated environments. In our experiments, the mmWave radar achieves 0.71 Intersection Over Union (IOU) and 3cm error for bounding boxes, while LIS has 0.56 IOU and 10cm distance error. Although the radar outperforms the LIS in terms of accuracy, LIS features additional applications in communication in addition to sensing scenarios. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2205.03229 [pdf]

Multi-core fiber enabled fading noise suppression in φ-OFDR based quantitative distributed vibration sensing

Authors: Yuxiang Feng, Weilin Xie, Yinxia Meng, Jiang Yang, Qiang Yang, Yan Ren, Tianwai Bo, Zhongwei Tan, Wei Wei, Yi Dong

Abstract: Coherent fading has been regarded as a critical issue in phase-sensitive optical frequency domain reflectometry (φ-OFDR) based distributed fiber-optic sensing. Here, we report on an approach for fading noise suppression in φ-OFDR with multi-core fiber. By exploiting the independent nature of the randomness in the distribution of reflective index in each of the cores, the drastic phase fluctuations… ▽ More Coherent fading has been regarded as a critical issue in phase-sensitive optical frequency domain reflectometry (φ-OFDR) based distributed fiber-optic sensing. Here, we report on an approach for fading noise suppression in φ-OFDR with multi-core fiber. By exploiting the independent nature of the randomness in the distribution of reflective index in each of the cores, the drastic phase fluctuations due to the fading phenomina can be effectively alleviated by applying weighted vectorial averaging for the Rayleigh backscattering traces from each of the cores with distinct fading distributions. With the consistent linear response with respect to external excitation of interest for each of the cores, demonstration for the propsoed φ-OFDR with a commercial seven-core fiber has achieved highly sensitive quantitative distributed vibration sensing with about 2.2 nm length precision and 2 cm sensing resolution along the 500 m fiber, corresponding to a range resolution factor as high as about about 4E-5. Featuring long distance, high sensitivity, high resolution, and fading robustness, this approach has shown promising potentials in various sensing techniques for a wide range of practical scenarios. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: 4 pages

arXiv:2204.04456 [pdf, other]

Approximation-free control based on the bioinspired reference model for suspension systems with uncertainty and unknown nonlinearity

Authors: Xiaoyan Hu, Guilin Wen, Shan Yin, Zhao Tan, Zebang Pan

Abstract: Uncertainty and unknown nonlinearity are often inevitable in the suspension systems, which were often solved using fuzzy logic system (FLS) or neural networks (NNs). However, these methods are restricted by the structural complexity of the controller and the huge computing cost. Meanwhile, the estimation error of such approximators is affected by adopted adaptive laws and learning gains. Thus, in… ▽ More Uncertainty and unknown nonlinearity are often inevitable in the suspension systems, which were often solved using fuzzy logic system (FLS) or neural networks (NNs). However, these methods are restricted by the structural complexity of the controller and the huge computing cost. Meanwhile, the estimation error of such approximators is affected by adopted adaptive laws and learning gains. Thus, in view of the above problem, this paper proposes the approximation-free control based on the bioinspired reference model for a class of uncertain suspension systems with unknown nonlinearity. The proposed method integrates the superior vibration suppression of the bioinspired reference model and the structural advantage of the prescribed performance function (PPF) in approximation-free control. Then, the vibration suppression performance is improved, the calculation burden is relieved, and the transient performance is improved, which is analyzed theoretically in this paper. Finally, the simulation results validate the approach, and the comparisons show the advantages of the proposed control method in terms of good vibration suppression, fast convergence, and less calculation burden. △ Less

Submitted 9 April, 2022; originally announced April 2022.

arXiv:2204.02195 [pdf, other]

Complex Recurrent Variational Autoencoder with Application to Speech Enhancement

Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

Abstract: As an extension of variational autoencoder (VAE), complex VAE uses complex Gaussian distributions to model latent variables and data. This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used. Firstly, to account for the temporal property of speech signals, this work introduces complex-valued recurrent ne… ▽ More As an extension of variational autoencoder (VAE), complex VAE uses complex Gaussian distributions to model latent variables and data. This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used. Firstly, to account for the temporal property of speech signals, this work introduces complex-valued recurrent neural network in the complex VAE framework. Besides, L1 loss is used as the reconstruction loss in this framework. To exemplify the use of the complex generative model in speech processing, we choose speech enhancement as the specific application in this paper. Experiments are based on the TIMIT dataset. The results show that the proposed method offers improvements on objective metrics in speech intelligibility and signal quality. △ Less

Submitted 12 May, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2204.02166 [pdf, other]

doi 10.1109/MLSP52302.2021.9596320

Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

Authors: Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

Abstract: Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive pre… ▽ More Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: Published in: 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)

arXiv:2202.12349 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747613

openFEAT: Improving Speaker Identification by Open-set Few-shot Embedding Adaptation with Transformer

Authors: Kishan K C, Zhenning Tan, Long Chen, Minho **, Eunjung Han, Andreas Stolcke, Chul Lee

Abstract: Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household spe… ▽ More Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household speaker identification as a few-shot open-set recognition task and then propose a novel embedding adaptation framework to adapt speaker representations from the given universal embedding space to a household-specific embedding space using a set-to-set function, yielding better household speaker identification performance. With our algorithm, Open-set Few-shot Embedding Adaptation with Transformer (openFEAT), we observe that the speaker identification equal error rate (IEER) on simulated households with 2 to 7 hard-to-discriminate speakers is reduced by 23% to 31% relative. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: To appear in Proc. IEEE ICASSP 2022

Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7062-7066

arXiv:2202.05492 [pdf, other]

Entroformer: A Transformer-based Entropy Model for Learned Image Compression

Authors: Yichen Qian, Ming Lin, Xiuyu Sun, Zhiyu Tan, Rong **

Abstract: One critical component in lossy deep image compression is the entropy model, which predicts the probability distribution of the quantized latent representation in the encoding and decoding modules. Previous works build entropy models upon convolutional neural networks which are inefficient in capturing global dependencies. In this work, we propose a novel transformer-based entropy model, termed En… ▽ More One critical component in lossy deep image compression is the entropy model, which predicts the probability distribution of the quantized latent representation in the encoding and decoding modules. Previous works build entropy models upon convolutional neural networks which are inefficient in capturing global dependencies. In this work, we propose a novel transformer-based entropy model, termed Entroformer, to capture long-range dependencies in probability distribution estimation effectively and efficiently. Different from vision transformers in image classification, the Entroformer is highly optimized for image compression, including a top-k self-attention and a diamond relative position encoding. Meanwhile, we further expand this architecture with a parallel bidirectional context model to speed up the decoding process. The experiments show that the Entroformer achieves state-of-the-art performance on image compression while being time-efficient. △ Less

Submitted 14 March, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

Comments: Accepted at ICLR 2022 for poster. Camera ready version

Journal ref: International Conference on Learning Representations (2022)

arXiv:2202.03647 [pdf, other]

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions. △ Less

Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2201.06426 [pdf, ps, other]

On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

Authors: Achintya kr. Sarkar, Zheng-Hua Tan

Abstract: Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training… ▽ More Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross entropy, joint-softmax and focal loss functions outperform the others. Finally, score-level fusion of different systems is also able to reduce the error rates. Experiments are conducted on the RedDots 2016 challenge database for TD-SV using short utterances. For the speaker classifications, the well-known Gaussian mixture model-universal background model (GMM-UBM) and i-vector techniques are used. △ Less

Submitted 17 January, 2022; originally announced January 2022.

arXiv:2111.10592 [pdf, other]

Deep Spoken Keyword Spotting: An Overview

Authors: Iván López-Espejo, Zheng-Hua Tan, John Hansen, Jesper Jensen

Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in te… ▽ More Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS. △ Less

Submitted 20 November, 2021; originally announced November 2021.

arXiv:2111.07759 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746170

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

Authors: Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan

Abstract: This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model t… ▽ More This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions. In this paper, we propose to use a simpler signal model and optimize speech intelligibility based on the Approximated Speech Intelligibility Index (ASII). We derive a closed-form solution to the joint far- and near-end speech enhancement problem that is independent of the marginal distribution of signal coefficients, and that achieves similar performance to existing work. In addition, we do not need to model or optimize for natural speech variations. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2111.06099 [pdf]

On Novel Peer Review System for Academic Journals: Experimental Study Based on Social Computing

Authors: Li Liu, Qian Wang, Zong-Yuan Tan, Ning Cai

Abstract: For improving the performance and effectiveness of peer review, a novel review system is proposed, based on analysis of peer review process for academic journals under a parallel model built via Monte Carlo method. The model can simulate the review, application and acceptance activities of the review systems, in a distributed manner. According to simulation experiments on two distinct review syste… ▽ More For improving the performance and effectiveness of peer review, a novel review system is proposed, based on analysis of peer review process for academic journals under a parallel model built via Monte Carlo method. The model can simulate the review, application and acceptance activities of the review systems, in a distributed manner. According to simulation experiments on two distinct review systems respectively, significant advantages manifest for the novel one. △ Less

Submitted 11 November, 2021; originally announced November 2021.

arXiv:2111.02783 [pdf, other]

Radio Sensing with Large Intelligent Surface for 6G

Authors: Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho

Abstract: This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks. Major research has been undergone about its communication capabilities but it can be exploited as a formidable tool for radio sensing. By taking advantage of arbitrary communication signals occurring in the scenario, we apply direct processing to the output signal from the LIS to obtain… ▽ More This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks. Major research has been undergone about its communication capabilities but it can be exploited as a formidable tool for radio sensing. By taking advantage of arbitrary communication signals occurring in the scenario, we apply direct processing to the output signal from the LIS to obtain a radio map that describes the physical presence of passive devices (scatterers, humans) which act as virtual sources due to the communication signal reflections. We then assess the usage of machine learning (k-means clustering), image processing and computer vision (template matching and component labeling) to extract meaningful information from these radio maps. As an exemplary use case, we evaluate this method for both active and passive user detection in an indoor setting. The results show that the presented method has high application potential as we are able to detect around 98% of humans passively and 100% active users by just using communication signals of commodity devices even in quite unfavorable Signal-to-Noise Ratio (SNR) conditions. △ Less

Submitted 3 February, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2109.09280 [pdf, other]

doi 10.1145/3474085.3475698

Interpolation variable rate image compression

Authors: Zhenhong Sun, Zhiyu Tan, Xiuyu Sun, Fangyi Zhang, Yichen Qian, Dongyang Li, Hao Li

Abstract: Compression standards have been used to reduce the cost of image storage and transmission for decades. In recent years, learned image compression methods have been proposed and achieved compelling performance to the traditional standards. However, in these methods, a set of different networks are used for various compression rates, resulting in a high cost in model storage and training. Although s… ▽ More Compression standards have been used to reduce the cost of image storage and transmission for decades. In recent years, learned image compression methods have been proposed and achieved compelling performance to the traditional standards. However, in these methods, a set of different networks are used for various compression rates, resulting in a high cost in model storage and training. Although some variable-rate approaches have been proposed to reduce the cost by using a single network, most of them brought some performance degradation when applying fine rate control. To enable variable-rate control without sacrificing the performance, we propose an efficient Interpolation Variable-Rate (IVR) network, by introducing a handy Interpolation Channel Attention (InterpCA) module in the compression network. With the use of two hyperparameters for rate control and linear interpolation, the InterpCA achieves a fine PSNR interval of 0.001 dB and a fine rate interval of 0.0001 Bits-Per-Pixel (BPP) with 9000 rates in the IVR network. Experimental results demonstrate that the IVR network is the first variable-rate learned method that outperforms VTM 9.0 (intra) in PSNR and Multiscale Structural Similarity (MS-SSIM). △ Less

Submitted 19 September, 2021; originally announced September 2021.

arXiv:2109.02576 [pdf, other]

doi 10.1109/ASRU51503.2021.9687975

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

Authors: Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

Abstract: Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same ho… ▽ More Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same household, we propose a household-adapted nonlinear map** to a low dimensional space to complement the global scoring metric. The combined scoring function is optimized on labeled or pseudo-labeled speaker utterances. With input dropout, the proposed scoring model reduces EER by 45-71% in simulated households with 2 to 7 hard-to-discriminate speakers per household. On real-world internal data, the EER reduction is 49.2%. From t-SNE visualization, we also show that clusters formed by household-adapted speaker embeddings are more compact and uniformly distributed, compared to clusters formed by global embeddings before adaptation. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: Submitted to ASRU 2021

Journal ref: Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Dec. 2021, pp. 1124-1131

arXiv:2107.12794 [pdf, other]

Short-Term Electricity Price Forecasting based on Graph Convolution Network and Attention Mechanism

Authors: Yuyun Yang, Zhenfei Tan, Haitao Yang, Guangchun Ruan, Haiwang Zhong

Abstract: In electricity markets, locational marginal price (LMP) forecasting is particularly important for market participants in making reasonable bidding strategies, managing potential trading risks, and supporting efficient system planning and operation. Unlike existing methods that only consider LMPs' temporal features, this paper tailors a spectral graph convolutional network (GCN) to greatly improve… ▽ More In electricity markets, locational marginal price (LMP) forecasting is particularly important for market participants in making reasonable bidding strategies, managing potential trading risks, and supporting efficient system planning and operation. Unlike existing methods that only consider LMPs' temporal features, this paper tailors a spectral graph convolutional network (GCN) to greatly improve the accuracy of short-term LMP forecasting. A three-branch network structure is then designed to match the structure of LMPs' compositions. Such kind of network can extract the spatial-temporal features of LMPs, and provide fast and high-quality predictions for all nodes simultaneously. The attention mechanism is also implemented to assign varying importance weights between different nodes and time slots. Case studies based on the IEEE-118 test system and real-world data from the PJM validate that the proposed model outperforms existing forecasting models in accuracy, and maintains a robust performance by avoiding extreme errors. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: Submitted to IET RPG. 9 pages, 15 figures, 6 tables

arXiv:2105.14826 [pdf, other]

PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

Authors: Wencheng Li, Zhenhua Tan, **gyu Ning, Zhenche Xia, Danke Wu

Abstract: Speaker recognition using i-vector has been replaced by speaker recognition using deep learning. Speaker recognition based on Convolutional Neural Networks (CNNs) has been widely used in recent years, which learn low-level speech representations from raw waveforms. On this basis, a CNN architecture called SincNet proposes a kind of unique convolutional layer, which has achieved band-pass filters.… ▽ More Speaker recognition using i-vector has been replaced by speaker recognition using deep learning. Speaker recognition based on Convolutional Neural Networks (CNNs) has been widely used in recent years, which learn low-level speech representations from raw waveforms. On this basis, a CNN architecture called SincNet proposes a kind of unique convolutional layer, which has achieved band-pass filters. Compared with standard CNNs, SincNet learns the low and high cut-off frequencies of each filter. This paper proposes an improved CNNs architecture called PF-Net, which encourages the first convolutional layer to implement more personalized filters than SincNet. PF-Net parameterizes the frequency domain shape and can realize band-pass filters by learning some deformation points in frequency domain. Compared with standard CNN, PF-Net can learn the characteristics of each filter. Compared with SincNet, PF-Net can learn more characteristic parameters, instead of only low and high cut-off frequencies. This provides a personalized filter bank for different tasks. As a result, our experiments show that the PF-Net converges faster than standard CNN and performs better than SincNet. Our code is available at github.com/TAN-OpenLab/PF-NET. △ Less

Submitted 18 June, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

arXiv:2104.06083 [pdf, other]

Spatiotemporal Entropy Model is All You Need for Learned Video Compression

Authors: Zhenhong Sun, Zhiyu Tan, Xiuyu Sun, Fangyi Zhang, Dongyang Li, Yichen Qian, Hao Li

Abstract: The framework of dominant learned video compression methods is usually composed of motion prediction modules as well as motion vector and residual image compression modules, suffering from its complex structure and error propagation problem. Approaches have been proposed to reduce the complexity by replacing motion prediction modules with implicit flow networks. Error propagation aware training st… ▽ More The framework of dominant learned video compression methods is usually composed of motion prediction modules as well as motion vector and residual image compression modules, suffering from its complex structure and error propagation problem. Approaches have been proposed to reduce the complexity by replacing motion prediction modules with implicit flow networks. Error propagation aware training strategy is also proposed to alleviate incremental reconstruction errors from previously decoded frames. Although these methods have brought some improvement, little attention has been paid to the framework itself. Inspired by the success of learned image compression through simplifying the framework with a single deep neural network, it is natural to expect a better performance in video compression via a simple yet appropriate framework. Therefore, we propose a framework to directly compress raw-pixel frames (rather than residual images), where no extra motion prediction module is required. Instead, an entropy model is used to estimate the spatiotemporal redundancy in a latent space rather than pixel level, which significantly reduces the complexity of the framework. Specifically, the whole framework is a compression module, consisting of a unified auto-encoder which produces identically distributed latents for all frames, and a spatiotemporal entropy estimation model to minimize the entropy of these latents. Experiments showed that the proposed method outperforms state-of-the-art (SOTA) performance under the metric of multiscale structural similarity (MS-SSIM) and achieves competitive results under the metric of PSNR. △ Less

Submitted 13 April, 2021; originally announced April 2021.

Showing 1–50 of 94 results for author: Tan, Z