Search | arXiv e-print repository

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Authors: Kyungbok Lee, You Zhang, Zhiyao Duan

Abstract: This paper addresses the challenge of develo** a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for t… ▽ More This paper addresses the challenge of develo** a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.10514 [pdf, other]

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Authors: Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Abstract: Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically,… ▽ More Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models. △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.10361 [pdf, other]

On Efficient Neural Network Architectures for Image Compression

Authors: Yichi Zhang, Zhihao Duan, Fengqing Zhu

Abstract: Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio… ▽ More Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 2024 IEEE International Conference on Image Processing (ICIP2024)

arXiv:2406.02438 [pdf, other]

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, **g Guo, Tomoki Toda, Zhiyao Duan

Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible. △ Less

Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.05244 [pdf, other]

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024). △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

arXiv:2404.09466 [pdf, other]

Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Authors: Yujia Yan, Zhiyao Duan

Abstract: The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and exp… ▽ More The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlap** intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset. △ Less

Submitted 23 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: Fixed Typos

arXiv:2404.07507 [pdf, other]

Learning to Classify New Foods Incrementally Via Compressed Exemplars

Authors: Justin Yang, Zhihao Duan, Jiangpeng He, Fengqing Zhu

Abstract: Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image… ▽ More Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.00432 [pdf, other]

doi 10.1109/ICMEW59549.2023.00038

Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems

Authors: Md Adnan Faisal Hossain, Zhihao Duan, Yuning Huang, Fengqing Zhu

Abstract: Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a ran… ▽ More Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a range of rates by introducing a rate control parameter as an input to the neural network model. By compressing different intermediate features of a pre-trained vision task model, the proposed method can scale the encoding complexity without changing the overall size of the model. The proposed method is more flexible than existing baselines, at the same time outperforming them in terms of the three-way trade-off between feature compression rate, vision task accuracy, and encoding complexity. We have made the source code available at https://github.com/adnan-hossain/var_feat_comp.git. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 6 pages, 7 figures, 1 table, International Conference on Multimedia and Expo Workshops 2023

arXiv:2403.18535 [pdf, other]

Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs

Authors: Yichi Zhang, Zhihao Duan, Yuning Huang, Fengqing Zhu

Abstract: Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bo… ▽ More Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 2024 IEEE International Conference on Multimedia and Expo (ICME2024)

arXiv:2403.10493 [pdf, other]

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Authors: Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

Abstract: Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fide… ▽ More Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/. △ Less

Submitted 20 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.18862 [pdf, other]

Towards Backward-Compatible Continual Learning of Image Compression

Authors: Zhihao Duan, Ming Lu, Justin Yang, Jiangpeng He, Zhan Ma, Fengqing Zhu

Abstract: This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine… ▽ More This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.15569 [pdf, other]

Toward Fully Self-Supervised Multi-Pitch Estimation

Authors: Frank Cwitkowitz, Zhiyao Duan

Abstract: Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with m… ▽ More Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.06986 [pdf, other]

Cacophony: An Improved Contrastive Audio-Text Model

Authors: Ge Zhu, Jordan Darefsky, Zhiyao Duan

Abstract: Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process n… ▽ More Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then, initializing our audio encoder from the MAE model, train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification. △ Less

Submitted 29 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

Comments: Work in Progress

arXiv:2401.11615 [pdf, other]

Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding

Authors: Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image… ▽ More While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2401.03363 [pdf, other]

Data-driven Dynamic Event-triggered Control

Authors: Tao Xu, Zhiyong Sun, Guanghui Wen, Zhisheng Duan

Abstract: This paper revisits the event-triggered control problem from a data-driven perspective, where unknown continuous-time linear systems subject to disturbances are taken into account. Using data information collected off-line instead of accurate system model information, a data-driven dynamic event-triggered control scheme is developed in this paper. The dynamic property is reflected by that the desi… ▽ More This paper revisits the event-triggered control problem from a data-driven perspective, where unknown continuous-time linear systems subject to disturbances are taken into account. Using data information collected off-line instead of accurate system model information, a data-driven dynamic event-triggered control scheme is developed in this paper. The dynamic property is reflected by that the designed event-triggering function embedded in the event-triggering mechanism (ETM) is dynamically updated as a whole. Thanks to this dynamic design, a strictly positive minimum inter-event time (MIET) is guaranteed without sacrificing control performance. Specifically, exponential input-to-state stability (ISS) of the closed-loop system with respect to disturbances is achieved in this paper, which is superior to some existing results that only guarantee a practical exponential ISS property. The dynamic ETM is easy-to-implement in practical operation since all designed parameters are determined only by a simple data-driven linear matrix inequality (LMI), without additional complicated conditions as required in relevant literature. As quantization is the most common signal constraint in practice, the developed control scheme is further extended to the case where state transmission is affected by a uniform or logarithmic quantization effect. Finally, adequate simulations are performed to show the validity and superiority of the proposed control schemes. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2312.15380 [pdf, other]

Battery-Care Resource Allocation and Task Offloading in Multi-Agent Post-Disaster MEC Environment

Authors: Yiwei Tang, Hualong Huang, Wenhan Zhan, Geyong Min, Zhekai Duan, Yuchuan Lei

Abstract: Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaste… ▽ More Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaster MEC environment with unstable 5G communication, where device-to-device (D2D) link communication and dynamic voltage and frequency scaling (DVFS) are adopted to balance each user's requirement for task delay and energy consumption. A battery degradation evaluation approach to prolong battery lifetime is also presented. The distributed optimization problem is formulated into a mixed cooperative-competitive (MCC) multi-agent Markov decision process (MAMDP) and is tackled with recurrent multi-agent Proximal Policy Optimization (rMAPPO). Extensive simulations and comprehensive comparisons with other representative algorithms clearly demonstrate the effectiveness of the proposed rMAPPO-based offloading scheme. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: accepted by wcnc2024

arXiv:2312.07126 [pdf, other]

Deep Hierarchical Video Compression

Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma

Abstract: Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f… ▽ More Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.14816 [pdf, other]

Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Authors: Enting Zhou, You Zhang, Zhiyao Duan

Abstract: Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it a… ▽ More Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it as a supervised regression task. These methods face data scarcity issues, as dimensional annotations are much harder to acquire than categorical labels. In this work, we propose to learn the AV representation from categorical emotion labels of speech. We start by learning a rich and emotion-relevant high-dimensional speech feature representation using self-supervised pre-training and emotion classification fine-tuning. This representation is then mapped to the 2D AV space according to psychological findings through anchored dimensionality reduction. Experiments show that our method achieves a Concordance Correlation Coefficient (CCC) performance comparable to state-of-the-art supervised regression methods on IEMOCAP without leveraging ground-truth AV annotations during training. This validates our proposed approach on AV prediction. Furthermore, visualization of AV predictions on MEAD and EmoDB datasets shows the interpretability of the learned AV representations. △ Less

Submitted 6 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.13371 [pdf, other]

A Novel Dynamic Event-triggered Mechanism for Dynamic Average Consensus

Authors: Tao Xu, Zhisheng Duan, Guanghui Wen, Zhiyong Sun

Abstract: This paper studies a challenging issue introduced in a recent survey, namely designing a distributed event-based scheme to solve the dynamic average consensus (DAC) problem. First, a robust adaptive distributed event-based DAC algorithm is designed without imposing specific initialization criteria to perform estimation task under intermittent communication. Second, a novel adaptive distributed dyn… ▽ More This paper studies a challenging issue introduced in a recent survey, namely designing a distributed event-based scheme to solve the dynamic average consensus (DAC) problem. First, a robust adaptive distributed event-based DAC algorithm is designed without imposing specific initialization criteria to perform estimation task under intermittent communication. Second, a novel adaptive distributed dynamic event-triggered mechanism is proposed to determine the triggering time when neighboring agents broadcast information to each other. Compared to the existing event-triggered mechanisms, the novelty of the proposed dynamic event-triggered mechanism lies in that it guarantees the existence of a positive and uniform minimum inter-event interval without sacrificing any accuracy of the estimation, which is much more practical than only ensuring the exclusion of the Zeno behavior or the boundedness of the estimation error. Third, a composite adaptive law is developed to update the adaptive gain employed in the distributed event-based DAC algorithm and dynamic event-triggered mechanism. Using the composite adaptive update law, the distributed event-based solution proposed in our work is implemented without requiring any global information. Finally, numerical simulations are provided to illustrate the effectiveness of the theoretical results. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: 9 pages, 8 figures

arXiv:2311.08667 [pdf, other]

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Authors: Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan

Abstract: Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining wit… ▽ More Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/ △ Less

Submitted 18 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Accepted at NeurIPS Workshop: Machine Learning for Audio (Camera Ready)

arXiv:2309.09085 [pdf, other]

SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

Authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan

Abstract: Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering mode… ▽ More Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering models trained on them prone to overfitting and incapable of generalizing to out-of-domain data. In order to address this issue, we present a methodology for synthesizing large-scale GTT audio using commercial acoustic and electric guitar plugins. We procure SynthTab, a dataset derived from DadaGP, which is a vast and diverse collection of richly annotated symbolic tablature. The proposed synthesis pipeline produces audio which faithfully adheres to the original fingerings and a subset of techniques specified in the tablature, and covers multiple guitars and styles for each track. Experiments show that pre-training a baseline GTT model on SynthTab can improve transcription performance when fine-tuning and testing on an individual dataset. More importantly, cross-dataset experiments show that pre-training significantly mitigates issues with overfitting. △ Less

Submitted 24 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024

arXiv:2309.07525 [pdf, other]

SingFake: Singing Voice Deepfake Detection

Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

Abstract: The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances.… ▽ More The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/validation/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available at https://www.singfake.org/. △ Less

Submitted 21 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted at ICASSP 2024

arXiv:2309.02574 [pdf, other]

An Improved Upper Bound on the Rate-Distortion Function of Images

Authors: Zhihao Duan, Jack Ma, Jiangpeng He, Fengqing Zhu

Abstract: Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) p… ▽ More Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) proposing a novel \ourfunction{} to stabilize training. We demonstrate that at least 30\% BD-rate reduction w.r.t. the intra prediction mode in VVC codec is achievable, suggesting that there is still great potential for improving lossy image compression. Code is made publicly available at https://github.com/duanzhiihao/lossy-vae. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Conference paper at ICIP 2023. The first two authors share equal contributions

arXiv:2307.14547 [pdf, other]

Mitigating Cross-Database Differences for Learning Unified HRTF Representation

Authors: Yutong Wen, You Zhang, Zhiyao Duan

Abstract: Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively l… ▽ More Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively limited samples. However, in addition to differences on the spatial sampling locations, recent studies have shown that, even for the common location, HRTFs across databases manifest consistent differences that make it trivial to tell which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases. In this work, we first identify the possible causes of these cross-database differences, attributing them to variations in the measurement setup. Then, we propose a novel approach to normalize the frequency responses of HRTFs across databases. We show that HRTFs from different databases cannot be classified by their database after normalization. We further show that these normalized HRTFs can be used to learn a more unified HRTF representation across databases than the prior art. We believe that this normalization approach paves the road to many data-intensive tasks on HRTF modeling. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 5 pages, 4 figures, accepted by IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

arXiv:2306.09215 [pdf, other]

On the Effects and Optimal Design of Redundant Sensors in Collaborative State Estimation

Authors: Yunxiao Ren, Zhisheng Duan, Peihu Duan, Ling Shi

Abstract: The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the… ▽ More The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the effects of redundant sensors and an engineering-oriented optimal design of redundant sensors. In the theoretical analysis, the paper leverages Riccati equations and Symplectic matrix theory to unveil the explicit role of redundant sensors in cooperative state estimation. The results unequivocally demonstrate that the addition of redundant sensors enhances the estimation performance of the sensor network, aligning with the principle of ``more is better". Moreover, the paper establishes a precise sufficient and necessary condition to assess whether the inclusion of redundant sensors improves the overall estimation performance. Moving towards engineering-oriented design optimization, the paper proposes a novel algorithm to tackle the optimal design problem of redundant sensors, and the convergence of the proposed algorithm is guaranteed. Numerical simulations are provided to demonstrate the results. △ Less

Submitted 4 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.03389 [pdf, other]

doi 10.21437/Interspeech.2023-2039

Phase perturbation improves channel robustness for speech spoofing countermeasures

Authors: Yongyi Zang, You Zhang, Zhiyao Duan

Abstract: In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitu… ▽ More In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitude information, therefore heavily altering phase information. Also, state-of-the-art CM systems rely on phase information to identify spoofed speech. Thus, we believe the information loss in the phase domain induced by lossy compression codec degrades the performance of the unseen channel. We first establish the dependence of time-domain CM systems on phase information by perturbing phase in evaluation, showing strong degradation. Then, we demonstrated that perturbing phase during training leads to a significant performance improvement, whereas perturbing magnitude leads to further degradation. △ Less

Submitted 6 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 5 pages; Proceedings of Interspeech 2023

arXiv:2306.02372 [pdf, other]

SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System

Authors: Mojtaba Heydari, Ju-Chiang Wang, Zhiyao Duan

Abstract: Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces furthe… ▽ More Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces further constraints such as inaccessibility to future data and the impossibility to correct the previous results that are inconsistent with the latter ones. In this paper, we introduce the first system that tracks the beats and downbeats of singing voices in real-time. Specifically, we propose a novel dynamic particle filtering approach that incorporates offline historical data to correct the online inference by using a variable number of particles. We evaluate the performance on two datasets: GTZAN with the separated vocal tracks, and an in-house dataset with the original vocal stems. Experimental result demonstrates that our proposed approach outperforms the baseline by 3-5%. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted for 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2023)

arXiv:2305.12755 [pdf, other]

GNCformer Enhanced Self-attention for Automatic Speech Recognition

Authors: J. Li, Z. Duan, S. Li, X. Yu, G. Yang

Abstract: In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting t… ▽ More In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting the ESA is also worth being explored.In this paper, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named GNCformer. The effectiveness of the GNCformer has been validated using two datasets, that are Aishell-1 and HKUST.Experimental results show that, compared with the Transformer network,0.8%CER,and 1.2%CER improvement for these two mentioned datasets, respectively, can be achieved.It is worth mentioning that only 1.4M additional parameters have been involved in our proposed GNCformer. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: 5 pages,3 figures,

arXiv:2304.04991 [pdf, other]

Sim-T: Simplify the Transformer Network by Multiplexing Technique for Speech Recognition

Authors: Guangyong Wei, Zhikui Duan, Shiren Li, Guangguang Yang, Xinmei Yu, Junhua Li

Abstract: In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim… ▽ More In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim-T has been proposed to expand the generality of the Transformer model. Under the help of the newly developed multiplexing technique, the Sim-T can efficiently compress the model with negligible sacrifice on its performance. To be more precise, the proposed technique includes two parts, that are, module weight multiplexing and attention score multiplexing. Moreover, a novel decoder structure has been proposed to facilitate the attention score multiplexing. Extensive experiments have been conducted to validate the effectiveness of Sim-T. In Aishell-1 dataset, when the proposed Sim-T is 48% parameter less than the baseline Transformer, 0.4% CER improvement can be obtained. Alternatively, 69% parameter reduction can be achieved if the Sim-T gives the same performance as the baseline Transformer. With regard to the HKUST and WSJ eval92 datasets, CER and WER will be improved by 0.3% and 0.2%, respectively, when parameters in Sim-T are 40% less than the baseline Transformer. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2303.08575 [pdf, other]

Observation of Periodic Systems: Bridge Centralized Kalman Filtering and Consensus-Based Distributed Filtering

Authors: Jiachen Qian, Zhisheng Duan, Peihu Duan, Zhongkui Li

Abstract: Compared with linear time invariant systems, linear periodic system can describe the periodic processes arising from nature and engineering more precisely. However, the time-varying system parameters increase the difficulty of the research on periodic system, such as stabilization and observation. This paper aims to consider the observation problem of periodic systems by bridging two fundamental f… ▽ More Compared with linear time invariant systems, linear periodic system can describe the periodic processes arising from nature and engineering more precisely. However, the time-varying system parameters increase the difficulty of the research on periodic system, such as stabilization and observation. This paper aims to consider the observation problem of periodic systems by bridging two fundamental filtering algorithms for periodic systems with a sensor network: consensus-on-measurement-based distributed filtering (CMDF) and centralized Kalman filtering (CKF). Firstly, one mild convergence condition based on uniformly collective observability is established for CMDF, under which the filtering performance of CMDF can be formulated as a symmetric periodic positive semidefinite (SPPS) solution to a discrete-time periodic Lyapunov equation. Then, the closed form of the performance gap between CMDF and CKF is presented in terms of the information fusion steps and the consensus weights of the network. Moreover, it is pointed out that the estimation error covariance of CMDF exponentially converges to the centralized one with the fusion steps tending to infinity. Altogether, these new results establish a concise and specific relationship between distributed and centralized filterings, and formulate the trade-off between the communication cost and distributed filtering performance on periodic systems. Finally, the theoretical results are verified with numerical experiments. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: arXiv admin note: text overlap with arXiv:2112.06395

arXiv:2303.06475 [pdf, other]

Transcription free filler word detection with Neural semi-CRFs

Authors: Ge Zhu, Yujia Yan, Juan-Pablo Caceres, Zhiyao Duan

Abstract: Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from… ▽ More Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from many aspects, e.g., budget, target languages, and computational power. In this work, we investigate filler word detection system that does not depend on ASR systems. We show that, by using the structured state space sequence model (S4) and neural semi-Markov conditional random fields (semi-CRFs), we achieve an absolute F1 improvement of 6.4% (segment level) and 3.1% (event level) on the PodcastFillers dataset. We also conduct a qualitative analysis on the detected results to analyze the limitations of our proposed system. △ Less

Submitted 11 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.08899 [pdf, other]

doi 10.1109/TPAMI.2023.3322904

QARV: Quantization-Aware ResNet VAE for Lossy Image Compression

Authors: Zhihao Duan, Ming Lu, Jack Ma, Yuning Huang, Zhan Ma, Fengqing Zhu

Abstract: This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy… ▽ More This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and a better rate-distortion performance than existing baseline methods. The code of our method is publicly accessible at https://github.com/duanzhiihao/lossy-vae △ Less

Submitted 1 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Full version (19 pages, includes appendix) of the paper accepted by IEEE TPAMI

arXiv:2211.11247 [pdf, ps, other]

Harmonic-Copuled Riccati Equations and its Applications in Distributed Filtering

Authors: Jiachen Qian, Peihu Duan, Zhisheng Duan, Ling shi

Abstract: The coupled Riccati equations are cosisted of multiple Riccati-like equations with solutions coupled with each other, which can be applied to depict the properties of more complex systems such as markovian systems or multi-agent systems. This paper manages to formulate and investigate a new kind of coupled Riccati equations, called harmonic-coupled Riccati equations (HCRE), from the matrix iterati… ▽ More The coupled Riccati equations are cosisted of multiple Riccati-like equations with solutions coupled with each other, which can be applied to depict the properties of more complex systems such as markovian systems or multi-agent systems. This paper manages to formulate and investigate a new kind of coupled Riccati equations, called harmonic-coupled Riccati equations (HCRE), from the matrix iterative law of the consensus on information-based distributed filtering (CIDF) algortihm proposed in [1], where the solutions of the equations are coupled with harmonic means. Firstly, mild conditions of the existence and uniqueness of the solution to HCRE are induced with collective observability and primitiviness of weighting matrix. Then, it is proved that the matrix iterative law of CIDF will converge to the unique solution of the corresponding HCRE, hence can be used to obtain the solution to HCRE. Moreover, through applying the novel theory of HCRE, it is pointed out that the real estimation error covariance of CIDF will also become steady-state and the convergent value is simplified as the solution to a discrete time Lyapunov equation (DLE). Altogether, these new results develop the theory of the coupled Riccati equations, and provide a novel perspective on the performance analysis of CIDF algorithm, which sufficiently reduces the conservativeness of the evaluation techniques in the literature. Finally, the theoretical results are verified with numerical experiments. △ Less

Submitted 12 July, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: 14 pages, 4 figures

arXiv:2211.09897 [pdf, other]

Efficient Feature Compression for Edge-Cloud Systems

Authors: Zhihao Duan, Fengqing Zhu

Abstract: Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be eas… ▽ More Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be easily scaled according to different computation resources on the edge device, taking a step towards achieving a rate-accuracy-complexity (RAC) trade-off. Under various settings, our feature coding system consistently outperforms previous methods in terms of the RAC performance. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Picture Coding Symposium (PCS) 2022

arXiv:2211.02718 [pdf, other]

SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

Authors: Siwen Ding, You Zhang, Zhiyao Duan

Abstract: Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consid… ▽ More Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.17313 [pdf, ps, other]

DiscreteCommunication and ControlUpdating in Event-Triggered Consensus

Authors: Bin Cheng, Yuezu Lv, Zhongkui Li, Zhisheng Duan

Abstract: This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demand… ▽ More This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demands at most are essentially not applicable, in this paper we establish a novel framework to solve the problem of fully distributed consensus with discrete communication and control. The first key point in this framework is the design of controllers that are only updated at discrete event instants and do not depend on global information by introducing time-varying gains inspired by the adaptive control technique. Another key point is the invention of novel dynamic triggering functions that are independent of relative information among neighboring agents. Under the established framework, we propose fully distributed state-feedback event-triggered protocols for undirected graphs and also further study the more complexed cases of output-feedback control and directed graphs. Finally, numerical examples are provided to verify the effectiveness of the proposed event-triggered protocols. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.15196 [pdf, other]

HRTF Field: Unifying Measured HRTF Magnitude Representation with Neural Fields

Authors: You Zhang, Yuxiang Wang, Zhiyao Duan

Abstract: Head-related transfer functions (HRTFs) are a set of functions describing the spatial filtering effect of the outer ear (i.e., torso, head, and pinnae) onto sound sources at different azimuth and elevation angles. They are widely used in spatial audio rendering. While the azimuth and elevation angles are intrinsically continuous, measured HRTFs in existing datasets employ different spatial samplin… ▽ More Head-related transfer functions (HRTFs) are a set of functions describing the spatial filtering effect of the outer ear (i.e., torso, head, and pinnae) onto sound sources at different azimuth and elevation angles. They are widely used in spatial audio rendering. While the azimuth and elevation angles are intrinsically continuous, measured HRTFs in existing datasets employ different spatial sampling schemes, making it difficult to model HRTFs across datasets. In this work, we propose to use neural fields, a differentiable representation of functions through neural networks, to model HRTFs with arbitrary spatial sampling schemes. Such representation is unified across datasets with different spatial sampling schemes. HRTFs for arbitrary azimuth and elevation angles can be derived from this representation. We further introduce a generative model named HRTF field to learn the latent space of the HRTF neural fields across subjects. We demonstrate promising performance on HRTF interpolation and generation tasks and point out potential future work. △ Less

Submitted 23 February, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: 5 pages, accepted by ICASSP 2023

arXiv:2210.06696 [pdf, other]

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

Authors: Huize Li, Hai **, Long Zheng, Yu Huang, Xiaofei Liao, Dan Chen, Zhuohui Duan, Cong Liu, Jiahong Xu, Chuanyi Gui

Abstract: The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM operations. However, current sparse attention solutions introduce massive off-chip random memory access. We propose CPSAA, a novel crossbar-based PIM-featured sparse a… ▽ More The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM operations. However, current sparse attention solutions introduce massive off-chip random memory access. We propose CPSAA, a novel crossbar-based PIM-featured sparse attention accelerator. First, we present a novel attention calculation mode. Second, we design a novel PIM-based sparsity pruning architecture. Finally, we present novel crossbar-based methods. Experimental results show that CPSAA has an average of 89.6X, 32.2X, 17.8X, 3.39X, and 3.84X performance improvement and 755.6X, 55.3X, 21.3X, 5.7X, and 4.9X energy-saving when compare with GPU, FPGA, SANGER, ReBERT, and ReTransformer. △ Less

Submitted 7 October, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: 14 pages, 19 figures

arXiv:2210.02700 [pdf, other]

Minimal-order Appointed-time Unknown Input Observers: Design and Applications

Authors: Yuezu Lv, Zhongkui Li, Zhisheng Duan

Abstract: This paper presents a framework on minimal-order appointed-time unknown input observers for linear systems based on the pairwise observer structure. A minimal-order appointed-time observer is first proposed for the linear system without the unknown input, which can estimate the state exactly at the preset time by seeking for the unique solution of a system of linear equations. To further release t… ▽ More This paper presents a framework on minimal-order appointed-time unknown input observers for linear systems based on the pairwise observer structure. A minimal-order appointed-time observer is first proposed for the linear system without the unknown input, which can estimate the state exactly at the preset time by seeking for the unique solution of a system of linear equations. To further release the computational burden, another form of the appointed-time observer is designed. For the general linear system with the unknown input acting on both the system dynamics and the measured output, the model reconfiguration is made to decouple the effect of the unknown input, and the gap between the existing reduced-order appointed-time unknown input observer and the possible minimal-order appointed-time observer is revealed. Based on the reconstructed model, the minimal-order appointed-time unknown input observer is presented to realize state estimation of linear system with the unknown input at the arbitrarily small preset time. The minimal-order appointed-time unknown input observer is then applied to the design of fully distributed adaptive output-feedback attack-free consensus protocols for linear multi-agent systems. △ Less

Submitted 6 October, 2022; originally announced October 2022.

arXiv:2209.11866 [pdf, other]

doi 10.21437/Interspeech.2023-1788

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed

Authors: Meiying Chen, Zhiyao Duan

Abstract: Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the c… ▽ More Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control. △ Less

Submitted 11 January, 2024; v1 submitted 23 September, 2022; originally announced September 2022.

Comments: Audio samples: https://bit.ly/3PsrKLJ; Code: https://github.com/MelissaChen15/control-vc

arXiv:2208.14578 [pdf, other]

Singing Beat Tracking With Self-supervised Front-end and Linear Transformers

Authors: Mojtaba Heydari, Zhiyao Duan

Abstract: Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing musi… ▽ More Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing music beat tracking systems fail to deliver satisfactory performance on singing voices. In this paper, we propose singing beat tracking as a novel task, and propose the first approach to solving this task. Our approach leverages semantic information of singing voices by employing pre-trained self-supervised WavLM and DistilHuBERT speech representations as the front-end and uses a self-attention encoder layer to predict beats. To train and test the system, we obtain separated singing voices and their beat annotations using source separation and beat tracking on complete songs, followed by manual corrections. Experiments on the 741 separated vocal tracks of the GTZAN dataset show that the proposed system outperforms several state-of-the-art music beat tracking methods by a large margin in terms of beat tracking accuracy. Ablation studies also confirm the advantages of pre-trained self-supervised speech representations over generic spectral features. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: 23rd International Society for Music Information Retrieval Conference (ISMIR 2022)

arXiv:2208.13056 [pdf, other]

doi 10.1109/WACV56688.2023.00028

Lossy Image Compression with Quantized Hierarchical VAEs

Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

Abstract: Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantizati… ▽ More Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantization-aware posterior and prior, enabling easy quantization and entropy coding at test time. Along with improved neural network architecture, we present a powerful and efficient model that outperforms previous methods on natural image lossy compression. Our model compresses images in a coarse-to-fine fashion and supports parallel encoding and decoding, leading to fast execution on GPUs. Code is available at https://github.com/duanzhiihao/lossy-vae. △ Less

Submitted 25 March, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

Comments: WACV 2023 Best Algorithms Paper Award, revised version

arXiv:2207.14352 [pdf, other]

Predicting Global Head-Related Transfer Functions From Scanned Head Geometry Using Deep Learning and Compact Representations

Authors: Yuxiang Wang, You Zhang, Zhiyao Duan, Mark Bocko

Abstract: In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject's HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose nov… ▽ More In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject's HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose novel pre-processing methods for both the head scans and HRTF data to achieve compact representations. For the head scan, we use truncated spherical cap harmonic (SCH) coefficients to represent the pinna area, which is important in the acoustic scattering process. For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets. One CNN model is trained to predict the SH coefficients of the HRTF magnitudes from the SCH coefficients of the scanned ear geometry and other anthropometric measurements of the head. The other CNN model is trained to predict SH coefficients of the HRTF onsets from only the anthropometric measurements of the ear, head, and torso. Combining the magnitude and onset predictions, our method is able to predict the complete and global HRTF data. A leave-one-out validation with the log-spectral distortion (LSD) metric is used for objective evaluation. The results show a decent LSD level at both spatial \& temporal dimensions compared to the ground-truth HRTFs and a lower LSD than the boundary element method (BEM) simulation of HRTFs that the database provides. The localization simulation results with an auditory model are also consistent with the objective evaluation metrics, showing the localization responses with our predicted HRTFs are significantly better than with the BEM calculated ones. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: 11 pages, 14 figures

arXiv:2206.10421 [pdf, other]

Rethinking Audio-visual Synchronization for Active Speaker Detection

Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang

Abstract: Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarif… ▽ More Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models. △ Less

Submitted 10 July, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Accepted by IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2022)

arXiv:2206.06784 [pdf, other]

Stochastic Event-triggered Variational Bayesian Filtering

Authors: Xiaoxu Lv, Peihu Duan, Zhisheng Duan, Guanrong Chen, Ling Shi

Abstract: This paper proposes an event-triggered variational Bayesian filter for remote state estimation with unknown and time-varying noise covariances. After presetting multiple nominal process noise covariances and an initial measurement noise covariance, a variational Bayesian method and a fixed-point iteration method are utilized to jointly estimate the posterior state vector and the unknown noise cova… ▽ More This paper proposes an event-triggered variational Bayesian filter for remote state estimation with unknown and time-varying noise covariances. After presetting multiple nominal process noise covariances and an initial measurement noise covariance, a variational Bayesian method and a fixed-point iteration method are utilized to jointly estimate the posterior state vector and the unknown noise covariances under a stochastic event-triggered mechanism. The proposed algorithm ensures low communication loads and excellent estimation performances for a wide range of unknown noise covariances. Finally, the performance of the proposed algorithm is demonstrated by tracking simulations of a vehicle. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2205.01686 [pdf, other]

Smart City Intersections: Intelligence Nodes for Future Metropolises

Authors: Zoran Kostić, Alex Angus, Zhengye Yang, Zhuoxu Duan, Ivan Seskar, Gil Zussman, Dipankar Raychaudhuri

Abstract: Traffic intersections are the most suitable locations for the deployment of computing, communications, and intelligence services for smart cities of the future. The abundance of data to be collected and processed, in combination with privacy and security concerns, motivates the use of the edge-computing paradigm which aligns well with physical intersections in metropolises. This paper focuses on h… ▽ More Traffic intersections are the most suitable locations for the deployment of computing, communications, and intelligence services for smart cities of the future. The abundance of data to be collected and processed, in combination with privacy and security concerns, motivates the use of the edge-computing paradigm which aligns well with physical intersections in metropolises. This paper focuses on high-bandwidth, low-latency applications, and in that context it describes: (i) system design considerations for smart city intersection intelligence nodes; (ii) key technological components including sensors, networking, edge computing, low latency design, and AI-based intelligence; and (iii) applications such as privacy preservation, cloud-connected vehicles, a real-time "radar-screen", traffic management, and monitoring of pedestrian behavior during pandemics. The results of the experimental studies performed on the COSMOS testbed located in New York City are illustrated. Future challenges in designing human-centered smart city intersections are summarized. △ Less

Submitted 13 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

arXiv:2204.09079 [pdf, other]

doi 10.1109/LSP.2022.3219355

Music Source Separation with Generative Flow

Authors: Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, Zhiyao Duan

Abstract: Fully-supervised models for source separation are trained on parallel mixture-source data and are currently state-of-the-art. However, such parallel data is often difficult to obtain, and it is cumbersome to adapt trained models to mixtures with new sources. Source-only supervised models, in contrast, only require individual source data for training. In this paper, we first leverage flow-based gen… ▽ More Fully-supervised models for source separation are trained on parallel mixture-source data and are currently state-of-the-art. However, such parallel data is often difficult to obtain, and it is cumbersome to adapt trained models to mixtures with new sources. Source-only supervised models, in contrast, only require individual source data for training. In this paper, we first leverage flow-based generators to train individual music source priors and then use these models, along with likelihood-based objectives, to separate music mixtures. We show that in singing voice separation and music separation tasks, our proposed method is competitive with a fully-supervised approach. We also demonstrate that we can flexibly add new types of sources, whereas fully-supervised approaches would require retraining of the entire model. △ Less

Submitted 16 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted by Signal Processing Letters

arXiv:2204.08094 [pdf, other]

A Data-Driven Methodology for Considering Feasibility and Pairwise Likelihood in Deep Learning Based Guitar Tablature Transcription Systems

Authors: Frank Cwitkowitz, Jonathan Driedger, Zhiyao Duan

Abstract: Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, w… ▽ More Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, which is commonly shared among guitarists online. In this work, a collection of symbolic tablature is leveraged to estimate the pairwise likelihood of notes on the guitar. The output layer of a baseline tablature transcription model is reformulated, such that an inhibition loss can be incorporated to discourage the co-activation of unlikely note pairs. This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods. With this methodology, we show that symbolic tablature can be used to shape the distribution of a tablature transcription model's predictions, even when little acoustic data is available. △ Less

Submitted 17 April, 2022; originally announced April 2022.

Comments: Sound and Music Computing Conference (SMC) 2022

arXiv:2202.13209 [pdf, other]

Opening the Black Box of Learned Image Coders

Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

Abstract: End-to-end learned lossy image coders (LICs), as opposed to hand-crafted image codecs, have shown increasing superiority in terms of the rate-distortion performance. However, they are mainly treated as black-box systems and their interpretability is not well studied. In this paper, we show that LICs learn a set of basis functions to transform input image for its compact representation in the laten… ▽ More End-to-end learned lossy image coders (LICs), as opposed to hand-crafted image codecs, have shown increasing superiority in terms of the rate-distortion performance. However, they are mainly treated as black-box systems and their interpretability is not well studied. In this paper, we show that LICs learn a set of basis functions to transform input image for its compact representation in the latent space, as analogous to the orthogonal transforms used in image coding standards. Our analysis provides insights to help understand how learned image coders work and could benefit future design and development. △ Less

Submitted 14 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

arXiv:2202.05253 [pdf, other]

A Probabilistic Fusion Framework for Spoofing Aware Speaker Verification

Authors: You Zhang, Ge Zhu, Zhiyao Duan

Abstract: The performance of automatic speaker verification (ASV) systems could be degraded by voice spoofing attacks. Most existing works aimed to develop standalone spoofing countermeasure (CM) systems. Relatively little work targeted at develo** an integrated spoofing aware speaker verification (SASV) system. In the recent SASV challenge, the organizers encourage the development of such integration by… ▽ More The performance of automatic speaker verification (ASV) systems could be degraded by voice spoofing attacks. Most existing works aimed to develop standalone spoofing countermeasure (CM) systems. Relatively little work targeted at develo** an integrated spoofing aware speaker verification (SASV) system. In the recent SASV challenge, the organizers encourage the development of such integration by releasing official protocols and baselines. In this paper, we build a probabilistic framework for fusing the ASV and CM subsystem scores. We further propose fusion strategies for direct inference and fine-tuning to predict the SASV score based on the framework. Surprisingly, these strategies significantly improve the SASV equal error rate (EER) from 19.31% of the baseline to 1.53% on the official evaluation trials of the SASV challenge. We verify the effectiveness of our proposed components through ablation studies and provide insights with score distribution analysis. △ Less

Submitted 24 April, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: 8 pages, 5 figures, to be appear in Odyssey 2022

Showing 1–50 of 81 results for author: Duan, Z