Skip to main content

Showing 1–50 of 81 results for author: Duan, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.14176  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

    Authors: Kyungbok Lee, You Zhang, Zhiyao Duan

    Abstract: This paper addresses the challenge of develo** a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for t… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2406.10514  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

    Authors: Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

    Abstract: Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically,… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  3. arXiv:2406.10361  [pdf, other

    eess.IV

    On Efficient Neural Network Architectures for Image Compression

    Authors: Yichi Zhang, Zhihao Duan, Fengqing Zhu

    Abstract: Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 2024 IEEE International Conference on Image Processing (ICIP2024)

  4. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, **g Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  5. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  6. arXiv:2404.09466  [pdf, other

    cs.SD cs.LG eess.AS

    Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

    Authors: Yujia Yan, Zhiyao Duan

    Abstract: The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and exp… ▽ More

    Submitted 23 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: Fixed Typos

  7. arXiv:2404.07507  [pdf, other

    eess.IV cs.CV

    Learning to Classify New Foods Incrementally Via Compressed Exemplars

    Authors: Justin Yang, Zhihao Duan, Jiangpeng He, Fengqing Zhu

    Abstract: Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  8. Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems

    Authors: Md Adnan Faisal Hossain, Zhihao Duan, Yuning Huang, Fengqing Zhu

    Abstract: Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a ran… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 6 pages, 7 figures, 1 table, International Conference on Multimedia and Expo Workshops 2023

  9. arXiv:2403.18535  [pdf, other

    eess.IV cs.LG

    Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs

    Authors: Yichi Zhang, Zhihao Duan, Yuning Huang, Fengqing Zhu

    Abstract: Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bo… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: 2024 IEEE International Conference on Multimedia and Expo (ICME2024)

  10. arXiv:2403.10493  [pdf, other

    cs.SD eess.AS eess.SP

    MusicHiFi: Fast High-Fidelity Stereo Vocoding

    Authors: Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

    Abstract: Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fide… ▽ More

    Submitted 20 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  11. arXiv:2402.18862  [pdf, other

    eess.IV

    Towards Backward-Compatible Continual Learning of Image Compression

    Authors: Zhihao Duan, Ming Lu, Justin Yang, Jiangpeng He, Zhan Ma, Fengqing Zhu

    Abstract: This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted to CVPR 2024

  12. arXiv:2402.15569  [pdf, other

    eess.AS cs.LG cs.SD

    Toward Fully Self-Supervised Multi-Pitch Estimation

    Authors: Frank Cwitkowitz, Zhiyao Duan

    Abstract: Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with m… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  13. arXiv:2402.06986  [pdf, other

    cs.SD eess.AS

    Cacophony: An Improved Contrastive Audio-Text Model

    Authors: Ge Zhu, Jordan Darefsky, Zhiyao Duan

    Abstract: Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process n… ▽ More

    Submitted 29 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: Work in Progress

  14. arXiv:2401.11615  [pdf, other

    eess.IV

    Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding

    Authors: Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma

    Abstract: While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image… ▽ More

    Submitted 21 January, 2024; originally announced January 2024.

    Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

  15. arXiv:2401.03363  [pdf, other

    eess.SY

    Data-driven Dynamic Event-triggered Control

    Authors: Tao Xu, Zhiyong Sun, Guanghui Wen, Zhisheng Duan

    Abstract: This paper revisits the event-triggered control problem from a data-driven perspective, where unknown continuous-time linear systems subject to disturbances are taken into account. Using data information collected off-line instead of accurate system model information, a data-driven dynamic event-triggered control scheme is developed in this paper. The dynamic property is reflected by that the desi… ▽ More

    Submitted 6 January, 2024; originally announced January 2024.

  16. arXiv:2312.15380  [pdf, other

    cs.NI eess.SP

    Battery-Care Resource Allocation and Task Offloading in Multi-Agent Post-Disaster MEC Environment

    Authors: Yiwei Tang, Hualong Huang, Wenhan Zhan, Geyong Min, Zhekai Duan, Yuchuan Lei

    Abstract: Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaste… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: accepted by wcnc2024

  17. arXiv:2312.07126  [pdf, other

    eess.IV

    Deep Hierarchical Video Compression

    Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma

    Abstract: Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  18. arXiv:2311.14816  [pdf, other

    eess.AS

    Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

    Authors: Enting Zhou, You Zhang, Zhiyao Duan

    Abstract: Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it a… ▽ More

    Submitted 6 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

  19. arXiv:2311.13371  [pdf, other

    eess.SY

    A Novel Dynamic Event-triggered Mechanism for Dynamic Average Consensus

    Authors: Tao Xu, Zhisheng Duan, Guanghui Wen, Zhiyong Sun

    Abstract: This paper studies a challenging issue introduced in a recent survey, namely designing a distributed event-based scheme to solve the dynamic average consensus (DAC) problem. First, a robust adaptive distributed event-based DAC algorithm is designed without imposing specific initialization criteria to perform estimation task under intermittent communication. Second, a novel adaptive distributed dyn… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 9 pages, 8 figures

  20. arXiv:2311.08667  [pdf, other

    cs.SD eess.AS

    EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

    Authors: Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan

    Abstract: Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining wit… ▽ More

    Submitted 18 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS Workshop: Machine Learning for Audio (Camera Ready)

  21. arXiv:2309.09085  [pdf, other

    cs.SD cs.IR cs.MM eess.AS eess.SP

    SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

    Authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan

    Abstract: Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering mode… ▽ More

    Submitted 24 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  22. arXiv:2309.07525  [pdf, other

    cs.SD cs.AI eess.AS

    SingFake: Singing Voice Deepfake Detection

    Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

    Abstract: The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances.… ▽ More

    Submitted 21 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  23. arXiv:2309.02574  [pdf, other

    eess.IV

    An Improved Upper Bound on the Rate-Distortion Function of Images

    Authors: Zhihao Duan, Jack Ma, Jiangpeng He, Fengqing Zhu

    Abstract: Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) p… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Conference paper at ICIP 2023. The first two authors share equal contributions

  24. arXiv:2307.14547  [pdf, other

    eess.AS cs.SD

    Mitigating Cross-Database Differences for Learning Unified HRTF Representation

    Authors: Yutong Wen, You Zhang, Zhiyao Duan

    Abstract: Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively l… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, accepted by IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

  25. arXiv:2306.09215  [pdf, other

    eess.SY

    On the Effects and Optimal Design of Redundant Sensors in Collaborative State Estimation

    Authors: Yunxiao Ren, Zhisheng Duan, Peihu Duan, Ling Shi

    Abstract: The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the… ▽ More

    Submitted 4 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

  26. Phase perturbation improves channel robustness for speech spoofing countermeasures

    Authors: Yongyi Zang, You Zhang, Zhiyao Duan

    Abstract: In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitu… ▽ More

    Submitted 6 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: 5 pages; Proceedings of Interspeech 2023

  27. arXiv:2306.02372  [pdf, other

    eess.AS

    SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System

    Authors: Mojtaba Heydari, Ju-Chiang Wang, Zhiyao Duan

    Abstract: Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces furthe… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted for 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2023)

  28. arXiv:2305.12755  [pdf, other

    cs.SD cs.CL eess.AS

    GNCformer Enhanced Self-attention for Automatic Speech Recognition

    Authors: J. Li, Z. Duan, S. Li, X. Yu, G. Yang

    Abstract: In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting t… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: 5 pages,3 figures,

  29. arXiv:2304.04991  [pdf, other

    cs.SD cs.CL eess.AS

    Sim-T: Simplify the Transformer Network by Multiplexing Technique for Speech Recognition

    Authors: Guangyong Wei, Zhikui Duan, Shiren Li, Guangguang Yang, Xinmei Yu, Junhua Li

    Abstract: In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim… ▽ More

    Submitted 11 April, 2023; originally announced April 2023.

  30. arXiv:2303.08575  [pdf, other

    eess.SP

    Observation of Periodic Systems: Bridge Centralized Kalman Filtering and Consensus-Based Distributed Filtering

    Authors: Jiachen Qian, Zhisheng Duan, Peihu Duan, Zhongkui Li

    Abstract: Compared with linear time invariant systems, linear periodic system can describe the periodic processes arising from nature and engineering more precisely. However, the time-varying system parameters increase the difficulty of the research on periodic system, such as stabilization and observation. This paper aims to consider the observation problem of periodic systems by bridging two fundamental f… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: text overlap with arXiv:2112.06395

  31. arXiv:2303.06475  [pdf, other

    eess.AS cs.CL

    Transcription free filler word detection with Neural semi-CRFs

    Authors: Ge Zhu, Yujia Yan, Juan-Pablo Caceres, Zhiyao Duan

    Abstract: Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  32. QARV: Quantization-Aware ResNet VAE for Lossy Image Compression

    Authors: Zhihao Duan, Ming Lu, Jack Ma, Yuning Huang, Zhan Ma, Fengqing Zhu

    Abstract: This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy… ▽ More

    Submitted 1 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: Full version (19 pages, includes appendix) of the paper accepted by IEEE TPAMI

  33. arXiv:2211.11247  [pdf, ps, other

    eess.SP

    Harmonic-Copuled Riccati Equations and its Applications in Distributed Filtering

    Authors: Jiachen Qian, Peihu Duan, Zhisheng Duan, Ling shi

    Abstract: The coupled Riccati equations are cosisted of multiple Riccati-like equations with solutions coupled with each other, which can be applied to depict the properties of more complex systems such as markovian systems or multi-agent systems. This paper manages to formulate and investigate a new kind of coupled Riccati equations, called harmonic-coupled Riccati equations (HCRE), from the matrix iterati… ▽ More

    Submitted 12 July, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: 14 pages, 4 figures

  34. arXiv:2211.09897  [pdf, other

    eess.IV

    Efficient Feature Compression for Edge-Cloud Systems

    Authors: Zhihao Duan, Fengqing Zhu

    Abstract: Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be eas… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: Picture Coding Symposium (PCS) 2022

  35. arXiv:2211.02718  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

    Authors: Siwen Ding, You Zhang, Zhiyao Duan

    Abstract: Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consid… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  36. arXiv:2210.17313  [pdf, ps, other

    eess.SY cs.AI math.OC

    DiscreteCommunication and ControlUpdating in Event-Triggered Consensus

    Authors: Bin Cheng, Yuezu Lv, Zhongkui Li, Zhisheng Duan

    Abstract: This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demand… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  37. arXiv:2210.15196  [pdf, other

    eess.AS cs.GR cs.SD

    HRTF Field: Unifying Measured HRTF Magnitude Representation with Neural Fields

    Authors: You Zhang, Yuxiang Wang, Zhiyao Duan

    Abstract: Head-related transfer functions (HRTFs) are a set of functions describing the spatial filtering effect of the outer ear (i.e., torso, head, and pinnae) onto sound sources at different azimuth and elevation angles. They are widely used in spatial audio rendering. While the azimuth and elevation angles are intrinsically continuous, measured HRTFs in existing datasets employ different spatial samplin… ▽ More

    Submitted 23 February, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: 5 pages, accepted by ICASSP 2023

  38. arXiv:2210.06696  [pdf, other

    cs.AR eess.SY

    CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

    Authors: Huize Li, Hai **, Long Zheng, Yu Huang, Xiaofei Liao, Dan Chen, Zhuohui Duan, Cong Liu, Jiahong Xu, Chuanyi Gui

    Abstract: The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM operations. However, current sparse attention solutions introduce massive off-chip random memory access. We propose CPSAA, a novel crossbar-based PIM-featured sparse a… ▽ More

    Submitted 7 October, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: 14 pages, 19 figures

  39. arXiv:2210.02700  [pdf, other

    eess.SY

    Minimal-order Appointed-time Unknown Input Observers: Design and Applications

    Authors: Yuezu Lv, Zhongkui Li, Zhisheng Duan

    Abstract: This paper presents a framework on minimal-order appointed-time unknown input observers for linear systems based on the pairwise observer structure. A minimal-order appointed-time observer is first proposed for the linear system without the unknown input, which can estimate the state exactly at the preset time by seeking for the unique solution of a system of linear equations. To further release t… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

  40. ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed

    Authors: Meiying Chen, Zhiyao Duan

    Abstract: Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the c… ▽ More

    Submitted 11 January, 2024; v1 submitted 23 September, 2022; originally announced September 2022.

    Comments: Audio samples: https://bit.ly/3PsrKLJ; Code: https://github.com/MelissaChen15/control-vc

  41. arXiv:2208.14578  [pdf, other

    eess.AS

    Singing Beat Tracking With Self-supervised Front-end and Linear Transformers

    Authors: Mojtaba Heydari, Zhiyao Duan

    Abstract: Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing musi… ▽ More

    Submitted 30 August, 2022; originally announced August 2022.

    Comments: 23rd International Society for Music Information Retrieval Conference (ISMIR 2022)

  42. Lossy Image Compression with Quantized Hierarchical VAEs

    Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

    Abstract: Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantizati… ▽ More

    Submitted 25 March, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

    Comments: WACV 2023 Best Algorithms Paper Award, revised version

  43. arXiv:2207.14352  [pdf, other

    eess.AS cs.SD

    Predicting Global Head-Related Transfer Functions From Scanned Head Geometry Using Deep Learning and Compact Representations

    Authors: Yuxiang Wang, You Zhang, Zhiyao Duan, Mark Bocko

    Abstract: In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject's HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose nov… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: 11 pages, 14 figures

  44. arXiv:2206.10421  [pdf, other

    cs.SD cs.AI cs.CV cs.MM eess.AS

    Rethinking Audio-visual Synchronization for Active Speaker Detection

    Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang

    Abstract: Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarif… ▽ More

    Submitted 10 July, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Accepted by IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2022)

  45. arXiv:2206.06784  [pdf, other

    eess.SP eess.SY

    Stochastic Event-triggered Variational Bayesian Filtering

    Authors: Xiaoxu Lv, Peihu Duan, Zhisheng Duan, Guanrong Chen, Ling Shi

    Abstract: This paper proposes an event-triggered variational Bayesian filter for remote state estimation with unknown and time-varying noise covariances. After presetting multiple nominal process noise covariances and an initial measurement noise covariance, a variational Bayesian method and a fixed-point iteration method are utilized to jointly estimate the posterior state vector and the unknown noise cova… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

  46. arXiv:2205.01686  [pdf, other

    cs.CV eess.IV

    Smart City Intersections: Intelligence Nodes for Future Metropolises

    Authors: Zoran Kostić, Alex Angus, Zhengye Yang, Zhuoxu Duan, Ivan Seskar, Gil Zussman, Dipankar Raychaudhuri

    Abstract: Traffic intersections are the most suitable locations for the deployment of computing, communications, and intelligence services for smart cities of the future. The abundance of data to be collected and processed, in combination with privacy and security concerns, motivates the use of the edge-computing paradigm which aligns well with physical intersections in metropolises. This paper focuses on h… ▽ More

    Submitted 13 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

  47. arXiv:2204.09079  [pdf, other

    eess.AS cs.SD eess.SP

    Music Source Separation with Generative Flow

    Authors: Ge Zhu, Jordan Darefsky, Fei Jiang, Anton Selitskiy, Zhiyao Duan

    Abstract: Fully-supervised models for source separation are trained on parallel mixture-source data and are currently state-of-the-art. However, such parallel data is often difficult to obtain, and it is cumbersome to adapt trained models to mixtures with new sources. Source-only supervised models, in contrast, only require individual source data for training. In this paper, we first leverage flow-based gen… ▽ More

    Submitted 16 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted by Signal Processing Letters

  48. arXiv:2204.08094  [pdf, other

    eess.AS cs.LG cs.SD

    A Data-Driven Methodology for Considering Feasibility and Pairwise Likelihood in Deep Learning Based Guitar Tablature Transcription Systems

    Authors: Frank Cwitkowitz, Jonathan Driedger, Zhiyao Duan

    Abstract: Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, w… ▽ More

    Submitted 17 April, 2022; originally announced April 2022.

    Comments: Sound and Music Computing Conference (SMC) 2022

  49. arXiv:2202.13209  [pdf, other

    eess.IV

    Opening the Black Box of Learned Image Coders

    Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

    Abstract: End-to-end learned lossy image coders (LICs), as opposed to hand-crafted image codecs, have shown increasing superiority in terms of the rate-distortion performance. However, they are mainly treated as black-box systems and their interpretability is not well studied. In this paper, we show that LICs learn a set of basis functions to transform input image for its compact representation in the laten… ▽ More

    Submitted 14 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

  50. arXiv:2202.05253  [pdf, other

    eess.AS cs.SD

    A Probabilistic Fusion Framework for Spoofing Aware Speaker Verification

    Authors: You Zhang, Ge Zhu, Zhiyao Duan

    Abstract: The performance of automatic speaker verification (ASV) systems could be degraded by voice spoofing attacks. Most existing works aimed to develop standalone spoofing countermeasure (CM) systems. Relatively little work targeted at develo** an integrated spoofing aware speaker verification (SASV) system. In the recent SASV challenge, the organizers encourage the development of such integration by… ▽ More

    Submitted 24 April, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

    Comments: 8 pages, 5 figures, to be appear in Odyssey 2022