Skip to main content

Showing 1–50 of 121 results for author: Huang, Q

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2405.00248  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Who is Authentic Speaker

    Authors: Qiang Huang

    Abstract: Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the conver… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

  3. arXiv:2403.11974  [pdf, other

    eess.IV cs.CV

    OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

    Authors: Yang Li, Qiuyi Huang, Chong Zhong, Danjuan Yang, Meiyan Li, A. H. Welsh, Aiyi Liu, Bo Fu, Catherien C. Liu, Xingtao Zhou

    Abstract: Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  4. Spatial features of CO2 for occupancy detection in a naturally ventilated school building

    Authors: Qirui Huang, Marc Syndicus, Jérôme Frisch, Christoph van Treeck

    Abstract: Accurate occupancy information helps to improve building energy efficiency and occupant comfort. Occupancy detection methods based on CO2 sensors have received attention due to their low cost and low intrusiveness. In naturally ventilated buildings, the accuracy of CO2-based occupancy detection is generally low in related studies due to the complex ventilation behavior and the difficulty in measur… ▽ More

    Submitted 28 June, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Indoor Environments, Volume 1, Issue 3, 2024, 100018, ISSN 2950-3620

    Journal ref: Indoor Environments, Volume 1, Issue 3, 2024, 100018, ISSN 2950-3620

  5. arXiv:2403.05834  [pdf, other

    cs.MM cs.SD eess.AS

    Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

    Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

    Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  6. arXiv:2403.01669  [pdf, other

    cs.LG eess.SY

    Quantifying and Predicting Residential Building Flexibility Using Machine Learning Methods

    Authors: Patrick Salter, Qiuhua Huang, Paulo Cesar Tabares-Velasco

    Abstract: Residential buildings account for a significant portion (35\%) of the total electricity consumption in the U.S. as of 2022. As more distributed energy resources are installed in buildings, their potential to provide flexibility to the grid increases. To tap into that flexibility provided by buildings, aggregators or system operators need to quantify and forecast flexibility. Previous works in this… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  7. arXiv:2402.01808  [pdf, other

    cs.SD eess.AS

    KS-Net: Multi-band joint speech restoration and enhancement network for 2024 ICASSP SSI Challenge

    Authors: Guochen Yu, Runqiang Han, Chenglin Xu, Haoran Zhao, Nan Li, Chen Zhang, Xiguang Zheng, Chao Zhou, Qi Huang, Bing Yu

    Abstract: This paper presents the speech restoration and enhancement system created by the 1024K team for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. Our system consists of a generative adversarial network (GAN) in complex-domain for speech restoration and a fine-grained multi-band fusion module for speech enhancement. In the blind test set of SSI, the proposed system achieves an overall mean… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024; Rank 1st in ICASSP 2024 Speech Signal Improvement (SSI) Challenge

  8. arXiv:2312.15481  [pdf, other

    eess.SP

    A Novel Field-Free SOT Magnetic Tunnel Junction With Local VCMA-Induced Switching

    Authors: Rui Zhou, Haiyang Zhang, Hao Wang, ** He, Qijun Huang, Sheng Chang

    Abstract: By integrating the local voltage-controlled magnetic anisotropy (VCMA) effect, Dzyaloshinskii-Moriya interaction (DMI) effect, and spin-orbit torque (SOT) effect, we propose a novel device structure for field-free magnetic tunnel junction (MTJ). Micromagnetic simulation shows that the device utilizes the chiral symmetry breaking caused by the DMI effect to induce a non-collinear spin texture under… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

  9. arXiv:2312.14468  [pdf, ps, other

    eess.SP

    FDA-MIMO-based Integrated Sensing and Communication System with Frequency Offset Permutation Index Modulation

    Authors: Jiangwei Jian, Qimao Huang, Bang Huang, Wen-Qin Wang

    Abstract: Considering that frequency diverse array multiple-input multiple-output (FDA-MIMO) possesses extra range information to enhance sensing performance, this paper explores the FDA-MIMO-based integrated sensing and communication (ISAC) system. To reinforce the system communication capability, we propose the frequency offset permutation index modulation (FOPIM) scheme, which conveys extra information b… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

  10. arXiv:2312.13722  [pdf, other

    cs.SD eess.AS

    BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural network for speech super-resolution

    Authors: Guochen Yu, Xiguang Zheng, Nan Li, Runqiang Han, Chengshi Zheng, Chen Zhang, Chao Zhou, Qi Huang, Bing Yu

    Abstract: Speech bandwidth extension (BWE) has demonstrated promising performance in enhancing the perceptual speech quality in real communication systems. Most existing BWE researches primarily focus on fixed upsampling ratios, disregarding the fact that the effective bandwidth of captured audio may fluctuate frequently due to various capturing devices and transmission conditions. In this paper, we propose… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024

  11. arXiv:2312.10381  [pdf, other

    cs.SD eess.AS

    SECap: Speech Emotion Captioning with Large Language Model

    Authors: Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, Rongzhi Gu

    Abstract: Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately… ▽ More

    Submitted 23 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024

  12. arXiv:2312.03490  [pdf, other

    eess.IV cs.CV

    PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis

    Authors: Meiyue Song, Zhihua Yu, Jiaxin Wang, Jiarui Wang, Yuting Lu, Baicun Li, Xiaoxu Wang, Qinghua Huang, Zhijun Li, Nikolaos I. Kanellakis, Jiangfeng Liu, **g Wang, Binglu Wang, Juntao Yang

    Abstract: The conventional pretraining-and-finetuning paradigm, while effective for common diseases with ample data, faces challenges in diagnosing data-scarce occupational diseases like pneumoconiosis. Recently, large language models (LLMs) have exhibits unprecedented ability when conducting multiple tasks in dialogue, bringing opportunities to diagnosis. A common strategy might involve using adapter layer… ▽ More

    Submitted 28 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: Medical Image Analysis

  13. arXiv:2312.03324  [pdf

    eess.AS

    Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

    Authors: Yanxiong Li, Zhongjie Jiang, Qisheng Huang, Wenchang Cao, Jialong Li

    Abstract: Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effec… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: 12 pages, 5 figures, 6 tables; accepted for publication in IEEE-ACM TASLP

  14. arXiv:2311.12264  [pdf, other

    eess.SY cs.AI cs.LG

    Resilient Control of Networked Microgrids using Vertical Federated Reinforcement Learning: Designs and Real-Time Test-Bed Validations

    Authors: Sayak Mukherjee, Ramij R. Hossain, Sheik M. Mohiuddin, Yuan Liu, Wei Du, Veronica Adetola, Rohit A. **siwale, Qiuhua Huang, Tianzhixi Yin, Ankit Singhal

    Abstract: Improving system-level resiliency of networked microgrids is an important aspect with increased population of inverter-based resources (IBRs). This paper (1) presents resilient control design in presence of adversarial cyber-events, and proposes a novel federated reinforcement learning (Fed-RL) approach to tackle (a) model complexities, unknown dynamical behaviors of IBR devices, (b) privacy issue… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 10 pages, 7 figures

  15. arXiv:2310.14016  [pdf, other

    eess.AS

    SwG-former: A Sliding-Window Graph Convolutional Network for Simultaneous Spatial-Temporal Information Extraction in Sound Event Localization and Detection

    Authors: Weiming Huang, Qinghua Huang, Liyan Ma, Chuan Wang

    Abstract: Sound event localization and detection (SELD) involves sound event detection (SED) and direction of arrival (DoA) estimation tasks. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. This paper addresses the need to simultaneously extract spatial-temporal information in audio signals… ▽ More

    Submitted 20 March, 2024; v1 submitted 21 October, 2023; originally announced October 2023.

  16. arXiv:2310.12733  [pdf, other

    eess.IV cs.CV

    Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding Network for Learned Video Compression

    Authors: Yiming Wang, Qian Huang, Bin Tang, Huashan Sun, Xing Li

    Abstract: Recently, learned video compression has achieved exciting performance. Following the traditional hybrid prediction coding framework, most learned methods generally adopt the motion estimation motion compensation (MEMC) method to remove inter-frame redundancy. However, inaccurate motion vector (MV) usually lead to the distortion of reconstructed frame. In addition, most approaches ignore the spatia… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: 12pages,12 figures

  17. arXiv:2310.08095  [pdf, other

    cs.IT eess.SP

    Multi-Satellite Cooperative Networks: Joint Hybrid Beamforming and User Scheduling Design

    Authors: Xuan Zhang, Shu Sun, Meixia Tao, Qin Huang, Xiaohu Tang

    Abstract: In this paper, we consider a cooperative communication network where multiple low-Earth-orbit (LEO) satellites provide services to multiple ground users (GUs) cooperatively at the same time and on the same frequency. The multi-satellite cooperation has great potential in extending communication coverage and increasing spectral efficiency. Considering that the on-board radio-frequency circuit resou… ▽ More

    Submitted 27 December, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: 14 pages, 13 figures. arXiv admin note: substantial text overlap with arXiv:2301.03888

  18. arXiv:2310.05021  [pdf, other

    eess.SY

    Toward Intelligent Emergency Control for Large-scale Power Systems: Convergence of Learning, Physics, Computing and Control

    Authors: Qiuhua Huang, Renke Huang, Tianzhixi Yin, Sohom Datta, Xueqing Sun, Jason Hou, Jie Tan, Wenhao Yu, Yuan Liu, Xinya Li, Bruce Palmer, Ang Li, Xinda Ke, Marianna Vaiman, Song Wang, Yousu Chen

    Abstract: This paper has delved into the pressing need for intelligent emergency control in large-scale power systems, which are experiencing significant transformations and are operating closer to their limits with more uncertainties. Learning-based control methods are promising and have shown effectiveness for intelligent power system control. However, when they are applied to large-scale power systems, t… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: submitted to PSCC 2024

  19. arXiv:2309.08051  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Retrieval-Augmented Text-to-Audio Generation

    Authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer… ▽ More

    Submitted 5 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  20. arXiv:2308.16593  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

    Authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng

    Abstract: The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech an… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted by INTERSPEECH 2023

  21. arXiv:2308.03382  [pdf, ps, other

    eess.IV cs.CV cs.LG

    Enhancing Nucleus Segmentation with HARU-Net: A Hybrid Attention Based Residual U-Blocks Network

    Authors: Junzhou Chen, Qian Huang, Yulin Chen, Linyi Qian, Chengyuan Yu

    Abstract: Nucleus image segmentation is a crucial step in the analysis, pathological diagnosis, and classification, which heavily relies on the quality of nucleus segmentation. However, the complexity of issues such as variations in nucleus size, blurred nucleus contours, uneven staining, cell clustering, and overlap** cells poses significant challenges. Current methods for nucleus segmentation primarily… ▽ More

    Submitted 10 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: Nucleus segmentation, Deep learning, Instance segmentation, Medical imaging, Dual-Branch network

  22. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, **hua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  23. arXiv:2307.08946  [pdf, other

    cs.CR eess.SP

    EsaNet: Environment Semantics Enabled Physical Layer Authentication

    Authors: Ning Gao, Qiying Huang, Cen Li, Shi **, Michail Matthaiou

    Abstract: Wireless networks are vulnerable to physical layer spoofing attacks due to the wireless broadcast nature, thus, integrating communications and security (ICAS) is urgently needed for 6G endogenous security. In this letter, we propose an environment semantics enabled physical layer authentication network based on deep learning, namely EsaNet, to authenticate the spoofing from the underlying wireless… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  24. arXiv:2306.08918  [pdf, other

    eess.IV cs.CV

    PUGAN: Physical Model-Guided Underwater Image Enhancement Using GAN with Dual-Discriminators

    Authors: Runmin Cong, Wenyu Yang, Wei Zhang, Chongyi Li, Chun-Le Guo, Qingming Huang, Sam Kwong

    Abstract: Due to the light absorption and scattering induced by the water medium, underwater images usually suffer from some degradation problems, such as low contrast, color distortion, and blurring details, which aggravate the difficulty of downstream underwater understanding tasks. Therefore, how to obtain clear and visually pleasant images has become a common concern of people, and the task of underwate… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: 8 pages, 4 figures, Accepted by IEEE Transactions on Image Processing 2023

  25. arXiv:2306.03835  [pdf, other

    eess.IV cs.CV cs.LG

    Atrial Septal Defect Detection in Children Based on Ultrasound Video Using Multiple Instances Learning

    Authors: Yiman Liu, Qiming Huang, Xiaoxiang Han, Tongtong Liang, Zhifang Zhang, Lijun Chen, **feng Wang, Angelos Stefanidis, Jionglong Su, Jiangang Chen, Qingli Li, Yuqi Zhang

    Abstract: Purpose: Congenital heart defect (CHD) is the most common birth defect. Thoracic echocardiography (TTE) can provide sufficient cardiac structure information, evaluate hemodynamics and cardiac function, and is an effective method for atrial septal defect (ASD) examination. This paper aims to study a deep learning method based on cardiac ultrasound video to assist in ASD diagnosis. Materials and met… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  26. arXiv:2306.02054  [pdf

    eess.AS

    Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet

    Authors: Yanxiong Li, Wenchang Cao, Wei Xie, Qisheng Huang, Wenfeng Pang, Qianhua He

    Abstract: We present a work on low-complexity acoustic scene classification (ASC) with multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge. This subtask focuses on classifying audio samples of multiple devices with a low-complexity model, where two main difficulties need to be overcome. First, the audio samples are recorded by different devices, and there is mismatch of recording dev… ▽ More

    Submitted 3 June, 2023; originally announced June 2023.

    Comments: 5 pages, 5 figures, 4 tables. Accepted for publication in the 16th IEEE International Conference on Signal Processing (IEEE ICSP)

  27. arXiv:2306.00426  [pdf

    eess.AS cs.SD

    Speaker verification using attentive multi-scale convolutional recurrent network

    Authors: Yanxiong Li, Zhongjie Jiang, Wenchang Cao, Qisheng Huang

    Abstract: In this paper, we propose a speaker verification method by an Attentive Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can acquire both local spatial information and global sequential information from the input speech recordings. In the proposed method, logarithm Mel spectrum is extracted from each speech recording and then fed to the proposed AMCRN for learning speaker em… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 21 pages, 6 figures, 8 tables. Accepted for publication in Applied Soft Computing

  28. Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grou** and Interaction

    Authors: Yanxiong Li, Hao Chen, Wenchang Cao, Qisheng Huang, Qianhua He

    Abstract: Existing methods for few-shot speaker identification (FSSI) obtain high accuracy, but their computational complexities and model sizes need to be reduced for lightweight applications. In this work, we propose a FSSI method using a lightweight prototypical network with the final goal to implement the FSSI on intelligent terminals with limited resources, such as smart watches and smart speakers. In… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: 12 pages, 4 figures, 12 tables. Accepted for publication in IEEE TMM

  29. arXiv:2305.06594  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

    Authors: Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo I. Denk

    Abstract: Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally alig… ▽ More

    Submitted 22 February, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: accepted at AAAI 2024, music samples available at https://tinyurl.com/v2meow

  30. arXiv:2302.03917  [pdf, other

    cs.SD cs.LG eess.AS

    Noise2Music: Text-conditioned Music Generation with Diffusion Models

    Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han

    Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and… ▽ More

    Submitted 6 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: 15 pages

  31. arXiv:2302.03839  [pdf, other

    eess.IV cs.CV cs.LG

    Futuristic Variations and Analysis in Fundus Images Corresponding to Biological Traits

    Authors: Muhammad Hassan, Hao Zhang, Ahmed Fateh Ameen, Home Wu Zeng, Shuye Ma, Wen Liang, Dingqi Shang, Jiaming Ding, Ziheng Zhan, Tsz Kwan Lam, Ming Xu, Qiming Huang, Dongmei Wu, Can Yang Zhang, Zhou You, Awiwu Ain, Pei Wu Qin

    Abstract: Fundus image captures rear of an eye, and which has been studied for the diseases identification, classification, segmentation, generation, and biological traits association using handcrafted, conventional, and deep learning methods. In biological traits estimation, most of the studies have been carried out for the age prediction and gender classification with convincing results. However, the curr… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

    Comments: 10 pages, 4 figures, 3 tables

  32. arXiv:2302.03376  [pdf, other

    cs.NI eess.SP

    System-Level Metrics for Non-Terrestrial Networks Under Stochastic Geometry Framework

    Authors: Qi Huang, Baha Eddine Youcef Belmekki, Ahmed M. Eltawil, Mohamed-Slim Alouini

    Abstract: Non-terrestrial networks (NTNs) are considered one of the key enablers in sixth-generation (6G) wireless networks; and with their rapid growth, system-level metrics analysis adds crucial understanding into NTN system performance. Applying stochastic geometry (SG) as a system-level analysis tool in the context of NTN offers novel insights into the network tradeoffs. In this paper, we study and high… ▽ More

    Submitted 10 February, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: 7 pages

  33. arXiv:2301.11325  [pdf, other

    cs.SD cs.LG eess.AS

    MusicLM: Generating Music From Text

    Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

    Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

  34. arXiv:2301.03888  [pdf, other

    cs.IT eess.SP

    Joint Hybrid Beamforming and User Scheduling for Multi-Satellite Cooperative Networks

    Authors: Xuan Zhang, Shu Sun, Meixia Tao, Qin Huang, Xiaohu Tang

    Abstract: In this paper, we consider a cooperative communication network where multiple satellites provide services for ground users (GUs) (at the same time and on the same frequency). The communication and computational resources on satellites are usually restricted and the satellite-GU link determination affects the communication performance significantly when multiple satellites provide services for mult… ▽ More

    Submitted 13 January, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

    Comments: 7 pages, 7 figures, accepted by IEEE Wireless Communications and Networking Conference (WCNC) 2023

  35. arXiv:2301.03238  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    MAQA: A Multimodal QA Benchmark for Negation

    Authors: Judith Yue Li, Aren Jansen, Qingqing Huang, Joonseok Lee, Ravi Ganti, Dima Kuzmin

    Abstract: Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: NeurIPS 2022 SyntheticData4ML Workshop

  36. arXiv:2212.08973  [pdf, other

    cs.LG eess.SY

    Enhancing Cyber Resilience of Networked Microgrids using Vertical Federated Reinforcement Learning

    Authors: Sayak Mukherjee, Ramij R. Hossain, Yuan Liu, Wei Du, Veronica Adetola, Sheik M. Mohiuddin, Qiuhua Huang, Tianzhixi Yin, Ankit Singhal

    Abstract: This paper presents a novel federated reinforcement learning (Fed-RL) methodology to enhance the cyber resiliency of networked microgrids. We formulate a resilient reinforcement learning (RL) training setup which (a) generates episodic trajectories injecting adversarial actions at primary control reference signals of the grid forming (GFM) inverters and (b) trains the RL agents (or controllers) to… ▽ More

    Submitted 17 December, 2022; originally announced December 2022.

    Comments: 13 pages, 5 figures

  37. arXiv:2212.04054  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Dub Movies via Hierarchical Prosody Models

    Authors: Gaoxiang Cong, Liang Li, Yuankai Qi, Zhengjun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, Qingming Huang

    Abstract: Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions a… ▽ More

    Submitted 4 April, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: accepted to CVPR 2023

  38. arXiv:2212.02715  [pdf, other

    eess.SY cs.AI cs.LG math.OC

    Efficient Learning of Voltage Control Strategies via Model-based Deep Reinforcement Learning

    Authors: Ramij R. Hossain, Tianzhixi Yin, Yan Du, Renke Huang, Jie Tan, Wenhao Yu, Yuan Liu, Qiuhua Huang

    Abstract: This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms… ▽ More

    Submitted 5 December, 2022; originally announced December 2022.

  39. arXiv:2212.00337  [pdf, other

    quant-ph cs.ET eess.SY

    Fault Models in Superconducting quantum circuits

    Authors: Qifan Huang, Boxi Li, Minbo Gao, Mingsheng Ying

    Abstract: Fault models are indispensable for many EDA tasks, so as for design and implementation of quantum hardware. In this article, we propose a fault model for superconducting quantum systems. Our fault model reflects the real fault behavior in control signals and structure of quantum systems. Based on it, we conduct fault simulation on controlled-Z gate and quantum circuits by QuTiP. We provide fidelit… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: 7 pages, 10 figures

  40. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  41. arXiv:2210.09849  [pdf, other

    eess.SP

    Scalable Framework For Deep Learning based CSI Feedback

    Authors: Liqiang **, Qiu** Huang, Qiubin Gao, Yongqiang Fei, Shaohui Sun

    Abstract: Deep learning (DL) based channel state information (CSI) feedback in multiple-input multiple-output (MIMO) systems recently has attracted lots of attention from both academia and industrial. From a practical point of views, it is huge burden to train, transfer and deploy a DL model for each parameter configuration of the base station (BS). In this paper, we propose a scalable and flexible framewor… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: 6 pages,3 figures

  42. arXiv:2209.11971  [pdf, other

    cs.ET eess.SP

    A Homogeneous Processing Fabric for Matrix-Vector Multiplication and Associative Search Using Ferroelectric Time-Domain Compute-in-Memory

    Authors: Xunzhao Yin, Qingrong Huang, Franz Müller, Shan Deng, Alptekin Vardar, Sourav De, Zhouhang Jiang, Mohsen Imani, Cheng Zhuo, Thomas Kämpfe, Kai Ni

    Abstract: In this work, we propose a ferroelectric FET(FeFET) time-domain compute-in-memory (TD-CiM) array as a homogeneous processing fabric for binary multiplication-accumulation (MAC) and content addressable memory (CAM). We demonstrate that: i) the XOR(XNOR)/AND logic function can be realized using a single cell composed of 2FeFETs connected in series; ii) a two-phase computation in an inverter chain wi… ▽ More

    Submitted 24 September, 2022; originally announced September 2022.

    Comments: 8 pages, 8 figures

  43. arXiv:2208.13430  [pdf, other

    eess.SP

    An AFDM-Based Integrated Sensing and Communications

    Authors: Yuanhan Ni, Zulin Wang, Peng Yuan, Qin Huang

    Abstract: This paper considers an affine frequency division multiplexing (AFDM)-based integrated sensing and communications (ISAC) system, where the AFDM waveform is used to simultaneously carry communications information and sense targets. To realize AFDM-based sensing functionality, two parameter estimation methods are designed to process echoes in the time domain and the discrete affine Fourier transform… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

  44. arXiv:2208.12415  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    MuLan: A Joint Embedding of Music Audio and Natural Language

    Authors: Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, Daniel P. W. Ellis

    Abstract: Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedd… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: To appear in ISMIR 2022

  45. arXiv:2208.10011  [pdf, ps, other

    eess.SY

    A Two-phase On-line Joint Scheduling for Welfare Maximization of Charging Station

    Authors: Qilong Huang, Qing-Shan Jia, Xiang Wu, Shengyuan Xu, Xiaohong Guan

    Abstract: The large adoption of EVs brings practical interest to the operation optimization of the charging station. The joint scheduling of pricing and charging control will achieve a win-win situation both for the charging station and EV drivers, thus enhancing the operational capability of the station. We consider this important problem in this paper and make the following contributions. First, a joint s… ▽ More

    Submitted 7 December, 2022; v1 submitted 21 August, 2022; originally announced August 2022.

  46. arXiv:2207.13868  [pdf, other

    eess.IV cs.CV cs.LG

    Extraction of Vascular Wall in Carotid Ultrasound via a Novel Boundary-Delineation Network

    Authors: Qinghua Huang, Lizhi Jia, Guanqing Ren, Xiaoyi Wang, Chunying Liu

    Abstract: Ultrasound imaging plays an important role in the diagnosis of vascular lesions. Accurate segmentation of the vascular wall is important for the prevention, diagnosis and treatment of vascular diseases. However, existing methods have inaccurate localization of the vascular wall boundary. Segmentation errors occur in discontinuous vascular wall boundaries and dark boundaries. To overcome these prob… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

  47. arXiv:2207.02250  [pdf, other

    cs.CV eess.IV

    Array Camera Image Fusion using Physics-Aware Transformers

    Authors: Qian Huang, Minghao Hu, David Jones Brady

    Abstract: We demonstrate a physics-aware transformer for feature-based data fusion from cameras with diverse resolution, color spaces, focal planes, focal lengths, and exposure. We also demonstrate a scalable solution for synthetic training data generation for the transformer using open-source computer graphics software. We demonstrate image synthesis on arrays with diverse spectral responses, instantaneous… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

  48. arXiv:2205.04227  [pdf

    eess.IV cs.CV

    Mixed-UNet: Refined Class Activation Map** for Weakly-Supervised Semantic Segmentation with Multi-scale Inference

    Authors: Yang Liu, Ersi Zhang, Lulu Xu, Chufan Xiao, Xiaoyun Zhong, Li** Lian, Fang Li, Bin Jiang, Yuhan Dong, Lan Ma, Qiming Huang, Ming Xu, Yongbing Zhang, Dongmei Yu, Chenggang Yan, Peiwu Qin

    Abstract: Deep learning techniques have shown great potential in medical image processing, particularly through accurate and reliable image segmentation on magnetic resonance imaging (MRI) scans or computed tomography (CT) scans, which allow the localization and diagnosis of lesions. However, training these segmentation models requires a large number of manually annotated pixel-level labels, which are time-… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: 12 pages, 7 figures

  49. arXiv:2204.05738  [pdf, other

    eess.AS cs.SD

    Text-Driven Separation of Arbitrary Sounds

    Authors: Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

    Abstract: We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model, SoundWords, is trained to jointly embed both an audio clip and its textual description to the same embedding in a shared representation. The second model, Sound… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  50. arXiv:2203.15147  [pdf, other

    eess.AS cs.AI cs.CL cs.SD eess.SP

    Separate What You Describe: Language-Queried Audio Source Separation

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, **zheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 3 figures