Skip to main content

Showing 1–50 of 206 results for author: Huang, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2406.18018  [pdf, other

    eess.IV

    A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset

    Authors: Muwei Jian, Haoran Zhang, Mingju Shao, Hongyu Chen, Huihui Huang, Yanjie Zhong, Changlei Zhang, Bin Wang, Penghui Gao

    Abstract: Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in im… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  3. arXiv:2406.02166  [pdf, other

    cs.SD cs.CL eess.AS

    Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

    Authors: Saierdaer Yusuyin, Te Ma, Hao Huang, Wenbo Zhao, Zhijian Ou

    Abstract: There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. Th… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  4. arXiv:2406.01205  [pdf, other

    eess.AS cs.LG cs.SD

    ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

    Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang, Zhou Zhao

    Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  5. arXiv:2406.00683  [pdf, other

    eess.IV cs.CV cs.MM

    Exploiting Frequency Correlation for Hyperspectral Image Reconstruction

    Authors: Muge Yan, Lizhi Wang, Lin Zhu, Hua Huang

    Abstract: Deep priors have emerged as potent methods in hyperspectral image (HSI) reconstruction. While most methods emphasize space-domain learning using image space priors like non-local similarity, frequency-domain learning using image frequency priors remains neglected, limiting the reconstruction capability of networks. In this paper, we first propose a Hyperspectral Frequency Correlation (HFC) prior r… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 14 pages, 11 figures

  6. arXiv:2405.14300  [pdf, other

    eess.IV cs.CV

    Automatic diagnosis of cardiac magnetic resonance images based on semi-supervised learning

    Authors: Hejun Huang, Zuguo Chen, Yi Huang, Guangqiang Luo, Chaoyang Chen, Youzhi Song

    Abstract: Cardiac magnetic resonance imaging (MRI) is a pivotal tool for assessing cardiac function. Precise segmentation of cardiac structures is imperative for accurate cardiac functional evaluation. This paper introduces a semi-supervised model for automatic segmentation of cardiac images and auxiliary diagnosis. By harnessing cardiac MRI images and necessitating only a small portion of annotated image d… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  7. arXiv:2404.09192  [pdf, other

    cs.SD cs.AI eess.AS

    Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

    Authors: Quanxiu Wang, Hui Huang, Mingjie Wang, Yong Dai, **zuomu Zhong, Benlai Tang

    Abstract: Over the past decade, a series of unflagging efforts have been dedicated to develo** highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  8. arXiv:2404.07477  [pdf, ps, other

    eess.SP

    Integrated Sensing and Communication Under DISCO Physical-Layer Jamming Attacks

    Authors: Huan Huang, Hongliang Zhang, Weidong Mei, Jun Li, Yi Cai, A. Lee Swindlehurst, Zhu Han

    Abstract: Integrated sensing and communication (ISAC) systems traditionally presuppose that sensing and communication (S&C) channels remain approximately constant during their coherence time. However, a "DISCO" reconfigurable intelligent surface (DRIS), i.e., an illegitimate RIS with random, time-varying reflection properties that acts like a "disco ball," introduces a paradigm shift that enables active cha… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: This paper has been submitted for possible publication. For the code of the DISCO RIS is available on Github (https://github.com/huanhuan1799/Disco-Intelligent-Reflecting-Surfaces-Active-Channel-Aging-for-Fully-Passive-Jamming-Attacks)

  9. arXiv:2404.07092  [pdf, other

    eess.SP physics.optics

    Net 835-Gb/s/λ Carrier- and LO-Free 100-km Transmission Using Channel-Aware Phase Retrieval Reception

    Authors: Hanzi Huang, Haoshuo Chen, Qian Hu, Di Che, Yetian Huang, Brian Stern, Nicolas K. Fontaine, Mikael Mazur, Lauren Dallachiesa, Roland Ryf, Zhengxuan Li, Yingxiong Song

    Abstract: We experimentally demonstrate the first carrier- and LO-free 800G/λ receiver enabling direct compatibility with standard coherent transmitters via phase retrieval, achieving net 835-Gb/s transmission over 100-km SMF and record 8.27-b/s/Hz net optical spectral efficiency.

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 3 pages, 3 figures

  10. arXiv:2403.05834  [pdf, other

    cs.MM cs.SD eess.AS

    Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

    Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

    Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  11. arXiv:2403.02566  [pdf, other

    eess.IV cs.CV

    Enhancing Weakly Supervised 3D Medical Image Segmentation through Probabilistic-aware Learning

    Authors: Zhaoxin Fan, Runmin Jiang, Junhao Wu, Xin Huang, Tianyang Wang, Heng Huang, Min Xu

    Abstract: 3D medical image segmentation is a challenging task with crucial implications for disease diagnosis and treatment planning. Recent advances in deep learning have significantly enhanced fully supervised medical image segmentation. However, this approach heavily relies on labor-intensive and time-consuming fully annotated ground-truth labels, particularly for 3D volumes. To overcome this limitation,… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  12. arXiv:2402.15738  [pdf, other

    cs.CR eess.SY

    Privacy-Preserving State Estimation in the Presence of Eavesdroppers: A Survey

    Authors: Xinhao Yan, Guanzhong Zhou, Daniel E. Quevedo, Carlos Murguia, Bo Chen, Hailong Huang

    Abstract: Networked systems are increasingly the target of cyberattacks that exploit vulnerabilities within digital communications, embedded hardware, and software. Arguably, the simplest class of attacks -- and often the first type before launching destructive integrity attacks -- are eavesdrop** attacks, which aim to infer information by collecting system data and exploiting it for malicious purposes. A… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

    Comments: 16 pages, 5 figures, 4 tables

  13. arXiv:2402.15693  [pdf

    eess.SY cs.AR

    Photolithography Control System : A Case Study For Cyber-Physical System

    Authors: Youbao Zhang, Huijie Huang

    Abstract: Photolithography control system (PCS) is an extremely complex distributed control system, which is composed of dozens of networked microprocessors, hundreds of actuators, hundreds of thousands of sensors, and millions of lines of code. Cyber-physical system (CPS), which deeply merges computation with physical processes together, copes with complex system from a higher level of abstraction. PCS is… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: 22 pages, 10 figures, 4 tables

  14. arXiv:2402.02411  [pdf, other

    eess.IV cs.CV

    Physics-Inspired Degradation Models for Hyperspectral Image Fusion

    Authors: Jie Lian, Lizhi Wang, Lin Zhu, Renwei Dian, Zhiwei Xiong, Hua Huang

    Abstract: The fusion of a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) has garnered increasing research interest. However, most fusion methods solely focus on the fusion algorithm itself and overlook the degradation models, which results in unsatisfactory performance in practical scenarios. To fill this gap, we propose physics-inspired degra… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

  15. arXiv:2402.02349  [pdf

    eess.IV cs.CV

    Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

    Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou

    Abstract: Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: 14 pages, 6 figures; reference added

  16. arXiv:2401.16087  [pdf, other

    cs.CV eess.IV

    High Resolution Image Quality Database

    Authors: Huang Huang, Qiang Wan, Jari Korhonen

    Abstract: With technology for digital photography and high resolution displays rapidly evolving and gaining popularity, there is a growing demand for blind image quality assessment (BIQA) models for high resolution images. Unfortunately, the publicly available large scale image quality databases used for training BIQA models contain mostly low or general resolution images. Since image resizing affects image… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

  17. arXiv:2401.09036  [pdf, other

    cs.IT eess.SP

    IRS-Enhanced Anti-Jamming Precoding Against DISCO Physical Layer Jamming Attacks

    Authors: Huan Huang, Hongliang Zhang, Yi Cai, Yun**g Zhang, A. Lee Swindlehurst, Zhu Han

    Abstract: Illegitimate intelligent reflective surfaces (IRSs) can pose significant physical layer security risks on multi-user multiple-input single-output (MU-MISO) systems. Recently, a DISCO approach has been proposed an illegitimate IRS with random and time-varying reflection coefficients, referred to as a "disco" IRS (DIRS). Such DIRS can attack MU-MISO systems without relying on either jamming power or… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted by IEEE ICC 2024

  18. arXiv:2401.07398  [pdf, other

    cs.CV cs.LG eess.IV

    Cross Domain Early Crop Map** using CropSTGAN

    Authors: Yiqun Wang, Hui Huang, Radu State

    Abstract: Driven by abundant satellite imagery, machine learning-based approaches have recently been promoted to generate high-resolution crop cultivation maps to support many agricultural applications. One of the major challenges faced by these approaches is the limited availability of ground truth labels. In the absence of ground truth, existing work usually adopts the "direct transfer strategy" that trai… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

  19. arXiv:2312.15921  [pdf, other

    cs.IT eess.SP

    Hybrid Precoder Design for Angle-of-Departure Estimation with Limited-Resolution Phase Shifters

    Authors: Hui** Huang, Musa Furkan Keskin, Henk Wymeersch, Xuesong Cai, Linlong Wu, Johan Thunberg, Fredrik Tufvesson

    Abstract: Hybrid analog-digital beamforming stands out as a key enabler for future communication systems with a massive number of antennas. In this paper, we investigate the hybrid precoder design problem for angle-of-departure (AoD) estimation, where we take into account the practical constraint on the limited resolution of phase shifters. Our goal is to design a radio-frequency (RF) precoder and a base-ba… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  20. arXiv:2312.15380  [pdf, other

    cs.NI eess.SP

    Battery-Care Resource Allocation and Task Offloading in Multi-Agent Post-Disaster MEC Environment

    Authors: Yiwei Tang, Hualong Huang, Wenhan Zhan, Geyong Min, Zhekai Duan, Yuchuan Lei

    Abstract: Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaste… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: accepted by wcnc2024

  21. arXiv:2312.14776  [pdf, other

    cs.CV eess.IV

    Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold

    Authors: Alireza Ganjdanesh, Shangqian Gao, Hirad Alipanah, Heng Huang

    Abstract: Generative Adversarial Networks (GANs) have shown remarkable success in modeling complex data distributions for image-to-image translation. Still, their high computational demands prohibit their deployment in practical scenarios like edge devices. Existing GAN compression methods mainly rely on knowledge distillation or convolutional classifiers' pruning techniques. Thus, they neglect the critical… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: The 38th Annual AAAI Conference on Artificial Intelligence, AAAI 2024

  22. arXiv:2312.13319  [pdf, other

    eess.IV cs.CV

    In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging

    Authors: Xin Wang, Lizhi Wang, Xiangtian Ma, Maoqing Zhang, Lin Zhu, Hua Huang

    Abstract: Dual-Camera Compressed Hyperspectral Imaging (DCCHI) offers the capability to reconstruct 3D Hyperspectral Image (HSI) by fusing compressive and Panchromatic (PAN) image, which has shown great potential for snapshot hyperspectral imaging in practice. In this paper, we introduce a novel DCCHI reconstruction network, the Intra-Inter Similarity Exploiting Transformer (In2SET). Our key insight is to m… ▽ More

    Submitted 8 June, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  23. arXiv:2312.12211  [pdf, other

    eess.SP

    Joint DOA estimation and distorted sensor detection under entangled low-rank and row-sparse constraints

    Authors: Hui** Huang, Tianjian Zhang, Feng Yin, Bin Liao, Henk Wymeersch

    Abstract: The problem of joint direction-of-arrival estimation and distorted sensor detection has received a lot of attention in recent decades. Most state-of-the-art work formulated such a problem via low-rank and row-sparse decomposition, where the low-rank and row-sparse components were treated in an isolated manner. Such a formulation results in a performance loss. Differently, in this paper, we entangl… ▽ More

    Submitted 21 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  24. arXiv:2312.10687  [pdf, other

    eess.AS cs.SD

    MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

    Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong

    Abstract: The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide… ▽ More

    Submitted 31 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: Accepted at AAAI2024

  25. arXiv:2312.08089  [pdf, other

    eess.AS

    Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

    Authors: Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, Yuehai Wang

    Abstract: With the rapid development of speech synthesis and voice conversion technologies, Audio Deepfake has become a serious threat to the Automatic Speaker Verification (ASV) system. Numerous countermeasures are proposed to detect this type of attack. In this paper, we report our efforts to combine the self-supervised WavLM model and Multi-Fusion Attentive classifier for audio deepfake detection. Our me… ▽ More

    Submitted 9 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024. 5 pages, 1 figure

  26. arXiv:2311.17382  [pdf, other

    eess.AS

    Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

    Authors: Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng

    Abstract: This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt. The Mixed Error Rate results show that th… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: 6 pages, 3 figures, 4 tables

  27. arXiv:2311.10689  [pdf, other

    eess.AS

    GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System

    Authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He

    Abstract: Speaker adaptation systems face privacy concerns, for such systems are trained on private datasets and often overfitting. This paper demonstrates that an attacker can extract speaker information by querying speaker-adapted speech recognition (ASR) systems. We focus on the speaker information of a transformer-based ASR and propose GhostVec, a simple and efficient attack method to extract the speake… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: accepted in ACM Multimedia Asia 2023

  28. arXiv:2311.10664  [pdf, other

    eess.AS

    Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

    Authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He

    Abstract: Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from larg… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: accepted in ACM Multimedia Asia2023

  29. arXiv:2311.10551  [pdf, other

    eess.SP

    A Tutorial on 5G Positioning

    Authors: Lorenzo Italiano, Bernardo Camajori Tedeschini, Mattia Brambilla, Hui** Huang, Monica Nicoli, Henk Wymeersch

    Abstract: The widespread adoption of the fifth generation (5G) of cellular networks has brought new opportunities for the development of localization-based services. High-accuracy positioning use cases and functionalities defined by the standards are drawing the interest of vertical industries. In the transition towards the deployment, this paper aims to provide an in-depth tutorial on 5G positioning, summa… ▽ More

    Submitted 27 March, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: This work has been submitted to the IEEE Communications Surveys & Tutorials for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  30. arXiv:2310.18498  [pdf, ps, other

    eess.IV cs.CV cs.LG

    GPT-4 Vision on Medical Image Classification -- A Case Study on COVID-19 Dataset

    Authors: Ruibo Chen, Tianyi Xiong, Yihan Wu, Guodong Liu, Zhengmian Hu, Lichang Chen, Yanshuo Chen, Chenxi Liu, Heng Huang

    Abstract: This technical report delves into the application of GPT-4 Vision (GPT-4V) in the nuanced realm of COVID-19 image classification, leveraging the transformative potential of in-context learning to enhance diagnostic processes.

    Submitted 27 October, 2023; originally announced October 2023.

  31. arXiv:2310.14355  [pdf

    cs.LG eess.IV

    A global product of fine-scale urban building height based on spaceborne lidar

    Authors: Xiao Ma, Guang Zheng, Chi Xu, L. Monika Moskal, Peng Gong, Qinghua Guo, Huabing Huang, Xuecao Li, Yong Pang, Cheng Wang, Huan Xie, Bailang Yu, Bo Zhao, Yuyu Zhou

    Abstract: Characterizing urban environments with broad coverages and high precision is more important than ever for achieving the UN's Sustainable Development Goals (SDGs) as half of the world's populations are living in cities. Urban building height as a fundamental 3D urban structural feature has far-reaching applications. However, so far, producing readily available datasets of recent urban building heig… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

  32. arXiv:2310.12378  [pdf, other

    eess.AS cs.SD

    The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

    Authors: Tae ** Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

    Abstract: We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Spea… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023

  33. arXiv:2310.12371  [pdf, other

    eess.AS cs.SD

    Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

    Authors: Tae ** Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg

    Abstract: We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for develo** neural models suited for speaker diarization… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023

  34. arXiv:2310.09505  [pdf, other

    cs.SD cs.LG eess.AS

    Advancing Test-Time Adaptation for Acoustic Foundation Models in Open-World Shifts

    Authors: Hongfu Liu, Hengguan Huang, Ye Wang

    Abstract: Test-Time Adaptation (TTA) is a critical paradigm for tackling distribution shifts during inference, especially in visual recognition tasks. However, while acoustic models face similar challenges due to distribution shifts in test-time speech, TTA techniques specifically designed for acoustic modeling in the context of open-world data shifts remain scarce. This gap is further exacerbated when cons… ▽ More

    Submitted 14 October, 2023; originally announced October 2023.

  35. arXiv:2310.09424  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

    Authors: Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg

    Abstract: We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recogni… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: submit to ICASSP 2024

    MSC Class: 68T10 ACM Class: I.2.7

  36. arXiv:2310.09126  [pdf, other

    eess.IV cs.CV cs.LG

    Physics-guided Noise Neural Proxy for Practical Low-light Raw Image Denoising

    Authors: Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Lin Zhu, Hua Huang

    Abstract: Recently, the mainstream practice for training low-light raw image denoising methods has shifted towards employing synthetic data. Noise modeling, which focuses on characterizing the noise distribution of real-world sensors, profoundly influences the effectiveness and practicality of synthetic data. Currently, physics-based noise modeling struggles to characterize the entire real noise distributio… ▽ More

    Submitted 22 January, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Under Review

  37. arXiv:2310.05314  [pdf, other

    eess.SP physics.optics

    Distortion-Aware Phase Retrieval Receiver for High-Order QAM Transmission with Carrierless Intensity-Only Measurements

    Authors: Hanzi Huang, Haoshuo Chen, Qi Gao, Yetian Huang, Nicolas K. Fontaine, Mikael Mazur, Lauren Dallachiesa, Roland Ryf, Zhengxuan Li, Yingxiong Song

    Abstract: We experimentally investigate transmitting high-order quadrature amplitude modulation (QAM) signals with carrierless and intensity-only measurements with phase retrieval (PR) receiving techniques. The intensity errors during measurement, including noise and distortions, are found to be a limiting factor for the precise convergence of the PR algorithm. To improve the PR reconstruction accuracy, we… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: 12 pages, 12 figures

  38. arXiv:2310.02467  [pdf

    physics.optics eess.SP physics.app-ph

    Dual-Polarization Phase Retrieval Receiver in Silicon Photonics

    Authors: Brian Stern, Hanzi Huang, Haoshuo Chen, Kwangwoong Kim, Mohamad Hossein Idjadi

    Abstract: We demonstrate a silicon photonic dual-polarization phase retrieval receiver. The receiver recovers phase from intensity-only measurements without a local oscillator or transmitted carrier. We design silicon waveguides providing long delays and microring resonators with large dispersion to enable symbol-to-symbol interference and dispersive projection in the phase retrieval algorithm. We retrieve… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: 11 pages, 7 figures

  39. arXiv:2310.00687  [pdf, ps, other

    eess.SP

    DISCO Might Not Be Funky: Random Intelligent Reflective Surface Configurations That Attack

    Authors: Huan Huang, Lipeng Dai, Hongliang Zhang, Chongfu Zhang, Zhongxing Tian, Yi Cai, A. Lee Swindlehurst, Zhu Han

    Abstract: Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegi… ▽ More

    Submitted 10 June, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: This paper has been accepted by IEEE Wireless Communications. For the code of the DISCO RIS is available on Github (https://github.com/huanhuan1799/Disco-Intelligent-Reflecting-Surfaces-Active-Channel-Aging-for-Fully-Passive-Jamming-Attacks)

  40. arXiv:2309.05423  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

    Authors: **zuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, **g Guo, Benlai Tang, Fengjie Zhu

    Abstract: In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silenc… ▽ More

    Submitted 11 June, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

  41. arXiv:2308.15716  [pdf, ps, other

    eess.SP

    Anti-Jamming Precoding Against Disco Intelligent Reflecting Surfaces Based Fully-Passive Jamming Attacks

    Authors: Huan Huang, Lipeng Dai, Hongliang Zhang, Zhongxing Tian, Yi Cai, Chongfu Zhang, A. Lee Swindlehurst, Zhu Han

    Abstract: Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active cha… ▽ More

    Submitted 24 January, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: This paper has been submitted for possible publication

  42. arXiv:2308.03018  [pdf, other

    cs.CV eess.IV

    Recurrent Spike-based Image Restoration under General Illumination

    Authors: Lin Zhu, Yunlong Zheng, Mengyue Geng, Lizhi Wang, Hua Huang

    Abstract: Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, whi… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: Accepted by ACM MM 2023

  43. arXiv:2307.07807  [pdf, other

    eess.IV cs.CV

    MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis

    Authors: Junyu Li, Han Huang, Dong Ni, Wufeng Xue, Dongmei Zhu, Jun Cheng

    Abstract: Early diagnosis of renal cancer can greatly improve the survival rate of patients. Contrast-enhanced ultrasound (CEUS) is a cost-effective and non-invasive imaging technique and has become more and more frequently used for renal tumor diagnosis. However, the classification of benign and malignant renal tumors can still be very challenging due to the highly heterogeneous appearance of cancer and im… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

    Comments: MICCAI 2023

  44. arXiv:2307.07057  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

    Authors: He Huang, Jagadeesh Balam, Boris Ginsburg

    Abstract: We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: INTERSPEECH 2023

  45. arXiv:2307.03629  [pdf, ps, other

    eess.SP

    An Anti-Jamming Strategy for Disco Intelligent Reflecting Surfaces Based Fully-Passive Jamming Attacks

    Authors: Huan Huang, Hongliang Zhang, Yi Cai, A. Lee Swindlehurst, Zhu Han

    Abstract: Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, while also pose a huge risk for physical layer security. A disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties, can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active channel aging (ACA) generated by the DIRS-based fully-passiv… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  46. arXiv:2306.15212  [pdf, other

    cs.SD cs.LG eess.AS

    TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

    Authors: Jie Liu, Zhiba Su, Hui Huang, Caiyan Wan, Quanxiu Wang, Jiangli Hong, Benlai Tang, Fengjie Zhu

    Abstract: Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake spe… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

  47. arXiv:2306.05196  [pdf, other

    eess.IV cs.CV

    Channel prior convolutional attention for medical image segmentation

    Authors: Hejun Huang, Zuguo Chen, Ying Zou, Ming Lu, Chaoyang Chen

    Abstract: Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distributi… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  48. arXiv:2306.04301  [pdf, other

    cs.SD eess.AS

    Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

    Authors: Wenhao Guan, Tao Li, Yishuang Li, Hukai Huang, Qingyang Hong, Lin Li

    Abstract: With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-… ▽ More

    Submitted 11 July, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech2023

  49. arXiv:2305.16753  [pdf, other

    eess.AS cs.AI eess.SP

    ElectrodeNet -- A Deep Learning Based Sound Coding Strategy for Cochlear Implants

    Authors: Enoch Hsin-Ho Huang, Rong Chao, Yu Tsao, Chao-Min Wu

    Abstract: ElectrodeNet, a deep learning based sound coding strategy for the cochlear implant (CI), is proposed to emulate the advanced combination encoder (ACE) strategy by replacing the conventional envelope detection using various artificial neural networks. The extended ElectrodeNet-CS strategy further incorporates the channel selection (CS). Network models of deep neural network (DNN), convolutional neu… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: 12 pages and 7 figures. Preprint version; IEEE Transactions on Cognitive and Developmental Systems (accepted)

  50. arXiv:2305.16222  [pdf, ps, other

    eess.IV cs.CV cs.LG q-bio.NC

    Incomplete Multimodal Learning for Complex Brain Disorders Prediction

    Authors: Reza Shirkavand, Liang Zhan, Heng Huang, Li Shen, Paul M. Thompson

    Abstract: Recent advancements in the acquisition of various brain data sources have created new opportunities for integrating multimodal brain data to assist in early detection of complex brain disorders. However, current data integration approaches typically need a complete set of biomedical data modalities, which may not always be feasible, as some modalities are only available in large-scale research coh… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.