Skip to main content

Showing 1–50 of 2,698 results for author: Huang

Searching in archive eess. Search in all archives.
.
  1. Generative Iris Prior Embedded Transformer for Iris Restoration

    Authors: Yubo Huang, Jia Wang, Peipei Li, Liuyu Xiang, Peigang Li, Zhaofeng He

    Abstract: Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: Our code is available at https://github.com/sawyercharlton/Gformer

    Journal ref: 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 2023, pp. 510-515

  2. arXiv:2406.19954  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

    Authors: Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

    Abstract: Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTO… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 68T10 ACM Class: I.2.7

  3. arXiv:2406.19856  [pdf

    eess.SP

    LUT-boosted CDR and Equalization for Burst-mode 50/100 Gbit/s Bandwidth-limited Flexible PON

    Authors: Yanlu Huang, Liyan Wu, Shangya Han, Kai **, Kun Xu, Yanni Ou

    Abstract: We proposed and experimentally demonstrated a look-up table boosted fast CDR and equalization scheme for the burst-mode 50/100 Gbps bandwidth-limited flexible PON, requiring no preamble for convergence and achieved the same bit error rate performance as in the case of long preambles.

    Submitted 28 June, 2024; originally announced June 2024.

  4. arXiv:2406.19749  [pdf, other

    eess.IV cs.CV

    SPIRONet: Spatial-Frequency Learning and Topological Channel Interaction Network for Vessel Segmentation

    Authors: De-Xing Huang, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, Zeng-Guang Hou

    Abstract: Automatic vessel segmentation is paramount for develo** next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel spatial-frequency learning and topological channel… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  5. arXiv:2406.19674  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

    Authors: Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

    Abstract: Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech-2024

  6. arXiv:2406.19072  [pdf, other

    eess.SP

    Scatterer Recognition from LiDAR Point Clouds for Environment-Embedded Vehicular Channel Modeling via Synesthesia of Machines

    Authors: Ziwei Huang, Lu Bai, Zengrui Han, Xiang Cheng

    Abstract: In this paper, a novel environment-embedded vehicular channel model is proposed by scatterer recognition from light detection and ranging (LiDAR) point clouds via Synesthesia of Machines (SoM). To provide a robust data foundation, a new intelligent sensing-communication integration dataset in vehicular urban scenarios is constructed. Based on the constructed dataset, the complex SoM mechanism, i.e… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  7. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  8. arXiv:2406.18018  [pdf, other

    eess.IV

    A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset

    Authors: Muwei Jian, Haoran Zhang, Mingju Shao, Hongyu Chen, Huihui Huang, Yanjie Zhong, Changlei Zhang, Bin Wang, Penghui Gao

    Abstract: Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in im… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  9. arXiv:2406.17976  [pdf, other

    eess.SY

    The Role of Electric Grid Research in Addressing Climate Change

    Authors: Le Xie, Subir Majumder, Tong Huang, Qian Zhang, ** Chang, David J. Hill, Mohammad Shahidehpour

    Abstract: Addressing the urgency of climate change necessitates a coordinated and inclusive effort from all relevant stakeholders. Critical to this effort is the modeling, analysis, control, and integration of technological innovations within the electric energy system, which plays a crucial role in scaling up climate change solutions. This perspective article presents a set of research challenges and oppor… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 17 pages, 2 figures

  10. arXiv:2406.17618  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

    Authors: Jiawen Huang, Emmanouil Benetos

    Abstract: Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of d… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted at EUSIPCO 2024

  11. arXiv:2406.17338  [pdf, other

    eess.IV cs.CV cs.LG

    Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

    Authors: Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, Xin Wang

    Abstract: Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propos… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024

  12. arXiv:2406.17173  [pdf, other

    eess.IV cs.CV cs.LG

    Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks

    Authors: Zihao **, Yingying Fang, Jiahao Huang, Caiwen Xu, Simon Walsh, Guang Yang

    Abstract: The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D dat… ▽ More

    Submitted 26 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: conference

  13. arXiv:2406.16928  [pdf, other

    eess.SP cs.LG

    A Multi-Resolution Mutual Learning Network for Multi-Label ECG Classification

    Authors: Wei Huang, Ning Wang, Panpan Feng, Haiyan Wang, Zongmin Wang, Bing Zhou

    Abstract: Electrocardiograms (ECG), which record the electrophysiological activity of the heart, have become a crucial tool for diagnosing these diseases. In recent years, the application of deep learning techniques has significantly improved the performance of ECG signal classification. Multi-resolution feature analysis, which captures and processes information at different time scales, can extract subtle… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  14. arXiv:2406.16297  [pdf, other

    cs.CV eess.IV

    Priorformer: A UGC-VQA Method with content and distortion priors

    Authors: Ya**g Pei, Shiyu Huang, Yiting Lu, Xin Li, Zhibo Chen

    Abstract: User Generated Content (UGC) videos are susceptible to complicated and variant degradations and contents, which prevents the existing blind video quality assessment (BVQA) models from good performance since the lack of the adapability of distortions and contents. To mitigate this, we propose a novel prior-augmented perceptual vision transformer (PriorFormer) for the BVQA of UGC, which boots its ad… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 7 pages

  15. arXiv:2406.16150  [pdf, other

    eess.IV cs.CV

    Intensity Confusion Matters: An Intensity-Distance Guided Loss for Bronchus Segmentation

    Authors: Haifan Gong, Wenhao Huang, Huan Zhang, Yu Wang, Xiang Wan, Hong Shen, Guanbin Li, Haofeng Li

    Abstract: Automatic segmentation of the bronchial tree from CT imaging is important, as it provides structural information for disease diagnosis. Despite the merits of previous automatic bronchus segmentation methods, they have paied less attention to the issue we term as \textit{Intensity Confusion}, wherein the intensity values of certain background voxels approach those of the foreground voxels within br… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: IEEE International Conference on Multimedia & Expo (ICME) 2024

  16. arXiv:2406.16026  [pdf

    physics.med-ph cs.LG eess.IV

    CEST-KAN: Kolmogorov-Arnold Networks for CEST MRI Data Analysis

    Authors: Jiawen Wang, Pei Cai, Ziyan Wang, Huabin Zhang, Jianpan Huang

    Abstract: Purpose: This study aims to propose and investigate the feasibility of using Kolmogorov-Arnold Network (KAN) for CEST MRI data analysis (CEST-KAN). Methods: CEST MRI data were acquired from twelve healthy volunteers at 3T. Data from ten subjects were used for training, while the remaining two were reserved for testing. The performance of multi-layer perceptron (MLP) and KAN models with the same ne… ▽ More

    Submitted 25 June, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

  17. arXiv:2406.15222  [pdf

    eess.IV cs.AI cs.CV

    Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study

    Authors: Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, **gyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He, Zhenpeng Yuan , et al. (15 additional authors not shown)

    Abstract: Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed… ▽ More

    Submitted 24 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: under peer review

  18. arXiv:2406.14931  [pdf, other

    eess.SP

    Multi-beam Training for Near-field Communications in High-frequency Bands

    Authors: Cong Zhou, Changsheng You, Zixuan Huang, Shuo Shi, Yi Gong, Chan-Byoung Chae, Kaibin Huang

    Abstract: In this paper, we study efficient multi-beam training design for near-field communications to reduce the beam training overhead of conventional single-beam training methods. In particular, the array-division based multi-beam training method, which is widely used in far-field communications, cannot be directly applied to the near-field scenario, since different sub-arrays may observe different user… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: In this paper, a novel near-field multi-beam training scheme is proposed by sparsely activating a portion of antennas to form a sparse linear array

  19. arXiv:2406.14878  [pdf, other

    cs.CV cs.LG eess.IV

    MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection

    Authors: Zhuoxiao Chen, Junjie Meng, Mahsa Baktashmotlagh, Zi Huang, Yadan Luo

    Abstract: LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assem… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  20. arXiv:2406.13788  [pdf, other

    eess.SP

    Groupwise Deformable Registration of Diffusion Tensor Cardiovascular Magnetic Resonance: Disentangling Diffusion Contrast, Respiratory and Cardiac Motions

    Authors: Fanwen Wang, Yihao Luo, Ke Wen, Jiahao Huang, Pedro F. Ferreira, Yaqing Luo, Yinzhe Wu, Camila Munoz, Dudley J. Pennell, Andrew D. Scott, Sonia Nielles-Vallespin, Guang Yang

    Abstract: Diffusion tensor based cardiovascular magnetic resonance (DT-CMR) offers a non-invasive method to visualize the myocardial microstructure. With the assumption that the heart is stationary, frames are acquired with multiple repetitions for different diffusion encoding directions. However, motion from poor breath-holding and imprecise cardiac triggering complicates DT-CMR analysis, further challenge… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by MICCAI 2024

  21. arXiv:2406.13708  [pdf

    eess.IV physics.med-ph

    Low-rank based motion correction followed by automatic frame selection in DT-CMR

    Authors: Fanwen Wang, Pedro F. Ferreira, Camila Munoz, Ke Wen, Yaqing Luo, Jiahao Huang, Yinzhe Wu, Dudley J. Pennell, Andrew D. Scott, Sonia Nielles-Vallespin, Guang Yang

    Abstract: Motivation: Post-processing of in-vivo diffusion tensor CMR (DT-CMR) is challenging due to the low SNR and variation in contrast between frames which makes image registration difficult, and the need to manually reject frames corrupted by motion. Goals: To develop a semi-automatic post-processing pipeline for robust DT-CMR registration and automatic frame selection. Approach: We used low intrinsic… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted as ISMRM 2024 Digital poster 2141

    Journal ref: ISMRM 2024 Digital poster 2141

  22. arXiv:2406.13209  [pdf, other

    eess.IV cs.CV physics.med-ph

    Diffusion Model-based FOD Restoration from High Distortion in dMRI

    Authors: Shuo Huang, Lujia Zhong, Yonggang Shi

    Abstract: Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 11 pages, 7 figures

  23. arXiv:2406.13165  [pdf, other

    eess.IV cs.AI cs.CV cs.RO

    Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World Model

    Authors: Haojun Jiang, Zhenguo Sun, Ning Jia, Meng Li, Yu Sun, Shaqi Luo, Shiji Song, Gao Huang

    Abstract: Echocardiography is the only technique capable of real-time imaging of the heart and is vital for diagnosing the majority of cardiac diseases. However, there is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges. To mitigate this situation, we present a Cardiac Copilot system capable of providing real-time probe moveme… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Early Accepted by MICCAI 2024

  24. arXiv:2406.12946  [pdf

    eess.AS cs.AI cs.CL cs.LG

    Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

    Authors: Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

    Abstract: In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships be… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  25. arXiv:2406.12456  [pdf, other

    eess.IV cs.CV

    Deep-learning-based groupwise registration for motion correction of cardiac $T_1$ map**

    Authors: Yi Zhang, Yidong Zhao, Lu Huang, Liming Xia, Qian Tao

    Abstract: Quantitative $T_1$ map** by MRI is an increasingly important tool for clinical assessment of cardiovascular diseases. The cardiac $T_1$ map is derived by fitting a known signal model to a series of baseline images, while the quality of this map can be deteriorated by involuntary respiratory and cardiac motion. To correct motion, a template image is often needed to register all baseline images, b… ▽ More

    Submitted 21 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024. Contents may slightly differ from the camera-ready version

  26. arXiv:2406.11296  [pdf

    eess.SY

    Energy efficiency analysis of ammonia-fueled power systems for vehicles considering residual heat recovery

    Authors: Zexin Nie, Yi Huang, Guangyu Tian

    Abstract: Ammonia, known as a good hydrogen carrier, shows great potential for use as a zero-carbon fuel for vehicles. However, both the internal combustion engine (ICE) and the proton exchange membrane fuel cell (PEMFC), the currently available engines used by the vehicle, require hydrogen decomposed from ammonia. On-board hydrogen production is an energy-intensive process that significantly reduces system… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  27. arXiv:2406.11064  [pdf, other

    eess.AS cs.SD

    Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech

    Authors: Guan-Ting Lin, Wei-** Huang, Hung-yi Lee

    Abstract: Deep learning-based end-to-end automatic speech recognition (ASR) has made significant strides but still struggles with performance on out-of-domain (OOD) samples due to domain shifts in real-world scenarios. Test-Time Adaptation (TTA) methods address this issue by adapting models using test samples at inference time. However, current ASR TTA methods have largely focused on non-continual TTA, whic… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages

  28. arXiv:2406.11056  [pdf, other

    eess.SY

    Reachability Analysis for Linear Systems with Uncertain Parameters using Polynomial Zonotopes

    Authors: Yushen Huang, Ertai Luo, Stanley Bak, Yifan Sun

    Abstract: In real world applications, uncertain parameters are the rule rather than the exception. We present a reachability algorithm for linear systems with uncertain parameters and inputs using set propagation of polynomial zonotopes. In contrast to previous methods, our approach is able to tightly capture the non-convexity of the reachable set. Building up on our main result, we show how our reachabilit… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  29. arXiv:2406.10365  [pdf, other

    eess.SY math.OC

    Multi-Objective Control Co-design Using Graph-Based Optimization for Offshore Wind Farm Grid Integration

    Authors: Himanshu Sharma, Wei Wang, Bowen Huang, Thiagarajan Ramachandran, Veronica Adetola

    Abstract: Offshore wind farms have emerged as a popular renewable energy source that can generate substantial electric power with a low environmental impact. However, integrating these farms into the grid poses significant complexities. To address these issues, optimal-sized energy storage can provide potential solutions and help improve the reliability, efficiency, and flexibility of the grid. Nevertheless… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  30. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  31. arXiv:2406.09656  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement

    Authors: **gcheng Li, Ye Qiao, Haocheng Xu, Sitao Huang

    Abstract: Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even mor… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  32. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, **ming Guo, Xiaolin Chen, **gcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  33. arXiv:2406.08806  [pdf, ps, other

    eess.SY

    Adaptive Cooperative Streaming of Holographic Video Over Wireless Networks: A Proximal Policy Optimization Solution

    Authors: Wanli Wen, Ji** Yan, Yulu Zhang, Zhen Huang, Liang Liang, Yunjian Jia

    Abstract: Adapting holographic video streaming to fluctuating wireless channels is essential to maintain consistent and satisfactory Quality of Experience (QoE) for users, which, however, is a challenging task due to the dynamic and uncertain characteristics of wireless networks. To address this issue, we propose a holographic video cooperative streaming framework designed for a generic wireless network in… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted for publication in IEEE Wireless Communications Letters

  34. arXiv:2406.08710  [pdf, other

    eess.SP

    Real-time Digital RF Emulation -- I: The Direct Path Computational Model

    Authors: Coleman DeLude, Joe Driscoll, Mandovi Mukherjee, Nael Rahman, Uday Kamal, Xiangyu Mao, Sharjeel Khan, Hariharan Sivaraman, Eric Huang, Jeffrey McHarg, Madhavan Swaminathan, Santosh Pande, Saibal Mukhopadhyay, Justin Romberg

    Abstract: In this paper we consider the problem of develo** a computational model for emulating an RF channel. The motivation for this is that an accurate and scalable emulator has the potential to minimize the need for field testing, which is expensive, slow, and difficult to replicate. Traditionally, emulators are built using a tapped delay line model where long filters modeling the physical interaction… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  35. arXiv:2406.08402  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

    Authors: Chun-Yi Kuan, Wei-** Huang, Hung-yi Lee

    Abstract: Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  36. arXiv:2406.08081  [pdf

    eess.SP

    CLDTA: Contrastive Learning based on Diagonal Transformer Autoencoder for Cross-Dataset EEG Emotion Recognition

    Authors: Yuan Liao, Yuhong Zhang, Shenghuan Wang, Xiruo Zhang, Yiling Zhang, Wei Chen, Yuzhe Gu, Liya Huang

    Abstract: Recent advances in non-invasive EEG technology have broadened its application in emotion recognition, yielding a multitude of related datasets. Yet, deep learning models struggle to generalize across these datasets due to variations in acquisition equipment and emotional stimulus materials. To address the pressing need for a universal model that fluidly accommodates diverse EEG dataset formats and… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  37. arXiv:2406.07890  [pdf, other

    eess.AS cs.CL cs.LG

    Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

    Authors: Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

    Abstract: Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  38. arXiv:2406.07498  [pdf, other

    cs.SD eess.AS

    RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

    Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's perfor… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  39. arXiv:2406.07409  [pdf, other

    stat.ML cs.IT cs.LG eess.SP math.OC

    Accelerating Ill-conditioned Hankel Matrix Recovery via Structured Newton-like Descent

    Authors: HanQin Cai, Longxiu Huang, Xiliang Lu, Juntao You

    Abstract: This paper studies the robust Hankel recovery problem, which simultaneously removes the sparse outliers and fulfills missing entries from the partial observation. We propose a novel non-convex algorithm, coined Hankel Structured Newton-Like Descent (HSNLD), to tackle the robust Hankel recovery problem. HSNLD is highly efficient with linear convergence, and its convergence rate is independent of th… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    MSC Class: 15A29; 15A83; 47B35; 90C17; 90C26; 90C53

  40. arXiv:2406.06534  [pdf, other

    cs.CV eess.IV physics.optics

    Compressed Meta-Optical Encoder for Image Classification

    Authors: Anna Wirth-Singh, **lin Xiang, Minho Choi, Johannes E. Fröch, Luocheng Huang, Shane Colburn, Eli Shlizerman, Arka Majumdar

    Abstract: Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modif… ▽ More

    Submitted 14 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

  41. arXiv:2406.06375  [pdf, other

    cs.SD cs.AI eess.AS

    MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

    Authors: Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-** Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su

    Abstract: In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset/tree/main and https://zenodo.org/records/11393449

  42. arXiv:2406.06208  [pdf, other

    cs.SD eess.AS

    Quantifying the effect of speech pathology on automatic and human speaker verification

    Authors: Bence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki Toda

    Abstract: This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024

    ACM Class: I.2.7

  43. arXiv:2406.05961  [pdf, other

    eess.AS

    BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation

    Authors: Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: Audio packet loss is an inevitable problem in real-time speech communication. A band-split packet loss concealment network (BS-PLCNet) targeting full-band signals was recently proposed. Although it performs superiorly in the ICASSP 2024 PLC Challenge, BS-PLCNet is a large model with high computational complexity of 8.95G FLOPS. This paper presents its updated version, BS-PLCNet 2, to reduce comput… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  44. arXiv:2406.05806  [pdf, other

    cs.CL cs.SD eess.AS

    Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

    Authors: Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee

    Abstract: This research explores the interaction between Whisper, a high-performing speech recognition model, and information in prompts. Our results unexpectedly show that Whisper may not fully grasp textual prompts as anticipated. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prom… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: In progress

  45. arXiv:2406.05170  [pdf

    q-bio.OT cs.CV eess.IV

    Research on Tumors Segmentation based on Image Enhancement Method

    Authors: Danyi Huang, Ziang Liu, Yizhou Li

    Abstract: One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These short… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  46. arXiv:2406.04858  [pdf, other

    cs.RO eess.SY

    Auto-Multilift: Distributed Learning and Control for Cooperative Load Transportation With Quadrotors

    Authors: Bingheng Wang, Kuankuan Sima, Rui Huang, Lin Zhao

    Abstract: Designing motion control and planning algorithms for multilift systems remains challenging due to the complexities of dynamics, collision avoidance, actuator limits, and scalability. Existing methods that use optimization and distributed techniques effectively address these constraints and scalability issues. However, they often require substantial manual tuning, leading to suboptimal performance.… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  47. arXiv:2406.04680  [pdf, other

    eess.IV cs.CV

    MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome

    Authors: Yixin Huang, Yiqi **, Ke Tao, Kaijian Xia, Jianfeng Gu, Lei Yu, Lan Du, Cunjian Chen

    Abstract: May-Thurner Syndrome (MTS), also known as iliac vein compression syndrome or Cockett's syndrome, is a condition potentially impacting over 20 percent of the population, leading to an increased risk of iliofemoral deep venous thrombosis. In this paper, we present a 3D-based deep learning approach called MTS-Net for diagnosing May-Thurner Syndrome using CT scans. To effectively capture the spatial-t… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  48. arXiv:2406.03872  [pdf, other

    cs.CL cs.SD eess.AS

    BLSP-Emo: Towards Empathetic Large Speech-Language Models

    Authors: Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang

    Abstract: The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we pr… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  49. arXiv:2406.03663  [pdf

    eess.IV cs.LG q-bio.QM

    A Hybrid Deep Learning Classification of Perimetric Glaucoma Using Peripapillary Nerve Fiber Layer Reflectance and Other OCT Parameters from Three Anatomy Regions

    Authors: Ou Tan, David S. Greenfield, Brian A. Francis, Rohit Varma, Joel S. Schuman, David Huang, Dongseok Choi

    Abstract: Precis: A hybrid deep-learning model combines NFL reflectance and other OCT parameters to improve glaucoma diagnosis. Objective: To investigate if a deep learning model could be used to combine nerve fiber layer (NFL) reflectance and other OCT parameters for glaucoma diagnosis. Patients and Methods: This is a prospective observational study where of 106 normal subjects and 164 perimetric glaucoma… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 12 pages

  50. arXiv:2406.03274  [pdf, other

    eess.AS cs.AI cs.SD

    Enhancing CTC-based speech recognition with diverse modeling units

    Authors: Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

    Abstract: In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvem… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.