Skip to main content

Showing 1–50 of 385 results for author: Yang, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.00596  [pdf, other

    eess.IV cs.CV

    HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

    Authors: Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Juming Xiong, Shunxing Bao, Hao Li, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Haichun Yang, Yuankai Huo

    Abstract: Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  2. arXiv:2406.18549  [pdf

    eess.IV cs.CV

    Advancements in Feature Extraction Recognition of Medical Imaging Systems Through Deep Learning Technique

    Authors: Qishi Zhan, Dan Sun, Erdi Gao, Yuhan Ma, Yaxin Liang, Haowei Yang

    Abstract: This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simple… ▽ More

    Submitted 23 May, 2024; originally announced June 2024.

    Comments: conference

  3. arXiv:2406.18548  [pdf

    eess.IV cs.CV

    Exploration of Multi-Scale Image Fusion Systems in Intelligent Medical Image Analysis

    Authors: Yuxiang Hu, Haowei Yang, Ting Xu, Shuyao He, Jiajie Yuan, Haozhang Deng

    Abstract: The diagnosis of brain cancer relies heavily on medical imaging techniques, with MRI being the most commonly used. It is necessary to perform automatic segmentation of brain tumors on MRI images. This project intends to build an MRI algorithm based on U-Net. The residual network and the module used to enhance the context information are combined, and the void space convolution pooling pyramid is a… ▽ More

    Submitted 23 May, 2024; originally announced June 2024.

  4. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  5. arXiv:2406.17430  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights

    Authors: Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

    Abstract: Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a strong capability to understand multimodal information and to interact with human users. Despite the progress made, the challenge of detecting high-risk interactions in multimodal settings, and in particular in speech modality, remains largely unexplored. Conventional research on risk for speech modality primarily… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  6. arXiv:2406.14973  [pdf, other

    cs.CV eess.IV

    LU2Net: A Lightweight Network for Real-time Underwater Image Enhancement

    Authors: Haodong Yang, Jisheng Xu, Zhiliang Lin, Jian** He

    Abstract: Computer vision techniques have empowered underwater robots to effectively undertake a multitude of tasks, including object tracking and path planning. However, underwater optical factors like light refraction and absorption present challenges to underwater vision, which cause degradation of underwater images. A variety of underwater image enhancement methods have been proposed to improve the effe… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  7. arXiv:2406.14576  [pdf, other

    eess.AS

    Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

    Authors: Kubilay Can Demir, Belen Lojo Rodriguez, Tobias Weise, Andreas Maier, Seung Hee Yang

    Abstract: To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 5 Pages, Interspeech 2024

    MSC Class: 00b20

  8. arXiv:2406.11220  [pdf, other

    eess.SP

    No Analog Combiner TTD-based Hybrid Precoding for Multi-User Sub-THz Communications

    Authors: Dang Qua Nguyen, Alexei Ashikhmin, Hong Yang, Taejoon Kim

    Abstract: We address the design and optimization of real-world-suitable hybrid precoders for multi-user wideband sub-terahertz (sub-THz) communications. We note that the conventional fully connected true-time delay (TTD)-based architecture is impractical because there is no room for the required large number of analog signal combiners in the circuit board. Additionally, analog signal combiners incur signifi… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  9. arXiv:2406.08920  [pdf, other

    cs.SD cs.AI eess.AS

    AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

    Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

    Abstract: Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability… ▽ More

    Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  10. arXiv:2406.08837  [pdf

    eess.IV cs.CV cs.LG

    Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

    Authors: Houze Liu, Iris Li, Yaxin Liang, Dan Sun, Yining Yang, Haowei Yang

    Abstract: Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networ… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  11. arXiv:2406.07801  [pdf, other

    cs.CL cs.SD eess.AS

    PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

    Authors: Runyan Yang, Huibao Yang, Xiqing Zhang, Tiantian Ye, Ying Liu, Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

    Abstract: Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures

  12. arXiv:2406.05982  [pdf

    eess.IV cs.LG physics.med-ph

    Artificial Intelligence for Neuro MRI Acquisition: A Review

    Authors: Hongjia Yang, Guanhua Wang, Ziyu Li, Haoxiang Li, Jialan Zheng, Yuxin Hu, Xiaozhi Cao, Congyu Liao, Huihui Ye, Qiyuan Tian

    Abstract: Magnetic resonance imaging (MRI) has significantly benefited from the resurgence of artificial intelligence (AI). By leveraging AI's capabilities in large-scale optimization and pattern recognition, innovative methods are transforming the MRI acquisition workflow, including planning, sequence design, and correction of acquisition artifacts. These emerging algorithms demonstrate substantial potenti… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Submitted to MAGMA for review

  13. arXiv:2406.05699  [pdf, ps, other

    eess.AS cs.AI eess.SP

    An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

    Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, **zhu Li, Sheng Zhao, **yu Li, Naoyuki Kanda

    Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH2024

  14. arXiv:2406.04791  [pdf, other

    cs.SD eess.AS

    Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

    Authors: Shaojun Li, Daimeng Wei, Jiaxin Guo, ZongYao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

    Abstract: Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model out… ▽ More

    Submitted 11 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  15. arXiv:2406.04456  [pdf, other

    eess.SP cs.AI cs.LG

    Learning Optimal Linear Precoding for Cell-Free Massive MIMO with GNN

    Authors: Benjamin Parlier, Lou Salaün, Hong Yang

    Abstract: We develop a graph neural network (GNN) to compute, within a time budget of 1 to 2 milliseconds required by practical systems, the optimal linear precoder (OLP) maximizing the minimal downlink user data rate for a Cell-Free Massive MIMO system - a key 6G wireless technology. The state-of-the-art method is a bisection search on second order cone programming feasibility test (B-SOCP) which is a magn… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024

  16. arXiv:2406.04281  [pdf, other

    eess.AS

    Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, **yu Li, Sheng Zhao, Naoyuki Kanda

    Abstract: Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations a… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  17. arXiv:2406.02126  [pdf, other

    eess.SY cs.AI cs.LG cs.MA

    CityLight: A Universal Model Towards Real-world City-scale Traffic Signal Control Coordination

    Authors: **wei Zeng, Chao Yu, Xinyi Yang, Wenxuan Ao, Jian Yuan, Yong Li, Yu Wang, Huazhong Yang

    Abstract: Traffic signal control (TSC) is a promising low-cost measure to enhance transportation efficiency without affecting existing road infrastructure. While various reinforcement learning-based TSC methods have been proposed and experimentally outperform conventional rule-based methods, none of them has been deployed in the real world. An essential gap lies in the oversimplification of the scenarios in… ▽ More

    Submitted 6 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  18. arXiv:2406.01605  [pdf, other

    eess.IV cs.CV

    An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

    Authors: Zijun Gao, Qi Wang, Taiyuan Mei, Xiaohan Cheng, Yun Zi, Haowei Yang

    Abstract: The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the… ▽ More

    Submitted 26 May, 2024; originally announced June 2024.

  19. arXiv:2405.18969  [pdf, ps, other

    eess.SY

    Global and local observability of hypergraphs

    Authors: Chencheng Zhang, Hao Yang, Shaoxuan Cui, Bin Jiang, Ming Cao

    Abstract: This paper studies observability for non-uniform hypergraphs with inputs and outputs. To capture higher-order interactions, we define a canonical non-homogeneous dynamical system with nonlinear outputs on hypergraphs. We then construct algebraic necessary and sufficient conditions based on polynomial ideals and varieties for global observability at an initial state of hypergraphs. An example is gi… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  20. arXiv:2405.16965  [pdf, ps, other

    cs.IT eess.SP

    Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

    Authors: Zhengchuan Chen, Kang Lang, Nikolaos Pappas, Howard H. Yang, Min Wang, Zhong Tian, Tony Q. S. Quek

    Abstract: Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Informat… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  21. arXiv:2405.16889  [pdf

    eess.SP

    Extraction of In-Phase and Quadrature Components by Time-Encoding Sampling

    Authors: Y. H. Shao, S. Y. Chen, H. Z. Yang, F. Xi, H. Hong, Z. Liu

    Abstract: Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 30 pages, 8 figures

  22. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  23. arXiv:2405.11541  [pdf, other

    cs.IT eess.SP

    R-NeRF: Neural Radiance Fields for Modeling RIS-enabled Wireless Environments

    Authors: Huiying Yang, Zihan **, Chenhao Wu, Ru**g Xiong, Robert Caiming Qiu, Zenan Ling

    Abstract: Recently, ray tracing has gained renewed interest with the advent of Reflective Intelligent Surfaces (RIS) technology, a key enabler of 6G wireless communications due to its capability of intelligent manipulation of electromagnetic waves. However, accurately modeling RIS-enabled wireless environments poses significant challenges due to the complex variations caused by various environmental factors… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  24. arXiv:2405.10441  [pdf

    cs.RO eess.SY

    Trajectory tracking control of a Remotely Operated Underwater Vehicle based on Fuzzy Disturbance Adaptation and Controller Parameter Optimization

    Authors: Hanzhi Yang

    Abstract: The exploration of under-ice environments presents unique challenges due to limited access for scientific research. This report investigates the potential of deploying a fully actuated Remotely Operated Vehicle (ROV) for shallow area exploration beneath ice sheets. Leveraging advancements in marine robotics technology, ROVs offer a promising solution for extending human presence into remote underw… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  25. arXiv:2405.10272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

    Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

    Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  26. arXiv:2405.09472  [pdf, other

    eess.IV cs.CV

    Perception- and Fidelity-aware Reduced-Reference Super-Resolution Image Quality Assessment

    Authors: Xinying Lin, Xuyang Liu, Hong Yang, Xiaohai He, Honggang Chen

    Abstract: With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-r… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  27. arXiv:2405.09073  [pdf, other

    eess.SP

    Interpretable attributed scattering center extracted via deep unfolding

    Authors: Haodong Yang, Zhe Zhang, Zhongling Huang

    Abstract: Most existing sparse representation-based approaches for attributed scattering center (ASC) extraction adopt traditional iterative optimization algorithms, which suffer from lengthy computation times and limited precision. This paper presents a solution by introducing an interpretable network that can effectively and rapidly extract ASC via deep unfolding. Initially, we create a dictionary contain… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: This paper has been accepted by IGARSS2024

  28. arXiv:2405.08530  [pdf, other

    eess.IV

    Parameter-Efficient Instance-Adaptive Neural Video Compression

    Authors: Hyunmo Yang, Seungjun Oh, Eunbyung Park

    Abstract: Learning-based Neural Video Codecs (NVCs) have emerged as a compelling alternative to the standard video codecs, demonstrating promising performance, and simple and easily maintainable pipelines. However, NVCs often fall short of compression performance and occasionally exhibit poor generalization capability due to inference-only compression scheme and their dependence on training data. The instan… ▽ More

    Submitted 11 June, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: 23 pages, 13 figures

  29. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  30. arXiv:2405.05579  [pdf

    cs.HC eess.SY

    Intelligent EC Rearview Mirror: Enhancing Driver Safety with Dynamic Glare Mitigation via Cloud Edge Collaboration

    Authors: Junyi Yang, Zefei Xu, Huayi Lai, Hongjian Chen, Sifan Kong, Yutong Wu, Huan Yang

    Abstract: Sudden glare from trailing vehicles significantly increases driving safety risks. Existing anti-glare technologies such as electronic, manually-adjusted, and electrochromic rearview mirrors, are expensive and lack effective adaptability in different lighting conditions. To address these issues, our research introduces an intelligent rearview mirror system utilizing novel all-liquid electrochromic… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  31. arXiv:2405.01314  [pdf, other

    eess.SY cs.LG

    Non-iterative Optimization of Trajectory and Radio Resource for Aerial Network

    Authors: Hyeonsu Lyu, Jonggyu Jang, Harim Lee, Hyun Jong Yang

    Abstract: We address a joint trajectory planning, user association, resource allocation, and power control problem to maximize proportional fairness in the aerial IoT network, considering practical end-to-end quality-of-service (QoS) and communication schedules. Though the problem is rather ancient, apart from the fact that the previous approaches have never considered user- and time-specific QoS, we point… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  32. arXiv:2404.14716  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS

    Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More

    Submitted 16 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  33. arXiv:2404.13714  [pdf, other

    eess.SY

    Self-Adjusting Prescribed Performance Control for Nonlinear Systems with Input Saturation

    Authors: Zhuwu Shao, Yujuan Wang, Huanyu Yang, Yongduan Song

    Abstract: Among the existing works on enhancing system performance via prescribed performance functions (PPFs), the decay rates of PPFs need to be predetermined by the designer, directly affecting the convergence time of the closed-loop system. However, if only considering accelerating the system convergence by selecting a big decay rate of the performance function, it may lead to the severe consequence of… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  34. arXiv:2404.13289  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    Double Mixture: Towards Continual Event Detection from Speech

    Authors: **gqi Kang, Tongtong Wu, **ming Zhao, Guitao Wang, Yinwei Wei, Hao Yang, Guilin Qi, Yuan-Fang Li, Gholamreza Haffari

    Abstract: Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

    Comments: The first two authors contributed equally to this work

  35. arXiv:2404.08064  [pdf

    eess.AS cs.AI cs.CR cs.LG

    The Impact of Speech Anonymization on Pathology and Its Limits

    Authors: Soroosh Tayebi Arasteh, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Tobias Weise, Kai Packhaeuser, Maria Schuster, Elmar Noeth, Andreas Maier, Seung Hee Yang

    Abstract: Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where priva… ▽ More

    Submitted 22 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

  36. arXiv:2404.04904  [pdf, other

    cs.SD cs.AI eess.AS

    Cross-Domain Audio Deepfake Detection: Dataset and Analysis

    Authors: Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

    Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  37. arXiv:2404.03154  [pdf, ps, other

    eess.SP

    Age-of-Information-Aware Distributed Task Offloading and Resource Allocation in Mobile Edge Computing Networks

    Authors: Minwoo Kim, Jonggyu Jang, Youngchol Choi, Hyun Jong Yang

    Abstract: The growth in artificial intelligence (AI) technology has attracted substantial interests in age-of-information (AoI)-aware task offloading of mobile edge computing (MEC)-namely, minimizing service latency. Additionally, the use of MEC systems poses an additional problem arising from limited battery resources of MDs. This paper tackles the pressing challenge of AoI-aware distributed task offloadin… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 17 pages, 8 figures

  38. arXiv:2404.02486  [pdf, other

    eess.SY cs.IT

    Joint Optimization on Uplink OFDMA and MU-MIMO for IEEE 802.11ax: Deep Hierarchical Reinforcement Learning Approach

    Authors: Hyeonho Noh, Harim Lee, Hyun Jong Yang

    Abstract: This letter tackles a joint user scheduling, frequency resource allocation (USRA), multi-input-multi-output mode selection (MIMO MS) between single-user MIMO and multi-user (MU) MIMO, and MU-MIMO user selection problem, integrating uplink orthogonal frequency division multiple access (OFDMA) in IEEE 802.11ax. Specifically, we focus on \textit{unsaturated traffic conditions} where users' data deman… ▽ More

    Submitted 15 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  39. arXiv:2404.02477  [pdf, ps, other

    eess.SP cs.AI

    Enhancing Sum-Rate Performance in Constrained Multicell Networks: A Low-Information Exchange Approach

    Authors: You** Kim, Jonggyu Jang, Hyun Jong Yang

    Abstract: Despite the extensive research on massive MIMO systems for 5G telecommunications and beyond, the reality is that many deployed base stations are equipped with a limited number of antennas rather than supporting massive MIMO configurations. Furthermore, while the cell-less network concept, which eliminates cell boundaries, is under investigation, practical deployments often grapple with significant… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 5 pages, 12 figures

  40. arXiv:2404.01672  [pdf, other

    cs.IT eess.SP

    The Meta Distribution of the SIR in Joint Communication and Sensing Networks

    Authors: Kun Ma, Chenyuan Feng, Giovanni Geraci, Howard H. Yang

    Abstract: In this paper, we introduce a novel mathematical framework for assessing the performance of joint communication and sensing (JCAS) in wireless networks, employing stochastic geometry as an analytical tool. We focus on deriving the meta distribution of the signal-to-interference ratio (SIR) for JCAS networks. This approach enables a fine-grained quantification of individual user or radar performanc… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  41. Personalized Neural Speech Codec

    Authors: Inseon Jang, Haici Yang, Wootaek Lim, Seungkwon Beack, Minje Kim

    Abstract: In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can b… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 991-995

  42. arXiv:2403.19062  [pdf, other

    eess.SY cs.RO

    GENESIS-RL: GEnerating Natural Edge-cases with Systematic Integration of Safety considerations and Reinforcement Learning

    Authors: Hsin-Jung Yang, Joe Beck, Md Zahid Hasan, Ekin Beyazit, Subhadeep Chakraborty, Tichakorn Wongpiromsarn, Soumik Sarkar

    Abstract: In the rapidly evolving field of autonomous systems, the safety and reliability of the system components are fundamental requirements. These components are often vulnerable to complex and unforeseen environments, making natural edge-case generation essential for enhancing system resilience. This paper presents GENESIS-RL, a novel framework that leverages system-level safety considerations and rein… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  43. arXiv:2403.18878  [pdf, other

    cs.CV cs.LG eess.IV

    AIC-UNet: Anatomy-informed Cascaded UNet for Robust Multi-Organ Segmentation

    Authors: Young Seok Jeon, Hongfei Yang, Huazhu Fu, Mengling Feng

    Abstract: Imposing key anatomical features, such as the number of organs, their shapes, sizes, and relative positions, is crucial for building a robust multi-organ segmentation model. Current attempts to incorporate anatomical features include broadening effective receptive fields (ERF) size with resource- and data-intensive modules such as self-attention or introducing organ-specific topology regularizers,… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  44. arXiv:2403.18843  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition

    Authors: Chang Sun, Hong Yang, Bo Qin

    Abstract: Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent limitations of conveying semantic information visually. To mitigate this challenge, this paper introduces an advanced knowledge distillation approach using a Joint-Embedding Predictive Architecture (JEPA), named JEP-KD, design… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  45. arXiv:2403.16809  [pdf, other

    eess.SY cs.AI cs.LG

    An LLM-Based Digital Twin for Optimizing Human-in-the Loop Systems

    Authors: Hanqing Yang, Marie Siew, Carlee Joe-Wong

    Abstract: The increasing prevalence of Cyber-Physical Systems and the Internet of Things (CPS-IoT) applications and Foundation Models are enabling new applications that leverage real-time control of the environment. For example, real-time control of Heating, Ventilation and Air-Conditioning (HVAC) systems can reduce its usage when not needed for the comfort of human occupants, hence reducing energy consumpt… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted at International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys) 2024, Co-located at CPS-IoT Week 2024

  46. A GNN Approach for Cell-Free Massive MIMO

    Authors: Lou Salaun, Hong Yang, Shashwat Mishra, Chung Shue Chen

    Abstract: Beyond 5G wireless technology Cell-Free Massive MIMO (CFmMIMO) downlink relies on carefully designed precoders and power control to attain uniformly high rate coverage. Many such power control problems can be calculated via second order cone programming (SOCP). In practice, several order of magnitude faster numerical procedure is required because power control has to be rapidly updated to adapt to… ▽ More

    Submitted 8 February, 2024; originally announced March 2024.

    Journal ref: GLOBECOM 2022 - 2022 IEEE Global Communications Conference, Dec 2022, Rio de Janeiro, France. pp.3053-3058

  47. arXiv:2403.09357  [pdf, other

    cs.IT eess.SP

    Joint Port Selection and Beamforming Design for Fluid Antenna Assisted Integrated Data and Energy Transfer

    Authors: Long Zhang, Halvin Yang, Yizhe Zhao, Jie Hu

    Abstract: Integrated data and energy transfer (IDET) has been of fundamental importance for providing both wireless data transfer (WDT) and wireless energy transfer (WET) services towards low-power devices. Fluid antenna (FA) is capable of exploiting the huge spatial diversity of the wireless channel to enhance the receive signal strength, which is more suitable for the tiny-size low-power devices having th… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  48. arXiv:2403.06920  [pdf, other

    eess.SP eess.SY

    Distributed Average Consensus via Noisy and Non-Coherent Over-the-Air Aggregation

    Authors: Huiwen Yang, Xiaomeng Chen, Lingying Huang, Subhrakanti Dey, Ling Shi

    Abstract: Over-the-air aggregation has attracted widespread attention for its potential advantages in task-oriented applications, such as distributed sensing, learning, and consensus. In this paper, we develop a communication-efficient distributed average consensus protocol by utilizing over-the-air aggregation, which exploits the superposition property of wireless channels rather than combat it. Noisy chan… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  49. arXiv:2403.06653  [pdf, other

    eess.SP

    UAV-Enabled Asynchronous Federated Learning

    Authors: Zhiyuan Zhai, Xiaojun Yuan, Xin Wang, Huiyuan Yang

    Abstract: To exploit unprecedented data generation in mobile edge networks, federated learning (FL) has emerged as a promising alternative to the conventional centralized machine learning (ML). However, there are some critical challenges for FL deployment. One major challenge called straggler issue severely limits FL's coverage where the device with the weakest channel condition becomes the bottleneck o… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  50. arXiv:2403.05753  [pdf, other

    eess.IV cs.CV

    UDCR: Unsupervised Aortic DSA/CTA Rigid Registration Using Deep Reinforcement Learning and Overlap Degree Calculation

    Authors: Wentao Liu, Bowen Liang, Wei** Xu, Tong Tian, Qingsheng Lu, Xipeng Pan, Haoyuan Li, Siyu Tian, Huihua Yang, Ruisheng Su

    Abstract: The rigid registration of aortic Digital Subtraction Angiography (DSA) and Computed Tomography Angiography (CTA) can provide 3D anatomical details of the vasculature for the interventional surgical treatment of conditions such as aortic dissection and aortic aneurysms, holding significant value for clinical research. However, the current methods for 2D/3D image registration are dependent on manual… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.