Skip to main content

Showing 1–50 of 150 results for author: Han, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.16148  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

    Authors: Yuwei Zhang, Tong Xia, **g Han, Yu Wu, Georgios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, Cecilia Mascolo

    Abstract: Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  2. arXiv:2406.10434  [pdf, other

    eess.SY

    Risk-Aware Value-Oriented Net Demand Forecasting for Virtual Power Plants

    Authors: Yufan Zhang, Jiajun Han, Yuanyuan Shi

    Abstract: This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the opera… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Submitted to The 56th North American Power Symposium (NAPS 2024)

  3. arXiv:2406.10083  [pdf, other

    cs.CL cs.SD eess.AS

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding

    Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL Findings 2024

  4. arXiv:2406.06337  [pdf, other

    physics.optics eess.IV physics.bio-ph

    System- and Sample-agnostic Isotropic 3D Microscopy by Weakly Physics-informed, Domain-shift-resistant Axial Deblurring

    Authors: Jiashu Han, Kunzan Liu, Keith B. Isaacson, Kristina Monakhova, Linda G. Griffith, Sixian You

    Abstract: Three-dimensional (3D) subcellular imaging is essential for biomedical research, but the diffraction limit of optical microscopy compromises axial resolution, hindering accurate 3D structural analysis. This challenge is particularly pronounced in label-free imaging of thick, heterogeneous tissues, where assumptions about data distribution (e.g. sparsity, label-specific distribution, and lateral-ax… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 27 pages, 6 figures

  5. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, **g Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  6. arXiv:2405.08317  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

    Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 9+6 pages, Submitted to ACL 2024

  7. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  8. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  9. arXiv:2405.03953  [pdf, other

    cs.SD eess.AS

    Intelligent Cardiac Auscultation for Murmur Detection via Parallel-Attentive Models with Uncertainty Estimation

    Authors: Zixing Zhang, Tao Pang, **g Han, Björn W. Schuller

    Abstract: Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection meth… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Journal ref: published at ICASSP 2024

  10. arXiv:2405.03952  [pdf, other

    cs.SD cs.CL eess.AS

    HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech

    Authors: Zhongren Dong, Zixing Zhang, Weixiang Xu, **g Han, Jianjun Ou, Björn W. Schuller

    Abstract: Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Journal ref: publised at ICASSP 2024

  11. arXiv:2404.17917  [pdf, other

    cs.CV eess.IV

    EvaNet: Elevation-Guided Flood Extent Map** on Earth Imagery

    Authors: Mirza Tanzim Sami, Da Yan, Saugat Adhikari, Lyuheng Yuan, Jiao Han, Zhe Jiang, Jalal Khalil, Yang Zhou

    Abstract: Accurate and timely map** of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectr… ▽ More

    Submitted 12 May, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

    Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI, 2024)

  12. arXiv:2404.17683  [pdf, other

    math.OC cs.GT cs.LG eess.SY

    Energy Storage Arbitrage in Two-settlement Markets: A Transformer-Based Approach

    Authors: Saud Alghumayjan, Jiajun Han, Ningkun Zheng, Ming Yi, Bolun Xu

    Abstract: This paper presents an integrated model for bidding energy storage in day-ahead and real-time markets to maximize profits. We show that in integrated two-stage bidding, the real-time bids are independent of day-ahead settlements, while the day-ahead bids should be based on predicted real-time prices. We utilize a transformer-based model for real-time price prediction, which captures complex dynami… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  13. arXiv:2404.07336  [pdf, other

    cs.CV cs.MM eess.AS

    PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

    Authors: Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania, Metehan Cekic, Marcello Federico, Kyu J. Han

    Abstract: Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 24 pages

  14. arXiv:2404.02999  [pdf, other

    eess.IV cs.CV

    MeshBrush: Painting the Anatomical Mesh with Neural Stylization for Endoscopy

    Authors: John J. Han, Ayberk Acar, Nicholas Kavoussi, Jie Ying Wu

    Abstract: Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis d… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 10 pages, 5 figures

  15. arXiv:2404.00569  [pdf, other

    cs.SD cs.CL eess.AS

    CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

    Authors: Xiang Li, Fan Bu, Ambuj Mehrish, Yingting Li, Jiale Han, Bo Cheng, Soujanya Poria

    Abstract: Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up infere… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Accepted by Findings of NAACL 2024. Code is available at https://github.com/XiangLi2022/CM-TTS

  16. arXiv:2403.10520  [pdf, other

    cs.CV cs.LG eess.IV

    Strong and Controllable Blind Image Decomposition

    Authors: Zeyu Zhang, Junlin Han, Chenhui Gou, Hongdong Li, Liang Zheng

    Abstract: Blind image decomposition aims to decompose all components present in an image, typically used to restore a multi-degraded input image. While fully recovering the clean image is appealing, in some scenarios, users might want to retain certain degradations, such as watermarks, for copyright protection. To address this need, we add controllability to the blind image decomposition process, allowing u… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/Zhangzeyu97/CBD.git

  17. arXiv:2402.12660  [pdf, other

    cs.SD cs.HC eess.AS

    SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

    Authors: Liumeng Xue, Chaoren Wang, Mingxuan Wang, Xueyao Zhang, Jun Han, Zhizheng Wu

    Abstract: In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also fac… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  18. arXiv:2402.01115  [pdf, other

    cs.CL eess.SP

    Interpretation of Intracardiac Electrograms Through Textual Representations

    Authors: William Jongwon Han, Diana Gomez, Avi Alok, Chao**g Duan, Michael A. Rosenberg, Douglas Weber, Emerson Liu, Ding Zhao

    Abstract: Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artif… ▽ More

    Submitted 11 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: 18 pages, 9 figures; Accepted to CHIL 2024

    ACM Class: I.2.7; J.3

  19. arXiv:2401.05850  [pdf, other

    cs.SD eess.AS

    Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection

    Authors: Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He

    Abstract: Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning fram… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: accepted by icassp2024

  20. arXiv:2312.15659  [pdf, other

    eess.IV

    Perceptual Quality Assessment for Video Frame Interpolation

    Authors: **liang Han, Xiongkuo Min, Yixuan Gao, Jun Jia, Lei Sun, Zuowei Cao, Yonglin Luo, Guangtao Zhai

    Abstract: The quality of frames is significant for both research and application of video frame interpolation (VFI). In recent VFI studies, the methods of full-reference image quality assessment have generally been used to evaluate the quality of VFI frames. However, high frame rate reference videos, necessities for the full-reference methods, are difficult to obtain in most applications of VFI. To evaluate… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 5 pages, 4 figures

    ACM Class: I.4.0

  21. arXiv:2312.10112  [pdf, other

    cs.CV cs.LG eess.IV

    NM-FlowGAN: Modeling sRGB Noise with a Hybrid Approach based on Normalizing Flows and Generative Adversarial Networks

    Authors: Young Joo Han, Ha-** Yu

    Abstract: Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks, such as building datasets for training image denoising systems. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as ge… ▽ More

    Submitted 14 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 25 pages, 11 figures, 7 tables

    MSC Class: 68T45 ACM Class: I.4.4

  22. arXiv:2312.09911  [pdf, other

    cs.SD eess.AS

    Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

    Authors: Xueyao Zhang, Liumeng Xue, Yicheng Gu, Yuancheng Wang, Haorui He, Chaoren Wang, Xi Chen, Zihao Fang, Haopeng Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, Zhizheng Wu

    Abstract: Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, a… ▽ More

    Submitted 22 February, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: Amphion Website: https://github.com/open-mmlab/Amphion

  23. arXiv:2312.05465  [pdf, other

    cs.LG eess.SY

    On Task-Relevant Loss Functions in Meta-Reinforcement Learning and Online LQR

    Authors: Jaeuk Shin, Giho Kim, Howon Lee, Joonho Han, Insoon Yang

    Abstract: Designing a competent meta-reinforcement learning (meta-RL) algorithm in terms of data usage remains a central challenge to be tackled for its successful real-world applications. In this paper, we propose a sample-efficient meta-RL algorithm that learns a model of the system or environment at hand in a task-directed manner. As opposed to the standard model-based approaches to meta-RL, our method e… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  24. arXiv:2311.06968  [pdf, other

    cs.LG cs.AI eess.SP stat.ML

    Physics-Informed Data Denoising for Real-Life Sensing Systems

    Authors: Xiyuan Zhang, Xiaohan Fu, Diyan Teng, Chengyu Dong, Keerthivasan Vijayakumar, Jiayun Zhang, Ranak Roy Chowdhury, Junsheng Han, Dezhi Hong, Rashmi Kulkarni, **gbo Shang, Rajesh Gupta

    Abstract: Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically r… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

    Comments: SenSys 2023

  25. arXiv:2310.16102  [pdf, other

    eess.IV cs.CV physics.optics

    Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient Multiphoton Microscopy

    Authors: Cassandra Tong Ye, Jiashu Han, Kunzan Liu, Anastasios Angelopoulos, Linda Griffith, Kristina Monakhova, Sixian You

    Abstract: Multiphoton microscopy (MPM) is a powerful imaging tool that has been a critical enabler for live tissue imaging. However, since most multiphoton microscopy platforms rely on point scanning, there is an inherent trade-off between acquisition time, field of view (FOV), phototoxicity, and image quality, often resulting in noisy measurements when fast, large FOV, and/or gentle imaging is needed. Deep… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  26. arXiv:2310.05352  [pdf, other

    cs.CL cs.SD eess.AS

    A Glance is Enough: Extract Target Sentence By Looking at A keyword

    Authors: Ying Shi, Dong Wang, Lantian Li, Jiqing Han

    Abstract: This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both t… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: submitted to ICASSP 2024

  27. arXiv:2309.15697  [pdf, other

    cs.CV eess.IV

    Physics Inspired Hybrid Attention for SAR Target Recognition

    Authors: Zhongling Huang, Chong Wu, Xiwen Yao, Zhicheng Zhao, Xiankai Huang, Junwei Han

    Abstract: There has been a recent emphasis on integrating physical models and deep neural networks (DNNs) for SAR target recognition, to improve performance and achieve a higher level of physical interpretability. The attributed scattering center (ASC) parameters garnered the most interest, being considered as additional input data or features for fusion in most methods. However, the performance greatly dep… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  28. arXiv:2309.10065  [pdf, other

    q-bio.NC cs.LG eess.IV

    Bayesian longitudinal tensor response regression for modeling neuroplasticity

    Authors: Suprateek Kundu, Alec Reinhardt, Serena Song, Joo Han, M. Lawson Meadows, Bruce Crosson, Venkatagiri Krishnamurthy

    Abstract: A major interest in longitudinal neuroimaging studies involves investigating voxel-level neuroplasticity due to treatment and other factors across visits. However, traditional voxel-wise methods are beset with several pitfalls, which can compromise the accuracy of these approaches. We propose a novel Bayesian tensor response regression approach for longitudinal imaging data, which pools informatio… ▽ More

    Submitted 18 October, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

    Comments: 28 pages, 8 figures, 6 tables

  29. arXiv:2309.08377  [pdf, other

    eess.AS cs.CL cs.SD

    DiaCorrect: Error Correction Back-end For Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

    Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  30. arXiv:2309.03905  [pdf, other

    cs.MM cs.CL cs.CV cs.LG cs.SD eess.AS

    ImageBind-LLM: Multi-modality Instruction Tuning

    Authors: Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

    Abstract: We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training… ▽ More

    Submitted 11 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: Code is available at https://github.com/OpenGVLab/LLaMA-Adapter

  31. arXiv:2309.02529  [pdf, other

    eess.IV

    Fast and High-Performance Learned Image Compression With Improved Checkerboard Context Model, Deformable Residual Module, and Knowledge Distillation

    Authors: Haisheng Fu, Feng Liang, Jie Liang, Yongqiang Wang, Guohe Zhang, **gning Han

    Abstract: Deep learning-based image compression has made great progresses recently. However, many leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we introduce four techniques to bala… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Submitted to Trans. Journal

  32. arXiv:2308.00187  [pdf, ps, other

    cs.RO cs.CV eess.SP

    Detecting the Anomalies in LiDAR Pointcloud

    Authors: Chiyu Zhang, Ji Han, Yao Zou, Kexin Dong, Yujia Li, Junchun Ding, Xiaoling Han

    Abstract: LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR… ▽ More

    Submitted 31 July, 2023; originally announced August 2023.

  33. arXiv:2307.14491  [pdf, other

    cs.MM cs.SD eess.AS

    A Unified Framework for Modality-Agnostic Deepfakes Detection

    Authors: Cai Yu, Peng Chen, Jiahe Tian, ** Liu, Jiao Dai, Xi Wang, Yesheng Chai, Shan Jia, Siwei Lyu, Jizhong Han

    Abstract: As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence betw… ▽ More

    Submitted 24 October, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  34. arXiv:2307.07518  [pdf

    cs.AI cs.CL cs.CV eess.IV

    CephGPT-4: An Interactive Multimodal Cephalometric Measurement and Diagnostic System with Visual Large Language Model

    Authors: Lei Ma, **cong Han, Zhaoxin Wang, Dian Zhang

    Abstract: Large-scale multimodal language models (LMMs) have achieved remarkable success in general domains. However, the exploration of diagnostic language models based on multimodal cephalometric medical data remains limited. In this paper, we propose a novel multimodal cephalometric analysis and diagnostic dialogue model. Firstly, a multimodal orthodontic medical dataset is constructed, comprising cephal… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

  35. arXiv:2306.15561  [pdf, other

    cs.CV cs.MM eess.IV

    You Can Mask More For Extremely Low-Bitrate Image Compression

    Authors: Anqi Li, Feng Li, Jiaxin Han, Huihui Bai, Runmin Cong, Chunjie Zhang, Meng Wang, Weisi Lin, Yao Zhao

    Abstract: Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture compon… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Under review

  36. EEG Decoding for Datasets with Heterogenous Electrode Configurations using Transfer Learning Graph Neural Networks

    Authors: **pei Han, Xiaoxi Wei, A. Aldo Faisal

    Abstract: Brain-Machine Interfacing (BMI) has greatly benefited from adopting machine learning methods for feature learning that require extensive data for training, which are often unavailable from a single dataset. Yet, it is difficult to combine data across labs or even data within the same lab collected over the years due to the variation in recording equipment and electrode layouts resulting in shifts… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: 19 pages, 4 figures

    Journal ref: J. Neural Eng. 20 066027 (2023)

  37. arXiv:2305.17706  [pdf, other

    cs.SD cs.AI eess.AS

    Spot keywords from very noisy and mixed speech

    Authors: Ying Shi, Dong Wang, Lantian Li, Jiqing Han, Shi Yin

    Abstract: Most existing keyword spotting research focuses on conditions with slight or moderate noise. In this paper, we try to tackle a more challenging task: detecting keywords buried under strong interfering speech (10 times higher than the keyword in amplitude), and even worse, mixed with other keywords. We propose a novel Mix Training (MT) strategy that encourages the model to discover low-energy keywo… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  38. arXiv:2305.07783  [pdf, other

    cs.CV eess.IV

    ROI-based Deep Image Compression with Swin Transformers

    Authors: Binglin Li, Jie Liang, Haisheng Fu, **gning Han

    Abstract: Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the netw… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

    Comments: This paper has been accepted by ICASSP 2023

  39. arXiv:2305.03328  [pdf, other

    eess.AS cs.SD

    Time-weighted Frequency Domain Audio Representation with GMM Estimator for Anomalous Sound Detection

    Authors: Jian Guan, Youde Liu, Qiaoxi Zhu, Tieran Zheng, Jiqing Han, Wenwu Wang

    Abstract: Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: To appear at ICASSP 2023

  40. arXiv:2305.00416  [pdf, other

    eess.IV

    Quaternion Matrix Completion Using Untrained Quaternion Convolutional Neural Network for Color Image Inpainting

    Authors: Jifei Miao, Kit Ian Kou, Liqiao Yang, Juan Han

    Abstract: The use of quaternions as a novel tool for color image representation has yielded impressive results in color image processing. By considering the color image as a unified entity rather than separate color space components, quaternions can effectively exploit the strong correlation among the RGB channels, leading to enhanced performance. Especially, color image inpainting tasks are highly benefici… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

  41. arXiv:2305.00092  [pdf, other

    cs.LG cs.AI cs.RO eess.SY math.OC

    Improving Gradient Computation for Differentiable Physics Simulation with Contacts

    Authors: Yaofeng Desmond Zhong, Jiequn Han, Biswadip Dey, Georgia Olympia Brikis

    Abstract: Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not per… ▽ More

    Submitted 28 April, 2023; originally announced May 2023.

    Comments: 5th Annual Conference on Learning for Dynamics and Control

    Journal ref: Proceedings of Machine Learning Research vol 211, 2023

  42. arXiv:2303.08636  [pdf, other

    eess.AS cs.SD

    HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism

    Authors: Yuguang Yang, Yu Pan, **g**g Yin, Jiangyu Han, Lei Ma, Heng Lu

    Abstract: SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast an… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  43. arXiv:2303.07067  [pdf, other

    cs.LG cs.DC cs.SD eess.AS

    Cross-device Federated Learning for Mobile Health Diagnostics: A First Study on COVID-19 Detection

    Authors: Tong Xia, **g Han, Abhirup Ghosh, Cecilia Mascolo

    Abstract: Federated learning (FL) aided health diagnostic models can incorporate data from a large number of personal edge devices (e.g., mobile phones) while kee** the data local to the originating devices, largely ensuring privacy. However, such a cross-device FL approach for health diagnostics still imposes many challenges due to both local data imbalance (as extreme as local data consists of a single… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: This paper has been accepted by IEEE ICASSP 2023

  44. arXiv:2302.13661  [pdf, other

    cs.CL cs.SD eess.AS

    Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

    Authors: Dekai Sun, Yancheng He, Jiqing Han

    Abstract: The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-hea… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

  45. arXiv:2301.01940  [pdf, other

    eess.IV cs.CV

    Enabling Augmented Segmentation and Registration in Ultrasound-Guided Spinal Surgery via Realistic Ultrasound Synthesis from Diagnostic CT Volume

    Authors: Ang Li, Jiayi Han, Yongjian Zhao, Keyu Li, Li Liu

    Abstract: This paper aims to tackle the issues on unavailable or insufficient clinical US data and meaningful annotation to enable bone segmentation and registration for US-guided spinal surgery. While the US is not a standard paradigm for spinal surgery, the scarcity of intra-operative clinical US data is an insurmountable bottleneck in training a neural network. Moreover, due to the characteristics of US… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Submitted to IEEE Transactions on Automation Science and Engineering. Copyright may be transferred without notice, after which this version may no longer be accessible. Note that the abstract is shorter than that in the pdf file due to character limitations

  46. arXiv:2212.03848  [pdf, other

    cs.CV cs.GR eess.IV

    NeRFEditor: Differentiable Style Decomposition for Full 3D Scene Editing

    Authors: Chunyi Sun, Yanbin Liu, Junlin Han, Stephen Gould

    Abstract: We present NeRFEditor, an efficient learning framework for 3D scene editing, which takes a video captured over 360° as input and outputs a high-quality, identity-preserving stylized 3D scene. Our method supports diverse types of editing such as guided by reference images, text prompts, and user interactions. We achieve this by encouraging a pre-trained StyleGAN model and a NeRF model to learn from… ▽ More

    Submitted 8 December, 2022; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: Project page: https://chuny1.github.io/NeRFEditor/nerfeditor.html

  47. arXiv:2211.14467  [pdf

    cs.CV eess.IV

    Self-Supervised Surgical Instrument 3D Reconstruction from a Single Camera Image

    Authors: Ange Lou, Xing Yao, Ziteng Liu, **tong Han, Jack Noble

    Abstract: Surgical instrument tracking is an active research area that can provide surgeons feedback about the location of their tools relative to anatomy. Recent tracking methods are mainly divided into two parts: segmentation and object detection. However, both can only predict 2D information, which is limiting for application to real-world surgery. An accurate 3D surgical instrument model is a prerequisi… ▽ More

    Submitted 25 November, 2022; originally announced November 2022.

    Comments: Accepted by SPIE Medical Imaging 2023

  48. arXiv:2211.12793  [pdf, other

    eess.IV

    Low Rank Quaternion Matrix Completion Based on Quaternion QR Decomposition and Sparse Regularizer

    Authors: Juan Han, Liqiao Yang, Kit Ian Kou, Jifei Miao, Lizhi Liu

    Abstract: Matrix completion is one of the most challenging problems in computer vision. Recently, quaternion representations of color images have achieved competitive performance in many fields. Because it treats the color image as a whole, the coupling information between the three channels of the color image is better utilized. Due to this, low-rank quaternion matrix completion (LRQMC) algorithms have gai… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

  49. arXiv:2211.12097  [pdf, other

    eess.AS

    Dynamic Acoustic Compensation and Adaptive Focal Training for Personalized Speech Enhancement

    Authors: Xiaofeng Ge, Jiangyu Han, Haixin Guan, Yanhua Long

    Abstract: Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and target speaker enrollment speech; 2) Hard sample mining and learning. In this paper, dynamic acoustic compensation (DA… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

  50. arXiv:2211.10885  [pdf, other

    cs.SD eess.AS

    Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

    Authors: Fan Qian, Jiqing Han

    Abstract: Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise… ▽ More

    Submitted 20 November, 2022; originally announced November 2022.

    Comments: Completed in October 2020 and submitted to ICASSP2021