Search | arXiv e-print repository

Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

Authors: Yuwei Zhang, Tong Xia, **g Han, Yu Wu, Georgios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, Cecilia Mascolo

Abstract: Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing… ▽ More Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, 440 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from https://github.com/evelyn0414/OPERA. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.10434 [pdf, other]

Risk-Aware Value-Oriented Net Demand Forecasting for Virtual Power Plants

Authors: Yufan Zhang, Jiajun Han, Yuanyuan Shi

Abstract: This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the opera… ▽ More This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the operation problems given the forecast. Leveraging the specific structure of the operation problem, we show that the bilevel program is equivalent to a convex program when the forecast model is linear. Numerical results show that our approach effectively reduces the risk of high costs compared to the forecasting approach developed for risk-neutral decision makers. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Submitted to The 56th North American Power Symposium (NAPS 2024)

arXiv:2406.10083 [pdf, other]

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted at ACL Findings 2024

arXiv:2406.06337 [pdf, other]

System- and Sample-agnostic Isotropic 3D Microscopy by Weakly Physics-informed, Domain-shift-resistant Axial Deblurring

Authors: Jiashu Han, Kunzan Liu, Keith B. Isaacson, Kristina Monakhova, Linda G. Griffith, Sixian You

Abstract: Three-dimensional (3D) subcellular imaging is essential for biomedical research, but the diffraction limit of optical microscopy compromises axial resolution, hindering accurate 3D structural analysis. This challenge is particularly pronounced in label-free imaging of thick, heterogeneous tissues, where assumptions about data distribution (e.g. sparsity, label-specific distribution, and lateral-ax… ▽ More Three-dimensional (3D) subcellular imaging is essential for biomedical research, but the diffraction limit of optical microscopy compromises axial resolution, hindering accurate 3D structural analysis. This challenge is particularly pronounced in label-free imaging of thick, heterogeneous tissues, where assumptions about data distribution (e.g. sparsity, label-specific distribution, and lateral-axial similarity) and system priors (e.g. independent and identically distributed (i.i.d.) noise and linear shift-invariant (LSI) point-spread functions (PSFs)) are often invalid. Here, we introduce SSAI-3D, a weakly physics-informed, domain-shift-resistant framework for robust isotropic 3D imaging. SSAI-3D enables robust axial deblurring by generating a PSF-flexible, noise-resilient, sample-informed training dataset and sparsely fine-tuning a large pre-trained blind deblurring network. SSAI-3D was applied to label-free nonlinear imaging of living organoids, freshly excised human endometrium tissue, and mouse whisker pads, and further validated in publicly available ground-truth-paired experimental datasets of 3D heterogeneous biological tissues with unknown blurring and noise across different microscopy systems. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 27 pages, 6 figures

arXiv:2406.02438 [pdf, other]

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, **g Guo, Tomoki Toda, Zhiyao Duan

Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible. △ Less

Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.08317 [pdf, other]

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 9+6 pages, Submitted to ACL 2024

arXiv:2405.08295 [pdf, other]

SpeechVerse: A Large-scale Generalizable Audio Language Model

Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while kee** the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks. △ Less

Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Single Column, 13 page

arXiv:2405.05244 [pdf, other]

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024). △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

arXiv:2405.03953 [pdf, other]

Intelligent Cardiac Auscultation for Murmur Detection via Parallel-Attentive Models with Uncertainty Estimation

Authors: Zixing Zhang, Tao Pang, **g Han, Björn W. Schuller

Abstract: Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection meth… ▽ More Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection method based on a parallel-attentive model, which consists of two branches: One is based on a self-attention module and the other one is based on a convolutional network. Unlike traditional approaches, this structure is better equipped to handle long-term dependencies in sequential data, and thus effectively captures the local and global features of heart murmurs. Additionally, we acknowledge the significance of understanding the uncertainty of model predictions in the medical field for clinical decision-making. Therefore, we have incorporated an effective uncertainty estimation method based on Monte Carlo Dropout into our model. Furthermore, we have employed temperature scaling to calibrate the predictions of our probabilistic model, enhancing its reliability. In experiments conducted on the CirCor Digiscope dataset for heart murmur detection, our proposed method achieves a weighted accuracy of 79.8% and an F1 of 65.1%, representing state-of-the-art results. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Journal ref: published at ICASSP 2024

arXiv:2405.03952 [pdf, other]

HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech

Authors: Zhongren Dong, Zixing Zhang, Weixiang Xu, **g Han, Jianjun Ou, Björn W. Schuller

Abstract: Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying… ▽ More Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying such models on edge devices. In this context, we construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for AD detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation, and a GELU-based Gated Linear Unit to replace the feedforward layer, aiming to automatically filter out the redundant information. Moreover, we design a hierarchical structure to force it to learn a variety of information grains, from the frame level to the dialogue level. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work, but with significant computational complexity and model size reduction compared to the standard Transformer. This shows the efficiency of HAFFormer in dealing with long audio for AD detection. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Journal ref: publised at ICASSP 2024

arXiv:2404.17917 [pdf, other]

EvaNet: Elevation-Guided Flood Extent Map** on Earth Imagery

Authors: Mirza Tanzim Sami, Da Yan, Saugat Adhikari, Lyuheng Yuan, Jiao Han, Zhe Jiang, Jalal Khalil, Yang Zhou

Abstract: Accurate and timely map** of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectr… ▽ More Accurate and timely map** of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectral features. Thanks to the digital elevation model (DEM) data readily available from sources such as United States Geological Survey (USGS), this work explores the use of an elevation map to improve flood extent map**. We propose, EvaNet, an elevation-guided segmentation model based on the encoder-decoder architecture with two novel techniques: (1) a loss function encoding the physical law of gravity that if a location is flooded (resp. dry), then its adjacent locations with a lower (resp. higher) elevation must also be flooded (resp. dry); (2) a new (de)convolution operation that integrates the elevation map by a location sensitive gating mechanism to regulate how much spectral features flow through adjacent layers. Extensive experiments show that EvaNet significantly outperforms the U-Net baselines, and works as a perfect drop-in replacement for U-Net in existing solutions to flood extent map**. △ Less

Submitted 12 May, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI, 2024)

arXiv:2404.17683 [pdf, other]

Energy Storage Arbitrage in Two-settlement Markets: A Transformer-Based Approach

Authors: Saud Alghumayjan, Jiajun Han, Ningkun Zheng, Ming Yi, Bolun Xu

Abstract: This paper presents an integrated model for bidding energy storage in day-ahead and real-time markets to maximize profits. We show that in integrated two-stage bidding, the real-time bids are independent of day-ahead settlements, while the day-ahead bids should be based on predicted real-time prices. We utilize a transformer-based model for real-time price prediction, which captures complex dynami… ▽ More This paper presents an integrated model for bidding energy storage in day-ahead and real-time markets to maximize profits. We show that in integrated two-stage bidding, the real-time bids are independent of day-ahead settlements, while the day-ahead bids should be based on predicted real-time prices. We utilize a transformer-based model for real-time price prediction, which captures complex dynamical patterns of real-time prices, and use the result for day-ahead bidding design. For real-time bidding, we utilize a long short-term memory-dynamic programming hybrid real-time bidding model. We train and test our model with historical data from New York State, and our results showed that the integrated system achieved promising results of almost a 20\% increase in profit compared to only bidding in real-time markets, and at the same time reducing the risk in terms of the number of days with negative profits. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.07336 [pdf, other]

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

Authors: Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania, Metehan Cekic, Marcello Federico, Kyu J. Han

Abstract: Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately… ▽ More Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild". △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 24 pages

arXiv:2404.02999 [pdf, other]

MeshBrush: Painting the Anatomical Mesh with Neural Stylization for Endoscopy

Authors: John J. Han, Ayberk Acar, Nicholas Kavoussi, Jie Ying Wu

Abstract: Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis d… ▽ More Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis due to the lack of temporal consistency, resulting in artifacts between frames. We propose MeshBrush, a neural mesh stylization method to synthesize temporally consistent videos with differentiable rendering. MeshBrush uses the underlying geometry of patient imaging data while leveraging existing I2I methods. With learned per-vertex textures, the stylized mesh guarantees consistency while producing high-fidelity outputs. We demonstrate that mesh stylization is a promising approach for creating realistic simulations for downstream tasks such as training and preoperative planning. Although our method is tested and designed for ureteroscopy, its components are transferable to general endoscopic and laparoscopic procedures. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 10 pages, 5 figures

arXiv:2404.00569 [pdf, other]

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Authors: Xiang Li, Fan Bu, Ambuj Mehrish, Yingting Li, Jiale Han, Bo Cheng, Soujanya Poria

Abstract: Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up infere… ▽ More Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted by Findings of NAACL 2024. Code is available at https://github.com/XiangLi2022/CM-TTS

arXiv:2403.10520 [pdf, other]

Strong and Controllable Blind Image Decomposition

Authors: Zeyu Zhang, Junlin Han, Chenhui Gou, Hongdong Li, Liang Zheng

Abstract: Blind image decomposition aims to decompose all components present in an image, typically used to restore a multi-degraded input image. While fully recovering the clean image is appealing, in some scenarios, users might want to retain certain degradations, such as watermarks, for copyright protection. To address this need, we add controllability to the blind image decomposition process, allowing u… ▽ More Blind image decomposition aims to decompose all components present in an image, typically used to restore a multi-degraded input image. While fully recovering the clean image is appealing, in some scenarios, users might want to retain certain degradations, such as watermarks, for copyright protection. To address this need, we add controllability to the blind image decomposition process, allowing users to enter which types of degradation to remove or retain. We design an architecture named controllable blind image decomposition network. Inserted in the middle of U-Net structure, our method first decomposes the input feature maps and then recombines them according to user instructions. Advantageously, this functionality is implemented at minimal computational cost: decomposition and recombination are all parameter-free. Experimentally, our system excels in blind image decomposition tasks and can outputs partially or fully restored images that well reflect user intentions. Furthermore, we evaluate and configure different options for the network structure and loss functions. This, combined with the proposed decomposition-and-recombination method, yields an efficient and competitive system for blind image decomposition, compared with current state-of-the-art methods. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Code: https://github.com/Zhangzeyu97/CBD.git

arXiv:2402.12660 [pdf, other]

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

Authors: Liumeng Xue, Chaoren Wang, Mingxuan Wang, Xueyao Zhang, Jun Han, Zhizheng Wu

Abstract: In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also fac… ▽ More In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.01115 [pdf, other]

Interpretation of Intracardiac Electrograms Through Textual Representations

Authors: William Jongwon Han, Diana Gomez, Avi Alok, Chao**g Duan, Michael A. Rosenberg, Douglas Weber, Emerson Liu, Ding Zhao

Abstract: Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artif… ▽ More Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artificial intelligence (AI) has allowed some works to utilize deep learning frameworks to interpret EGMs during AFib. Additionally, language models (LMs) have shown exceptional performance in being able to generalize to unseen domains, especially in healthcare. In this study, we are the first to leverage pretrained LMs for finetuning of EGM interpolation and AFib classification via masked language modeling. We formulate the EGM as a textual sequence and present competitive performances on AFib classification compared against other representations. Lastly, we provide a comprehensive interpretability study to provide a multi-perspective intuition of the model's behavior, which could greatly benefit the clinical use. △ Less

Submitted 11 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: 18 pages, 9 figures; Accepted to CHIL 2024

ACM Class: I.2.7; J.3

arXiv:2401.05850 [pdf, other]

Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection

Authors: Yadong Guan, Jiqing Han, Hongwei Song, Wenjie Song, Guibin Zheng, Tieran Zheng, Yongjun He

Abstract: Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning fram… ▽ More Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: accepted by icassp2024

arXiv:2312.15659 [pdf, other]

Perceptual Quality Assessment for Video Frame Interpolation

Authors: **liang Han, Xiongkuo Min, Yixuan Gao, Jun Jia, Lei Sun, Zuowei Cao, Yonglin Luo, Guangtao Zhai

Abstract: The quality of frames is significant for both research and application of video frame interpolation (VFI). In recent VFI studies, the methods of full-reference image quality assessment have generally been used to evaluate the quality of VFI frames. However, high frame rate reference videos, necessities for the full-reference methods, are difficult to obtain in most applications of VFI. To evaluate… ▽ More The quality of frames is significant for both research and application of video frame interpolation (VFI). In recent VFI studies, the methods of full-reference image quality assessment have generally been used to evaluate the quality of VFI frames. However, high frame rate reference videos, necessities for the full-reference methods, are difficult to obtain in most applications of VFI. To evaluate the quality of VFI frames without reference videos, a no-reference perceptual quality assessment method is proposed in this paper. This method is more compatible with VFI application and the evaluation scores from it are consistent with human subjective opinions. A new quality assessment dataset for VFI was constructed through subjective experiments firstly, to assess the opinion scores of interpolated frames. The dataset was created from triplets of frames extracted from high-quality videos using 9 state-of-the-art VFI algorithms. The proposed method evaluates the perceptual coherence of frames incorporating the original pair of VFI inputs. Specifically, the method applies a triplet network architecture, including three parallel feature pipelines, to extract the deep perceptual features of the interpolated frame as well as the original pair of frames. Coherence similarities of the two-way parallel features are jointly calculated and optimized as a perceptual metric. In the experiments, both full-reference and no-reference quality assessment methods were tested on the new quality dataset. The results show that the proposed method achieves the best performance among all compared quality assessment methods on the dataset. △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: 5 pages, 4 figures

ACM Class: I.4.0

arXiv:2312.10112 [pdf, other]

NM-FlowGAN: Modeling sRGB Noise with a Hybrid Approach based on Normalizing Flows and Generative Adversarial Networks

Authors: Young Joo Han, Ha-** Yu

Abstract: Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks, such as building datasets for training image denoising systems. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as ge… ▽ More Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks, such as building datasets for training image denoising systems. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as generative adversarial networks (GAN) and Normalizing Flows. These studies achieve more accurate modeling of sRGB noise compared to traditional noise modeling methods. However, there are performance limitations due to the inherent characteristics of each generative model. To address this issue, we propose NM-FlowGAN, a hybrid approach that exploits the strengths of both GAN and Normalizing Flows. We simultaneously employ a pixel-wise noise modeling network based on Normalizing Flows, and spatial correlation modeling networks based on GAN. In our experiments, our NM-FlowGAN outperforms other baselines on the sRGB noise synthesis task. Moreover, the denoising neural network, trained with synthesized image pairs from our model, also shows superior performance compared to other baselines. Our code is available at: \url{https://github.com/YoungJooHan/NM-FlowGAN}. △ Less

Submitted 14 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: 25 pages, 11 figures, 7 tables

MSC Class: 68T45 ACM Class: I.4.4

arXiv:2312.09911 [pdf, other]

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

Authors: Xueyao Zhang, Liumeng Xue, Yicheng Gu, Yuancheng Wang, Haorui He, Chaoren Wang, Xi Chen, Zihao Fang, Haopeng Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, Zhizheng Wu

Abstract: Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, a… ▽ More Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. Additionally, it provides interactive visualizations and demonstrations of classic models for educational purposes. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion. △ Less

Submitted 22 February, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Amphion Website: https://github.com/open-mmlab/Amphion

arXiv:2312.05465 [pdf, other]

On Task-Relevant Loss Functions in Meta-Reinforcement Learning and Online LQR

Authors: Jaeuk Shin, Giho Kim, Howon Lee, Joonho Han, Insoon Yang

Abstract: Designing a competent meta-reinforcement learning (meta-RL) algorithm in terms of data usage remains a central challenge to be tackled for its successful real-world applications. In this paper, we propose a sample-efficient meta-RL algorithm that learns a model of the system or environment at hand in a task-directed manner. As opposed to the standard model-based approaches to meta-RL, our method e… ▽ More Designing a competent meta-reinforcement learning (meta-RL) algorithm in terms of data usage remains a central challenge to be tackled for its successful real-world applications. In this paper, we propose a sample-efficient meta-RL algorithm that learns a model of the system or environment at hand in a task-directed manner. As opposed to the standard model-based approaches to meta-RL, our method exploits the value information in order to rapidly capture the decision-critical part of the environment. The key component of our method is the loss function for learning the task inference module and the system model that systematically couples the model discrepancy and the value estimate, thereby facilitating the learning of the policy and the task inference module with a significantly smaller amount of data compared to the existing meta-RL algorithms. The idea is also extended to a non-meta-RL setting, namely an online linear quadratic regulator (LQR) problem, where our method can be simplified to reveal the essence of the strategy. The proposed method is evaluated in high-dimensional robotic control and online LQR problems, empirically verifying its effectiveness in extracting information indispensable for solving the tasks from observations in a sample efficient manner. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.06968 [pdf, other]

Physics-Informed Data Denoising for Real-Life Sensing Systems

Authors: Xiyuan Zhang, Xiaohan Fu, Diyan Teng, Chengyu Dong, Keerthivasan Vijayakumar, Jiayun Zhang, Ranak Roy Chowdhury, Junsheng Han, Dezhi Hong, Rashmi Kulkarni, **gbo Shang, Rajesh Gupta

Abstract: Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically r… ▽ More Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically rely on using ground truth clean data to train a denoising model, which is often challenging or prohibitive to obtain for many real-world applications. We observe that in many scenarios, the relationships between different sensor measurements (e.g., location and acceleration) are analytically described by laws of physics (e.g., second-order differential equation). By incorporating such physics constraints, we can guide the denoising process to improve even in the absence of ground truth data. In light of this, we design a physics-informed denoising model that leverages the inherent algebraic relationships between different measurements governed by the underlying physics. By obviating the need for ground truth clean data, our method offers a practical denoising solution for real-world applications. We conducted experiments in various domains, including inertial navigation, CO2 monitoring, and HVAC control, and achieved state-of-the-art performance compared with existing denoising methods. Our method can denoise data in real time (4ms for a sequence of 1s) for low-cost noisy sensors and produces results that closely align with those from high-precision, high-cost alternatives, leading to an efficient, cost-effective approach for more accurate sensor-based systems. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: SenSys 2023

arXiv:2310.16102 [pdf, other]

Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient Multiphoton Microscopy

Authors: Cassandra Tong Ye, Jiashu Han, Kunzan Liu, Anastasios Angelopoulos, Linda Griffith, Kristina Monakhova, Sixian You

Abstract: Multiphoton microscopy (MPM) is a powerful imaging tool that has been a critical enabler for live tissue imaging. However, since most multiphoton microscopy platforms rely on point scanning, there is an inherent trade-off between acquisition time, field of view (FOV), phototoxicity, and image quality, often resulting in noisy measurements when fast, large FOV, and/or gentle imaging is needed. Deep… ▽ More Multiphoton microscopy (MPM) is a powerful imaging tool that has been a critical enabler for live tissue imaging. However, since most multiphoton microscopy platforms rely on point scanning, there is an inherent trade-off between acquisition time, field of view (FOV), phototoxicity, and image quality, often resulting in noisy measurements when fast, large FOV, and/or gentle imaging is needed. Deep learning could be used to denoise multiphoton microscopy measurements, but these algorithms can be prone to hallucination, which can be disastrous for medical and scientific applications. We propose a method to simultaneously denoise and predict pixel-wise uncertainty for multiphoton imaging measurements, improving algorithm trustworthiness and providing statistical guarantees for the deep learning predictions. Furthermore, we propose to leverage this learned, pixel-wise uncertainty to drive an adaptive acquisition technique that rescans only the most uncertain regions of a sample. We demonstrate our method on experimental noisy MPM measurements of human endometrium tissues, showing that we can maintain fine features and outperform other denoising methods while predicting uncertainty at each pixel. Finally, with our adaptive acquisition technique, we demonstrate a 120X reduction in acquisition time and total light dose while successfully recovering fine features in the sample. We are the first to demonstrate distribution-free uncertainty quantification for a denoising task with real experimental data and the first to propose adaptive acquisition based on reconstruction uncertainty △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.05352 [pdf, other]

A Glance is Enough: Extract Target Sentence By Looking at A keyword

Authors: Ying Shi, Dong Wang, Lantian Li, Jiqing Han

Abstract: This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both t… ▽ More This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both the keyword and the speech utterance and then rely on the cross-attention mechanism to select the correct content from the concatenated or overlap** speech. Experimental results on Librispeech demonstrate that our proposed method can effectively extract target sentences from very noisy and mixed speech (SNR=-3dB), achieving a phone error rate (PER) of 26\%, compared to the baseline system's PER of 96%. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: submitted to ICASSP 2024

arXiv:2309.15697 [pdf, other]

Physics Inspired Hybrid Attention for SAR Target Recognition

Authors: Zhongling Huang, Chong Wu, Xiwen Yao, Zhicheng Zhao, Xiankai Huang, Junwei Han

Abstract: There has been a recent emphasis on integrating physical models and deep neural networks (DNNs) for SAR target recognition, to improve performance and achieve a higher level of physical interpretability. The attributed scattering center (ASC) parameters garnered the most interest, being considered as additional input data or features for fusion in most methods. However, the performance greatly dep… ▽ More There has been a recent emphasis on integrating physical models and deep neural networks (DNNs) for SAR target recognition, to improve performance and achieve a higher level of physical interpretability. The attributed scattering center (ASC) parameters garnered the most interest, being considered as additional input data or features for fusion in most methods. However, the performance greatly depends on the ASC optimization result, and the fusion strategy is not adaptable to different types of physical information. Meanwhile, the current evaluation scheme is inadequate to assess the model's robustness and generalizability. Thus, we propose a physics inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol to address the above issues. PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of target, so as to re-weight the feature importance based on knowledge prior. It is flexible and generally applicable to various physical models, and can be integrated into arbitrary DNNs without modifying the original architecture. The experiments involve a rigorous assessment using the proposed OFA, which entails training and validating a model on either sufficient or limited data and evaluating on multiple test sets with different data distributions. Our method outperforms other state-of-the-art approaches in 12 test scenarios with same ASC parameters. Moreover, we analyze the working mechanism of PIHA and evaluate various PIHA enabled DNNs. The experiments also show PIHA is effective for different physical information. The source code together with the adopted physical information is available at https://github.com/XAI4SAR. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.10065 [pdf, other]

Bayesian longitudinal tensor response regression for modeling neuroplasticity

Authors: Suprateek Kundu, Alec Reinhardt, Serena Song, Joo Han, M. Lawson Meadows, Bruce Crosson, Venkatagiri Krishnamurthy

Abstract: A major interest in longitudinal neuroimaging studies involves investigating voxel-level neuroplasticity due to treatment and other factors across visits. However, traditional voxel-wise methods are beset with several pitfalls, which can compromise the accuracy of these approaches. We propose a novel Bayesian tensor response regression approach for longitudinal imaging data, which pools informatio… ▽ More A major interest in longitudinal neuroimaging studies involves investigating voxel-level neuroplasticity due to treatment and other factors across visits. However, traditional voxel-wise methods are beset with several pitfalls, which can compromise the accuracy of these approaches. We propose a novel Bayesian tensor response regression approach for longitudinal imaging data, which pools information across spatially-distributed voxels to infer significant changes while adjusting for covariates. The proposed method, which is implemented using Markov chain Monte Carlo (MCMC) sampling, utilizes low-rank decomposition to reduce dimensionality and preserve spatial configurations of voxels when estimating coefficients. It also enables feature selection via joint credible regions which respect the shape of the posterior distributions for more accurate inference. In addition to group level inferences, the method is able to infer individual-level neuroplasticity, allowing for examination of personalized disease or recovery trajectories. The advantages of the proposed approach in terms of prediction and feature selection over voxel-wise regression are highlighted via extensive simulation studies. Subsequently, we apply the approach to a longitudinal Aphasia dataset consisting of task functional MRI images from a group of subjects who were administered either a control intervention or intention treatment at baseline and were followed up over subsequent visits. Our analysis revealed that while the control therapy showed long-term increases in brain activity, the intention treatment produced predominantly short-term changes, both of which were concentrated in distinct localized regions. In contrast, the voxel-wise regression failed to detect any significant neuroplasticity after multiplicity adjustments, which is biologically implausible and implies lack of power. △ Less

Submitted 18 October, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: 28 pages, 8 figures, 6 tables

arXiv:2309.08377 [pdf, other]

DiaCorrect: Error Correction Back-end For Speaker Diarization

Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.03905 [pdf, other]

ImageBind-LLM: Multi-modality Instruction Tuning

Authors: Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

Abstract: We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training… ▽ More We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter. △ Less

Submitted 11 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: Code is available at https://github.com/OpenGVLab/LLaMA-Adapter

arXiv:2309.02529 [pdf, other]

Fast and High-Performance Learned Image Compression With Improved Checkerboard Context Model, Deformable Residual Module, and Knowledge Distillation

Authors: Haisheng Fu, Feng Liang, Jie Liang, Yongqiang Wang, Guohe Zhang, **gning Han

Abstract: Deep learning-based image compression has made great progresses recently. However, many leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we introduce four techniques to bala… ▽ More Deep learning-based image compression has made great progresses recently. However, many leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we introduce four techniques to balance the trade-off between the complexity and performance. We are the first to introduce deformable convolutional module in compression framework, which can remove more redundancies in the input image, thereby enhancing compression performance. Second, we design a checkerboard context model with two separate distribution parameter estimation networks and different probability models, which enables parallel decoding without sacrificing the performance compared to the sequential context-adaptive model. Third, we develop an improved three-step knowledge distillation and training scheme to achieve different trade-offs between the complexity and the performance of the decoder network, which transfers both the final and intermediate results of the teacher network to the student network to help its training. Fourth, we introduce $L_{1}$ regularization to make the numerical values of the latent representation more sparse. Then we only encode non-zero channels in the encoding and decoding process, which can greatly reduce the encoding and decoding time. Experiments show that compared to the state-of-the-art learned image coding scheme, our method can be about 20 times faster in encoding and 70-90 times faster in decoding, and our R-D performance is also $2.3 \%$ higher. Our method outperforms the traditional approach in H.266/VVC-intra (4:4:4) and some leading learned schemes in terms of PSNR and MS-SSIM metrics when testing on Kodak and Tecnick-40 datasets. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Submitted to Trans. Journal

arXiv:2308.00187 [pdf, ps, other]

Detecting the Anomalies in LiDAR Pointcloud

Authors: Chiyu Zhang, Ji Han, Yao Zou, Kexin Dong, Yujia Li, Junchun Ding, Xiaoling Han

Abstract: LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR… ▽ More LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR is generating anomalous pointcloud by analyzing the pointcloud characteristics. Specifically, we develop a pointcloud quality metric based on the LiDAR points' spatial and intensity distribution to characterize the noise level of the pointcloud, which relies on pure mathematical analysis and does not require any labeling or training as learning-based methods do. Therefore, the method is scalable and can be quickly deployed either online to improve the autonomy safety by monitoring anomalies in the LiDAR data or offline to perform in-depth study of the LiDAR behavior over large amount of data. The proposed approach is studied with extensive real public road data collected by LiDARs with different scanning mechanisms and laser spectrums, and is proven to be able to effectively handle various known and unknown sources of pointcloud anomaly. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.14491 [pdf, other]

A Unified Framework for Modality-Agnostic Deepfakes Detection

Authors: Cai Yu, Peng Chen, Jiahe Tian, ** Liu, Jiao Dai, Xi Wang, Yesheng Chai, Shan Jia, Siwei Lyu, Jizhong Han

Abstract: As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence betw… ▽ More As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we introduce a comprehensive framework that is agnostic to fake modalities, which facilitates the identification of multimodal deepfakes and handles situations with missing modalities, regardless of the manipulations embedded in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as a preliminary task. This efficiently extracts speech correlations across modalities, a feature challenging for deepfakes to replicate. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments on three audio-visual datasets show that our scheme outperforms state-of-the-art detection methods with promising performance on modality-agnostic audio/video deepfakes. △ Less

Submitted 24 October, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2307.07518 [pdf]

CephGPT-4: An Interactive Multimodal Cephalometric Measurement and Diagnostic System with Visual Large Language Model

Authors: Lei Ma, **cong Han, Zhaoxin Wang, Dian Zhang

Abstract: Large-scale multimodal language models (LMMs) have achieved remarkable success in general domains. However, the exploration of diagnostic language models based on multimodal cephalometric medical data remains limited. In this paper, we propose a novel multimodal cephalometric analysis and diagnostic dialogue model. Firstly, a multimodal orthodontic medical dataset is constructed, comprising cephal… ▽ More Large-scale multimodal language models (LMMs) have achieved remarkable success in general domains. However, the exploration of diagnostic language models based on multimodal cephalometric medical data remains limited. In this paper, we propose a novel multimodal cephalometric analysis and diagnostic dialogue model. Firstly, a multimodal orthodontic medical dataset is constructed, comprising cephalometric images and doctor-patient dialogue data, with automatic analysis of cephalometric landmarks using U-net and generation of diagnostic reports. Then, the cephalometric dataset and generated diagnostic reports are separately fine-tuned on Minigpt-4 and VisualGLM. Results demonstrate that the CephGPT-4 model exhibits excellent performance and has the potential to revolutionize orthodontic measurement and diagnostic applications. These innovations hold revolutionary application potential in the field of orthodontics. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2306.15561 [pdf, other]

You Can Mask More For Extremely Low-Bitrate Image Compression

Authors: Anqi Li, Feng Li, Jiaxin Han, Huihui Bai, Runmin Cong, Chunjie Zhang, Meng Wang, Weisi Lin, Yao Zhao

Abstract: Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture compon… ▽ More Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture components crucial for image compression, treating them equally alongside uninformative components in networks. This can cause severe perceptual quality degradation, especially under low-bitrate scenarios. In this work, inspired by the success of pre-trained masked autoencoders (MAE) in many downstream tasks, we propose to rethink its mask sampling strategy from structure and texture perspectives for high redundancy reduction and discriminative feature representation, further unleashing the potential of LIC methods. Therefore, we present a dual-adaptive masking approach (DA-Mask) that samples visible patches based on the structure and texture distributions of original images. We combine DA-Mask and pre-trained MAE in masked image modeling (MIM) as an initial compressor that abstracts informative semantic context and texture representations. Such a pipeline can well cooperate with LIC networks to achieve further secondary compression while preserving promising reconstruction quality. Consequently, we propose a simple yet effective masked compression model (MCM), the first framework that unifies MIM and LIC end-to-end for extremely low-bitrate image compression. Extensive experiments have demonstrated that our approach outperforms recent state-of-the-art methods in R-D performance, visual quality, and downstream applications, at very low bitrates. Our code is available at https://github.com/lianqi1008/MCM.git. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: Under review

arXiv:2306.13109 [pdf, other]

doi 10.1088/1741-2552/ad09ff

EEG Decoding for Datasets with Heterogenous Electrode Configurations using Transfer Learning Graph Neural Networks

Authors: **pei Han, Xiaoxi Wei, A. Aldo Faisal

Abstract: Brain-Machine Interfacing (BMI) has greatly benefited from adopting machine learning methods for feature learning that require extensive data for training, which are often unavailable from a single dataset. Yet, it is difficult to combine data across labs or even data within the same lab collected over the years due to the variation in recording equipment and electrode layouts resulting in shifts… ▽ More Brain-Machine Interfacing (BMI) has greatly benefited from adopting machine learning methods for feature learning that require extensive data for training, which are often unavailable from a single dataset. Yet, it is difficult to combine data across labs or even data within the same lab collected over the years due to the variation in recording equipment and electrode layouts resulting in shifts in data distribution, changes in data dimensionality, and altered identity of data dimensions. Our objective is to overcome this limitation and learn from many different and diverse datasets across labs with different experimental protocols. To tackle the domain adaptation problem, we developed a novel machine learning framework combining graph neural networks (GNNs) and transfer learning methodologies for non-invasive Motor Imagery (MI) EEG decoding, as an example of BMI. Empirically, we focus on the challenges of learning from EEG data with different electrode layouts and varying numbers of electrodes. We utilise three MI EEG databases collected using very different numbers of EEG sensors (from 22 channels to 64) and layouts (from custom layouts to 10-20). Our model achieved the highest accuracy with lower standard deviations on the testing datasets. This indicates that the GNN-based transfer learning framework can effectively aggregate knowledge from multiple datasets with different electrode layouts, leading to improved generalization in subject-independent MI EEG classification. The findings of this study have important implications for Brain-Computer-Interface (BCI) research, as they highlight a promising method for overcoming the limitations posed by non-unified experimental setups. By enabling the integration of diverse datasets with varying electrode layouts, our proposed approach can help advance the development and application of BMI technologies. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 19 pages, 4 figures

Journal ref: J. Neural Eng. 20 066027 (2023)

arXiv:2305.17706 [pdf, other]

Spot keywords from very noisy and mixed speech

Authors: Ying Shi, Dong Wang, Lantian Li, Jiqing Han, Shi Yin

Abstract: Most existing keyword spotting research focuses on conditions with slight or moderate noise. In this paper, we try to tackle a more challenging task: detecting keywords buried under strong interfering speech (10 times higher than the keyword in amplitude), and even worse, mixed with other keywords. We propose a novel Mix Training (MT) strategy that encourages the model to discover low-energy keywo… ▽ More Most existing keyword spotting research focuses on conditions with slight or moderate noise. In this paper, we try to tackle a more challenging task: detecting keywords buried under strong interfering speech (10 times higher than the keyword in amplitude), and even worse, mixed with other keywords. We propose a novel Mix Training (MT) strategy that encourages the model to discover low-energy keywords from noisy and mixed speech. Experiments were conducted with a vanilla CNN and two EfficientNet (B0/B2) architectures. The results evaluated with the Google Speech Command dataset demonstrated that the proposed mix training approach is highly effective and outperforms standard data augmentation and mixup training. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: Interspeech 2023

arXiv:2305.07783 [pdf, other]

ROI-based Deep Image Compression with Swin Transformers

Authors: Binglin Li, Jie Liang, Haisheng Fu, **gning Han

Abstract: Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the netw… ▽ More Encoding the Region Of Interest (ROI) with better quality than the background has many applications including video conferencing systems, video surveillance and object-oriented vision tasks. In this paper, we propose a ROI-based image compression framework with Swin transformers as main building blocks for the autoencoder network. The binary ROI mask is integrated into different layers of the network to provide spatial information guidance. Based on the ROI mask, we can control the relative importance of the ROI and non-ROI by modifying the corresponding Lagrange multiplier $ λ$ for different regions. Experimental results show our model achieves higher ROI PSNR than other methods and modest average PSNR for human evaluation. When tested on models pre-trained with original images, it has superior object detection and instance segmentation performance on the COCO validation dataset. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: This paper has been accepted by ICASSP 2023

arXiv:2305.03328 [pdf, other]

Time-weighted Frequency Domain Audio Representation with GMM Estimator for Anomalous Sound Detection

Authors: Jian Guan, Youde Liu, Qiaoxi Zhu, Tieran Zheng, Jiqing Han, Wenwu Wang

Abstract: Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for… ▽ More Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for different machines. This paper presents Time-Weighted Frequency Domain Representation (TWFR) with the GMM method (TWFR-GMM) for anomalous sound detection. The TWFR is a generalized statistical frequency domain representation that can adapt to different machine types, using the global weighted ranking pooling over time-domain. This allows GMM estimator to recognize anomalies, even under domain-shift conditions, as visualized with a Mahalanobis distance-based metric. Experiments on DCASE 2022 Challenge Task2 dataset show that our method has better detection performance than recent deep learning methods. TWFR-GMM is the core of our submission that achieved the 3rd place in DCASE 2022 Challenge Task2. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: To appear at ICASSP 2023

arXiv:2305.00416 [pdf, other]

Quaternion Matrix Completion Using Untrained Quaternion Convolutional Neural Network for Color Image Inpainting

Authors: Jifei Miao, Kit Ian Kou, Liqiao Yang, Juan Han

Abstract: The use of quaternions as a novel tool for color image representation has yielded impressive results in color image processing. By considering the color image as a unified entity rather than separate color space components, quaternions can effectively exploit the strong correlation among the RGB channels, leading to enhanced performance. Especially, color image inpainting tasks are highly benefici… ▽ More The use of quaternions as a novel tool for color image representation has yielded impressive results in color image processing. By considering the color image as a unified entity rather than separate color space components, quaternions can effectively exploit the strong correlation among the RGB channels, leading to enhanced performance. Especially, color image inpainting tasks are highly beneficial from the application of quaternion matrix completion techniques, in recent years. However, existing quaternion matrix completion methods suffer from two major drawbacks. First, it can be difficult to choose a regularizer that captures the common characteristics of natural images, and sometimes the regularizer that is chosen based on empirical evidence may not be the optimal or efficient option. Second, the optimization process of quaternion matrix completion models is quite challenging because of the non-commutativity of quaternion multiplication. To address the two drawbacks of the existing quaternion matrix completion approaches mentioned above, this paper tends to use an untrained quaternion convolutional neural network (QCNN) to directly generate the completed quaternion matrix. This approach replaces the explicit regularization term in the quaternion matrix completion model with an implicit prior that is learned by the QCNN. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method for color image inpainting compared with some existing quaternion-based and tensor-based methods. △ Less

Submitted 30 April, 2023; originally announced May 2023.

arXiv:2305.00092 [pdf, other]

Improving Gradient Computation for Differentiable Physics Simulation with Contacts

Authors: Yaofeng Desmond Zhong, Jiequn Han, Biswadip Dey, Georgia Olympia Brikis

Abstract: Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not per… ▽ More Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not perfect and might provide wrong gradients that deteriorate its performance in learning tasks. In this paper, we study differentiable rigid-body simulation with contacts. We find that existing differentiable simulation methods provide inaccurate gradients when the contact normal direction is not fixed - a general situation when the contacts are between two moving objects. We propose to improve gradient computation by continuous collision detection and leverage the time-of-impact (TOI) to calculate the post-collision velocities. We demonstrate our proposed method, referred to as TOI-Velocity, on two optimal control problems. We show that with TOI-Velocity, we are able to learn an optimal control sequence that matches the analytical solution, while without TOI-Velocity, existing differentiable simulation methods fail to do so. △ Less

Submitted 28 April, 2023; originally announced May 2023.

Comments: 5th Annual Conference on Learning for Dynamics and Control

Journal ref: Proceedings of Machine Learning Research vol 211, 2023

arXiv:2303.08636 [pdf, other]

HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism

Authors: Yuguang Yang, Yu Pan, **g**g Yin, Jiangyu Han, Lei Ma, Heng Lu

Abstract: SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast an… ▽ More SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast and efficient way. Specifically, we first incorporate linear attention (LA) and propose a hybrid LASA paradigm to increase the model's inference speed. Second, a hybrid neural architecture search (NAS) guided structural re-parameterization (SRep) mechanism, termed NSR, is proposed to enhance the ability of the model to extract local interactions. Extensive experiments conducted on the LibriSpeech dataset demonstrate that our proposed HybridFormer can achieve a 9.1% relative word error rate (WER) reduction over SqueezeFormer on the test-other dataset. Furthermore, when input speech is 30s, the HybridFormer can improve the model's inference speed up to 18%. Our source code is available online. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP2023

arXiv:2303.07067 [pdf, other]

Cross-device Federated Learning for Mobile Health Diagnostics: A First Study on COVID-19 Detection

Authors: Tong Xia, **g Han, Abhirup Ghosh, Cecilia Mascolo

Abstract: Federated learning (FL) aided health diagnostic models can incorporate data from a large number of personal edge devices (e.g., mobile phones) while kee** the data local to the originating devices, largely ensuring privacy. However, such a cross-device FL approach for health diagnostics still imposes many challenges due to both local data imbalance (as extreme as local data consists of a single… ▽ More Federated learning (FL) aided health diagnostic models can incorporate data from a large number of personal edge devices (e.g., mobile phones) while kee** the data local to the originating devices, largely ensuring privacy. However, such a cross-device FL approach for health diagnostics still imposes many challenges due to both local data imbalance (as extreme as local data consists of a single disease class) and global data imbalance (the disease prevalence is generally low in a population). Since the federated server has no access to data distribution information, it is not trivial to solve the imbalance issue towards an unbiased model. In this paper, we propose FedLoss, a novel cross-device FL framework for health diagnostics. Here the federated server averages the models trained on edge devices according to the predictive loss on the local data, rather than using only the number of samples as weights. As the predictive loss better quantifies the data distribution at a device, FedLoss alleviates the impact of data imbalance. Through a real-world dataset on respiratory sound and symptom-based COVID-$19$ detection task, we validate the superiority of FedLoss. It achieves competitive COVID-$19$ detection performance compared to a centralised model with an AUC-ROC of $79\%$. It also outperforms the state-of-the-art FL baselines in sensitivity and convergence speed. Our work not only demonstrates the promise of federated COVID-$19$ detection but also paves the way to a plethora of mobile health model development in a privacy-preserving fashion. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: This paper has been accepted by IEEE ICASSP 2023

arXiv:2302.13661 [pdf, other]

Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

Authors: Dekai Sun, Yancheng He, Jiqing Han

Abstract: The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-hea… ▽ More The lack of data and the difficulty of multimodal fusion have always been challenges for multimodal emotion recognition (MER). In this paper, we propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality, and finetune them in downstream task of MER to cope with the lack of data. For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module. Starting from the MER task itself, we design two auxiliary tasks to alleviate the insufficient fusion between modalities and guide the network to capture and align emotion-related features. Compared to the previous state-of-the-art models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2301.01940 [pdf, other]

Enabling Augmented Segmentation and Registration in Ultrasound-Guided Spinal Surgery via Realistic Ultrasound Synthesis from Diagnostic CT Volume

Authors: Ang Li, Jiayi Han, Yongjian Zhao, Keyu Li, Li Liu

Abstract: This paper aims to tackle the issues on unavailable or insufficient clinical US data and meaningful annotation to enable bone segmentation and registration for US-guided spinal surgery. While the US is not a standard paradigm for spinal surgery, the scarcity of intra-operative clinical US data is an insurmountable bottleneck in training a neural network. Moreover, due to the characteristics of US… ▽ More This paper aims to tackle the issues on unavailable or insufficient clinical US data and meaningful annotation to enable bone segmentation and registration for US-guided spinal surgery. While the US is not a standard paradigm for spinal surgery, the scarcity of intra-operative clinical US data is an insurmountable bottleneck in training a neural network. Moreover, due to the characteristics of US imaging, it is difficult to clearly annotate bone surfaces which causes the trained neural network missing its attention to the details. Hence, we propose an In silico bone US simulation framework that synthesizes realistic US images from diagnostic CT volume. Afterward, using these simulated bone US we train a lightweight vision transformer model that can achieve accurate and on-the-fly bone segmentation for spinal sonography. In the validation experiments, the realistic US simulation was conducted by deriving from diagnostic spinal CT volume to facilitate a radiation-free US-guided pedicle screw placement procedure. When it is employed for training bone segmentation task, the Chamfer distance achieves 0.599mm; when it is applied for CT-US registration, the associated bone segmentation accuracy achieves 0.93 in Dice, and the registration accuracy based on the segmented point cloud is 0.13~3.37mm in a complication-free manner. While bone US images exhibit strong echoes at the medium interface, it may enable the model indistinguishable between thin interfaces and bone surfaces by simply relying on small neighborhood information. To overcome these shortcomings, we propose to utilize a Long-range Contrast Learning Module to fully explore the Long-range Contrast between the candidates and their surrounding pixels. △ Less

Submitted 5 January, 2023; originally announced January 2023.

Comments: Submitted to IEEE Transactions on Automation Science and Engineering. Copyright may be transferred without notice, after which this version may no longer be accessible. Note that the abstract is shorter than that in the pdf file due to character limitations

arXiv:2212.03848 [pdf, other]

NeRFEditor: Differentiable Style Decomposition for Full 3D Scene Editing

Authors: Chunyi Sun, Yanbin Liu, Junlin Han, Stephen Gould

Abstract: We present NeRFEditor, an efficient learning framework for 3D scene editing, which takes a video captured over 360° as input and outputs a high-quality, identity-preserving stylized 3D scene. Our method supports diverse types of editing such as guided by reference images, text prompts, and user interactions. We achieve this by encouraging a pre-trained StyleGAN model and a NeRF model to learn from… ▽ More We present NeRFEditor, an efficient learning framework for 3D scene editing, which takes a video captured over 360° as input and outputs a high-quality, identity-preserving stylized 3D scene. Our method supports diverse types of editing such as guided by reference images, text prompts, and user interactions. We achieve this by encouraging a pre-trained StyleGAN model and a NeRF model to learn from each other mutually. Specifically, we use a NeRF model to generate numerous image-angle pairs to train an adjustor, which can adjust the StyleGAN latent code to generate high-fidelity stylized images for any given angle. To extrapolate editing to GAN out-of-domain views, we devise another module that is trained in a self-supervised learning manner. This module maps novel-view images to the hidden space of StyleGAN that allows StyleGAN to generate stylized images on novel views. These two modules together produce guided images in 360°views to finetune a NeRF to make stylization effects, where a stable fine-tuning strategy is proposed to achieve this. Experiments show that NeRFEditor outperforms prior work on benchmark and real-world scenes with better editability, fidelity, and identity preservation. △ Less

Submitted 8 December, 2022; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: Project page: https://chuny1.github.io/NeRFEditor/nerfeditor.html

arXiv:2211.14467 [pdf]

Self-Supervised Surgical Instrument 3D Reconstruction from a Single Camera Image

Authors: Ange Lou, Xing Yao, Ziteng Liu, **tong Han, Jack Noble

Abstract: Surgical instrument tracking is an active research area that can provide surgeons feedback about the location of their tools relative to anatomy. Recent tracking methods are mainly divided into two parts: segmentation and object detection. However, both can only predict 2D information, which is limiting for application to real-world surgery. An accurate 3D surgical instrument model is a prerequisi… ▽ More Surgical instrument tracking is an active research area that can provide surgeons feedback about the location of their tools relative to anatomy. Recent tracking methods are mainly divided into two parts: segmentation and object detection. However, both can only predict 2D information, which is limiting for application to real-world surgery. An accurate 3D surgical instrument model is a prerequisite for precise predictions of the pose and depth of the instrument. Recent single-view 3D reconstruction methods are only used in natural object reconstruction and do not achieve satisfying reconstruction accuracy without 3D attribute-level supervision. Further, those methods are not suitable for the surgical instruments because of their elongated shapes. In this paper, we firstly propose an end-to-end surgical instrument reconstruction system -- Self-supervised Surgical Instrument Reconstruction (SSIR). With SSIR, we propose a multi-cycle-consistency strategy to help capture the texture information from a slim instrument while only requiring a binary instrument label map. Experiments demonstrate that our approach improves the reconstruction quality of surgical instruments compared to other self-supervised methods and achieves promising results. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: Accepted by SPIE Medical Imaging 2023

arXiv:2211.12793 [pdf, other]

Low Rank Quaternion Matrix Completion Based on Quaternion QR Decomposition and Sparse Regularizer

Authors: Juan Han, Liqiao Yang, Kit Ian Kou, Jifei Miao, Lizhi Liu

Abstract: Matrix completion is one of the most challenging problems in computer vision. Recently, quaternion representations of color images have achieved competitive performance in many fields. Because it treats the color image as a whole, the coupling information between the three channels of the color image is better utilized. Due to this, low-rank quaternion matrix completion (LRQMC) algorithms have gai… ▽ More Matrix completion is one of the most challenging problems in computer vision. Recently, quaternion representations of color images have achieved competitive performance in many fields. Because it treats the color image as a whole, the coupling information between the three channels of the color image is better utilized. Due to this, low-rank quaternion matrix completion (LRQMC) algorithms have gained considerable attention from researchers. In contrast to the traditional quaternion matrix completion algorithms based on quaternion singular value decomposition (QSVD), we propose a novel method based on quaternion Qatar Riyal decomposition (QQR). In the first part of the paper, a novel method for calculating an approximate QSVD based on iterative QQR is proposed (CQSVD-QQR), whose computational complexity is lower than that of QSVD. The largest $r \ (r>0)$ singular values of a given quaternion matrix can be computed by using CQSVD-QQR. Then, we propose a new quaternion matrix completion method based on CQSVD-QQR which combines low-rank and sparse priors of color images. Experimental results on color images and color medical images demonstrate that our model outperforms those state-of-the-art methods. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.12097 [pdf, other]

Dynamic Acoustic Compensation and Adaptive Focal Training for Personalized Speech Enhancement

Authors: Xiaofeng Ge, Jiangyu Han, Haixin Guan, Yanhua Long

Abstract: Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and target speaker enrollment speech; 2) Hard sample mining and learning. In this paper, dynamic acoustic compensation (DA… ▽ More Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and target speaker enrollment speech; 2) Hard sample mining and learning. In this paper, dynamic acoustic compensation (DAC) is proposed to alleviate the environment mismatch, by intercepting the noise or environmental acoustic segments from noisy speech and mixing it with the clean enrollment speech. To well exploit the hard samples in training data, we propose an adaptive focal training (AFT) strategy by assigning adaptive loss weights to hard and non-hard samples during training. A time-frequency multi-loss training is further introduced to improve and generalize our previous work sDPCCN for PSE. The effectiveness of proposed methods are examined on the DNS4 Challenge dataset. Results show that, the DAC brings large improvements in terms of multiple evaluation metrics, and AFT reduces the hard sample rate significantly and produces obvious MOS score improvement. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.10885 [pdf, other]

Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

Authors: Fan Qian, Jiqing Han

Abstract: Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise… ▽ More Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discriminative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system. △ Less

Submitted 20 November, 2022; originally announced November 2022.

Comments: Completed in October 2020 and submitted to ICASSP2021

Showing 1–50 of 150 results for author: Han, J