-
Interference Cancellation Based Neural Receiver for Superimposed Pilot in Multi-Layer Transmission
Authors:
Han Xiao,
Wenqiang Tian,
Shi **,
Wendong Liu,
Jia Shen,
Zhihua Shi,
Zhi Zhang
Abstract:
In this paper, an interference cancellation based neural receiver for superimposed pilot (SIP) in multi-layer transmission is proposed, where the data and pilot are non-orthogonally superimposed in the same time-frequency resource. Specifically, to deal with the intra-layer and inter-layer interference of SIP under multi-layer transmission, the interference cancellation with superimposed symbol ai…
▽ More
In this paper, an interference cancellation based neural receiver for superimposed pilot (SIP) in multi-layer transmission is proposed, where the data and pilot are non-orthogonally superimposed in the same time-frequency resource. Specifically, to deal with the intra-layer and inter-layer interference of SIP under multi-layer transmission, the interference cancellation with superimposed symbol aided channel estimation is leveraged in the neural receiver, accompanied by the pre-design of pilot code-division orthogonal mechanism at transmitter. In addition, to address the complexity issue for inter-vendor collaboration and the generalization problem in practical deployments, respectively, this paper also provides a fixed SIP (F-SIP) design based on constant pilot power ratio and scalable mechanisms for different modulation and coding schemes (MCSs) and transmission layers. Simulation results demonstrate the superiority of the proposed schemes on the performance of block error rate and throughput compared with existing counterparts.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
A Lung Nodule Dataset with Histopathology-based Cancer Type Annotation
Authors:
Muwei Jian,
Hongyu Chen,
Zaiyong Zhang,
Nan Yang,
Haorang Zhang,
Lifu Ma,
Wen**g Xu,
Huixiang Zhi
Abstract:
Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accuratel…
▽ More
Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accurately predicting multiple cancer types. This limitation can be attributed to the scarcity of publicly available datasets annotated with expert-level cancer type information. This research aims to bridge this gap by providing publicly accessible datasets and reliable tools for medical diagnosis, facilitating a finer categorization of different types of lung diseases so as to offer precise treatment recommendations. To achieve this objective, we curated a diverse dataset of lung Computed Tomography (CT) images, comprising 330 annotated nodules (nodules are labeled as bounding boxes) from 95 distinct patients. The quality of the dataset was evaluated using a variety of classical classification and detection models, and these promising results demonstrate that the dataset has a feasible application and further facilitate intelligent auxiliary diagnosis.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Testing Topological Data Analysis for Condition Monitoring of Wind Turbines
Authors:
Simone Casolo,
Alexander Stasik,
Zhenyou Zhang,
Signe Riemer-Sorensen
Abstract:
We present an investigation of how topological data analysis (TDA) can be applied to condition-based monitoring (CBM) of wind turbines for energy generation. TDA is a branch of data analysis focusing on extracting meaningful information from complex datasets by analyzing their structure in state space and computing their underlying topological features. By representing data in a high-dimensional s…
▽ More
We present an investigation of how topological data analysis (TDA) can be applied to condition-based monitoring (CBM) of wind turbines for energy generation. TDA is a branch of data analysis focusing on extracting meaningful information from complex datasets by analyzing their structure in state space and computing their underlying topological features. By representing data in a high-dimensional state space, TDA enables the identification of patterns, anomalies, and trends in the data that may not be apparent through traditional signal processing methods. For this study, wind turbine data was acquired from a wind park in Norway via standard vibration sensors at different locations of the turbine's gearbox. Both the vibration acceleration data and its frequency spectra were recorded at infrequent intervals for a few seconds at high frequency and failure events were labelled as either gear-tooth or ball-bearing failures. The data processing and analysis are based on a pipeline where the time series data is first split into intervals and then transformed into multi-dimensional point clouds via a time-delay embedding. The shape of the point cloud is analyzed with topological methods such as persistent homology to generate topology-based key health indicators based on Betti numbers, information entropy and signal persistence. Such indicators are tested for CBM and diagnosis (fault detection) to identify faults in wind turbines and classify them accordingly. Topological indicators are shown to be an interesting alternative for failure identification and diagnosis of operational failures in wind turbines.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention Network
Authors:
Zaiyan Zhang,
**ing Yan,
Yuanqi Liang,
Jiaxin Feng,
Haixu He,
Wei Han
Abstract:
Due to factors such as thick cloud cover and sensor limitations, remote sensing images often suffer from significant missing data, resulting in incomplete time-series information. Existing methods for imputing missing values in remote sensing images do not fully exploit spatio-temporal auxiliary information, leading to limited accuracy in restoration. Therefore, this paper proposes a novel deep le…
▽ More
Due to factors such as thick cloud cover and sensor limitations, remote sensing images often suffer from significant missing data, resulting in incomplete time-series information. Existing methods for imputing missing values in remote sensing images do not fully exploit spatio-temporal auxiliary information, leading to limited accuracy in restoration. Therefore, this paper proposes a novel deep learning-based approach called MS2TAN (Multi-scale Masked Spatial-Temporal Attention Network), for reconstructing time-series remote sensing images. Firstly, we introduce an efficient spatio-temporal feature extractor based on Masked Spatial-Temporal Attention (MSTA), to obtain high-quality representations of the spatio-temporal neighborhood features in the missing regions. Secondly, a Multi-scale Restoration Network consisting of the MSTA-based Feature Extractors, is employed to progressively refine the missing values by exploring spatio-temporal neighborhood features at different scales. Thirdly, we propose a ``Pixel-Structure-Perception'' Multi-Objective Joint Optimization method to enhance the visual effects of the reconstruction results from multiple perspectives and preserve more texture structures. Furthermore, the proposed method reconstructs missing values in all input temporal phases in parallel (i.e., Multi-In Multi-Out), achieving higher processing efficiency. Finally, experimental evaluations on two typical missing data restoration tasks across multiple research areas demonstrate that the proposed method outperforms state-of-the-art methods with an improvement of 0.40dB/1.17dB in mean peak signal-to-noise ratio (mPSNR) and 3.77/9.41 thousandths in mean structural similarity (mSSIM), while exhibiting stronger texture and structural consistency.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact Reduction
Authors:
Xinquan Yang,
Guanqun Zhou,
Wei Sun,
Youjian Zhang,
Zhongya Wang,
Jiahui He,
Zhicheng Zhang
Abstract:
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discover…
▽ More
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation
Authors:
Song Wang,
Zhong Zhang,
Huan Yan,
Ming Xu,
Guanghui Wang
Abstract:
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pa…
▽ More
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Multiple Intelligent Reflecting Surfaces Collaborative Wireless Localization System
Authors:
Ziheng Zhang,
Wen Chen,
Qingqing Wu,
Zhendong Li,
Xusheng Zhu,
**gfeng Chen,
Nan Cheng
Abstract:
This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multipl…
▽ More
This paper studies a multiple intelligent reflecting surfaces (IRSs) collaborative localization system where multiple semi-passive IRSs are deployed in the network to locate one or more targets based on time-of-arrival. It is assumed that each semi-passive IRS is equipped with reflective elements and sensors, which are used to establish the line-of-sight links from the base station (BS) to multiple targets and process echo signals, respectively. Based on the above model, we derive the Fisher information matrix of the echo signal with respect to the time delay. By employing the chain rule and exploiting the geometric relationship between time delay and position, the Cramer-Rao bound (CRB) for estimating the target's Cartesian coordinate position is derived. Then, we propose a two-stage algorithmic framework to minimize CRB in single- and multi-target localization systems by joint optimizing active beamforming at BS, passive beamforming at multiple IRSs and IRS selection. For the single-target case, we derive the optimal closed-form solution for multiple IRSs coefficients design and propose a lowcomplexity algorithm based on alternating direction method of multipliers to obtain the optimal solution for active beaming design. For the multi-target case, alternating optimization is used to transform the original problem into two subproblems where semi-definite relaxation and successive convex approximation are applied to tackle the quadraticity and indefiniteness in the CRB expression, respectively. Finally, numerical simulation results validate the effectiveness of the proposed algorithm for multiple IRSs collaborative localization system compared to other benchmark schemes as well as the significant performance gains.
△ Less
Submitted 17 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior
Authors:
Baiang Li,
Sizhuo Ma,
Yanhong Zeng,
Xiaogang Xu,
Youqing Fang,
Zhao Zhang,
Jian Wang,
Kai Chen
Abstract:
Capturing High Dynamic Range (HDR) scenery using 8-bit cameras often suffers from over-/underexposure, loss of fine details due to low bit-depth compression, skewed color distributions, and strong noise in dark areas. Traditional LDR image enhancement methods primarily focus on color map**, which enhances the visual representation by expanding the image's color range and adjusting the brightness…
▽ More
Capturing High Dynamic Range (HDR) scenery using 8-bit cameras often suffers from over-/underexposure, loss of fine details due to low bit-depth compression, skewed color distributions, and strong noise in dark areas. Traditional LDR image enhancement methods primarily focus on color map**, which enhances the visual representation by expanding the image's color range and adjusting the brightness. However, these approaches fail to effectively restore content in dynamic range extremes, which are regions with pixel values close to 0 or 255. To address the full scope of challenges in HDR imaging and surpass the limitations of current models, we propose a novel two-stage approach. The first stage maps the color and brightness to an appropriate range while kee** the existing details, and the second stage utilizes a diffusion prior to generate content in dynamic range extremes lost during capture. This generative refinement module can also be used as a plug-and-play module to enhance and complement existing LDR enhancement models. The proposed method markedly improves the quality and details of LDR images, demonstrating superior performance through rigorous experimental validation. The project page is at https://sagiri0208.github.io
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Authors:
Chunyi Li,
Xiele Wu,
Haoning Wu,
Donghui Feng,
Zicheng Zhang,
Guo Lu,
Xiongkuo Min,
Xiaohong Liu,
Guangtao Zhai,
Weisi Lin
Abstract:
Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1\% or even lower, which has strong potential applications. However, CMC has certain defects in…
▽ More
Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1\% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection
Authors:
Da Mu,
Zhicheng Zhang,
Haobo Yue
Abstract:
Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structu…
▽ More
Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.
△ Less
Submitted 15 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Micro-expression recognition based on depth map to point cloud
Authors:
Ren Zhang,
Jianqin Yin,
Chao Qi,
Zehao Wang,
Zhicheng Zhang,
Yonghao Dang
Abstract:
Micro-expressions are nonverbal facial expressions that reveal the covert emotions of individuals, making the micro-expression recognition task receive widespread attention. However, the micro-expression recognition task is challenging due to the subtle facial motion and brevity in duration. Many 2D image-based methods have been developed in recent years to recognize MEs effectively, but, these ap…
▽ More
Micro-expressions are nonverbal facial expressions that reveal the covert emotions of individuals, making the micro-expression recognition task receive widespread attention. However, the micro-expression recognition task is challenging due to the subtle facial motion and brevity in duration. Many 2D image-based methods have been developed in recent years to recognize MEs effectively, but, these approaches are restricted by facial texture information and are susceptible to environmental factors, such as lighting. Conversely, depth information can effectively represent motion information related to facial structure changes and is not affected by lighting. Motion information derived from facial structures can describe motion features that pixel textures cannot delineate. We proposed a network for micro-expression recognition based on facial depth information, and our experiments have demonstrated the crucial role of depth maps in the micro-expression recognition task. Initially, we transform the depth map into a point cloud and obtain the motion information for each point by aligning the initiating frame with the apex frame and performing a differential operation. Subsequently, we adjusted all point cloud motion feature input dimensions and used them as inputs for multiple point cloud networks to assess the efficacy of this representation. PointNet++ was chosen as the ultimate outcome for micro-expression recognition due to its superior performance. Our experiments show that our proposed method significantly outperforms the existing deep learning methods, including the baseline, on the $CAS(ME)^3$ dataset, which includes depth information.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications
Authors:
Zhou Zhou,
Guohang He,
Zheng Zhang,
Luziwei Leng,
Qinghai Guo,
Jianxing Liao,
Xuan Song,
Ran Cheng
Abstract:
Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks…
▽ More
Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation
Authors:
Zihan Zhang,
Xianjun Xia,
Chuanzeng Huang,
Yijian Xiao,
Lei Xie
Abstract:
Audio packet loss is an inevitable problem in real-time speech communication. A band-split packet loss concealment network (BS-PLCNet) targeting full-band signals was recently proposed. Although it performs superiorly in the ICASSP 2024 PLC Challenge, BS-PLCNet is a large model with high computational complexity of 8.95G FLOPS. This paper presents its updated version, BS-PLCNet 2, to reduce comput…
▽ More
Audio packet loss is an inevitable problem in real-time speech communication. A band-split packet loss concealment network (BS-PLCNet) targeting full-band signals was recently proposed. Although it performs superiorly in the ICASSP 2024 PLC Challenge, BS-PLCNet is a large model with high computational complexity of 8.95G FLOPS. This paper presents its updated version, BS-PLCNet 2, to reduce computational complexity and improve performance further. Specifically, to compensate for the missing future information, in the wide-band module, we design a dual-path encoder structure (with non-causal and causal path) and leverage an intra-model knowledge distillation strategy to distill the future information from the non-causal teacher to the casual student. Moreover, we introduce a lightweight post-processing module after packet loss restoration to recover speech distortions and remove residual noise in the audio signal. With only 40% of original parameters in BS-PLCNet, BS-PLCNet 2 brings 0.18 PLCMOS improvement on the ICASSP 2024 PLC challenge blind set, achieving state-of-the-art performance on this dataset.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Near-Field Channel Estimation for Extremely Large-Scale Terahertz Communications
Authors:
Songjie Yang,
Yizhou Peng,
Wanting Lyu,
Ya Li,
Hongjun He,
Zhongpei Zhang,
Chau Yuen
Abstract:
Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to…
▽ More
Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to enhance efficiency in Terahertz widely-spaced multi-subarray (WSMS) systems. Initially, we introduce three sparse channel representation methods: polar-domain representation (PD-R), multi-angular-domain representation (MAD-R), and two-dimensional polar-angular-domain representation (2D-PAD-R). Each method is meticulously developed for near-field WSMS channels, capitalizing on their sparsity characteristics. Building on this, we propose four estimation frameworks using the sparse recovery theory: polar-domain estimation (PD-E), multi-angular-domain estimation (MAD-E), two-stage polar-angular-domain estimation (TS-PAD-E), and two-dimensional polar-angular-domain estimation (2D-PAD-E). Particularly, 2D-PAD-E, integrating a 2D dictionary process, and TS-PAD-E, with its sequential approach to angle and distance estimation, stand out as particularly effective for near-field angle-distance estimation, enabling decoupled calculation of these parameters. Overall, these frameworks provide versatile and efficient solutions for WSMS channel estimation, balancing low complexity with high-performance outcomes. Additionally, they represent a fresh perspective on near-field signal processing.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
SF-V: Single Forward Video Generation Model
Authors:
Zhixing Zhang,
Yanyu Li,
Yushu Wu,
Yanwu Xu,
Anil Kag,
Ivan Skorokhodov,
Willi Menapace,
Aliaksandr Siarohin,
Junli Cao,
Dimitris Metaxas,
Sergey Tulyakov,
Jian Ren
Abstract:
Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune p…
▽ More
Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around $23\times$ speedup compared with SVD and $6\times$ speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Energy-storing analysis and fishtail stiffness optimization for a wire-driven elastic robotic fish
Authors:
Xiaocun Liao,
Chao Zhou,
Junfeng Fan,
Zhuoliang Zhang,
Zhaoran Yin,
Liangwei Deng
Abstract:
The robotic fish with high propulsion efficiency and good maneuverability achieves underwater fishlike propulsion by commonly adopting the motor to drive the fishtail, causing the significant fluctuations of the motor power due to the uneven swing speed of the fishtail in one swing cycle. Hence, we propose a wire-driven robotic fish with a spring-steel-based active-segment elastic spine. This bion…
▽ More
The robotic fish with high propulsion efficiency and good maneuverability achieves underwater fishlike propulsion by commonly adopting the motor to drive the fishtail, causing the significant fluctuations of the motor power due to the uneven swing speed of the fishtail in one swing cycle. Hence, we propose a wire-driven robotic fish with a spring-steel-based active-segment elastic spine. This bionic spine can produce elastic deformation to store energy under the action of the wire driving and motor for responding to the fluctuations of the motor power. Further, we analyze the effects of the energy-storing of the active-segment elastic spine on the smoothness of motor power. Based on the developed Lagrangian dynamic model and cantilever beam model, the power-variance-based nonlinear optimization model for the stiffness of the active-segment elastic spine is established to respond to the sharp fluctuations of motor power during each fishtail swing cycle. Results validate that the energy-storing of the active-segment elastic spine plays a vital role in improving the power fluctuations and maximum frequency of the motor by adjusting its stiffness reasonably, which is beneficial to achieving high propulsion and high speed for robotic fish. Compared with the active-segment rigid spine that is incapable of storing energy, the energy-storing of the active-segment elastic spine is beneficial to increase the maximum frequency of the motor and the average thrust of the fishtail by 0.41 Hz, and 0.06 N, respectively.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Development of an underwater inductive coupling communication system with power carrier technology
Authors:
Zhongxing Zhang
Abstract:
Inductive coupling communication is one of the main methods of underwater communication systems due to its excellent comprehensive performance. However, the data transmission distance and operational power consumption need to be further enhanced. In this paper, an underwater induction coupling communication scheme based on power carrier technology is proposed to improve the transmission speed and…
▽ More
Inductive coupling communication is one of the main methods of underwater communication systems due to its excellent comprehensive performance. However, the data transmission distance and operational power consumption need to be further enhanced. In this paper, an underwater induction coupling communication scheme based on power carrier technology is proposed to improve the transmission speed and reduce the bit error rate. The microcontroller of STM32L series with ultra-low power consumption was employed as the core of the system. Through the construction and simulation of the communication channel, the optimal parameters were determined. According to the circuit model of the power carrier communication, the effect of different modulation and demodulation methods to the signal transmission quality were discussed, which demonstrates the superiority of Differential Phase Shift Keying (DPSK). With the system-level low power design and onboard communication quality optimization, the device was developed. The test results in the laboratory environment show that the system can achieve efficient data communication with a rate of 115200bps and static power consumption as low as 660μA in the 700m channel. This study provides a practical design approach for the high-speed communication and Low-power operation of underwater communication systems.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Authors:
Victor Boone,
Zihan Zhang
Abstract:
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where…
▽ More
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Exploring Channel Estimation and Signal Detection for ODDM-based ISAC Systems
Authors:
Dezhi Wang,
Chongwen Huang,
Lei Liu,
Xiaoming Chen,
Wei Wang,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Inspired by providing reliable communications for high-mobility scenarios, in this letter, we investigate the channel estimation and signal detection in integrated sensing and communication~(ISAC) systems based on the orthogonal delay-Doppler multiplexing~(ODDM) modulation, which consists of a pulse-train that can achieve the orthogonality with respect to the resolution of the delay-Doppler~(DD) p…
▽ More
Inspired by providing reliable communications for high-mobility scenarios, in this letter, we investigate the channel estimation and signal detection in integrated sensing and communication~(ISAC) systems based on the orthogonal delay-Doppler multiplexing~(ODDM) modulation, which consists of a pulse-train that can achieve the orthogonality with respect to the resolution of the delay-Doppler~(DD) plane. To enhance the communication performance in the ODDM-based ISAC systems, we first propose a low-complexity approximation algorithm for channel estimation, which addresses the challenge of the high complexity from high resolution in the ODDM modulation, and achieves performance close to that of the maximum likelihood estimator scheme. Then, we employ the orthogonal approximate message-passing scheme to detect the symbols in the communication process based on the estimated channel information. Finally, simulation results show that the detection performance of ODDM is better than other multi-carrier modulation schemes. Specifically, the ODDM outperforms the orthogonal time frequency space scheme by 2.3 dB when the bit error ratio is $10^{-6}$.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Learning to Stabilize Unknown LTI Systems on a Single Trajectory under Stochastic Noise
Authors:
Ziyi Zhang,
Yorie Nakahira,
Guannan Qu
Abstract:
We study the problem of learning to stabilize unknown noisy Linear Time-Invariant (LTI) systems on a single trajectory. It is well known in the literature that the learn-to-stabilize problem suffers from exponential blow-up in which the state norm blows up in the order of $Θ(2^n)$ where $n$ is the state space dimension. This blow-up is due to the open-loop instability when exploring the $n$-dimens…
▽ More
We study the problem of learning to stabilize unknown noisy Linear Time-Invariant (LTI) systems on a single trajectory. It is well known in the literature that the learn-to-stabilize problem suffers from exponential blow-up in which the state norm blows up in the order of $Θ(2^n)$ where $n$ is the state space dimension. This blow-up is due to the open-loop instability when exploring the $n$-dimensional state space. To address this issue, we develop a novel algorithm that decouples the unstable subspace of the LTI system from the stable subspace, based on which the algorithm only explores and stabilizes the unstable subspace, the dimension of which can be much smaller than $n$. With a new singular-value-decomposition(SVD)-based analytical framework, we prove that the system is stabilized before the state norm reaches $2^{O(k \log n)}$, where $k$ is the dimension of the unstable subspace. Critically, this bound avoids exponential blow-up in state dimension in the order of $Θ(2^n)$ as in the previous works, and to the best of our knowledge, this is the first paper to avoid exponential blow-up in dimension for stabilizing LTI systems with noise.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
An Efficient Network with Novel Quantization Designed for Massive MIMO CSI Feedback
Authors:
Xinran Sun,
Zhengming Zhang,
Luxi Yang
Abstract:
The efficacy of massive multiple-input multiple-output (MIMO) techniques heavily relies on the accuracy of channel state information (CSI) in frequency division duplexing (FDD) systems. Many works focus on CSI compression and quantization methods to enhance CSI reconstruction accuracy with lower feedback overhead. In this letter, we propose CsiConformer, a novel CSI feedback network that combines…
▽ More
The efficacy of massive multiple-input multiple-output (MIMO) techniques heavily relies on the accuracy of channel state information (CSI) in frequency division duplexing (FDD) systems. Many works focus on CSI compression and quantization methods to enhance CSI reconstruction accuracy with lower feedback overhead. In this letter, we propose CsiConformer, a novel CSI feedback network that combines convolutional operations and self-attention mechanisms to improve CSI feedback accuracy. Additionally, a new quantization module is developed to improve encoding efficiency. Experiment results show that CsiConformer outperforms previous state-of-the-art networks, achieving an average accuracy improvement of 17.67\% with lower computational overhead.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
Authors:
Hanwei Zhu,
Haoning Wu,
Yixuan Li,
Zicheng Zhang,
Baoliang Chen,
Lingyu Zhu,
Yuming Fang,
Guangtao Zhai,
Weisi Lin,
Shiqi Wang
Abstract:
While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA)…
▽ More
While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Multiscale Spatio-Temporal Enhanced Short-term Load Forecasting of Electric Vehicle Charging Stations
Authors:
Zongbao Zhang,
Jiao Hao,
Wenmeng Zhao,
Yan Liu,
Yaohui Huang,
Xinhang Luo
Abstract:
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address the…
▽ More
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address these challenges, we propose a Multiscale Spatio-Temporal Enhanced Model (MSTEM) for effective load forecasting at EVCS. MSTEM incorporates a multiscale graph neural network to discern hierarchical nonlinear temporal dependencies across various time scales. Besides, it also integrates a recurrent learning component and a residual fusion mechanism, enhancing its capability to accurately capture spatial and temporal variations in charging patterns. The effectiveness of the proposed MSTEM has been validated through comparative analysis with six baseline models using three evaluation metrics. The case studies utilize real-world datasets for both fast and slow charging loads at EVCS in Perth, UK. The experimental results demonstrate the superiority of MSTEM in short-term continuous load forecasting for EVCS.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
VBIM-Net: Variational Born Iterative Network for Inverse Scattering Problems
Authors:
Ziqing Xing,
Zhaoyang Zhang,
Zirui Chen,
Yusong Wang,
Haoran Ma,
Zhun Wei,
Gang Bao
Abstract:
Recently, studies have shown the potential of integrating field-type iterative methods with deep learning (DL) techniques in solving inverse scattering problems (ISPs). In this article, we propose a novel Variational Born Iterative Network, namely, VBIM-Net, to solve the full-wave ISPs with significantly improved flexibility and inversion quality. The proposed VBIM-Net emulates the alternating upd…
▽ More
Recently, studies have shown the potential of integrating field-type iterative methods with deep learning (DL) techniques in solving inverse scattering problems (ISPs). In this article, we propose a novel Variational Born Iterative Network, namely, VBIM-Net, to solve the full-wave ISPs with significantly improved flexibility and inversion quality. The proposed VBIM-Net emulates the alternating updates of the total electric field and the contrast in the variational Born iterative method (VBIM) by multiple layers of subnetworks. We embed the calculation of the contrast variation into each of the subnetworks, converting the scattered field residual into an approximate contrast variation and then enhancing it by a U-Net, thus avoiding the requirement of matched measurement dimension and grid resolution as in existing approaches. The total field and contrast of each layer's output is supervised in the loss function of VBIM-Net, which guarantees the physical interpretability of variables of the subnetworks. In addition, we design a training scheme with extra noise to enhance the model's stability. Extensive numerical results on synthetic and experimental data both verify the inversion quality, generalization ability, and robustness of the proposed VBIM-Net. This work may provide some new inspiration for the design of efficient field-type DL schemes.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
UniCompress: Enhancing Multi-Data Medical Image Compression with Knowledge Distillation
Authors:
Runzhao Yang,
Yinda Chen,
Zhihong Zhang,
Xiaoyu Liu,
Zongren Li,
Kunlun He,
Zhiwei Xiong,
**li Suo,
Qionghai Dai
Abstract:
In the field of medical image compression, Implicit Neural Representation (INR) networks have shown remarkable versatility due to their flexible compression ratios, yet they are constrained by a one-to-one fitting approach that results in lengthy encoding times. Our novel method, ``\textbf{UniCompress}'', innovatively extends the compression capabilities of INR by being the first to compress multi…
▽ More
In the field of medical image compression, Implicit Neural Representation (INR) networks have shown remarkable versatility due to their flexible compression ratios, yet they are constrained by a one-to-one fitting approach that results in lengthy encoding times. Our novel method, ``\textbf{UniCompress}'', innovatively extends the compression capabilities of INR by being the first to compress multiple medical data blocks using a single INR network. By employing wavelet transforms and quantization, we introduce a codebook containing frequency domain information as a prior input to the INR network. This enhances the representational power of INR and provides distinctive conditioning for different image blocks. Furthermore, our research introduces a new technique for the knowledge distillation of implicit representations, simplifying complex model knowledge into more manageable formats to improve compression ratios. Extensive testing on CT and electron microscopy (EM) datasets has demonstrated that UniCompress outperforms traditional INR methods and commercial compression solutions like HEVC, especially in complex and high compression scenarios. Notably, compared to existing INR techniques, UniCompress achieves a 4$\sim$5 times increase in compression speed, marking a significant advancement in the field of medical image compression. Codes will be publicly available.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Quality-aware Masked Diffusion Transformer for Enhanced Music Generation
Authors:
Chang Li,
Ruoyu Wang,
Lijuan Liu,
Jun Du,
Yixuan Sun,
Zilu Guo,
Zhenrong Zhang,
Yuan Jiang
Abstract:
In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering a novel approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, which often constitutes only a fraction of available datasets. Within open-source datasets, the prevalence of issues like mi…
▽ More
In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering a novel approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, which often constitutes only a fraction of available datasets. Within open-source datasets, the prevalence of issues like mislabeling, weak labeling, unlabeled data, and low-quality music waveform significantly hampers the development of music generation models. To overcome these challenges, we introduce a novel quality-aware masked diffusion transformer (QA-MDT) approach that enables generative models to discern the quality of input music waveform during training. Building on the unique properties of musical signals, we have adapted and implemented a MDT model for TTM task, while further unveiling its distinct capacity for quality control. Moreover, we address the issue of low-quality captions with a caption refinement data processing approach. Our demo page is shown in https://qa-mdt.github.io/. Code on https://github.com/ivcylc/qa-mdt
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System
Authors:
Zhisheng Zhang,
Pengyang Huang
Abstract:
In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations wit…
▽ More
In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training. The HiddenSpeaker utilizes a simplified error-minimizing method named Single-Level Error-Minimizing (SLEM) to generate specific and effective perturbations. Additionally, a hybrid objective function is employed for human perceptual optimization, ensuring the perturbation is indistinguishable from human listeners. We conduct extensive experiments on multiple state-of-the-art (SOTA) models in the speaker verification domain to evaluate HiddenSpeaker. Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations, showcasing strong transferability across different models.
△ Less
Submitted 26 May, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Spatial-aware Attention Generative Adversarial Network for Semi-supervised Anomaly Detection in Medical Image
Authors:
Zerui Zhang,
Zhichao Sun,
Zelong Liu,
Bo Du,
Rui Yu,
Zhou Zhao,
Yongchao Xu
Abstract:
Medical anomaly detection is a critical research area aimed at recognizing abnormal images to aid in diagnosis.Most existing methods adopt synthetic anomalies and image restoration on normal samples to detect anomaly. The unlabeled data consisting of both normal and abnormal data is not well explored. We introduce a novel Spatial-aware Attention Generative Adversarial Network (SAGAN) for one-class…
▽ More
Medical anomaly detection is a critical research area aimed at recognizing abnormal images to aid in diagnosis.Most existing methods adopt synthetic anomalies and image restoration on normal samples to detect anomaly. The unlabeled data consisting of both normal and abnormal data is not well explored. We introduce a novel Spatial-aware Attention Generative Adversarial Network (SAGAN) for one-class semi-supervised generation of health images.Our core insight is the utilization of position encoding and attention to accurately focus on restoring abnormal regions and preserving normal regions. To fully utilize the unlabelled data, SAGAN relaxes the cyclic consistency requirement of the existing unpaired image-to-image conversion methods, and generates high-quality health images corresponding to unlabeled data, guided by the reconstruction of normal images and restoration of pseudo-anomaly images.Subsequently, the discrepancy between the generated healthy image and the original image is utilized as an anomaly score.Extensive experiments on three medical datasets demonstrate that the proposed SAGAN outperforms the state-of-the-art methods.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning
Authors:
Zheyuan Zhang,
Elif Keles,
Gorkem Durak,
Yavuz Taktak,
Onkar Susladkar,
Vandan Gorade,
Debesh Jha,
Asli C. Ormeci,
Alpay Medetalibeyoglu,
Lanhong Yao,
Bin Wang,
Ilkin Sevgi Isler,
Linkai Peng,
Hongyi Pan,
Camila Lopes Vendrami,
Amir Bourhani,
Yury Velichko,
Boqing Gong,
Concetto Spampinato,
Ayis Pyrros,
Pallavi Tiwari,
Derk C. F. Klatte,
Megan Engels,
Sanne Hoogenboom,
Candice W. Bolan
, et al. (13 additional authors not shown)
Abstract:
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective st…
▽ More
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective study, we collected a large dataset (767 scans from 499 participants) of T1-weighted (T1W) and T2-weighted (T2W) abdominal MRI series from five centers between March 2004 and November 2022. We also collected CT scans of 1,350 patients from publicly available sources for benchmarking purposes. We developed a new pancreas segmentation method, called PanSegNet, combining the strengths of nnUNet and a Transformer network with a new linear attention module enabling volumetric computation. We tested PanSegNet's accuracy in cross-modality (a total of 2,117 scans) and cross-center settings with Dice and Hausdorff distance (HD95) evaluation metrics. We used Cohen's kappa statistics for intra and inter-rater agreement evaluation and paired t-tests for volume and Dice comparisons, respectively. For segmentation accuracy, we achieved Dice coefficients of 88.3% (std: 7.2%, at case level) with CT, 85.0% (std: 7.9%) with T1W MRI, and 86.3% (std: 6.4%) with T2W MRI. There was a high correlation for pancreas volume prediction with R^2 of 0.91, 0.84, and 0.85 for CT, T1W, and T2W, respectively. We found moderate inter-observer (0.624 and 0.638 for T1W and T2W MRI, respectively) and high intra-observer agreement scores. All MRI data is made available at https://osf.io/kysnj/. Our source code is available at https://github.com/NUBagciLab/PaNSegNet.
△ Less
Submitted 25 May, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Authors:
Guankun Wang,
Long Bai,
Wan Jun Nah,
Jie Wang,
Zhaoxi Zhang,
Zhen Chen,
**lin Wu,
Mobarakol Islam,
Hongbin Liu,
Hongliang Ren
Abstract:
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recogni…
▽ More
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and related region grounding have shown great promise for robotic and medical applications, addressing the critical need for automated methods in personalized surgical mentorship. However, existing models primarily provide simple structured answers and struggle with complex scenarios due to their limited capability in recognizing long-range dependencies and aligning multimodal information. In this paper, we introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. Leveraging the pre-trained large vision-language model and specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in understanding complex visual-language tasks within surgical contexts. In addressing the visual grounding task, we propose the Token-Interaction (TIT) module, which strengthens the interaction between the grounding module and the language responses of the Large Visual Language Model (LVLM) after projecting them into the latent space. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset, which sets new performance standards. Our work contributes to advancing the field of automated surgical mentorship by providing a context-aware solution.
△ Less
Submitted 22 March, 2024;
originally announced May 2024.
-
Dual-Robust Integrated Sensing and Communication: Beamforming under CSI Imperfection and Location Uncertainty
Authors:
Wanting Lyu,
Songjie Yang,
Yue Xiu,
Xinyi Chen,
Zhongpei Zhang,
Chadi Assi,
Chau Yuan
Abstract:
A dual-robust design of beamforming is investigated in an integrated sensing and communication (ISAC) system.Existing research on robust ISAC waveform design, while proposing solutions to imperfect channel state information (CSI), generally depends on prior knowledge of the target's approximate location to design waveforms. This approach, however, limits the precision in sensing the target's exact…
▽ More
A dual-robust design of beamforming is investigated in an integrated sensing and communication (ISAC) system.Existing research on robust ISAC waveform design, while proposing solutions to imperfect channel state information (CSI), generally depends on prior knowledge of the target's approximate location to design waveforms. This approach, however, limits the precision in sensing the target's exact location. In this paper, considering both CSI imperfection and target location uncertainty, a novel framework of joint robust optimization is proposed by maximizing the weighted sum of worst-case data rate and beampattern gain. To address this challenging problem, we propose an efficient two-layer iteration algorithm based on S-Procedure and convex hull. Finally, numerical results verify the effectiveness and performance improvement of our dual-robust algorithm, as well as the trade-off between communication and sensing performance.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Flexible Beamforming for Movable Antenna-Enabled Integrated Sensing and Communication
Authors:
Wanting Lyu,
Songjie Yang,
Yue Xiu,
Zhongpei Zhang,
Chadi Assi,
Chau Yuen
Abstract:
This paper investigates flexible beamforming design in an integrated sensing and communication (ISAC) network with movable antennas (MAs). A bistatic radar system is integrated into a multi-user multiple-input-single-output (MU-MISO) system, with the base station (BS) equipped with MAs. This enables array response reconfiguration by adjusting the positions of antennas. Thus, a joint beamforming an…
▽ More
This paper investigates flexible beamforming design in an integrated sensing and communication (ISAC) network with movable antennas (MAs). A bistatic radar system is integrated into a multi-user multiple-input-single-output (MU-MISO) system, with the base station (BS) equipped with MAs. This enables array response reconfiguration by adjusting the positions of antennas. Thus, a joint beamforming and antenna position optimization problem, namely flexible beamforming, is proposed to maximize communication rate and sensing mutual information (MI). The fractional programming (FP) method is adopted to transform the non-convex objective function, and we alternatively update the beamforming matrix and antenna positions. Karush-Kuhn-Tucker (KKT) conditions are employed to derive the close-form solution of the beamforming matrix, while we propose an efficient search-based projected gradient ascent (SPGA) method to update the antenna positions. Simulation results demonstrate that MAs significantly enhance the ISAC performance when employing our proposed algorithm, achieving a 59.8% performance gain compared to fixed uniform arrays.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Electromagnetic Information Theory for Holographic MIMO Communications
Authors:
Li Wei,
Tierui Gong,
Chongwen Huang,
Zhaoyang Zhang,
Wei E. I. Sha,
Zhi Ning Chen,
Linglong Dai,
Merouane Debbah,
Chau Yuen
Abstract:
Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far it…
▽ More
Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far its capabilities can be extended. However, the traditional Shannon information theory falls short in addressing these inquiries because it only focuses on the information itself while neglecting the underlying carrier, electromagnetic (EM) waves, and environmental interactions. To fill up the gap between the theoretical analysis and the practical application for HMIMO systems, we introduce electromagnetic information theory (EIT) in this paper. This paper begins by laying the foundation for HMIMO-oriented EIT, encompassing EM wave equations and communication regions. In the context of HMIMO systems, the resultant physical limitations are presented, involving Chu's limit, Harrington's limit, Hannan's limit, and the evaluation of coupling effects. Field sampling and HMIMO-assisted oversampling are also discussed to guide the optimal HMIMO design within the EIT framework. To comprehensively depict the EM-compliant propagation process, we present the approximate and exact channel modeling approaches in near-/far-field zones. Furthermore, we discuss both traditional Shannon's information theory, employing the probabilistic method, and Kolmogorov information theory, utilizing the functional analysis, for HMIMO-oriented EIT systems.
△ Less
Submitted 25 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Introducing Learning Rate Adaptation CMA-ES into Rigid 2D/3D Registration for Robotic Navigation in Spine Surgery
Authors:
Zhirun Zhang,
Minheng Chen
Abstract:
The covariance matrix adaptive evolution strategy (CMA-ES) has been widely used in the field of 2D/3D registration in recent years. This optimization method exhibits exceptional robustness and usability for complex surgical scenarios. However, due to the inherent ill-posed nature of the 2D/3D registration task and the presence of numerous local minima in the landscape of similarity measures. Evolu…
▽ More
The covariance matrix adaptive evolution strategy (CMA-ES) has been widely used in the field of 2D/3D registration in recent years. This optimization method exhibits exceptional robustness and usability for complex surgical scenarios. However, due to the inherent ill-posed nature of the 2D/3D registration task and the presence of numerous local minima in the landscape of similarity measures. Evolution strategies often require a larger population size in each generation in each generation to ensure the stability of registration and the globality and effectiveness of search, which makes the entire process computationally expensive. In this paper, we build a 2D/3D registration framework based on a learning rate adaptation CMA-ES manner. The framework employs a fixed and small population size, leading to minimized runtime and optimal utilization of computing resources. We conduct experimental comparisons between the proposed framework and other intensity-based baselines using a substantial volume of synthetic data. The results suggests that our method demonstrates superiority in both registration accuracy and running time. Code is available at github.com/m1nhengChen/CMAES-reg.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis
Authors:
Zeyi Zhang,
Tenglong Ao,
Yuyao Zhang,
Qingzhe Gao,
Chuan Lin,
Baoquan Chen,
Libin Liu
Abstract:
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging fo…
▽ More
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging for deep learning-based systems, trained on moderately sized datasets, to capture the relationship between the movements and the corresponding speech semantics. To address this challenge, we develop a generative retrieval framework based on a large language model. This framework efficiently retrieves suitable semantic gesture candidates from a motion library in response to the input speech. To construct this motion library, we summarize a comprehensive list of commonly used semantic gestures based on findings in linguistics, and we collect a high-quality motion dataset encompassing both body and hand movements. We also design a novel GPT-based model with strong generalization capabilities to audio, capable of generating high-quality gestures that match the rhythm of speech. Furthermore, we propose a semantic alignment mechanism to efficiently align the retrieved semantic gestures with the GPT's output, ensuring the naturalness of the final animation. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit, as evidenced by a comprehensive collection of examples. User studies confirm the quality and human-likeness of our results, and show that our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
△ Less
Submitted 16 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Interpretable attributed scattering center extracted via deep unfolding
Authors:
Haodong Yang,
Zhe Zhang,
Zhongling Huang
Abstract:
Most existing sparse representation-based approaches for attributed scattering center (ASC) extraction adopt traditional iterative optimization algorithms, which suffer from lengthy computation times and limited precision. This paper presents a solution by introducing an interpretable network that can effectively and rapidly extract ASC via deep unfolding. Initially, we create a dictionary contain…
▽ More
Most existing sparse representation-based approaches for attributed scattering center (ASC) extraction adopt traditional iterative optimization algorithms, which suffer from lengthy computation times and limited precision. This paper presents a solution by introducing an interpretable network that can effectively and rapidly extract ASC via deep unfolding. Initially, we create a dictionary containing reliable prior knowledge and apply it to the iterative shrinkage-thresholding algorithm (ISTA). Then, we unfold ISTA into a neural network, employing it to autonomously and precisely optimize the hyperparameters. The interpretability of physics is retained by applying a dictionary with physical meaning. The experiments are conducted on multiple test sets with diverse data distributions and demonstrate the superior performance and generalizability of our method.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Enhancing Blind Video Quality Assessment with Rich Quality-aware Features
Authors:
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai
Abstract:
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQ…
▽ More
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{https://github.com/sunwei925/RQ-VQA.git}.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Robust Beamforming with Gradient-based Liquid Neural Network
Authors:
Xinquan Wang,
Fenghao Zhu,
Chongwen Huang,
Ahmed Alhammadi,
Faouzi Bader,
Zhaoyang Zhang,
Chau Yuen,
Merouane Debbah
Abstract:
Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address the…
▽ More
Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address these issues, we propose a robust gradient-based liquid neural network (GLNN) framework that utilizes ordinary differential equation-based liquid neurons to solve the beamforming problem. Specifically, our proposed GLNN framework takes gradients of the optimization objective function as inputs to extract the high-order channel feature information, and then introduces a residual connection to mitigate the training burden. Furthermore, we use the manifold learning technique to compress the search space of the beamforming problem. These designs enable the GLNN to effectively maintain low complexity while ensuring strong robustness to noisy and highly dynamic channels. Extensive simulation results demonstrate that the GLNN can achieve 4.15% higher spectral efficiency than that of typical iterative algorithms, and reduce the time consumption to only 1.61% that of conventional methods.
△ Less
Submitted 17 May, 2024; v1 submitted 12 May, 2024;
originally announced May 2024.
-
MDNet: Multi-Decoder Network for Abdominal CT Organs Segmentation
Authors:
Debesh Jha,
Nikhil Kumar Tomar,
Koushik Biswas,
Gorkem Durak,
Matthew Antalek,
Zheyuan Zhang,
Bin Wang,
Md Mostafijur Rahman,
Hongyi Pan,
Alpay Medetalibeyoglu,
Yury Velichko,
Daniela Ladner,
Amir Borhani,
Ulas Bagci
Abstract:
Accurate segmentation of organs from abdominal CT scans is essential for clinical applications such as diagnosis, treatment planning, and patient monitoring. To handle challenges of heterogeneity in organ shapes, sizes, and complex anatomical relationships, we propose a \textbf{\textit{\ac{MDNet}}}, an encoder-decoder network that uses the pre-trained \textit{MiT-B2} as the encoder and multiple di…
▽ More
Accurate segmentation of organs from abdominal CT scans is essential for clinical applications such as diagnosis, treatment planning, and patient monitoring. To handle challenges of heterogeneity in organ shapes, sizes, and complex anatomical relationships, we propose a \textbf{\textit{\ac{MDNet}}}, an encoder-decoder network that uses the pre-trained \textit{MiT-B2} as the encoder and multiple different decoder networks. Each decoder network is connected to a different part of the encoder via a multi-scale feature enhancement dilated block. With each decoder, we increase the depth of the network iteratively and refine segmentation masks, enriching feature maps by integrating previous decoders' feature maps. To refine the feature map further, we also utilize the predicted masks from the previous decoder to the current decoder to provide spatial attention across foreground and background regions. MDNet effectively refines the segmentation mask with a high dice similarity coefficient (DSC) of 0.9013 and 0.9169 on the Liver Tumor segmentation (LiTS) and MSD Spleen datasets. Additionally, it reduces Hausdorff distance (HD) to 3.79 for the LiTS dataset and 2.26 for the spleen segmentation dataset, underscoring the precision of MDNet in capturing the complex contours. Moreover, \textit{\ac{MDNet}} is more interpretable and robust compared to the other baseline models.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Cross-IQA: Unsupervised Learning for Image Quality Assessment
Authors:
Zhen Zhang
Abstract:
Automatic perception of image quality is a challenging problem that impacts billions of Internet and social media users daily. To advance research in this field, we propose a no-reference image quality assessment (NR-IQA) method termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA method can learn image quality features from unlabeled image data. We construct the pretext…
▽ More
Automatic perception of image quality is a challenging problem that impacts billions of Internet and social media users daily. To advance research in this field, we propose a no-reference image quality assessment (NR-IQA) method termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA method can learn image quality features from unlabeled image data. We construct the pretext task of synthesized image reconstruction to unsupervised extract the image quality information based ViT block. The pretrained encoder of Cross-IQA is used to fine-tune a linear regression model for score prediction. Experimental results show that Cross-IQA can achieve state-of-the-art performance in assessing the low-frequency degradation information (e.g., color change, blurring, etc.) of images compared with the classical full-reference IQA and NR-IQA under the same datasets.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Adaptive Speech Emotion Representation Learning Based On Dynamic Graph
Authors:
Yingxue Gao,
Huan Zhao,
Zixing Zhang
Abstract:
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of gra…
▽ More
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of graph representation learning for sequential data. For this reason, we propose an adaptive graph representation learning method based on dynamically evolved graphs, which are consecutively constructed on a series of subsequences segmented by a sliding window. In doing this, it is better to capture local and global context information within a long sequence. Moreover, we introduce a weighted approach to update the node representation rather than the conventional average one, where the weights are calculated by a novel matrix computation based on the degree of neighboring nodes. Finally, we construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to optimize the graph structure. To verify the effectiveness of the proposed method, we conducted experiments for speech emotion recognition on the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed method outperforms the latest (non-)graph-based models.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Intelligent Cardiac Auscultation for Murmur Detection via Parallel-Attentive Models with Uncertainty Estimation
Authors:
Zixing Zhang,
Tao Pang,
**g Han,
Björn W. Schuller
Abstract:
Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection meth…
▽ More
Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection method based on a parallel-attentive model, which consists of two branches: One is based on a self-attention module and the other one is based on a convolutional network. Unlike traditional approaches, this structure is better equipped to handle long-term dependencies in sequential data, and thus effectively captures the local and global features of heart murmurs. Additionally, we acknowledge the significance of understanding the uncertainty of model predictions in the medical field for clinical decision-making. Therefore, we have incorporated an effective uncertainty estimation method based on Monte Carlo Dropout into our model. Furthermore, we have employed temperature scaling to calibrate the predictions of our probabilistic model, enhancing its reliability. In experiments conducted on the CirCor Digiscope dataset for heart murmur detection, our proposed method achieves a weighted accuracy of 79.8% and an F1 of 65.1%, representing state-of-the-art results.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech
Authors:
Zhongren Dong,
Zixing Zhang,
Weixiang Xu,
**g Han,
Jianjun Ou,
Björn W. Schuller
Abstract:
Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying…
▽ More
Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying such models on edge devices. In this context, we construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for AD detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation, and a GELU-based Gated Linear Unit to replace the feedforward layer, aiming to automatically filter out the redundant information. Moreover, we design a hierarchical structure to force it to learn a variety of information grains, from the frame level to the dialogue level. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work, but with significant computational complexity and model size reduction compared to the standard Transformer. This shows the efficiency of HAFFormer in dealing with long audio for AD detection.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Active RIS-Aided Massive MIMO With Imperfect CSI and Phase Noise
Authors:
Zhangjie Peng,
Jianchen Zhu,
Cunhua Pan,
Zaichen Zhang,
Daniel Benevides da Costa,
Maged Elkashlan,
George K. Karagiannidis
Abstract:
Active reconfigurable intelligent surface (RIS) has attracted significant attention as a recently proposed RIS architecture. Owing to its capability to amplify the incident signals, active RIS can mitigate the multiplicative fading effect inherent in the passive RIS-aided system. In this paper, we consider an active RIS-aided uplink multi-user massive multiple-input multiple-output (MIMO) system i…
▽ More
Active reconfigurable intelligent surface (RIS) has attracted significant attention as a recently proposed RIS architecture. Owing to its capability to amplify the incident signals, active RIS can mitigate the multiplicative fading effect inherent in the passive RIS-aided system. In this paper, we consider an active RIS-aided uplink multi-user massive multiple-input multiple-output (MIMO) system in the presence of phase noise at the active RIS. Specifically, we employ a two-timescale scheme, where the beamforming at the base station (BS) is adjusted based on the instantaneous aggregated channel state information (CSI) and the statistical CSI serves as the basis for designing the phase shifts at the active RIS, so that the feedback overhead and computational complexity can be significantly reduced. The aggregated channel composed of the cascaded and direct channels is estimated by utilizing the linear minimum mean square error (LMMSE) technique. Based on the estimated channel, we derive the analytical closed-form expression of a lower bound of the achievable rate. The power scaling laws in the active RIS-aided system are investigated based on the theoretical expressions. When the transmit power of each user is scaled down by the number of BS antennas M or reflecting elements N, we find that the thermal noise will cause the lower bound of the achievable rate to approach zero, as the number of M or N increases to infinity. Moreover, an optimization approach based on genetic algorithms (GA) is introduced to tackle the phase shift optimization problem. Numerical results reveal that the active RIS can greatly enhance the performance of the considered system under various settings.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Interleave Frequency Division Multiplexing
Authors:
Yuhao Chi,
Lei Liu,
Yao Ge,
Xuehui Chen,
Ying Li,
Zhaoyang Zhang
Abstract:
In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying ex…
▽ More
In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying existing modulation techniques is to establish sparse equivalent channel matrices in order to facilitate the design of low-complexity detection algorithms for signal recovery, making a trade-off between performance and implementation complexity. In contrast, the proposed IFDM establishes an equivalent fully dense and right-unitarily invariant channel matrix with the goal of achieving channel capacity, ensuring that the signals undergo sufficient statistical channel fading. Meanwhile, a low-complexity and replica maximum a posteriori (MAP)-optimal cross-domain memory approximate message passing (CD-MAMP) detector is proposed for IFDM by exploiting the sparsity of the time-domain channel and the unitary invariance in interleave-frequency-domain channel. Numerical results show that IFDM with extremely low-complexity CD-MAMP outperforms OFDM, OTFS, and AFDM with state-of-the-art orthogonal approximate message passing detectors, particularly at low velocities.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
PAM-UNet: Shifting Attention on Region of Interest in Medical Images
Authors:
Abhijit Das,
Debesh Jha,
Vandan Gorade,
Koushik Biswas,
Hongyi Pan,
Zheyuan Zhang,
Daniela P. Ladner,
Yury Velichko,
Amir Borhani,
Ulas Bagci
Abstract:
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To addre…
▽ More
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of develo** efficient segmentation models to accelerate the adoption of AI in clinical practice.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Joint Signal Detection and Automatic Modulation Classification via Deep Learning
Authors:
Huijun Xing,
Xuhui Zhang,
Shuo Chang,
**ke Ren,
Zixun Zhang,
Jie Xu,
Shuguang Cui
Abstract:
Signal detection and modulation classification are two crucial tasks in various wireless communication systems. Different from prior works that investigate them independently, this paper studies the joint signal detection and automatic modulation classification (AMC) by considering a realistic and complex scenario, in which multiple signals with different modulation schemes coexist at different ca…
▽ More
Signal detection and modulation classification are two crucial tasks in various wireless communication systems. Different from prior works that investigate them independently, this paper studies the joint signal detection and automatic modulation classification (AMC) by considering a realistic and complex scenario, in which multiple signals with different modulation schemes coexist at different carrier frequencies. We first generate a coexisting RADIOML dataset (CRML23) to facilitate the joint design. Different from the publicly available AMC dataset ignoring the signal detection step and containing only one signal, our synthetic dataset covers the more realistic multiple-signal coexisting scenario. Then, we present a joint framework for detection and classification (JDM) for such a multiple-signal coexisting environment, which consists of two modules for signal detection and AMC, respectively. In particular, these two modules are interconnected using a designated data structure called "proposal". Finally, we conduct extensive simulations over the newly developed dataset, which demonstrate the effectiveness of our designs. Our code and dataset are now available as open-source (https://github.com/Singingkettle/ChangShuoRadioData).
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
EEG-MACS: Manifold Attention and Confidence Stratification for EEG-based Cross-Center Brain Disease Diagnosis under Unreliable Annotations
Authors:
Zhenxi Song,
Ruihan Qin,
Huixia Ren,
Zhen Liang,
Yi Guo,
Min Zhang,
Zhiguo Zhang
Abstract:
Cross-center data heterogeneity and annotation unreliability significantly challenge the intelligent diagnosis of diseases using brain signals. A notable example is the EEG-based diagnosis of neurodegenerative diseases, which features subtler abnormal neural dynamics typically observed in small-group settings. To advance this area, in this work, we introduce a transferable framework employing Mani…
▽ More
Cross-center data heterogeneity and annotation unreliability significantly challenge the intelligent diagnosis of diseases using brain signals. A notable example is the EEG-based diagnosis of neurodegenerative diseases, which features subtler abnormal neural dynamics typically observed in small-group settings. To advance this area, in this work, we introduce a transferable framework employing Manifold Attention and Confidence Stratification (MACS) to diagnose neurodegenerative disorders based on EEG signals sourced from four centers with unreliable annotations. The MACS framework's effectiveness stems from these features: 1) The Augmentor generates various EEG-represented brain variants to enrich the data space; 2) The Switcher enhances the feature space for trusted samples and reduces overfitting on incorrectly labeled samples; 3) The Encoder uses the Riemannian manifold and Euclidean metrics to capture spatiotemporal variations and dynamic synchronization in EEG; 4) The Projector, equipped with dual heads, monitors consistency across multiple brain variants and ensures diagnostic accuracy; 5) The Stratifier adaptively stratifies learned samples by confidence levels throughout the training process; 6) Forward and backpropagation in MACS are constrained by confidence stratification to stabilize the learning system amid unreliable annotations. Our subject-independent experiments, conducted on both neurocognitive and movement disorders using cross-center corpora, have demonstrated superior performance compared to existing related algorithms. This work not only improves EEG-based diagnostics for cross-center and small-setting brain diseases but also offers insights into extending MACS techniques to other data analyses, tackling data heterogeneity and annotation unreliability in multimedia and multimodal content understanding.
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
Beamforming Inferring by Conditional WGAN-GP for Holographic Antenna Arrays
Authors:
Fenghao Zhu,
Xinquan Wang,
Chongwen Huang,
Ahmed Alhammadi,
Hui Chen,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial…
▽ More
The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial intelligence have been developed to reduce beamforming overhead. Intuitively, if we can implement the near-optimal beamforming only using a tiny subset of the all channel information, the overhead for channel estimation and beamforming would be reduced significantly compared with the traditional beamforming methods that usually need full channel information and the inversion of large dimensional matrix. In light of this idea, we propose a novel scheme that utilizes Wasserstein generative adversarial network with gradient penalty to infer the full beamforming matrices based on very little of channel information. Simulation results confirm that it can accomplish comparable performance with the weighted minimum mean-square error algorithm, while reducing the overhead by over 50%.
△ Less
Submitted 15 May, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Robust Continuous-Time Beam Tracking with Liquid Neural Network
Authors:
Fenghao Zhu,
Xinquan Wang,
Chongwen Huang,
Richeng **,
Qianqian Yang,
Ahmed Alhammadi,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high…
▽ More
Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high overhead, we propose a novel solution for robust continuous-time beam tracking with liquid neural network, which dynamically adjust the narrow mmWave beams to ensure real-time beam alignment with mobile users. Through extensive simulations, we validate the effectiveness of our proposed method and demonstrate its superiority over existing state-of-the-art deep-learning-based approaches. Specifically, our scheme achieves at most 46.9% higher normalized spectral efficiency than the baselines when the user is moving at 5 m/s, demonstrating the potential of liquid neural networks to enhance mmWave mobile communication performance.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.