-
Coordinated RSMA for Integrated Sensing and Communication in Emergency UAV Systems
Authors:
Binghan Yao,
Ruoguang Li,
Yingyang Chen,
Li Wang
Abstract:
Recently, unmanned aerial vehicle (UAV)-enabled integrated sensing and communication (ISAC) is emerging as a promising technique for achieving robust and rapid emergency response capabilities. Such a novel framework offers high-quality and cost-efficient C\&S services due to the intrinsic flexibility and mobility of UAVs. In parallel, rate-splitting multiple access (RSMA) is able to achieve a tail…
▽ More
Recently, unmanned aerial vehicle (UAV)-enabled integrated sensing and communication (ISAC) is emerging as a promising technique for achieving robust and rapid emergency response capabilities. Such a novel framework offers high-quality and cost-efficient C\&S services due to the intrinsic flexibility and mobility of UAVs. In parallel, rate-splitting multiple access (RSMA) is able to achieve a tailor-made communication by splitting the messages into private and common parts with adjustable rates, making it suitable for on-demand data transmission in disaster scenarios. In this paper, we propose a coordinated RSMA for integrated sensing and communication (CoRSMA-ISAC) scheme in emergency UAV system to facilitate search and rescue operations, where a number of ISAC UAVs simultaneously communicate with multiple communication survivors (CSs) and detect a potentially trapped survivor (TS) in a coordinated manner. Towards this end, an optimization problem is formulated to maximize the weighted sum rate (WSR) of the system, subject to the sensing signal-to-noise ratio (SNR) requirement. In order to solve the formulated non-convex problem, we first decompose it into three subproblems, i.e., UAV-CS association, UAV deployment, as well as beamforming optimization and rate allocation. Subsequently, we introduce an iterative optimization approach leveraging K-Means, successive convex approximation (SCA), and semi-definite relaxation (SDR) algorithms to reframe the subproblems into a more tractable form and efficiently solve them. Simulation results demonstrate that the proposed CoRSMA-ISAC scheme is superior to conventional space division multiple access (SDMA), non-orthogonal multiple access (NOMA), and orthogonal multiple access (OMA) in terms of both communication and sensing performance.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Exploiting Structured Sparsity in Near Field: From the Perspective of Decomposition
Authors:
Xufeng Guo,
Yuanbin Chen,
Ying Wang,
Chau Yuen
Abstract:
The structured sparsity can be leveraged in traditional far-field channels, greatly facilitating efficient sparse channel recovery by compressing the complexity of overheads to the level of the scatterer number. However, when experiencing a fundamental shift from planar-wave-based far-field modeling to spherical-wave-based near-field modeling, whether these benefits persist in the near-field regim…
▽ More
The structured sparsity can be leveraged in traditional far-field channels, greatly facilitating efficient sparse channel recovery by compressing the complexity of overheads to the level of the scatterer number. However, when experiencing a fundamental shift from planar-wave-based far-field modeling to spherical-wave-based near-field modeling, whether these benefits persist in the near-field regime remains an open issue. To answer this question, this article delves into structured sparsity in the near-field realm, examining its peculiarities and challenges. In particular, we present the key features of near-field structured sparsity in contrast to the far-field counterpart, drawing from both physical and mathematical perspectives. Upon unmasking the theoretical bottlenecks, we resort to bypassing them by decoupling the geometric parameters of the scatterers, termed the triple parametric decomposition (TPD) framework. It is demonstrated that our novel TPD framework can achieve robust recovery of near-field sparse channels by applying the potential structured sparsity and avoiding the curse of complexity and overhead.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Waveform Learning under Phase Noise Impairment for Sub-THz Communications
Authors:
Dileepa Marasinghe,
Le Hang Nguyen,
Jafar Mohammadi,
Yejian Chen,
Thorsten Wild,
Nandana Rajatheva
Abstract:
The large untapped spectrum in sub-THz allows for ultra-high throughput communication to realize many seemingly impossible applications in 6G. Phase noise (PN) is one key hardware impairment, which is accentuated as we increase the frequency and bandwidth. Furthermore, the modest output power of the power amplifier demands limits on peak to average power ratio (PAPR) signal design. In this work, w…
▽ More
The large untapped spectrum in sub-THz allows for ultra-high throughput communication to realize many seemingly impossible applications in 6G. Phase noise (PN) is one key hardware impairment, which is accentuated as we increase the frequency and bandwidth. Furthermore, the modest output power of the power amplifier demands limits on peak to average power ratio (PAPR) signal design. In this work, we design a PN-robust, low PAPR single-carrier (SC) waveform by geometrically sha** the constellation and adapting the pulse sha** filter pair under practical PN modeling and adjacent channel leakage ratio (ACLR) constraints for a given excess bandwidth. We optimize the waveforms under conventional and state-of-the-art PN-aware demappers. Moreover, we introduce a neural-network (NN) demapper enhancing transceiver adaptability. We formulate the waveform optimization problem in its augmented Lagrangian form and use a back-propagation-inspired technique to obtain a design that is numerically robust to PN, while adhering to PAPR and ACLR constraints. The results substantiate the efficacy of the method, yielding up to 2.5 dB in the required Eb/N0 under stronger PN along with a PAPR reduction of 0.5 dB. Moreover, PAPR reductions up to 1.2 dB are possible with competitive BLER and SE performance in both low and high PN conditions.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
EarDA: Towards Accurate and Data-Efficient Earable Activity Sensing
Authors:
Shengzhe Lyu,
Yongliang Chen,
Di Duan,
Renqi Jia,
Weitao Xu
Abstract:
In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from e…
▽ More
In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from earable devices show significant changes in amplitudes and patterns, especially in the presence of dynamic and unpredictable head movements, posing a significant challenge for activity classification. In this work, we present EarDA, an adversarial-based domain adaptation system to extract the domain-independent features across different sensor locations. Moreover, while most deep learning methods commonly rely on training with substantial amounts of labeled data to offer good accuracy, the proposed scheme can release the potential usage of publicly available smartphone-based IMU datasets. Furthermore, we explore the feasibility of applying a filter-based data processing method to mitigate the impact of head movement. EarDA, the proposed system, enables more data-efficient and accurate activity sensing. It achieves an accuracy of 88.8% under HAR task, demonstrating a significant 43% improvement over methods without domain adaptation. This clearly showcases its effectiveness in mitigating domain gaps.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Coded Beam Training for RIS Assisted Wireless Communications
Authors:
Yuhao Chen,
Linglong Dai
Abstract:
Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, whi…
▽ More
Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, which reduces the beam training accuracy. To deal with this problem, we exploit the coded beam training framework for RIS systems, which leverages the error correction capability of channel coding to improve the beam training accuracy under low SNR. Specifically, we first extend the coded beam training framework to RIS systems by decoupling the base station-RIS channel and the RIS-user channel. For this framework, codewords that accurately steer to multiple angles is essential for fully unleashing the error correction capability. In order to realize effective codeword design in RIS systems, we then propose a new codeword design criterion, based on which we propose a relaxed Gerchberg-Saxton (GS) based codeword design scheme by considering the constant modulus constraints of RIS elements. In addition, considering the two dimensional structure of RIS, we further propose a dimension reduced encoder design scheme, which can not only guarentee a better beam shape, but also enable a stronger error correction capability. Simulation results reveal that the proposed scheme can realize effective and accurate beam training in low SNR scenarios.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data
Authors:
Yu-Hua Chen,
Woosung Choi,
Wei-Hsiang Liao,
Marco Martínez-Ramírez,
Kin Wai Cheuk,
Yuki Mitsufuji,
Jyh-Shing Roger Jang,
Yi-Hsuan Yang
Abstract:
Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work…
▽ More
Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work done by Wright et al. has explored the potential of leveraging unpaired data for training, using a generative adversarial network (GAN)-based framework. This paper extends their work by using more advanced discriminators in the GAN, and using more unpaired data for training. Specifically, drawing inspiration from recent advancements in neural vocoders, we employ in our GAN-based model for guitar amplifier modeling two sets of discriminators, one based on multi-scale discriminator (MSD) and the other multi-period discriminator (MPD). Moreover, we experiment with adding unprocessed audio signals that do not have the corresponding rendered audio of a target tone to the training data, to see how much the GAN model benefits from the unpaired data. Our experiments show that the proposed two extensions contribute to the modeling of both low-gain and high-gain guitar amplifiers.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Modeling Unknown Stochastic Dynamical System Subject to External Excitation
Authors:
Yuan Chen,
Dongbin Xiu
Abstract:
We present a numerical method for learning unknown nonautonomous stochastic dynamical system, i.e., stochastic system subject to time dependent excitation or control signals. Our basic assumption is that the governing equations for the stochastic system are unavailable. However, short bursts of input/output (I/O) data consisting of certain known excitation signals and their corresponding system re…
▽ More
We present a numerical method for learning unknown nonautonomous stochastic dynamical system, i.e., stochastic system subject to time dependent excitation or control signals. Our basic assumption is that the governing equations for the stochastic system are unavailable. However, short bursts of input/output (I/O) data consisting of certain known excitation signals and their corresponding system responses are available. When a sufficient amount of such I/O data are available, our method is capable of learning the unknown dynamics and producing an accurate predictive model for the stochastic responses of the system subject to arbitrary excitation signals not in the training data. Our method has two key components: (1) a local approximation of the training I/O data to transfer the learning into a parameterized form; and (2) a generative model to approximate the underlying unknown stochastic flow map in distribution. After presenting the method in detail, we present a comprehensive set of numerical examples to demonstrate the performance of the proposed method, especially for long-term system predictions.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidth
Authors:
Taixia Shi,
Dingding Liang,
Lu Wang,
Lin Li,
Shaogang Guo,
Jiawei Gao,
Xiaowei Li,
Chulun Lin,
Lei Shi,
Baogang Ding,
Shiyang Liu,
Fangyi Yang,
Chi Jiang,
Yang Chen
Abstract:
In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz.…
▽ More
In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition
Authors:
Kuan-Chen Wang,
You-** Li,
Wei-Lun Chen,
Yu-Wen Chen,
Yi-Ching Wang,
**-Cheng Yeh,
Chao Zhang,
Yu Tsao
Abstract:
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study intro…
▽ More
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. A bridge module, which is a lightweight NN, is proposed to evaluate the signal-level information of the speech signal. Subsequently, using the signal-level information, the observation addition technique is applied to effectively reduce the shortcomings of SE. The experimental results demonstrate the success of our method in integrating diverse pre-trained SE and ASR models, considerably boosting the ASR robustness. Crucially, no prior knowledge of the ASR or speech contents is required during the training or inference stages. Moreover, the effectiveness of this approach extends to different datasets without necessitating the fine-tuning of the bridge module, ensuring efficiency and improved generalization.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Reconfigurable Intelligent Surface Equipped UAV in Emergency Wireless Communications: A New Fading-Shadowing Model and Performance Analysis
Authors:
Yinong Chen,
Wenchi Cheng,
Wei Zhang
Abstract:
Communication infrastructure is often severely disrupted in post-disaster areas, which interrupts communications and impedes rescue. Recently, the technology of reconfigurable intelligent surface (RIS)-equipped-UAV has been investigated as a feasible approach to assist communication under such conditions. However, the channel characteristics in the post-disaster area rapidly change due to the topo…
▽ More
Communication infrastructure is often severely disrupted in post-disaster areas, which interrupts communications and impedes rescue. Recently, the technology of reconfigurable intelligent surface (RIS)-equipped-UAV has been investigated as a feasible approach to assist communication under such conditions. However, the channel characteristics in the post-disaster area rapidly change due to the topographical changes caused by secondary disasters and the high mobility of UAVs. In this paper we develop a new fading-shadowing model to fit the path loss caused by the debris. Following this, we derive the exact distribution of the new channel statistics for a small number of RIS elements and the approximate distribution for a large number of RIS elements, respectively. Then, we derive the closed-form expressions for performance analysis, including average capacity (AC), energy efficiency (EE), and outage probability (OP). Based on the above analytical derivations, we maximize the energy efficiency by optimizing the number of RIS elements and the coverage area by optimizing the altitude of the RIS-equipped UAV, respectively. Finally, simulation results validate the accuracy of derived expressions and show insights related to the optimal number of RIS elements and the optimal UAV altitude for emergency wireless communication (EWC).
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Shiliang Zhang,
Wen Wang
Abstract:
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an…
▽ More
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training.
△ Less
Submitted 25 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Fast Fractional Programming for Multi-Cell Integrated Sensing and Communications
Authors:
Yannan Chen,
Yi Feng,
Xiaoyang Li,
Licheng Zhao,
Kaiming Shen
Abstract:
This paper concerns the coordinate multi-cell beamforming design for integrated sensing and communications (ISAC). In particular, we assume that each base station (BS) has massive antennas. The optimization objective is to maximize a weighted sum of the data rates (for communications) and the Fisher information (for sensing). We first show that the conventional beamforming method for the multiple-…
▽ More
This paper concerns the coordinate multi-cell beamforming design for integrated sensing and communications (ISAC). In particular, we assume that each base station (BS) has massive antennas. The optimization objective is to maximize a weighted sum of the data rates (for communications) and the Fisher information (for sensing). We first show that the conventional beamforming method for the multiple-input multiple-output (MIMO) transmission, i.e., the weighted minimum mean square error (WMMSE) algorithm, has a natural extension to the ISAC problem scenario from a fractional programming (FP) perspective. However, the extended WMMSE algorithm requires computing the $N\times N$ matrix inverse extensively, where $N$ is proportional to the antenna array size, so the algorithm becomes quite costly when antennas are massively deployed. To address this issue, we develop a nonhomogeneous bound and use it in conjunction with the FP technique to solve the ISAC beamforming problem without the need to invert any large matrices. It is further shown that the resulting new FP algorithm has an intimate connection with gradient projection, based on which we can accelerate the convergence via Nesterov's gradient extrapolation.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition
Authors:
Guinan Li,
Jiajun Deng,
Youjun Chen,
Mengzhe Geng,
Shujie Hu,
Zhe Li,
Zengrui **,
Tianzi Wang,
Xurong Xie,
Helen Meng,
Xunying Liu
Abstract:
This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint…
▽ More
This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms
Authors:
Yifei Chen,
Zhu Zhu,
Shenghao Zhu,
Linwei Qiu,
Binfeng Zou,
Fan Jia,
Yunpeng Zhu,
Chenyan Zhang,
Zhaojie Fang,
Feiwei Qin,
** Fan,
Changmiao Wang,
Yu Gao,
Gang Yu
Abstract:
The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redund…
▽ More
The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redundant feature extraction when processing high-dimensional microimage data. We propose a novel fine-grained classification model, SCKansformer, for bone marrow blood cells, which addresses these challenges and enhances classification accuracy and efficiency. The model integrates the Kansformer Encoder, SCConv Encoder, and Global-Local Attention Encoder. The Kansformer Encoder replaces the traditional MLP layer with the KAN, improving nonlinear feature representation and interpretability. The SCConv Encoder, with its Spatial and Channel Reconstruction Units, enhances feature representation and reduces redundancy. The Global-Local Attention Encoder combines Multi-head Self-Attention with a Local Part module to capture both global and local features. We validated our model using the Bone Marrow Blood Cell Fine-Grained Classification Dataset (BMCD-FGCD), comprising over 10,000 samples and nearly 40 classifications, developed with a partner hospital. Comparative experiments on our private dataset, as well as the publicly available PBC and ALL-IDB datasets, demonstrate that SCKansformer outperforms both typical and advanced microcell classification methods across all datasets. Our source code and private BMCD-FGCD dataset are available at https://github.com/JustlfC03/SCKansformer.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Frequency-mix Knowledge Distillation for Fake Speech Detection
Authors:
Cunhang Fan,
Shunbo Dong,
Jun Xue,
Yujie Chen,
Jiangyan Yi,
Zhao Lv
Abstract:
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA…
▽ More
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31\% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Authors:
Meng Wang,
Tian Lin,
Kai Yu,
Aidi Lin,
Yuanyuan Peng,
Lianyu Wang,
Cheng Chen,
Ke Zou,
Huiyu Liang,
Man Chen,
Xue Yao,
Meiqin Zhang,
Binwei Huang,
Chaoxin Zheng,
Wei Chen,
Yilong Luo,
Yifan Chen,
**gcheng Wang,
Yih Chung Tham,
Dianbo Liu,
Wendy Wong,
Sahil Thakur,
Beau Fenner,
Yanda Meng,
Yukun Zhou
, et al. (11 additional authors not shown)
Abstract:
The current retinal artificial intelligence models were trained using data with a limited category of diseases and limited knowledge. In this paper, we present a retinal vision-language foundation model (RetiZero) with knowledge of over 400 fundus diseases. Specifically, we collected 341,896 fundus images paired with text descriptions from 29 publicly available datasets, 180 ophthalmic books, and…
▽ More
The current retinal artificial intelligence models were trained using data with a limited category of diseases and limited knowledge. In this paper, we present a retinal vision-language foundation model (RetiZero) with knowledge of over 400 fundus diseases. Specifically, we collected 341,896 fundus images paired with text descriptions from 29 publicly available datasets, 180 ophthalmic books, and online resources, encompassing over 400 fundus diseases across multiple countries and ethnicities. RetiZero achieved outstanding performance across various downstream tasks, including zero-shot retinal disease recognition, image-to-image retrieval, internal domain and cross-domain retinal disease classification, and few-shot fine-tuning. Specially, in the zero-shot scenario, RetiZero achieved a Top5 score of 0.8430 and 0.7561 on 15 and 52 fundus diseases respectively. In the image-retrieval task, RetiZero achieved a Top5 score of 0.9500 and 0.8860 on 15 and 52 retinal diseases respectively. Furthermore, clinical evaluations by ophthalmology experts from different countries demonstrate that RetiZero can achieve performance comparable to experienced ophthalmologists using zero-shot and image retrieval methods without requiring model retraining. These capabilities of retinal disease identification strengthen our RetiZero foundation model in clinical implementation.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Traffic Signal Cycle Control with Centralized Critic and Decentralized Actors under Varying Intervention Frequencies
Authors:
Maonan Wang,
Yirong Chen,
Yuheng Kan,
Chengcheng Xu,
Michael Lepech,
Man-On Pun,
Xi Xiong
Abstract:
Traffic congestion in urban areas is a significant problem, leading to prolonged travel times, reduced efficiency, and increased environmental concerns. Effective traffic signal control (TSC) is a key strategy for reducing congestion. Unlike most TSC systems that rely on high-frequency control, this study introduces an innovative joint phase traffic signal cycle control method that operates effect…
▽ More
Traffic congestion in urban areas is a significant problem, leading to prolonged travel times, reduced efficiency, and increased environmental concerns. Effective traffic signal control (TSC) is a key strategy for reducing congestion. Unlike most TSC systems that rely on high-frequency control, this study introduces an innovative joint phase traffic signal cycle control method that operates effectively with varying control intervals. Our method features an adjust all phases action design, enabling simultaneous phase changes within the signal cycle, which fosters both immediate stability and sustained TSC effectiveness, especially at lower frequencies. The approach also integrates decentralized actors to handle the complexity of the action space, with a centralized critic to ensure coordinated phase adjusting. Extensive testing on both synthetic and real-world data across different intersection types and signal setups shows that our method significantly outperforms other popular techniques, particularly at high control intervals. Case studies of policies derived from traffic data further illustrate the robustness and reliability of our proposed method.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
Authors:
Hanzhao Li,
Liumeng Xue,
Haohan Guo,
Xinfa Zhu,
Yuanjun Lv,
Lei Xie,
Yunlin Chen,
Hao Yin,
Zhifei Li
Abstract:
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermor…
▽ More
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
Authors:
Yujie Chen,
Jiangyan Yi,
Jun Xue,
Chenglong Wang,
Xiaohui Zhang,
Shunbo Dong,
Siding Zeng,
Jianhua Tao,
Lv Zhao,
Cunhang Fan
Abstract:
Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepf…
▽ More
Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
△ Less
Submitted 18 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Speech-based Clinical Depression Screening: An Empirical Study
Authors:
Yangbin Chen,
Chenyang Xu,
Chunfeng Liang,
Yanbao Tao,
Chuan Shi
Abstract:
This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios, including psychiatric interviews, chatbot conversations, and text readings. Participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital and control group members from the community, all diagnosed by psychiatrists followin…
▽ More
This study investigates the utility of speech signals for AI-based depression screening across varied interaction scenarios, including psychiatric interviews, chatbot conversations, and text readings. Participants include depressed patients recruited from the outpatient clinics of Peking University Sixth Hospital and control group members from the community, all diagnosed by psychiatrists following standardized diagnostic protocols. We extracted acoustic and deep speech features from each participant's segmented recordings. Classifications were made using neural networks or SVMs, with aggregated clip outcomes determining final assessments. Our analysis across interaction scenarios, speech processing techniques, and feature types confirms speech as a crucial marker for depression screening. Specifically, human-computer interaction matches clinical interview efficacy, surpassing reading tasks. Segment duration and quantity significantly affect model performance, with deep speech features substantially outperforming traditional acoustic features.
△ Less
Submitted 12 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Authors:
Philip Anastassiou,
Jiawei Chen,
Jitong Chen,
Yuanzhe Chen,
Zhuo Chen,
Ziyi Chen,
Jian Cong,
Lelai Deng,
Chuang Ding,
Lu Gao,
Mingqing Gong,
Peisong Huang,
Qingqing Huang,
Zhiying Huang,
Yuanyuan Huo,
Dongya Jia,
Chumin Li,
Feiya Li,
Hui Li,
Jiaxin Li,
Xiaoyang Li,
Xingxing Li,
Lin Liu,
Shouda Liu,
Sichao Liu
, et al. (21 additional authors not shown)
Abstract:
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub…
▽ More
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
A Study of the Latest Updates of the Readout System for the Hybird-Pixel Detector at HEPS
Authors:
Hangxu Li,
Jie Zhang,
Wei Wei,
Zhenjie Li,
Xiaolu Ji,
Yan Zhang,
Xuanzheng Yang,
Shuihan Zhang,
Xueke Ma,
Peng Liu,
Zheng Wang,
Yuanbai Chen
Abstract:
The High Energy Photon Source (HEPS) represents a fourth-generation light source. This facility has made unprecedented advancements in accelerator technology, necessitating the development of new detectors to satisfy physical requirements such as single-photon resolution, large dynamic range, and high frame rates. Since 2016, the Institute of High Energy Physics has introduced the first user-exper…
▽ More
The High Energy Photon Source (HEPS) represents a fourth-generation light source. This facility has made unprecedented advancements in accelerator technology, necessitating the development of new detectors to satisfy physical requirements such as single-photon resolution, large dynamic range, and high frame rates. Since 2016, the Institute of High Energy Physics has introduced the first user-experimental hybrid pixel detector, progressing to the fourth-generation million-pixel detector designed for challenging conditions, with the dual-threshold single-photon detector HEPS-Bei**g PIXel (HEPS-BPIX) set as the next-generation target. HEPS-BPIX will employ the entirely new Application-Specific Integrated Circuit (ASIC) BP40 for pixel information readout. Data flow will be managed and controlled through readout electronics based on a two-tier Field-Programmable Gate Array (FPGA) system: the Front-End Electronics (FEE) and the Input-Output Board (IOB) handle the fan-out for 12 ASICs, and the u4FCP is tasked with processing serial data on high-speed links, transferring pixel-level data to the back-end RTM and uTCA chassis, or independently outputting through a network port, enabling remote control of the entire detector. The new HEPS-BPIX firmware has undergone a comprehensive redesign and update to meet the electronic characteristics of the new chip and to improve the overall performance of the detector. We provide an overview of the core subunits of HEPS-BPIX, emphasizing the readout system, evaluating the new hardware and firmware, and highlighting some of its innovative features and characteristics.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Optimizing Air-borne Network-in-a-box Deployment for Efficient Remote Coverage
Authors:
Sidrah Javed,
Yunfei Chen,
Mohamed-Slim Alouini,
Cheng-Xiang Wang
Abstract:
Among many envisaged drivers for the sixth generation, one is from the United Nations Sustainability Development Goals 2030 to eliminate digital inequality. Remote coverage in sparsely populated areas, difficult terrains, or emergency scenarios requires on-demand access and flexible deployment with minimal capex and opex. In this context, network-in-a-box (NIB) is an exciting solution that packs t…
▽ More
Among many envisaged drivers for the sixth generation, one is from the United Nations Sustainability Development Goals 2030 to eliminate digital inequality. Remote coverage in sparsely populated areas, difficult terrains, or emergency scenarios requires on-demand access and flexible deployment with minimal capex and opex. In this context, network-in-a-box (NIB) is an exciting solution that packs the whole wireless network into a single portable and re-configurable box to support multiple access technologies such as WiFi/2G/3G/4G/5G etc. In this paper, we propose low-altitude platform stations (LAPS) based NIBs with stratospheric high-altitude platform station (HAPS) as backhaul. Specifically, backhaul employs non-orthogonal multiple access (NOMA) with superposition coding at the transmitting HAPS and successive interference cancellation (SIC) at the receiving NIBs, whereas the access link (AL) employs superposition coding along with the regularized zero-forcing (RZF) precoding at the NIB in order to elevate the computational overhead from the ground users. The required number of airborne NIBs to serve a desired coverage area, their optimal placement, user association, beam optimization, and resource allocation are optimized by maximizing the sum rate of the AL while maintaining the quality of service. Our findings reveal the significance of thorough system planning and communication parameters optimization for enhanced system performance and best coverage under limited resources.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency
Authors:
Yafeng Chen,
Siqi Zheng,
Hui Wang,
Luyao Cheng,
Qian Chen,
Shiliang Zhang,
Junjie Li
Abstract:
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion…
▽ More
Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion demonstrates sub-optimal performance in short-duration speaker verification. To further improve the short-duration feature extraction capability of ERes2Net, we expand the channel dimension within each stage. However, this modification also increases the number of model parameters and computational complexity. To alleviate this problem, we propose an improved ERes2NetV2 by pruning redundant structures, ultimately reducing both the model parameters and its computational cost. A range of experiments conducted on the VoxCeleb datasets exhibits the superiority of ERes2NetV2, which achieves EER of 0.61% for the full-duration trial, 0.98% for the 3s-duration trial, and 1.48% for the 2s-duration trial on VoxCeleb1-O, respectively.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Sparse Recovery for Holographic MIMO Channels: Leveraging the Clustered Sparsity
Authors:
Yuqing Guo,
Xufeng Guo,
Yuanbin Chen,
Ying Wang
Abstract:
Envisioned as the next-generation transceiver technology, the holographic multiple-input-multiple-output (HMIMO) garners attention for its superior capabilities of fabricating electromagnetic (EM) waves. However, the densely packed antenna elements significantly increase the dimension of the HMIMO channel matrix, rendering traditional channel estimation methods inefficient. While the dimension cur…
▽ More
Envisioned as the next-generation transceiver technology, the holographic multiple-input-multiple-output (HMIMO) garners attention for its superior capabilities of fabricating electromagnetic (EM) waves. However, the densely packed antenna elements significantly increase the dimension of the HMIMO channel matrix, rendering traditional channel estimation methods inefficient. While the dimension curse can be relieved to avoid the proportional increase with the antenna density using the state-of-the-art wavenumber-domain sparse representation, the sparse recovery complexity remains tied to the order of non-zero elements in the sparse channel, which still considerably exceeds the number of scatterers. By modeling the inherent clustered sparsity using a Gaussian mixed model (GMM)-based von Mises-Fisher (vMF) distribution, the to-be-estimated channel characteristics can be compressed to the scatterer level. Upon the sparsity extraction, a novel wavenumber-domain expectation-maximization (WD-EM) algorithm is proposed to implement the cluster-by-cluster variational inference, thus significantly reducing the computational complexity. Simulation results verify the robustness of the proposed scheme across overheads and signal-to-noise ratio (SNR).
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Adaptive Relaxation based Non-Conservative Chance Constrained Stochastic MPC for Battery Scheduling Under Forecast Uncertainties
Authors:
Avik Ghosh,
Cristian Cortes-Aguirre,
Yi-An Chen,
Adil Khurram,
Jan Kleissl
Abstract:
Chance constrained stochastic model predictive controllers (CC-SMPC) trade off full constraint satisfaction for economical plant performance under uncertainty. Previous CC-SMPC works are over-conservative in constraint violations leading to worse economic performance. Other past works require a-priori information about the uncertainty set, limiting their application to real-world systems. This pap…
▽ More
Chance constrained stochastic model predictive controllers (CC-SMPC) trade off full constraint satisfaction for economical plant performance under uncertainty. Previous CC-SMPC works are over-conservative in constraint violations leading to worse economic performance. Other past works require a-priori information about the uncertainty set, limiting their application to real-world systems. This paper considers a discrete linear time invariant system with hard constraints on inputs and chance constraints on states, with unknown uncertainty distribution, statistics, or samples. This work proposes a novel adaptive online update rule to relax the state constraints based on the time-average of past constraint violations, for the SMPC to achieve reduced conservativeness in closed-loop. Under an ideal control policy assumption, it is proven that the time-average of constraint violations converges to the maximum allowed violation probability. The time-average of constraint violations is also proven to asymptotically converge even without the simplifying assumptions. The proposed method is applied to the optimal battery energy storage system (BESS) dispatch in a grid connected microgrid with PV generation and load demand with chance constraints on BESS state-of-charge (SOC). Realistic simulations show the superior electricity cost saving potential of the proposed method as compared to the traditional MPC (with hard constraints on BESS SOC), by satisfying the chance constraints non-conservatively in closed-loop, thereby effectively trading off increased cost savings with minimal adverse effects on BESS lifetime.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Unified Control of Voltage, Frequency and Angle in Electrical Power Systems: A Passivity and Negative-Imaginary based Approach
Authors:
Yijun Chen,
Kanghong Shi,
Ian R. Petersen,
Elizabeth L. Ratnam
Abstract:
This paper proposes a unified methodology for voltage regulation, frequency synchronization, and rotor angle control in power transmission systems considering a one-axis generator model with time-varying voltages. First, we formulate an output consensus problem with a passivity and negative-imaginary (NI) based control framework. We establish output consensus results for both networked passive sys…
▽ More
This paper proposes a unified methodology for voltage regulation, frequency synchronization, and rotor angle control in power transmission systems considering a one-axis generator model with time-varying voltages. First, we formulate an output consensus problem with a passivity and negative-imaginary (NI) based control framework. We establish output consensus results for both networked passive systems and networked NI systems. Next, we apply the output consensus problem by controlling large-scale batteries co-located with synchronous generators -- using real-time voltage phasor measurements. By controlling the battery storage systems so as to dispatch real and reactive power, we enable simultaneous control of voltage, frequency, and power angle differences across a transmission network. Validation through numerical simulations on a four-area transmission network confirms the robustness of our unified control framework.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
On the Stability of Networked Nonlinear Imaginary Systems with Applications to Electrical Power Systems
Authors:
Yijun Chen,
Kanghong Shi,
Ian R. Petersen,
Elizabeth L. Ratnam
Abstract:
In the transition to achieving net zero emissions, it has been suggested that a substantial expansion of electric power grids will be necessary to support emerging renewable energy zones. In this paper, we propose employing battery-based feedback control and nonlinear negative imaginary (NI) systems theory to reduce the need for such expansion. By formulating a novel Luré-Postnikov-like Lyapunov f…
▽ More
In the transition to achieving net zero emissions, it has been suggested that a substantial expansion of electric power grids will be necessary to support emerging renewable energy zones. In this paper, we propose employing battery-based feedback control and nonlinear negative imaginary (NI) systems theory to reduce the need for such expansion. By formulating a novel Luré-Postnikov-like Lyapunov function, stability results are presented for the feedback interconnection of two single nonlinear NI systems, while output feedback consensus results are established for the feedback interconnection of two networked nonlinear NI systems based on a network topology. This theoretical framework underpins our design of battery-based control in power transmission systems. We demonstrate that the power grid can be gradually transitioned into the proposed NI systems, one transmission line at a time.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Patterned Beam Training: A Novel Low-Complexity and Low-Overhead Scheme for ELAA
Authors:
Hongkang Yu,
Yuan Si,
Shujuan Zhang,
Yijian Chen
Abstract:
Extremely large antenna arrays (ELAAs) can provide higher spectral efficiency. However, the use of narrower beams for data transmission significantly increases the overhead associated with beam training. In this letter, we propose a novel patterned beam training (PBT) scheme characterized by its low overhead and complexity. This scheme requires only a single linear operation by both the base stati…
▽ More
Extremely large antenna arrays (ELAAs) can provide higher spectral efficiency. However, the use of narrower beams for data transmission significantly increases the overhead associated with beam training. In this letter, we propose a novel patterned beam training (PBT) scheme characterized by its low overhead and complexity. This scheme requires only a single linear operation by both the base station and the user equipment to determine the optimal beam, reducing the training overhead to half or even less compared to traditional exhaustive search methods. Furthermore, We discuss the pattern design principles in detail and provide specific forms. Simulation results demonstrate that the proposed scheme outperforms the compared methods in terms of beam alignment accuracy and achieves a balance between signal-to-noise ratio (SNR) conditions and training overhead, making it a promising alternative.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
Authors:
Jiaben Chen,
Xin Yan,
Yihang Chen,
Siyuan Cen,
Qinwei Ma,
Haoyu Zhen,
Kaizhi Qian,
Lie Lu,
Chuang Gan
Abstract:
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic bo…
▽ More
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors
Authors:
Zihui Wu,
Yu Sun,
Yifan Chen,
Bingliang Zhang,
Yisong Yue,
Katherine L. Bouman
Abstract:
Diffusion models (DMs) have recently shown outstanding capability in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior de…
▽ More
Diffusion models (DMs) have recently shown outstanding capability in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior defined within the Bayesian framework. To harness the generative power of DMs while avoiding such approximations, we propose a Markov chain Monte Carlo algorithm that performs posterior sampling for general inverse problems by reducing it to sampling the posterior of a Gaussian denoising problem. Crucially, we leverage a general DM formulation as a unified interface that allows for rigorously solving the denoising problem with a range of state-of-the-art DMs. We demonstrate the effectiveness of the proposed method on six inverse problems (three linear and three nonlinear), including a real-world black hole imaging problem. Experimental results indicate that our proposed method offers more accurate reconstructions and posterior estimation compared to existing DM-based imaging inverse methods.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Holographic MIMO Systems, Their Channel Estimation and Performance
Authors:
Yuanbin Chen,
Ying Wang,
Zhaocheng Wang,
** Zhang
Abstract:
Holographic multiple-input multiple-output (MIMO) systems constitute a promising technology in support of next-generation wireless communications, thus paving the way for a smart programmable radio environment. However, despite its significant potential, further fundamental issues remain to be addressed, such as the acquisition of accurate channel information. Indeed, the conventional angular-doma…
▽ More
Holographic multiple-input multiple-output (MIMO) systems constitute a promising technology in support of next-generation wireless communications, thus paving the way for a smart programmable radio environment. However, despite its significant potential, further fundamental issues remain to be addressed, such as the acquisition of accurate channel information. Indeed, the conventional angular-domain channel representation is no longer adequate for characterizing the sparsity inherent in holographic MIMO channels. To fill this knowledge gap, in this article, we conceive a decomposition and reconstruction (DeRe)-based framework for facilitating the estimation of sparse channels in holographic MIMOs. In particular, the channel parameters involved in the steering vector, namely the azimuth and elevation angles plus the distance (AED), are decomposed for independently constructing their own covariance matrices. Then, the acquisition of each parameter can be formulated as a compressive sensing (CS) problem by harnessing the covariance matrix associated with each individual parameter. We demonstrate that our solution exhibits an improved performance and imposes a reduced pilot overhead, despite its reduced complexity. Finally, promising open research topics are highlighted to bridge the gap between the theory and the practical employment of holographic MIMO schemes.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Extraction of In-Phase and Quadrature Components by Time-Encoding Sampling
Authors:
Y. H. Shao,
S. Y. Chen,
H. Z. Yang,
F. Xi,
H. Hong,
Z. Liu
Abstract:
Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper…
▽ More
Time encoding machine (TEM) is a biologically-inspired scheme to perform signal sampling using timing. In this paper, we study its application to the sampling of bandpass signals. We propose an integrate-and-fire TEM scheme by which the in-phase (I) and quadrature (Q) components are extracted through reconstruction. We design the TEM according to the signal bandwidth and amplitude instead of upper-edge frequency and amplitude as in the case of bandlimited/lowpass signals. We show that the I and Q components can be perfectly reconstructed from the TEM measurements if the minimum firing rate is equal to the Landau's rate of the signal. For the reconstruction of I and Q components, we develop an alternating projection onto convex sets (POCS) algorithm in which two POCS algorithms are alternately iterated. For the algorithm analysis, we define a solution space of vector-valued signals and prove that the proposed reconstruction algorithm converges to the correct unique solution in the noiseless case. The proposed TEM can operate regardless of the center frequencies of the bandpass signals. This is quite different from traditional bandpass sampling, where the center frequency should be carefully allocated for Landau's rate and its variations have the negative effect on the sampling performance. In addition, the proposed TEM achieves certain reconstructed signal-to-noise-plus-distortion ratios for small firing rates in thermal noise, which is unavoidably present and will be aliased to the Nyquist band in the traditional sampling such that high sampling rates are required. We demonstrate the reconstruction performance and substantiate our claims via simulation experiments.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
UniCompress: Enhancing Multi-Data Medical Image Compression with Knowledge Distillation
Authors:
Runzhao Yang,
Yinda Chen,
Zhihong Zhang,
Xiaoyu Liu,
Zongren Li,
Kunlun He,
Zhiwei Xiong,
**li Suo,
Qionghai Dai
Abstract:
In the field of medical image compression, Implicit Neural Representation (INR) networks have shown remarkable versatility due to their flexible compression ratios, yet they are constrained by a one-to-one fitting approach that results in lengthy encoding times. Our novel method, ``\textbf{UniCompress}'', innovatively extends the compression capabilities of INR by being the first to compress multi…
▽ More
In the field of medical image compression, Implicit Neural Representation (INR) networks have shown remarkable versatility due to their flexible compression ratios, yet they are constrained by a one-to-one fitting approach that results in lengthy encoding times. Our novel method, ``\textbf{UniCompress}'', innovatively extends the compression capabilities of INR by being the first to compress multiple medical data blocks using a single INR network. By employing wavelet transforms and quantization, we introduce a codebook containing frequency domain information as a prior input to the INR network. This enhances the representational power of INR and provides distinctive conditioning for different image blocks. Furthermore, our research introduces a new technique for the knowledge distillation of implicit representations, simplifying complex model knowledge into more manageable formats to improve compression ratios. Extensive testing on CT and electron microscopy (EM) datasets has demonstrated that UniCompress outperforms traditional INR methods and commercial compression solutions like HEVC, especially in complex and high compression scenarios. Notably, compared to existing INR techniques, UniCompress achieves a 4$\sim$5 times increase in compression speed, marking a significant advancement in the field of medical image compression. Codes will be publicly available.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Graphon Particle Systems, Part I: Spatio-Temporal Approximation and Law of Large Numbers
Authors:
Yan Chen,
Tao Li
Abstract:
We study a class of graphon particle systems with time-varying random coefficients. In a graphon particle system, the interactions among particles are characterized by the coupled mean field terms through an underlying graphon and the randomness of the coefficients comes from the stochastic processes associated with the particle labels. By constructing two-level approximated sequences converging i…
▽ More
We study a class of graphon particle systems with time-varying random coefficients. In a graphon particle system, the interactions among particles are characterized by the coupled mean field terms through an underlying graphon and the randomness of the coefficients comes from the stochastic processes associated with the particle labels. By constructing two-level approximated sequences converging in 2-Wasserstein distance, we prove the existence and uniqueness of the solution to the system. Besides, by constructing two-level approximated functions converging to the graphon mean field terms, we establish the law of large numbers, which reveals that if the number of particles tends to infinity and the discretization step tends to zero, then the discrete-time interacting particle system over the large-scale network converges to the graphon particle system. As a byproduct, we discover that the graphon particle system can describe the dynamic evolution of the distributed stochastic gradient descent algorithm over the large-scale network and prove that if the gradients of the local cost functions are Lipschitz continuous, then the graphon particle system can be regarded as the spatio-temporal approximation of the discrete-time distributed stochastic gradient descent algorithm as the number of network nodes tends to infinity and the algorithm step size tends to zero.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Combining Radiomics and Machine Learning Approaches for Objective ASD Diagnosis: Verifying White Matter Associations with ASD
Authors:
Junlin Song,
Yuzhuo Chen,
Yuan Yao,
Zetong Chen,
Renhao Guo,
Lida Yang,
Xinyi Sui,
Qihang Wang,
Xijiao Li,
Aihua Cao,
Wei Li
Abstract:
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully u…
▽ More
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully utilized. To address this gap, we develop a computer-aided diagnostic model focusing on white matter regions in brain MRI by employing radiomics and machine learning methods. This study introduced a MultiUNet model for segmenting white matter, leveraging the UNet architecture and utilizing manually segmented MRI images as the training data. Subsequently, we extracted white matter features using the Pyradiomics toolkit and applied different machine learning models such as Support Vector Machine, Random Forest, Logistic Regression, and K-Nearest Neighbors to predict autism. The prediction sets all exceeded 80% accuracy. Additionally, we employed Convolutional Neural Network to analyze segmented white matter images, achieving a prediction accuracy of 86.84%. Notably, Support Vector Machine demonstrated the highest prediction accuracy at 89.47%. These findings not only underscore the efficacy of the models but also establish a link between white matter abnormalities and autism. Our study contributes to a comprehensive evaluation of various diagnostic models for autism and introduces a computer-aided diagnostic algorithm for early and objective autism diagnosis based on MRI white matter regions.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
A better approach to diagnose retinal diseases: Combining our Segmentation-based Vascular Enhancement with deep learning features
Authors:
Yuzhuo Chen,
Zetong Chen,
Yuanyuan Liu
Abstract:
Abnormalities in retinal fundus images may indicate certain pathologies such as diabetic retinopathy, hypertension, stroke, glaucoma, retinal macular edema, venous occlusion, and atherosclerosis, making the study and analysis of retinal images of great significance. In conventional medicine, the diagnosis of retina-related diseases relies on a physician's subjective assessment of the retinal fundu…
▽ More
Abnormalities in retinal fundus images may indicate certain pathologies such as diabetic retinopathy, hypertension, stroke, glaucoma, retinal macular edema, venous occlusion, and atherosclerosis, making the study and analysis of retinal images of great significance. In conventional medicine, the diagnosis of retina-related diseases relies on a physician's subjective assessment of the retinal fundus images, which is a time-consuming process and the accuracy is highly dependent on the physician's subjective experience. To this end, this paper proposes a fast, objective, and accurate method for the diagnosis of diseases related to retinal fundus images. This method is a multiclassification study of normal samples and 13 categories of disease samples on the STARE database, with a test set accuracy of 99.96%. Compared with other studies, our method achieved the highest accuracy. This study innovatively propose Segmentation-based Vascular Enhancement(SVE). After comparing the classification performances of the deep learning models of SVE images, original images and Smooth Grad-CAM ++ images, we extracted the deep learning features and traditional features of the SVE images and input them into nine meta learners for classification. The results shows that our proposed UNet-SVE-VGG-MLP model has the optimal performance for classifying diseases related to retinal fundus images on the STARE database, with a overall accuracy of 99.96% and a weighted AUC of 99.98% for the 14 categories on test dataset. This method can be used to realize rapid, objective, and accurate classification and diagnosis of retinal fundus image related diseases.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Comparing remote sensing-based forest biomass map** approaches using new forest inventory plots in contrasting forests in northeastern and southwestern China
Authors:
Wenquan Dong,
Edward T. A. Mitchard,
Yuwei Chen,
Man Chen,
Congfeng Cao,
Peilun Hu,
Cong Xu,
Steven Hancock
Abstract:
Large-scale high spatial resolution aboveground biomass (AGB) maps play a crucial role in determining forest carbon stocks and how they are changing, which is instrumental in understanding the global carbon cycle, and implementing policy to mitigate climate change. The advent of the new space-borne LiDAR sensor, NASA's GEDI instrument, provides unparalleled possibilities for the accurate and unbia…
▽ More
Large-scale high spatial resolution aboveground biomass (AGB) maps play a crucial role in determining forest carbon stocks and how they are changing, which is instrumental in understanding the global carbon cycle, and implementing policy to mitigate climate change. The advent of the new space-borne LiDAR sensor, NASA's GEDI instrument, provides unparalleled possibilities for the accurate and unbiased estimation of forest AGB at high resolution, particularly in dense and tall forests, where Synthetic Aperture Radar (SAR) and passive optical data exhibit saturation. However, GEDI is a sampling instrument, collecting dispersed footprints, and its data must be combined with that from other continuous cover satellites to create high-resolution maps, using local machine learning methods. In this study, we developed local models to estimate forest AGB from GEDI L2A data, as the models used to create GEDI L4 AGB data incorporated minimal field data from China. We then applied LightGBM and random forest regression to generate wall-to-wall AGB maps at 25 m resolution, using extensive GEDI footprints as well as Sentinel-1 data, ALOS-2 PALSAR-2 and Sentinel-2 optical data. Through a 5-fold cross-validation, LightGBM demonstrated a slightly better performance than Random Forest across two contrasting regions. However, in both regions, the computation speed of LightGBM is substantially faster than that of the random forest model, requiring roughly one-third of the time to compute on the same hardware. Through the validation against field data, the 25 m resolution AGB maps generated using the local models developed in this study exhibited higher accuracy compared to the GEDI L4B AGB data. We found in both regions an increase in error as slope increased. The trained models were tested on nearby but different regions and exhibited good performance.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Seamless Integration and Implementation of Distributed Contact and Contactless Vital Sign Monitoring
Authors:
Dingding Liang,
Yang Chen,
Jiawei Gao,
Taixia Shi,
Jian** Yao
Abstract:
Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless…
▽ More
Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless vital sign monitoring system, which can simultaneously implement respiration and heartbeat monitoring, is proposed. In contact vital sign monitoring, the chest wall movement due to respiration and heartbeat is translated into changes in the optical output intensity of a fiber Bragg grating (FBG). The FBG is also an important part of radar signal generation for contactless vital sign monitoring, in which the chest wall movement is translated into phase changes of the radar de-chirped signal. By analyzing the intensity of the FBG output and phase of the radar de-chirped signal, real-time respiration and heartbeat monitoring are realized. In addition, due to the distributed structure of the system and its good integration with the wavelength-division multiplexing optical network, it can be massively scaled by employing more wavelengths. A proof-of-concept experiment is carried out. Contact and contactless respiration and heartbeat monitoring of three people are simultaneously realized. During a monitoring time of 60 s, the maximum absolute measurement errors of respiration and heartbeat rates are 1.6 respirations per minute and 2.3 beats per minute, respectively. The measurement error does not have an obvious change even when the monitoring time is decreased to 5 s.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Edge Information Hub-Empowered 6G NTN: Latency-Oriented Resource Orchestration and Configuration
Authors:
Yueshan Lin,
Wei Feng,
Yunfei Chen,
Ning Ge,
Zhiyong Feng,
Yue Gao
Abstract:
Quick response to disasters is crucial for saving lives and reducing loss. This requires low-latency uploading of situation information to the remote command center. Since terrestrial infrastructures are often damaged in disaster areas, non-terrestrial networks (NTNs) are preferable to provide network coverage, and mobile edge computing (MEC) could be integrated to improve the latency performance.…
▽ More
Quick response to disasters is crucial for saving lives and reducing loss. This requires low-latency uploading of situation information to the remote command center. Since terrestrial infrastructures are often damaged in disaster areas, non-terrestrial networks (NTNs) are preferable to provide network coverage, and mobile edge computing (MEC) could be integrated to improve the latency performance. Nevertheless, the communications and computing in MEC-enabled NTNs are strongly coupled, which complicates the system design. In this paper, an edge information hub (EIH) that incorporates communication, computing and storage capabilities is proposed to synergize communication and computing and enable systematic design. We first address the joint data scheduling and resource orchestration problem to minimize the latency for uploading sensing data. The problem is solved using an optimal resource orchestration algorithm. On that basis, we propose the principles for resource configuration of the EIH considering payload constraints on size, weight and energy supply. Simulation results demonstrate the superiority of our proposed scheme in reducing the overall upload latency, thus enabling quick emergency rescue.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Simultaneous Deep Learning of Myocardium Segmentation and T2 Quantification for Acute Myocardial Infarction MRI
Authors:
Yirong Zhou,
Chengyan Wang,
Mengtian Lu,
Kunyuan Guo,
Zi Wang,
Dan Ruan,
Rui Guo,
Peijun Zhao,
Jianhua Wang,
Naiming Wu,
Jianzhong Lin,
Yinyin Chen,
Hang **,
Lianxin Xie,
Lilan Wu,
Liuhong Zhu,
Jianjun Zhou,
Congbo Cai,
He Wang,
Xiaobo Qu
Abstract:
In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features…
▽ More
In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate map**. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI.
△ Less
Submitted 29 May, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
Co-learning-aided Multi-modal-deep-learning Framework of Passive DOA Estimators for a Heterogeneous Hybrid Massive MIMO Receiver
Authors:
Jiatong Bai,
Feng Shu,
Qinghe Zheng,
Bo Xu,
Baihua Shi,
Yiwen Chen,
Weibin Zhang,
Xianpeng Wang
Abstract:
Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structu…
▽ More
Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structure. But there exists the problem of phase ambiguity for HAD, which leads to its low-efficiency or high-latency. Does exist there such a MIMO structure of owning low-cost, low-complexity and high time efficiency at the same time. To satisfy the three properties, a novel heterogeneous hybrid MIMO receiver structure of integrating FD and heterogeneous HAD ($\rm{H}^2$AD-FD) is proposed and corresponding multi-modal (MD)-learning framework is developed. The framework includes three major stages: 1) generate the candidate sets via root multiple signal classification (Root-MUSIC) or deep learning (DL); 2) infer the class of true solutions from candidate sets using machine learning (ML) methods; 3) fuse the two-part true solutions to achieve a better DOA estimation. The above process form two methods named MD-Root-MUSIC and MDDL. To improve DOA estimation accuracy and reduce the clustering complexity, a co-learning-aided MD framework is proposed to form two enhanced methods named CoMDDL and CoMD-RootMUSIC. Moreover, the Cramer-Rao lower bound (CRLB) for the proposed $\rm{H}^2$AD-FD structure is also derived. Experimental results demonstrate that our proposed four methods could approach the CRLB for signal-to-noise ratio (SNR) > 0 dB and the proposed CoMDDL and MDDL perform better than CoMD-RootMUSIC and MD-RootMUSIC, particularly in the extremely low SNR region.
△ Less
Submitted 12 June, 2024; v1 submitted 27 April, 2024;
originally announced May 2024.
-
EEG-Features for Generalized Deepfake Detection
Authors:
Arian Beckmann,
Tilman Stephani,
Felix Klotzsche,
Yonghao Chen,
Simon M. Hofmann,
Arno Villringer,
Michael Gaebler,
Vadim Nikulin,
Sebastian Bosse,
Peter Eisert,
Anna Hilsmann
Abstract:
Since the advent of Deepfakes in digital media, the development of robust and reliable detection mechanism is urgently called for. In this study, we explore a novel approach to Deepfake detection by utilizing electroencephalography (EEG) measured from the neural processing of a human participant who viewed and categorized Deepfake stimuli from the FaceForensics++ datset. These measurements serve a…
▽ More
Since the advent of Deepfakes in digital media, the development of robust and reliable detection mechanism is urgently called for. In this study, we explore a novel approach to Deepfake detection by utilizing electroencephalography (EEG) measured from the neural processing of a human participant who viewed and categorized Deepfake stimuli from the FaceForensics++ datset. These measurements serve as input features to a binary support vector classifier, trained to discriminate between real and manipulated facial images. We examine whether EEG data can inform Deepfake detection and also if it can provide a generalized representation capable of identifying Deepfakes beyond the training domain. Our preliminary results indicate that human neural processing signals can be successfully integrated into Deepfake detection frameworks and hint at the potential for a generalized neural representation of artifacts in computer generated faces. Moreover, our study provides next steps towards the understanding of how digital realism is embedded in the human cognitive system, possibly enabling the development of more realistic digital avatars in the future.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
NAFRSSR: a Lightweight Recursive Network for Efficient Stereo Image Super-Resolution
Authors:
Yihong Chen,
Zhen Fan,
Shuai Dong,
Zhiwei Chen,
Wenjie Li,
Minghui Qin,
Min Zeng,
Xubing Lu,
Guofu Zhou,
Xingsen Gao,
Jun-Ming Liu
Abstract:
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high co…
▽ More
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at https://github.com/JNUChenYiHong/NAFRSSR.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Minimum-Variance Recursive State Estimation for 2-D Systems: When Asynchronous Multi-Channel Delays meet Energy Harvesting Constraints
Authors:
Yu Chen,
Wei Wang
Abstract:
This paper is concerned with the state estimation problem for two-dimensional systems with asynchronous multichannel delays and energy harvesting constraints. In the system, each smart sensor has a certain probability of harvesting energy from the external environment, the authorized transmission between the sensor and the remote filter is contingent upon the current energy level of the sensor, wh…
▽ More
This paper is concerned with the state estimation problem for two-dimensional systems with asynchronous multichannel delays and energy harvesting constraints. In the system, each smart sensor has a certain probability of harvesting energy from the external environment, the authorized transmission between the sensor and the remote filter is contingent upon the current energy level of the sensor, which results in intermittent transmission of observation information. Addressing the issue of incomplete observation information due to asynchronous multi-channel delays, a novel approach for observation partition reconstruction is proposed to convert the delayed activated observation sequences into equivalent delay-free activated observation sequences. Through generating spatial equivalency validation, it is found that the reconstructed delay-free activated observation sequences contain the same information as the original delayed activated observation sequences. Based on the reconstructed activated observation sequence and activated probability, a novel unbiased h+1-step recursive estimator is constructed. Then, the evolution of the probability distribution of the energy level is discussed. The estimation gains are obtained by minimizing the filtering error covariance. Subsequently, through parameter assumptions, a uniform lower bound and a recursive upper bound for the filtering error covariance are presented. And the monotonicity analysis of activated probability on estimation performance is given. Finally, the effectiveness of the proposed estimation scheme is verified through a numerical simulation example.
△ Less
Submitted 13 May, 2024; v1 submitted 12 May, 2024;
originally announced May 2024.
-
Joint Uplink and Downlink Rate Splitting for Fog Computing-Enabled Internet of Medical Things
Authors:
Jiasi Zhou,
Yan Chen,
Cong Zhou,
Yan**g Sun
Abstract:
The Internet of Medical Things (IoMT) facilitates in-home electronic healthcare, transforming traditional hospital-based medical examination approaches. This paper proposes a novel transmit scheme for fog computing-enabled IoMT that leverages uplink and downlink rate splitting (RS). Fog computing allows offloading partial computation tasks to the edge server and processing the remainder of the tas…
▽ More
The Internet of Medical Things (IoMT) facilitates in-home electronic healthcare, transforming traditional hospital-based medical examination approaches. This paper proposes a novel transmit scheme for fog computing-enabled IoMT that leverages uplink and downlink rate splitting (RS). Fog computing allows offloading partial computation tasks to the edge server and processing the remainder of the tasks locally. The uplink RS and downlink RS utilize their flexible interference management capabilities to suppress offloading and feedback delay. Our overarching goal is to minimize the total time cost for task offloading, data processing, and result feedback. The resulting problem requires the joint design of task offloading, computing resource allocation, uplink beamforming, downlink beamforming, and common rate allocation. To solve the formulated non-convex problem, we introduce several auxiliary variables and then construct accurate surrogates to smooth the achievable rate. Moreover, we derive the optimal computation resource allocation per user with closed-form expressions. On this basis, we recast the computing resource allocation and energy consumption at the base station to a convex constraint set. We finally develop an alternating optimization algorithm to update the auxiliary variable and inherent variable alternately. Simulation results show that our transmit scheme and algorithm exhibit considerable performance enhancements over several benchmarks.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Channel Estimation for Holographic MIMO: Wavenumber-Domain Sparsity Inspired Approaches
Authors:
Yuqing Guo,
Yuanbin Chen,
Ying Wang
Abstract:
This paper investigates the sparse channel estimation for holographic multiple-input multiple-output (HMIMO) systems. Given that the wavenumber-domain representation is based on a series of Fourier harmonics that are in essence a series of orthogonal basis functions, a novel wavenumber-domain sparsifying basis is designed to expose the sparsity inherent in HMIMO channels. Furthermore, by harnessin…
▽ More
This paper investigates the sparse channel estimation for holographic multiple-input multiple-output (HMIMO) systems. Given that the wavenumber-domain representation is based on a series of Fourier harmonics that are in essence a series of orthogonal basis functions, a novel wavenumber-domain sparsifying basis is designed to expose the sparsity inherent in HMIMO channels. Furthermore, by harnessing the beneficial sparsity in the wavenumber domain, the sparse estimation of HMIMO channels is structured as a compressed sensing problem, which can be efficiently solved by our proposed wavenumber-domain orthogonal matching pursuit (WD-OMP) algorithm. Finally, numerical results demonstrate that the proposed wavenumber-domain sparsifying basis maintains its detection accuracy regardless of the number of antenna elements and antenna spacing. Additionally, in the case of antenna spacing being much less than half a wavelength, the wavenumber-domain approach remains highly accurate in identifying the significant angular power of HMIMO channels.
△ Less
Submitted 12 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Research on the Tender Leaf Identification and Mechanically Perceptible Plucking Finger for High-quality Green Tea
Authors:
Wei Zhang,
Yong Chen,
Qianqian Wang,
Jun Chen
Abstract:
BACKGROUND: Intelligent identification and precise plucking are the keys to intelligent tea harvesting robots, which are of increasing significance nowadays. Aiming at plucking tender leaves for high-quality green tea producing, in this paper, a tender leaf identification algorithm and a mechanically perceptible plucking finger have been proposed. RESULTS: Based on segmentation algorithm and color…
▽ More
BACKGROUND: Intelligent identification and precise plucking are the keys to intelligent tea harvesting robots, which are of increasing significance nowadays. Aiming at plucking tender leaves for high-quality green tea producing, in this paper, a tender leaf identification algorithm and a mechanically perceptible plucking finger have been proposed. RESULTS: Based on segmentation algorithm and color features, the tender leaf identification algorithm shows an average identification accuracy of over 92.8%. The mechanically perceptible plucking finger plucks tender leaves in a way that a human hand does so as to remain high quality of tea products. Though finite element analysis, we determine the ideal size of grippers and the location of strain gauge attachment on a gripper to enable the employment of feedback control of desired grip** force. Revealed from our experiments, the success rate of tender leaf plucking reaches 92.5%, demonstrating the effectiveness of our design. CONCLUSION: The results show that the tender leaf identification algorithm and the mechanically perceptible plucking finger are effective for tender leaves identification and plucking, providing a foundation for the development of an intelligent tender leaf plucking robot.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
HAGAN: Hybrid Augmented Generative Adversarial Network for Medical Image Synthesis
Authors:
Zhihan Ju,
Wanting Zhou,
Longteng Kong,
Yu Chen,
Yi Li,
Zhenan Sun,
Caifeng Shan
Abstract:
Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Ad…
▽ More
Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Adversarial Network (HAGAN) to maintain the authenticity of structural texture and tissue cells. HAGAN contains Attention Mixed (AttnMix) Generator, Hierarchical Discriminator and Reverse Skip Connection between Discriminator and Generator. The AttnMix consistency differentiable regularization encourages the perception in structural and textural variations between real and fake images, which improves the pathological integrity of synthetic images and the accuracy of features in local areas. The Hierarchical Discriminator introduces pixel-by-pixel discriminant feedback to generator for enhancing the saliency and discriminance of global and local details simultaneously. The Reverse Skip Connection further improves the accuracy for fine details by fusing real and synthetic distribution features. Our experimental evaluations on three datasets of different scales, i.e., COVID-CT, ACDC and BraTS2018, demonstrate that HAGAN outperforms the existing methods and achieves state-of-the-art performance in both high-resolution and low-resolution.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results
Authors:
Yaqi Wu,
Zhihao Fan,
Xiaofeng Chu,
Jimmy S. Ren,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangcheng Zhou,
Ruicheng Feng,
Yuekun Dai,
Peiqing Yang,
Chen Change Loy,
Senyan Xu,
Zhi**g Sun,
Jiaying Zhu,
Yurui Zhu,
Xueyang Fu,
Zheng-Jun Zha,
Jun Cao,
Cheng Li,
Shu Chen,
Liang Ma,
Shiyang Zhou,
Hai** Zeng,
Kai Feng
, et al. (24 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.