Search | arXiv e-print repository

LLM4CP: Adapting Large Language Models for Channel Prediction

Authors: Boxun Liu, Xuanyu Liu, Shijian Gao, Xiang Cheng, Liuqing Yang

Abstract: Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied… ▽ More Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied to cross-modal tasks, including the time series analysis. Leveraging the expressive power of LLMs, we propose a pre-trained LLM-empowered channel prediction method (LLM4CP) to predict the future downlink channel state information (CSI) sequence based on the historical uplink CSI sequence. We fine-tune the network while freezing most of the parameters of the pre-trained LLM for better cross-modality knowledge transfer. To bridge the gap between the channel data and the feature space of the LLM, preprocessor, embedding, and output modules are specifically tailored by taking into account unique channel characteristics. Simulations validate that the proposed method achieves SOTA prediction performance on full-sample, few-shot, and generalization tests with low training and inference costs. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13150 [pdf]

MCAD: Multi-modal Conditioned Adversarial Diffusion Model for High-Quality PET Image Reconstruction

Authors: Jiaqi Cui, Xinyi Zeng, Pinxian Zeng, Bo Liu, Xi Wu, Jiliu Zhou, Yan Wang

Abstract: Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g.… ▽ More Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g., patients' clinical tabular, resulting in compromised reconstruction with limited diagnostic utility. Moreover, they often overlook the semantic consistency between real SPET and reconstructed images, leading to distorted semantic contexts. To tackle these problems, we propose a novel Multi-modal Conditioned Adversarial Diffusion model (MCAD) to reconstruct SPET images from multi-modal inputs, including LPET images and clinical tabular. Specifically, our MCAD incorporates a Multi-modal conditional Encoder (Mc-Encoder) to extract multi-modal features, followed by a conditional diffusion process to blend noise with multi-modal features and gradually map blended features to the target SPET images. To balance multi-modal inputs, the Mc-Encoder embeds Optimal Multi-modal Transport co-Attention (OMTA) to narrow the heterogeneity gap between image and tabular while capturing their interactions, providing sufficient guidance for reconstruction. In addition, to mitigate semantic distortions, we introduce the Multi-Modal Masked Text Reconstruction (M3TRec), which leverages semantic knowledge extracted from denoised PET images to restore the masked clinical tabular, thereby compelling the network to maintain accurate semantics during reconstruction. To expedite the diffusion process, we further introduce an adversarial diffusive network with a reduced number of diffusion steps. Experiments show that our method achieves the state-of-the-art performance both qualitatively and quantitatively. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Early accepted by MICCAI2024

arXiv:2406.12707 [pdf, other]

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Authors: Haoqiu Yan, Yongxin Zhu, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu

Abstract: Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions… ▽ More Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 9 pages, 3 figures, ACL24 accepted

arXiv:2406.05359 [pdf, other]

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Authors: Bei Liu, Haoyu Wang, Yanmin Qian

Abstract: Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids custom… ▽ More Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges. △ Less

Submitted 18 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing (Under Review)

arXiv:2405.12487 [pdf, other]

3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification

Authors: Yan He, Bing Tu, Bo Liu, Jun Li, Antonio Plaza

Abstract: Hyperspectral image (HSI) classification constitutes the fundamental research in remote sensing fields. Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capability in capturing spectral-spatial contextual dependencies. However, these architectures suffer from limited receptive fields and quadratic computational complexity, respectively. Fortunately, recent Mamba a… ▽ More Hyperspectral image (HSI) classification constitutes the fundamental research in remote sensing fields. Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capability in capturing spectral-spatial contextual dependencies. However, these architectures suffer from limited receptive fields and quadratic computational complexity, respectively. Fortunately, recent Mamba architectures built upon the State Space Model integrate the advantages of long-range sequence modeling and linear computational efficiency, exhibiting substantial potential in low-dimensional scenarios. Motivated by this, we propose a novel 3D-Spectral-Spatial Mamba (3DSS-Mamba) framework for HSI classification, allowing for global spectral-spatial relationship modeling with greater computational efficiency. Technically, a spectral-spatial token generation (SSTG) module is designed to convert the HSI cube into a set of 3D spectral-spatial tokens. To overcome the limitations of traditional Mamba, which is confined to modeling causal sequences and inadaptable to high-dimensional scenarios, a 3D-Spectral-Spatial Selective Scanning (3DSS) mechanism is introduced, which performs pixel-wise selective scanning on 3D hyperspectral tokens along the spectral and spatial dimensions. Five scanning routes are constructed to investigate the impact of dimension prioritization. The 3DSS scanning mechanism combined with conventional map** operations forms the 3D-spectral-spatial mamba block (3DMB), enabling the extraction of global spectral-spatial semantic representations. Experimental results and analysis demonstrate that the proposed method outperforms the state-of-the-art methods on HSI classification benchmarks. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.09778 [pdf, other]

Beam Pattern Modulation Embedded Hybrid Transceiver Optimization for Integrated Sensing and Communication

Authors: Boxun Liu, Shijian Gao, Zonghui Yang, Xiang Cheng, Liuqing Yang

Abstract: Integrated sensing and communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widely utilized hybrid architecture in mmWave systems compromises multiplexing gain due to the constraints of limited radio frequency chains. Moreover, additional sensing functionalities exacerbate the impairment of spectrum efficiency (SE). In t… ▽ More Integrated sensing and communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widely utilized hybrid architecture in mmWave systems compromises multiplexing gain due to the constraints of limited radio frequency chains. Moreover, additional sensing functionalities exacerbate the impairment of spectrum efficiency (SE). In this paper, we present an optimized beam pattern modulation-embedded ISAC (BPM-ISAC) transceiver design, which spares one RF chain for sensing and the others for communication. To compensate for the reduced SE, index modulation across communication beams is applied. We formulate an optimization problem aimed at minimizing the mean squared error (MSE) of the sensing beampattern, subject to a symbol MSE constraint. This problem is then solved by sequentially optimizing the analog and digital parts. Both the multi-aperture structure (MAS) and the multi-beam structure (MBS) are considered for the design of the analog part. We conduct theoretical analysis on the asymptotic pairwise error probability (APEP) and the Cramér-Rao bound (CRB) of direction of arrival (DoA) estimation. Numerical simulations validate the overall enhanced ISAC performance over existing alternatives. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.09298 [pdf]

Deep Blur Multi-Model (DeepBlurMM) -- a strategy to mitigate the impact of image blur on deep learning model performance in histopathology image analysis

Authors: Yujie Xiang, Bo**g Liu, Mattias Rantalainen

Abstract: AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality, including unsharp areas of WSIs, impacts model performance. We investigate the impact of blur and propose a multi-model approach to mitigate negative impact of unsharp image areas. In this study, we use a simulation approach, evaluating model performance under varying levels… ▽ More AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality, including unsharp areas of WSIs, impacts model performance. We investigate the impact of blur and propose a multi-model approach to mitigate negative impact of unsharp image areas. In this study, we use a simulation approach, evaluating model performance under varying levels of added Gaussian blur to image tiles from >900 H&E-stained breast cancer WSIs. To reduce impact of blur, we propose a novel multi-model approach (DeepBlurMM) where multiple models trained on data with variable amounts of Gaussian blur are used to predict tiles based on their blur levels. Using histological grade as a principal example, we found that models trained with mildly blurred tiles improved performance over the base model when moderate-high blur was present. DeepBlurMM outperformed the base model in presence of moderate blur across all tiles (AUC:0.764 vs. 0.710), and in presence of a mix of low, moderate, and high blur across tiles (AUC:0.821 vs. 0.789). Unsharp image tiles in WSIs impact prediction performance. DeepBlurMM improved prediction performance under some conditions and has the potential to increase quality in both research and clinical applications. △ Less

Submitted 23 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

ACM Class: I.4; J.3

arXiv:2404.17736 [pdf, other]

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

Authors: Mingyu Yang, Bowen Liu, Boyang Wang, Hun-Seok Kim

Abstract: Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated as an effective approach for wireless image transmission. Nevertheless, current research has concentrated on minimizing a standard distortion metric such as Mean Squared Error (MSE), which does not necessarily improve the perceptual quality. To address this issue, we propose DiffJSCC, a novel framework that leverages… ▽ More Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated as an effective approach for wireless image transmission. Nevertheless, current research has concentrated on minimizing a standard distortion metric such as Mean Squared Error (MSE), which does not necessarily improve the perceptual quality. To address this issue, we propose DiffJSCC, a novel framework that leverages pre-trained text-to-image diffusion models to enhance the realism of images transmitted over the channel. The proposed DiffJSCC utilizes prior deep JSCC frameworks to deliver an initial reconstructed image at the receiver. Then, the spatial and textual features are extracted from the initial reconstruction, which, together with the channel state information (e.g., signal-to-noise ratio, SNR), are passed to a control module to fine-tune the pre-trained Stable Diffusion model. Extensive experiments on the Kodak dataset reveal that our method significantly surpasses both conventional methods and prior deep JSCC approaches on perceptual metrics such as LPIPS and FID scores, especially with poor channel conditions and limited bandwidth. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR. Our code will be released in https://github.com/mingyuyng/DiffJSCC. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.07556 [pdf, other]

Attention-Aware Laparoscopic Image Desmoking Network with Lightness Embedding and Hybrid Guided Embedding

Authors: Ziteng Liu, Jiahua Zhu, Bainan Liu, Hao Liu, Wenpeng Gao, Yili Fu

Abstract: This paper presents a novel method of smoke removal from the laparoscopic images. Due to the heterogeneous nature of surgical smoke, a two-stage network is proposed to estimate the smoke distribution and reconstruct a clear, smoke-free surgical scene. The utilization of the lightness channel plays a pivotal role in providing vital information pertaining to smoke density. The reconstruction of smok… ▽ More This paper presents a novel method of smoke removal from the laparoscopic images. Due to the heterogeneous nature of surgical smoke, a two-stage network is proposed to estimate the smoke distribution and reconstruct a clear, smoke-free surgical scene. The utilization of the lightness channel plays a pivotal role in providing vital information pertaining to smoke density. The reconstruction of smoke-free image is guided by a hybrid embedding, which combines the estimated smoke mask with the initial image. Experimental results demonstrate that the proposed method boasts a Peak Signal to Noise Ratio that is $2.79\%$ higher than the state-of-the-art methods, while also exhibits a remarkable $38.2\%$ reduction in run-time. Overall, the proposed method offers comparable or even superior performance in terms of both smoke removal quality and computational efficiency when compared to existing state-of-the-art methods. This work will be publicly available on http://homepage.hit.edu.cn/wpgao △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: ISBI2024

arXiv:2403.15418 [pdf, other]

Stochastic Analysis of Touch-Tone Frequency Recognition in Two-Way Radio Systems for Dialed Telephone Number Identification

Authors: Liqiang Yu, Chen Li, Bo Liu, Chang Che

Abstract: This paper focuses on recognizing dialed numbers in a touch-tone telephone system based on the Dual Tone MultiFrequency (DTMF) signaling technique with analysis of stochastic aspects during the noise and random duration of characters. Each dialed digit's acoustic profile is derived from a composite of two carrier frequencies, distinctly assigned to represent that digit. The identification of each… ▽ More This paper focuses on recognizing dialed numbers in a touch-tone telephone system based on the Dual Tone MultiFrequency (DTMF) signaling technique with analysis of stochastic aspects during the noise and random duration of characters. Each dialed digit's acoustic profile is derived from a composite of two carrier frequencies, distinctly assigned to represent that digit. The identification of each digit is achieved by pinpointing the frequency pair with the highest energy or amplitude in its spectral output, utilizing the Discrete-Time Fourier Transform (DTFT). This analysis includes simulations that illustrate the effects of introducing stochastic variations during the "Mark" and "Space" intervals of the decoding process, offering insights into the technique's efficacy and the impact of random temporal fluctuations. This will reduce the accuracy of decoder decoding and lower the SNR of the system. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: It is accepted by The 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2024)

arXiv:2403.15238 [pdf]

WEEP: A method for spatial interpretation of weakly supervised CNN models in computational pathology

Authors: Abhinav Sharma, Bo**g Liu, Mattias Rantalainen

Abstract: Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel met… ▽ More Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel method, Wsi rEgion sElection aPproach (WEEP), for model interpretation. It provides a principled yet straightforward way to establish the spatial area of WSI required for assigning a particular prediction label. We demonstrate WEEP on a binary classification task in the area of breast cancer computational pathology. WEEP is easy to implement, is directly connected to the model-based decision process, and offers information relevant to both research and diagnostic applications. △ Less

Submitted 8 April, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.13290 [pdf, other]

A Log-domain Interior Point Method for Convex Quadratic Games

Authors: Bingqi Liu, Dominic Liao-McPherson

Abstract: In this paper, we propose an equilibrium-seeking algorithm for finding generalized Nash equilibria of non-cooperative monotone convex quadratic games. Specifically, we recast the Nash equilibrium-seeking problem as variational inequality problem that we solve using a log-domain interior point method and provide a general purpose solver based on this algorithm. This approach is suitable for non-pot… ▽ More In this paper, we propose an equilibrium-seeking algorithm for finding generalized Nash equilibria of non-cooperative monotone convex quadratic games. Specifically, we recast the Nash equilibrium-seeking problem as variational inequality problem that we solve using a log-domain interior point method and provide a general purpose solver based on this algorithm. This approach is suitable for non-potential, general sum games and does not require extensive structural assumptions. We demonstrate the efficiency and versatility of our method using three benchmark games and demonstrate our algorithm is especially effective on small to medium scale problems. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.10417 [pdf, other]

Beam Pattern Modulation Embedded mmWave Hybrid Transceiver Design Towards ISAC

Authors: Boxun Liu, Shijian Gao, Zonghui Yang, Xiang Cheng

Abstract: Integrated Sensing and Communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widespread adoption of hybrid architecture in mmWave systems compromises multiplexing gain due to limited radio-frequency chains, resulting in mediocre performance when embedding sensing functionality. To avoid sacrificing the spectrum efficiency… ▽ More Integrated Sensing and Communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widespread adoption of hybrid architecture in mmWave systems compromises multiplexing gain due to limited radio-frequency chains, resulting in mediocre performance when embedding sensing functionality. To avoid sacrificing the spectrum efficiency in hybrid structures while addressing performance bottlenecks in its extension to ISAC, we present an optimized beam pattern modulation-embedded ISAC (BPM-ISAC). BPM-ISAC applies index modulation over beamspace by selectively activating communication beams, aiming to minimize sensing beampattern mean squared error (MSE) under communication MSE constraints through dedicated hybrid transceiver design. Optimization involves the analog part through a min-MSE-based beam selection algorithm, followed by the digital part using an alternating optimization algorithm. Convergence and asymptotic pairwise error probability (APEP) analyses accompany numerical simulations, validating its overall enhanced ISAC performance over existing alternatives. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.09442 [pdf]

Progress in artificial intelligence applications based on the combination of self-driven sensors and deep learning

Authors: Weixiang Wan, Wenjian Sun, Qiang Zeng, Linying Pan, **gyu Xu, Bo Liu

Abstract: In the era of Internet of Things, how to develop a smart sensor system with sustainable power supply, easy deployment and flexible use has become a difficult problem to be solved. The traditional power supply has problems such as frequent replacement or charging when in use, which limits the development of wearable devices. The contact-to-separate friction nanogenerator (TENG) was prepared by usin… ▽ More In the era of Internet of Things, how to develop a smart sensor system with sustainable power supply, easy deployment and flexible use has become a difficult problem to be solved. The traditional power supply has problems such as frequent replacement or charging when in use, which limits the development of wearable devices. The contact-to-separate friction nanogenerator (TENG) was prepared by using polychotomy thy lene (PTFE) and aluminum (AI) foils. Human motion energy was collected by human body arrangement, and human motion posture was monitored according to the changes of output electrical signals. In 2012, Academician Wang Zhong lin and his team invented the triboelectric nanogenerator (TENG), which uses Maxwell displacement current as a driving force to directly convert mechanical stimuli into electrical signals, so it can be used as a self-driven sensor. Teng-based sensors have the advantages of simple structure and high instantaneous power density, which provides an important means for building intelligent sensor systems. At the same time, machine learning, as a technology with low cost, short development cycle, strong data processing ability and prediction ability, has a significant effect on the processing of a large number of electrical signals generated by TENG, and the combination with TENG sensors will promote the rapid development of intelligent sensor networks in the future. Therefore, this paper is based on the intelligent sound monitoring and recognition system of TENG, which has good sound recognition capability, and aims to evaluate the feasibility of the sound perception module architecture in ubiquitous sensor networks. △ Less

Submitted 12 March, 2024; v1 submitted 30 January, 2024; originally announced February 2024.

Comments: This aticle was accepted by ieee conference

arXiv:2402.04267 [pdf]

doi 10.54097/LAwfJzEA

Application analysis of ai technology combined with spiral CT scanning in early lung cancer screening

Authors: Shulin Li, Liqiang Yu, Bo Liu, Qunwei Lin, Jiaxin Huang

Abstract: At present, the incidence and fatality rate of lung cancer in China rank first among all malignant tumors. Despite the continuous development and improvement of China's medical level, the overall 5-year survival rate of lung cancer patients is still lower than 20% and is staged. A number of studies have confirmed that early diagnosis and treatment of early stage lung cancer is of great significanc… ▽ More At present, the incidence and fatality rate of lung cancer in China rank first among all malignant tumors. Despite the continuous development and improvement of China's medical level, the overall 5-year survival rate of lung cancer patients is still lower than 20% and is staged. A number of studies have confirmed that early diagnosis and treatment of early stage lung cancer is of great significance to improve the prognosis of patients. In recent years, artificial intelligence technology has gradually begun to be applied in oncology. ai is used in cancer screening, clinical diagnosis, radiation therapy (image acquisition, at-risk organ segmentation, image calibration and delivery) and other aspects of rapid development. However, whether medical ai can be socialized depends on the public's attitude and acceptance to a certain extent. However, at present, there are few studies on the diagnosis of early lung cancer by AI technology combined with SCT scanning. In view of this, this study applied the combined method in early lung cancer screening, aiming to find a safe and efficient screening mode and provide a reference for clinical diagnosis and treatment. △ Less

Submitted 26 January, 2024; originally announced February 2024.

Comments: This article was accepted by Frontiers in Computing and Intelligent Systems https://drpress.org/ojs/index.php/fcis/article/view/15781. arXiv admin note: text overlap with arXiv:nlin/0508031 by other authors

arXiv:2402.02327 [pdf, other]

Bootstrap** Audio-Visual Segmentation by Strengthening Audio Cues

Authors: Tianxiang Chen, Zhentao Tan, Tao Gong, Qi Chu, Yue Wu, Bin Liu, Le Lu, Jie** Ye, Nenghai Yu

Abstract: How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate… ▽ More How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method attains new benchmarks in AVS performance. △ Less

Submitted 6 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.06272 [pdf, other]

Segmentation of Mediastinal Lymph Nodes in CT with Anatomical Priors

Authors: Tejas Sudharshan Mathai, Bohan Liu, Ronald M. Summers

Abstract: Purpose: Lymph nodes (LNs) in the chest have a tendency to enlarge due to various pathologies, such as lung cancer or pneumonia. Clinicians routinely measure nodal size to monitor disease progression, confirm metastatic cancer, and assess treatment response. However, variations in their shapes and appearances make it cumbersome to identify LNs, which reside outside of most organs. Methods: We prop… ▽ More Purpose: Lymph nodes (LNs) in the chest have a tendency to enlarge due to various pathologies, such as lung cancer or pneumonia. Clinicians routinely measure nodal size to monitor disease progression, confirm metastatic cancer, and assess treatment response. However, variations in their shapes and appearances make it cumbersome to identify LNs, which reside outside of most organs. Methods: We propose to segment LNs in the mediastinum by leveraging the anatomical priors of 28 different structures (e.g., lung, trachea etc.) generated by the public TotalSegmentator tool. The CT volumes from 89 patients available in the public NIH CT Lymph Node dataset were used to train three 3D nnUNet models to segment LNs. The public St. Olavs dataset containing 15 patients (out-of-training-distribution) was used to evaluate the segmentation performance. Results: For the 15 test patients, the 3D cascade nnUNet model obtained the highest Dice score of 72.2 +- 22.3 for mediastinal LNs with short axis diameter $\geq$ 8mm and 54.8 +- 23.8 for all LNs respectively. These results represent an improvement of 10 points over a current approach that was evaluated on the same test dataset. Conclusion: To our knowledge, we are the first to harness 28 distinct anatomical priors to segment mediastinal LNs, and our work can be extended to other nodal zones in the body. The proposed method has immense potential for improved patient outcomes through the identification of enlarged nodes in initial staging CT scans. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: Submitted to CARS 2024

arXiv:2401.05698 [pdf, other]

doi 10.1016/j.inffus.2024.102382

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao

Abstract: Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in… ▽ More Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE. △ Less

Submitted 1 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Accepted by Information Fusion. The code is available at https://github.com/sunlicai/HiCMAE

Journal ref: Information Fusion, 2024

arXiv:2312.15628 [pdf, other]

Balanced SNR-Aware Distillation for Guided Text-to-Audio Generation

Authors: Bingzhi Liu, Yin Cao, Haohe Liu, Yi Zhou

Abstract: Diffusion models have demonstrated promising results in text-to-audio generation tasks. However, their practical usability is hindered by slow sampling speeds, limiting their applicability in high-throughput scenarios. To address this challenge, progressive distillation methods have been effective in producing more compact and efficient models. Nevertheless, these methods encounter issues with unb… ▽ More Diffusion models have demonstrated promising results in text-to-audio generation tasks. However, their practical usability is hindered by slow sampling speeds, limiting their applicability in high-throughput scenarios. To address this challenge, progressive distillation methods have been effective in producing more compact and efficient models. Nevertheless, these methods encounter issues with unbalanced weights at both high and low noise levels, potentially impacting the quality of generated samples. In this paper, we propose the adaptation of the progressive distillation method to text-to-audio generation tasks and introduce the Balanced SNR-Aware~(BSA) method, an enhanced loss-weighting mechanism for diffusion distillation. The BSA method employs a balanced approach to weight the loss for both high and low noise levels. We evaluate our proposed method on the AudioCaps dataset and report experimental results showing superior performance during the reverse diffusion process compared to previous distillation methods with the same number of sampling steps. Furthermore, the BSA method allows for a significant reduction in sampling steps from 200 to 25, with minimal performance degradation when compared to the original teacher models. △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: 5 pages

arXiv:2312.02199 [pdf, other]

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

Authors: Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng

Abstract: Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop… ▽ More Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat . △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.00996 [pdf, other]

VCISR: Blind Single Image Super-Resolution with Video Compression Synthetic Data

Authors: Boyang Wang, Bowen Liu, Shiyu Liu, Fengyu Yang

Abstract: In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compressio… ▽ More In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compression-based degradation model to synthesize low-resolution image data in the blind SISR task. Our proposed image synthesizing method is widely applicable to existing image datasets, so that a single degraded image can contain distortions caused by the lossy video compression algorithms. This overcomes the leak of feature diversity in video data and thus retains the training efficiency. By introducing video coding artifacts to SISR degradation models, neural networks can super-resolve images with the ability to restore video compression degradations, and achieve better results on restoring generic distortions caused by image compression as well. Our proposed approach achieves superior performance in SOTA no-reference Image Quality Assessment, and shows better visual quality on various datasets. In addition, we evaluate the SISR neural network trained with our degradation model on video super-resolution (VSR) datasets. Compared to architectures specifically designed for the VSR purpose, our method exhibits similar or better performance, evidencing that the presented strategy on infusing video-based degradation is generalizable to address more complicated compression artifacts even without temporal cues. △ Less

Submitted 22 November, 2023; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.08080 [pdf]

RT-SRTS: Angle-Agnostic Real-Time Simultaneous 3D Reconstruction and Tumor Segmentation from Single X-Ray Projection

Authors: Miao Zhu, Qiming Fu, Bo Liu, Mengxi Zhang, Bojian Li, Xiaoyan Luo, Fugen Zhou

Abstract: Radiotherapy is one of the primary treatment methods for tumors, but the organ movement caused by respiration limits its accuracy. Recently, 3D imaging from a single X-ray projection has received extensive attention as a promising approach to address this issue. However, current methods can only reconstruct 3D images without directly locating the tumor and are only validated for fixed-angle imagin… ▽ More Radiotherapy is one of the primary treatment methods for tumors, but the organ movement caused by respiration limits its accuracy. Recently, 3D imaging from a single X-ray projection has received extensive attention as a promising approach to address this issue. However, current methods can only reconstruct 3D images without directly locating the tumor and are only validated for fixed-angle imaging, which fails to fully meet the requirements of motion control in radiotherapy. In this study, a novel imaging method RT-SRTS is proposed which integrates 3D imaging and tumor segmentation into one network based on multi-task learning (MTL) and achieves real-time simultaneous 3D reconstruction and tumor segmentation from a single X-ray projection at any angle. Furthermore, the attention enhanced calibrator (AEC) and uncertain-region elaboration (URE) modules have been proposed to aid feature extraction and improve segmentation accuracy. The proposed method was evaluated on fifteen patient cases and compared with three state-of-the-art methods. It not only delivers superior 3D reconstruction but also demonstrates commendable tumor segmentation results. Simultaneous reconstruction and segmentation can be completed in approximately 70 ms, significantly faster than the required time threshold for real-time tumor tracking. The efficacies of both AEC and URE have also been validated in ablation studies. The code of work is available at https://github.com/ZywooSimple/RT-SRTS. △ Less

Submitted 28 March, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.07198 [pdf, other]

doi 10.1364/OL.515429

Temporal compressive edge imaging enabled by a lensless diffuser camera

Authors: Ze Zheng, Baolei Liu, Jiaqi Song, Lei Ding, Xiaolan Zhong, David Mcgloin, Fan Wang

Abstract: Lensless imagers based on diffusers or encoding masks enable high-dimensional imaging from a single shot measurement and have been applied in various applications. However, to further extract image information such as edge detection, conventional post-processing filtering operations are needed after the reconstruction of the original object images in the diffuser imaging systems. Here, we present… ▽ More Lensless imagers based on diffusers or encoding masks enable high-dimensional imaging from a single shot measurement and have been applied in various applications. However, to further extract image information such as edge detection, conventional post-processing filtering operations are needed after the reconstruction of the original object images in the diffuser imaging systems. Here, we present the concept of a temporal compressive edge detection method based on a lensless diffuser camera, which can directly recover a time sequence of edge images of a moving object from a single-shot measurement, without further post-processing steps. Our approach provides higher image quality during edge detection, compared with the conventional post-processing method. We demonstrate the effectiveness of this approach by both numerical simulation and experiments. The proof-of-concept approach can be further developed with other image post-process operations or versatile computer vision assignments toward task-oriented intelligent lensless imaging systems. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures

Journal ref: Optics Letters, 49(11), 3058-3061 (2024)

arXiv:2308.13234 [pdf, other]

Decoding Natural Images from EEG for Object Recognition

Authors: Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, Xiaorong Gao

Abstract: Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes imag… ▽ More Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. The code will be released on https://github.com/eeyhsong/NICE-EEG. △ Less

Submitted 4 April, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

Comments: ICLR, 2024

arXiv:2308.09223 [pdf, other]

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

Authors: Xiaoxiao He, Chaowei Tan, Ligong Han, Bo Liu, Leon Axel, Kang Li, Dimitris N. Metaxas

Abstract: Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac vo… ▽ More Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks, we propose a morphology-guided diffusion model for 3D cardiac volume reconstruction, DMCVR, that synthesizes high-resolution 2D images and corresponding 3D reconstructed volumes. Our method outperforms previous approaches by conditioning the cardiac morphology on the generative model, eliminating the time-consuming iterative optimization process of the latent code, and improving generation quality. The learned latent spaces provide global semantics, local cardiac morphology and details of each 2D cMRI slice with highly interpretable value to reconstruct 3D cardiac shape. Our experiments show that DMCVR is highly effective in several aspects, such as 2D generation and 3D reconstruction performance. With DMCVR, we can produce high-resolution 3D cardiac MRI reconstructions, surpassing current techniques. Our proposed framework has great potential for improving the accuracy of cardiac disease diagnosis and treatment planning. Code can be accessed at https://github.com/hexiaoxiao-cs/DMCVR. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted in MICCAI 2023

arXiv:2306.14125 [pdf, other]

M$^3$SC: A Generic Dataset for Mixed Multi-Modal (MMM) Sensing and Communication Integration

Authors: Xiang Cheng, Ziwei Huang, Lu Bai, Haotian Zhang, Mingran Sun, Boxun Liu, Sijiang Li, Jianan Zhang, Minson Lee

Abstract: The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC data… ▽ More The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC dataset is further given. To obtain multi-modal sensory data in physical space and communication data in electromagnetic space, we utilize AirSim and WaveFarer to collect multi-modal sensory data and exploit Wireless InSite to collect communication data. Furthermore, the in-depth integration and precise alignment of AirSim, WaveFarer, and Wireless InSite are achieved. The M3SC dataset covers various weather conditions, various frequency bands, and different times of the day. Currently, the M3SC dataset contains 1500 snapshots, including 80 RGB images, 160 depth maps, 80 LiDAR point clouds, 256 sets of mmWave waveforms with 8 radar point clouds, and 72 channel impulse response (CIR) matrices per snapshot, thus totaling 120,000 RGB images, 240,000 depth maps, 120,000 LiDAR point clouds, 384,000 sets of mmWave waveforms with 12,000 radar point clouds, and 108,000 CIR matrices. The data processing result presents the multi-modal sensory information and communication channel statistical properties. Finally, the MMM sensing-communication application, which can be supported by the M3SC dataset, is discussed. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 12 pages, 12 figures

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2305.18205 [pdf, other]

Pulse shape discrimination based on the Tempotron: a powerful classifier on GPU

Authors: Haoran Liu, Peng Li, Ming-Zhe Liu, Kai-Ming Wang, Zhuo Zuo, Bing-Qi Liu

Abstract: This study introduces the Tempotron, a powerful classifier based on a third-generation neural network model, for pulse shape discrimination. By eliminating the need for manual feature extraction, the Tempotron model can process pulse signals directly, generating discrimination results based on learned prior knowledge. The study performed experiments using GPU acceleration, resulting in over a 500… ▽ More This study introduces the Tempotron, a powerful classifier based on a third-generation neural network model, for pulse shape discrimination. By eliminating the need for manual feature extraction, the Tempotron model can process pulse signals directly, generating discrimination results based on learned prior knowledge. The study performed experiments using GPU acceleration, resulting in over a 500 times speedup compared to the CPU-based model, and investigated the impact of noise augmentation on the Tempotron's performance. Experimental results showed that the Tempotron is a potent classifier capable of achieving high discrimination accuracy. Furthermore, analyzing the neural activity of Tempotron during training shed light on its learning characteristics and aided in selecting the Tempotron's hyperparameters. The dataset used in this study and the source code of the GPU-based Tempotron are publicly available on GitHub at https://github.com/HaoranLiu507/TempotronGPU. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 14 pages,7 figures

arXiv:2305.10788 [pdf, other]

Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR

Authors: Hang Shao, Wei Wang, Bei Liu, Xun Gong, Haoyu Wang, Yanmin Qian

Abstract: Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce… ▽ More Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce model size while maintaining performance, we propose a novel guided knowledge distillation and quantization for large pre-trained model Whisper. The student model selects distillation and quantization layers based on quantization loss and distillation loss, respectively. We compressed $\text{Whisper}_\text{small}$ to $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$ levels, making $\text{Whisper}_\text{small}$ 5.18x/10.48x smaller, respectively. Moreover, compared to the original $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$, there is also a relative character error rate~(CER) reduction of 11.3% and 14.0% for the new compressed model respectively. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.12804 [pdf, other]

Channel Estimation and Signal Detection for NLOS Ultraviolet Scattering Communication with Space Division Multiple Access

Authors: Yubo Zhang, Yuchen Pan, Chen Gong, Beiyuan Liu, Zhengyuan Xu

Abstract: We design a receiver assembling several photomultipliers (PMTs) as an array to increase the field of view (FOV) of the receiver and adapt to multiuser situation over None-line-of-sight (NLOS) ultraviolet (UV) channels. Channel estimation and signal detection have been investigated according to the space division characteristics of the structure. Firstly, we adopt the balanced structure on the pilo… ▽ More We design a receiver assembling several photomultipliers (PMTs) as an array to increase the field of view (FOV) of the receiver and adapt to multiuser situation over None-line-of-sight (NLOS) ultraviolet (UV) channels. Channel estimation and signal detection have been investigated according to the space division characteristics of the structure. Firstly, we adopt the balanced structure on the pilot matrix, analyze the channel estimation mean square error (MSE), and optimize the structure parameters. Then, with the estimated parameters, an analytical threshold detection rule is proposed as a preliminary work of multiuser detection. The detection rule can be optimized by analyzing the separability of two users based on the Gaussian approximation of Poisson weighted sum. To assess the effect of imperfect estimation, the sensitivity analysis of channel estimation error on two-user signal detection is performed. Moreover, we propose a successive elimination method for on-off keying (OOK) modulated multiuser symbol detection based on the previous threshold detection rule. A closed-form upper bound on the detection error rate is calculated, which turns out to be a good approximation of that of multiuser maximum-likelihood (ML) detection. The proposed successive elimination method is twenty times faster than the ML detection with negligible detection error rate degradation. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.09607 [pdf, other]

CB-Conformer: Contextual biasing Conformer for biased word recognition

Authors: Yaoxun Xu, Baiji Liu, Qiaochu Huang and, Xingchen Song, Zhiyong Wu, Shiyin Kang, Helen Meng

Abstract: Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this… ▽ More Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this work, we propose CB-Conformer to improve biased word recognition by introducing the Contextual Biasing Module and the Self-Adaptive Language Model to vanilla Conformer. The Contextual Biasing Module combines audio fragments and contextual information, with only 0.2% model parameters of the original Conformer. The Self-Adaptive Language Model modifies the internal weights of biased words based on their recall and precision, resulting in a greater focus on biased words and more successful integration with the automatic speech recognition model than the standard fixed language model. In addition, we construct and release an open-source Mandarin biased-word dataset based on WenetSpeech. Experiments indicate that our proposed method brings a 15.34% character error rate reduction, a 14.13% biased word recall increase, and a 6.80% biased word F1-score increase compared with the base Conformer. △ Less

Submitted 25 April, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

arXiv:2304.02273 [pdf, other]

MMVC: Learned Multi-Mode Video Compression with Block-based Prediction Mode Selection and Density-Adaptive Entropy Coding

Authors: Bowen Liu, Yu Chen, Rakesh Chowdary Machineni, Shiyu Liu, Hun-Seok Kim

Abstract: Learning-based video compression has been extensively studied over the past years, but it still has limitations in adapting to various motion patterns and entropy models. In this paper, we propose multi-mode video compression (MMVC), a block wise mode ensemble deep video compression framework that selects the optimal mode for feature domain prediction adapting to different motion patterns. Propose… ▽ More Learning-based video compression has been extensively studied over the past years, but it still has limitations in adapting to various motion patterns and entropy models. In this paper, we propose multi-mode video compression (MMVC), a block wise mode ensemble deep video compression framework that selects the optimal mode for feature domain prediction adapting to different motion patterns. Proposed multi-modes include ConvLSTM-based feature domain prediction, optical flow conditioned feature domain prediction, and feature propagation to address a wide range of cases from static scenes without apparent motions to dynamic scenes with a moving camera. We partition the feature space into blocks for temporal prediction in spatial block-based representations. For entropy coding, we consider both dense and sparse post-quantization residual blocks, and apply optional run-length coding to sparse residuals to improve the compression rate. In this sense, our method uses a dual-mode entropy coding scheme guided by a binary density map, which offers significant rate reduction surpassing the extra cost of transmitting the binary selection map. We validate our scheme with some of the most popular benchmarking datasets. Compared with state-of-the-art video compression schemes and standard codecs, our method yields better or competitive results measured with PSNR and MS-SSIM. △ Less

Submitted 5 April, 2023; originally announced April 2023.

arXiv:2303.14357 [pdf, other]

Dealing With Heterogeneous 3D MR Knee Images: A Federated Few-Shot Learning Method With Dual Knowledge Distillation

Authors: Xiaoxiao He, Chaowei Tan, Bo Liu, Li** Si, Weiwu Yao, Liang Zhao, Di Liu, Qilong Zhangli, Qi Chang, Kang Li, Dimitris N. Metaxas

Abstract: Federated Learning has gained popularity among medical institutions since it enables collaborative training between clients (e.g., hospitals) without aggregating data. However, due to the high cost associated with creating annotations, especially for large 3D image datasets, clinical institutions do not have enough supervised data for training locally. Thus, the performance of the collaborative mo… ▽ More Federated Learning has gained popularity among medical institutions since it enables collaborative training between clients (e.g., hospitals) without aggregating data. However, due to the high cost associated with creating annotations, especially for large 3D image datasets, clinical institutions do not have enough supervised data for training locally. Thus, the performance of the collaborative model is subpar under limited supervision. On the other hand, large institutions have the resources to compile data repositories with high-resolution images and labels. Therefore, individual clients can utilize the knowledge acquired in the public data repositories to mitigate the shortage of private annotated images. In this paper, we propose a federated few-shot learning method with dual knowledge distillation. This method allows joint training with limited annotations across clients without jeopardizing privacy. The supervised learning of the proposed method extracts features from limited labeled data in each client, while the unsupervised data is used to distill both feature and response-based knowledge from a national data repository to further improve the accuracy of the collaborative model and reduce the communication cost. Extensive evaluations are conducted on 3D magnetic resonance knee images from a private clinical dataset. Our proposed method shows superior performance and less training time than other semi-supervised federated learning methods. Codes and additional visualization results are available at https://github.com/hexiaoxiao-cs/fedml-knee. △ Less

Submitted 17 April, 2023; v1 submitted 25 March, 2023; originally announced March 2023.

arXiv:2302.13161 [pdf, other]

Cybersecurity Challenges of Power Transformers

Authors: Hossein Rahimpour, Joe Tusek, Alsharif Abuadbba, Aruna Seneviratne, Toan Phung, Ahmed Musleh, Boyu Liu

Abstract: The rise of cyber threats on critical infrastructure and its potential for devastating consequences, has significantly increased. The dependency of new power grid technology on information, data analytic and communication systems make the entire electricity network vulnerable to cyber threats. Power transformers play a critical role within the power grid and are now commonly enhanced through facto… ▽ More The rise of cyber threats on critical infrastructure and its potential for devastating consequences, has significantly increased. The dependency of new power grid technology on information, data analytic and communication systems make the entire electricity network vulnerable to cyber threats. Power transformers play a critical role within the power grid and are now commonly enhanced through factory add-ons or intelligent monitoring systems added later to improve the condition monitoring of critical and long lead time assets such as transformers. However, the increased connectivity of those power transformers opens the door to more cyber attacks. Therefore, the need to detect and prevent cyber threats is becoming critical. The first step towards that would be a deeper understanding of the potential cyber-attacks landscape against power transformers. Much of the existing literature pays attention to smart equipment within electricity distribution networks, and most methods proposed are based on model-based detection algorithms. Moreover, only a few of these works address the security vulnerabilities of power elements, especially transformers within the transmission network. To the best of our knowledge, there is no study in the literature that systematically investigate the cybersecurity challenges against the newly emerged smart transformers. This paper addresses this shortcoming by exploring the vulnerabilities and the attack vectors of power transformers within electricity networks, the possible attack scenarios and the risks associated with these attacks. △ Less

Submitted 25 March, 2023; v1 submitted 25 February, 2023; originally announced February 2023.

Comments: 11 pages

arXiv:2302.11728 [pdf, other]

doi 10.1109/ICIP49359.2023.10222276

A Convolutional-Transformer Network for Crack Segmentation with Boundary Awareness

Authors: Huaqi Tao, Bingxi Liu, **qiang Cui, Hong Zhang

Abstract: Cracks play a crucial role in assessing the safety and durability of manufactured buildings. However, the long and sharp topological features and complex background of cracks make the task of crack segmentation extremely challenging. In this paper, we propose a novel convolutional-transformer network based on encoder-decoder architecture to solve this challenge. Particularly, we designed a Dilated… ▽ More Cracks play a crucial role in assessing the safety and durability of manufactured buildings. However, the long and sharp topological features and complex background of cracks make the task of crack segmentation extremely challenging. In this paper, we propose a novel convolutional-transformer network based on encoder-decoder architecture to solve this challenge. Particularly, we designed a Dilated Residual Block (DRB) and a Boundary Awareness Module (BAM). The DRB pays attention to the local detail of cracks and adjusts the feature dimension for other blocks as needed. And the BAM learns the boundary features from the dilated crack label. Furthermore, the DRB is combined with a lightweight transformer that captures global information to serve as an effective encoder. Experimental results show that the proposed network performs better than state-of-the-art algorithms on two typical datasets. Datasets, code, and trained models are available for research at https://github.com/HqiTao/CT-crackseg. △ Less

Submitted 11 November, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

Comments: Accepted to ICIP 2023

arXiv:2302.07269 [pdf, other]

doi 10.1364/OE.486290

Dual-mode adaptive-SVD ghost imaging

Authors: Da**g Wang, Baolei Liu, Jiaqi Song, Yao Wang, Xuchen Shan, Fan Wang

Abstract: In this paper, we present a dual-mode adaptive singular value decomposition ghost imaging (A-SVD GI), which can be easily switched between the modes of imaging and edge detection. It can adaptively localize the foreground pixels via a threshold selection method. Then only the foreground region is illuminated by the singular value decomposition (SVD) - based patterns, consequently retrieving high-q… ▽ More In this paper, we present a dual-mode adaptive singular value decomposition ghost imaging (A-SVD GI), which can be easily switched between the modes of imaging and edge detection. It can adaptively localize the foreground pixels via a threshold selection method. Then only the foreground region is illuminated by the singular value decomposition (SVD) - based patterns, consequently retrieving high-quality images with fewer sampling ratios. By changing the selecting range of foreground pixels, the A-SVD GI can be switched to the mode of edge detection to directly reveal the edge of objects, without needing the original image. We investigate the performance of these two modes through both numerical simulations and experiments. We also develop a single-round scheme to halve measurement numbers in experiments, instead of separately illuminating positive and negative patterns in traditional methods. The binarized SVD patterns, generated by the spatial dithering method, are modulated by a digital micromirror device (DMD) to speed up the data acquisition. This dual-mode A-SVD GI can be applied in various applications, such as remote sensing or target recognition, and could be further extended for multi-modality functional imaging/detection. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.05726 [pdf, other]

Enhance Local Consistency in Federated Learning: A Multi-Step Inertial Momentum Approach

Authors: Yixing Liu, Yan Sun, Zhengtao Ding, Li Shen, Bo Liu, Dacheng Tao

Abstract: Federated learning (FL), as a collaborative distributed training paradigm with several edge computing devices under the coordination of a centralized server, is plagued by inconsistent local stationary points due to the heterogeneity of the local partial participation clients, which precipitates the local client-drifts problems and sparks off the unstable and slow convergence, especially on the ag… ▽ More Federated learning (FL), as a collaborative distributed training paradigm with several edge computing devices under the coordination of a centralized server, is plagued by inconsistent local stationary points due to the heterogeneity of the local partial participation clients, which precipitates the local client-drifts problems and sparks off the unstable and slow convergence, especially on the aggravated heterogeneous dataset. To address these issues, we propose a novel federated learning algorithm, named FedMIM, which adopts the multi-step inertial momentum on the edge devices and enhances the local consistency for free during the training to improve the robustness of the heterogeneity. Specifically, we incorporate the weighted global gradient estimations as the inertial correction terms to guide both the local iterates and stochastic gradient estimation, which can reckon the global objective optimization on the edges' heterogeneous dataset naturally and maintain the demanding consistent iteration locally. Theoretically, we show that FedMIM achieves the $\mathcal{O}(\frac{1}{\sqrt{SKT}})$ convergence rate with a linear speedup property with respect to the number of selected clients $S$ and proper local interval $K$ in communication round $T$ without convex assumption. Empirically, we conduct comprehensive experiments on various real-world datasets and demonstrate the efficacy of the proposed FedMIM against several state-of-the-art baselines. △ Less

Submitted 11 February, 2023; originally announced February 2023.

arXiv:2301.00657 [pdf, other]

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

Authors: Kailin Liang, Bin Liu, Yifan Hu, Rui Liu, Feilong Bao, Guanglai Gao

Abstract: Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-so… ▽ More Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for the benefit of related researchers. In this work, we prepare the transcription from various topics and invite three professional Mongolian announcers to form a three-speaker TTS dataset, in which each announcer records 10 hours of speeches in Mongolian, resulting 30 hours in total. Furthermore, we build the baseline system based on the state-of-the-art FastSpeech2 model and HiFi-GAN vocoder. The experimental results suggest that the constructed MnTTS2 dataset is sufficient to build robust multi-speaker TTS models for real-world applications. The MnTTS2 dataset, training recipe, and pretrained models are released at: \url{https://github.com/ssmlkl/MnTTS2} △ Less

Submitted 11 December, 2022; originally announced January 2023.

Comments: Accepted by NCMMSC'2022 (https://ncmmsc2022.ustc.edu.cn/main.htm)

arXiv:2301.00308 [pdf, other]

High-Accuracy Absolute-Position-Aided Code Phase Tracking Based on RTK/INS Deep Integration in Challenging Static Scenarios

Authors: Yiran Luo, Li-Ta Hsu, Yang Jiang, Baoyu Liu, Zhetao Zhang, Yan Xiang, Naser El-Sheimy

Abstract: Many multi-sensor navigation systems urgently demand accurate positioning initialization from global navigation satellite systems (GNSSs) in challenging static scenarios. However, ground blockages against line-of-sight (LOS) signal reception make it difficult for GNSS users. Steering local codes in GNSS basebands is a desiring way to correct instantaneous signal phase misalignment, efficiently gat… ▽ More Many multi-sensor navigation systems urgently demand accurate positioning initialization from global navigation satellite systems (GNSSs) in challenging static scenarios. However, ground blockages against line-of-sight (LOS) signal reception make it difficult for GNSS users. Steering local codes in GNSS basebands is a desiring way to correct instantaneous signal phase misalignment, efficiently gathering useful signal power and increasing positioning accuracy. Besides, inertial navigation systems (INSs) have been used as a well-complementary dead reckoning (DR) sensor for GNSS receivers in kinematic scenarios resisting various interferences since early. But little work focuses on the case of whether the INS can improve GNSS receivers in static scenarios. Thus, this paper proposes an enhanced navigation system deeply integrated with low-cost INS solutions and GNSS high-accuracy carrier-based positioning. First, an absolute code phase is predicted from base station information, and integrated solution of the INS DR and real-time kinematic (RTK) results through an extended Kalman filter (EKF). Then, a numerically controlled oscillator (NCO) leverages the predicted code phase to improve the alignment between instantaneous local code phases and received ones. The proposed algorithm is realized in a vector-tracking GNSS software-defined radio (SDR). Real-world experiments demonstrate the proposed SDR regarding estimating time-of-arrival (TOA) and positioning accuracy. △ Less

Submitted 31 December, 2022; originally announced January 2023.

Comments: 27 pages, 18 figures

arXiv:2212.05439 [pdf, other]

doi 10.1016/j.buildenv.2022.109950

Personalized local heating neutralizing individual, spatial and temporal thermo-physiological variances in extreme cold environments

Authors: Yi Ju, Xinyuan Ju, Hui Zhang, Bin Cao, Bin Liu, Yingxin Zhu

Abstract: In this paper, we investigate the feasibility, robustness and optimization of introducing personal comfort systems (PCS), apparatuses that promises in energy saving and comfort improvement, into a broader range of environments. We report a series of laboratory experiments systematically examining the effect of personalized heating in neutralizing individual, spatial and temporal variations of ther… ▽ More In this paper, we investigate the feasibility, robustness and optimization of introducing personal comfort systems (PCS), apparatuses that promises in energy saving and comfort improvement, into a broader range of environments. We report a series of laboratory experiments systematically examining the effect of personalized heating in neutralizing individual, spatial and temporal variations of thermal demands. The experiments were conducted in an artificial climate chamber at -15 degC in order to simulate extreme cold environments. We developed a heating garment with 20 pieces of 20 * 20 cm2 heating cloth (grouped into 9 regions) comprehensively covering human body. Surface temperatures of the garment can be controlled independently, quickly (within 20 seconds), precisely (within 1 degC) and easily (through a tablet) up to 45 degC. Participants were instructed to adjust surface temperatures of each segment to their preferences, with their physiological, psychological and adjustment data collected. We found that active heating could significantly and stably improve thermal satisfaction. The overall TSV and TCV were improved 1.50 and 1.53 during the self-adjustment phase. Preferred heating surface temperatures for different segments varied widely. Further, even for the same segment, individual differences among participants were considerable. Such variances were observed through local heating powers, while unnoticeable among thermal perception votes. In other words, all these various differences could be neutralized given the flexibility in personalized adjustments. Our research reaffirms the paradigm of "adaptive thermal comfort" and will promote innovations on human-centric design for more efficient PCSs. △ Less

Submitted 27 December, 2022; v1 submitted 11 December, 2022; originally announced December 2022.

Journal ref: Building and Environment, 109950 (2022)

arXiv:2211.10721 [pdf]

Multi-timescale Event Detection in Nonintrusive Load Monitoring based on MDL Principle

Authors: Bo Liu, Jianfeng Zhang, Wenpeng Luan, Zishuai Liu, Yixin Yu

Abstract: Load event detection is the fundamental step for the event-based non-intrusive load monitoring (NILM). However, existing event detection methods with fixed parameters may fail in co** with the inherent multi-timescale characteristics of events and their event detection accuracy is easily affected by the load fluctuation. In this regard, this paper extends our previously designed two-stage event… ▽ More Load event detection is the fundamental step for the event-based non-intrusive load monitoring (NILM). However, existing event detection methods with fixed parameters may fail in co** with the inherent multi-timescale characteristics of events and their event detection accuracy is easily affected by the load fluctuation. In this regard, this paper extends our previously designed two-stage event detection framework, and proposes a novel multi-timescale event detection method based on the principle of minimum description length (MDL). Following the completion of step-like event detection in the first stage, a long-transient event detection scheme with variable-length sliding window is designed for the second stage, which is intended to provide the observation and characterization of the same event at different time scales. In that, the context information in the aggregated load data is mined by motif discovery, and then based on the MDL principle, the proper observation scales are selected for different events and the corresponding detection results are determined. In the post-processing step, a load fluctuation location method based on voice activity detection (VAD) is proposed to identify and remove the unreasonable events caused by fluctuations. Based on newly proposed evaluation metrics, the comparison tests on public and private datasets demonstrate that our method achieves higher detection accuracy and integrity for events of various appliances across different scenarios. △ Less

Submitted 19 November, 2022; originally announced November 2022.

Comments: 11 pages,16 figures

arXiv:2211.06161 [pdf, other]

Spatial Temporal Graph Convolution with Graph Structure Self-learning for Early MCI Detection

Authors: Yunpeng Zhao, Fugen Zhou, Bin Guo, Bo Liu

Abstract: Graph neural networks (GNNs) have been successfully applied to early mild cognitive impairment (EMCI) detection, with the usage of elaborately designed features constructed from blood oxygen level-dependent (BOLD) time series. However, few works explored the feasibility of using BOLD signals directly as features. Meanwhile, existing GNN-based methods primarily rely on hand-crafted explicit brain t… ▽ More Graph neural networks (GNNs) have been successfully applied to early mild cognitive impairment (EMCI) detection, with the usage of elaborately designed features constructed from blood oxygen level-dependent (BOLD) time series. However, few works explored the feasibility of using BOLD signals directly as features. Meanwhile, existing GNN-based methods primarily rely on hand-crafted explicit brain topology as the adjacency matrix, which is not optimal and ignores the implicit topological organization of the brain. In this paper, we propose a spatial temporal graph convolutional network with a novel graph structure self-learning mechanism for EMCI detection. The proposed spatial temporal graph convolution block directly exploits BOLD time series as input features, which provides an interesting view for rsfMRI-based preclinical AD diagnosis. Moreover, our model can adaptively learn the optimal topological structure and refine edge weights with the graph structure self-learning mechanism. Results on the Alzheimer's Disease Neuroimaging Initiative (ADNI) database show that our method outperforms state-of-the-art approaches. Biomarkers consistent with previous studies can be extracted from the model, proving the reliable interpretability of our method. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: 5 pages, 3 figures

arXiv:2211.04238 [pdf, other]

HDRfeat: A Feature-Rich Network for High Dynamic Range Image Reconstruction

Authors: Lingkai Zhu, Fei Zhou, Bozhi Liu, Orcun Göksel

Abstract: A major challenge for high dynamic range (HDR) image reconstruction from multi-exposed low dynamic range (LDR) images, especially with dynamic scenes, is the extraction and merging of relevant contextual features in order to suppress any ghosting and blurring artifacts from moving objects. To tackle this, in this work we propose a novel network for HDR reconstruction with deep and rich feature ext… ▽ More A major challenge for high dynamic range (HDR) image reconstruction from multi-exposed low dynamic range (LDR) images, especially with dynamic scenes, is the extraction and merging of relevant contextual features in order to suppress any ghosting and blurring artifacts from moving objects. To tackle this, in this work we propose a novel network for HDR reconstruction with deep and rich feature extraction layers, including residual attention blocks with sequential channel and spatial attention. For the compression of the rich-features to the HDR domain, a residual feature distillation block (RFDB) based architecture is adopted. In contrast to earlier deep-learning methods for HDR, the above contributions shift focus from merging/compression to feature extraction, the added value of which we demonstrate with ablation experiments. We present qualitative and quantitative comparisons on a public benchmark dataset, showing that our proposed method outperforms the state-of-the-art. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: 4 pages, 5 figures

arXiv:2211.00815 [pdf, other]

Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022

Authors: Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian

Abstract: Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-C… ▽ More Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-Celeb. This year, our team participated in both speaker verification closed tracks in CNSRC 2022 and VoxSRC 2022, and achieved the 1st place and 3rd place respectively. In most system reports, the authors usually only provide a description of their systems but lack an effective analysis of their methods. In this paper, we will outline how to build a strong speaker verification challenge system and give a detailed analysis of each method compared with some other popular technical means. △ Less

Submitted 1 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted by InterSpeech 2023

arXiv:2210.16791 [pdf, other]

Adaptive Speech Quality Aware Complex Neural Network for Acoustic Echo Cancellation with Supervised Contrastive Learning

Authors: Bozhong Liu, Xiaoxi Yu, Hantao Huang

Abstract: Acoustic echo cancellation (AEC) is designed to remove echoes, reverberation, and unwanted added sounds from the microphone signal while maintaining the quality of the near-end speaker's speech. This paper proposes adaptive speech quality complex neural networks to focus on specific tasks for real-time acoustic echo cancellation. In specific, we propose a complex modularize neural network with dif… ▽ More Acoustic echo cancellation (AEC) is designed to remove echoes, reverberation, and unwanted added sounds from the microphone signal while maintaining the quality of the near-end speaker's speech. This paper proposes adaptive speech quality complex neural networks to focus on specific tasks for real-time acoustic echo cancellation. In specific, we propose a complex modularize neural network with different stages to focus on feature extraction, acoustic separation, and mask optimization receptively. Furthermore, we adopt the contrastive learning framework and novel speech quality aware loss functions to further improve the performance. The model is trained with 72 hours for pre-training and then 72 hours for fine-tuning. The proposed model outperforms the state-of-the-art performance. △ Less

Submitted 9 November, 2022; v1 submitted 30 October, 2022; originally announced October 2022.

Comments: Submitted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023. Under review

arXiv:2209.09076 [pdf, other]

SJTU-AISPEECH System for VoxCeleb Speaker Recognition Challenge 2022

Authors: Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian

Abstract: This report describes the SJTU-AISPEECH system for the Voxceleb Speaker Recognition Challenge 2022. For track1, we implemented two kinds of systems, the online system and the offline system. Different ResNet-based backbones and loss functions are explored. Our final fusion system achieved 3rd place in track1. For track3, we implemented statistic adaptation and jointly training based domain adaptat… ▽ More This report describes the SJTU-AISPEECH system for the Voxceleb Speaker Recognition Challenge 2022. For track1, we implemented two kinds of systems, the online system and the offline system. Different ResNet-based backbones and loss functions are explored. Our final fusion system achieved 3rd place in track1. For track3, we implemented statistic adaptation and jointly training based domain adaptation. In the jointly training based domain adaptation, we jointly trained the source and target domain dataset with different training objectives to do the domain adaptation. We explored two different training objectives for target domain data, self-supervised learning based angular proto-typical loss and semi-supervised learning based classification loss with estimated pseudo labels. Besides, we used the dynamic loss-gate and label correction (DLG-LC) strategy to improve the quality of pseudo labels when the target domain objective is a classification loss. Our final fusion system achieved 4th place (very close to 3rd place, relatively less than 1%) in track3. △ Less

Submitted 20 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: System description of VoxSRC 2022

arXiv:2209.07618 [pdf, other]

Differentiable Bilevel Programming for Stackelberg Congestion Games

Authors: Jiayang Li, **g Yu, Qianni Wang, Boyi Liu, Zhaoran Wang, Yu Marco Nie

Abstract: In a Stackelberg congestion game (SCG), a leader aims to maximize their own gain by anticipating and manipulating the equilibrium state at which the followers settle by playing a congestion game. Often formulated as bilevel programs, large-scale SCGs are well known for their intractability and complexity. Here, we attempt to tackle this computational challenge by marrying traditional methodologies… ▽ More In a Stackelberg congestion game (SCG), a leader aims to maximize their own gain by anticipating and manipulating the equilibrium state at which the followers settle by playing a congestion game. Often formulated as bilevel programs, large-scale SCGs are well known for their intractability and complexity. Here, we attempt to tackle this computational challenge by marrying traditional methodologies with the latest differentiable programming techniques in machine learning. The core idea centers on replacing the lower-level equilibrium problem with a smooth evolution trajectory defined by the imitative logit dynamic (ILD), which we prove converges to the equilibrium of the congestion game under mild conditions. Building upon this theoretical foundation, we propose two new local search algorithms for SCGs. The first is a gradient descent algorithm that obtains the derivatives by unrolling ILD via differentiable programming. Thanks to the smoothness of ILD, the algorithm promises both efficiency and scalability. The second algorithm adds a heuristic twist by cutting short the followers' evolution trajectory. Behaviorally, this means that, instead of anticipating the followers' best response at equilibrium, the leader seeks to approximate that response by only looking ahead a limited number of steps. Our numerical experiments are carried out over various instances of classic SCG applications, ranging from toy benchmarks to large-scale real-world examples. The results show the proposed algorithms are reliable and scalable local solvers that deliver high-quality solutions with greater regularity and significantly less computational effort compared to the many incumbents included in our study. △ Less

Submitted 13 May, 2024; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2208.02792 [pdf]

A Cooperative Perception Environment for Traffic Operations and Control

Authors: Hanlin Chen, Brian Liu, Xumiao Zhang, Feng Qian, Z. Morley Mao, Yiheng Feng

Abstract: Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR)… ▽ More Existing data collection methods for traffic operations and control usually rely on infrastructure-based loop detectors or probe vehicle trajectories. Connected and automated vehicles (CAVs) not only can report data about themselves but also can provide the status of all detected surrounding vehicles. Integration of perception data from multiple CAVs as well as infrastructure sensors (e.g., LiDAR) can provide richer information even under a very low penetration rate. This paper aims to develop a cooperative data collection system, which integrates Lidar point cloud data from both infrastructure and CAVs to create a cooperative perception environment for various transportation applications. The state-of-the-art 3D detection models are applied to detect vehicles in the merged point cloud. We test the proposed cooperative perception environment with the max pressure adaptive signal control model in a co-simulation platform with CARLA and SUMO. Results show that very low penetration rates of CAV plus an infrastructure sensor are sufficient to achieve comparable performance with 30% or higher penetration rates of connected vehicles (CV). We also show the equivalent CV penetration rate (E-CVPR) under different CAV penetration rates to demonstrate the data collection efficiency of the cooperative perception environment. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2207.14174 [pdf, other]

Bayesian Optimization-Based Beam Alignment for MmWave MIMO Communication Systems

Authors: Songjie Yang, Baojuan Liu, Zhiqin Hong, Zhongpei Zhang

Abstract: Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possibl… ▽ More Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possible optimal beam pair. During the BA procedure, this strategy exploits information from the measured beam pairs to predict the best beam pair. In addition, we suggest a novel BO algorithm based on the gradient boosting regression tree model. The simulation results demonstrate the spectral efficiency performance of our proposed schemes for BA using three different surrogate models. They also demonstrate that the proposed schemes can achieve spectral efficiency with a small overhead when compared to the orthogonal match pursuit (OMP) algorithm and the Thompson sampling-based multi-armed bandit (TS-MAB) method. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.14103 [pdf, other]

doi 10.1109/LAWP.2022.3192308

Low-complexity Sparse Array Synthesis Based on Off-grid Compressive Sensing

Authors: Songjie Yang, Baojuan Liu, Zhiqin Hong, Zhongpei Zhang

Abstract: A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More important… ▽ More A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More importantly, we take into account the minimum inter-element spacing constraint for ensuring the physically realizable solution. Specifically, the off-grid Orthogonal Match Pursuit (OMP) algorithm is first proposed with low complexity and then off-grid Look Ahead Orthogonal Match Pursuit (LAOMP) is designed with better synthesis performance but higher complexity. In addition, simulation results have shown the proposed schemes have more advantages in computational complexity and synthesis performances compared with the related method. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Showing 1–50 of 148 results for author: Liu, B