Search | arXiv e-print repository

BS-PLCNet: Band-split Packet Loss Concealment Network with Multi-task Learning Framework and Multi-discriminators

Authors: Zihan Zhang, Jiayao Sun, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

Abstract: Packet loss is a common and unavoidable problem in voice over internet phone (VoIP) systems. To deal with the problem, we propose a band-split packet loss concealment network (BS-PLCNet). Specifically, we split the full-band signal into wide-band (0-8kHz) and high-band (8-24kHz). The wide-band signals are processed by a gated convolutional recurrent network (GCRN), while the high-band counterpart… ▽ More Packet loss is a common and unavoidable problem in voice over internet phone (VoIP) systems. To deal with the problem, we propose a band-split packet loss concealment network (BS-PLCNet). Specifically, we split the full-band signal into wide-band (0-8kHz) and high-band (8-24kHz). The wide-band signals are processed by a gated convolutional recurrent network (GCRN), while the high-band counterpart is processed by a simple GRU network. To ensure high speech quality and automatic speech recognition (ASR) compatibility, multi-task learning (MTL) framework including fundamental frequency (f0) prediction, linguistic awareness, and multi-discriminators are used. The proposed approach tied for 1st place in the ICASSP 2024 PLC Challenge. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: submitted to ICASSP 2024

arXiv:2401.03473 [pdf, ps, other]

ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively. △ Less

Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2401.03424 [pdf, other]

doi 10.1109/ICASSP48485.2024.10446769

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Authors: He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the… ▽ More While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset. △ Less

Submitted 8 April, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

Comments: 5 pages, 3 figures Accepted at ICASSP 2024

arXiv:2401.03105 [pdf, other]

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Authors: Xin He, Longhui Wei, Lingxi Xie, Qi Tian

Abstract: Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets are collected. However, a prevailing challenge persists in these approaches, specifically in relation to the limited visual perception ability, as CL… ▽ More Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets are collected. However, a prevailing challenge persists in these approaches, specifically in relation to the limited visual perception ability, as CLIP-like encoders employed for extracting visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, we introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline, aiming to provide a more comprehensive and accurate summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception achieved through the integration of visual experts. △ Less

Submitted 13 January, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

arXiv:2401.02617 [pdf, ps, other]

How are the abnormally hot chromosphere and corona heated by the solar magnetic fields?

Authors: K. J. Li, J. C. Xu, W. F eng, J. L. Xie, X. J. Shi, L. H. Deng

Abstract: The corona is a structure possessed by stars, including the sun. The abnormal heating of the solar corona and chromosphere is one of the greatest mysteries in modern astronomy. While state-of-the-art observations have identified some candidates of magnetic activity events that could be responsible for this abnormal heating, and theoretical studies have proposed various heating modes, a complete ph… ▽ More The corona is a structure possessed by stars, including the sun. The abnormal heating of the solar corona and chromosphere is one of the greatest mysteries in modern astronomy. While state-of-the-art observations have identified some candidates of magnetic activity events that could be responsible for this abnormal heating, and theoretical studies have proposed various heating modes, a complete physical picture of how they are heated as a whole remains elusive. In this study, the characteristics of the heated corona and chromosphere are investigated, and for the first time, the question of how they are abnormally heated is explicitly answered by analyzing the long-term observations of the global chromosphere in the Ca II K line and the global corona in the coronal green line. The findings reveal that both the quiet chromosphere and corona are in anti-phase with the solar cycle, whereas the active chromosphere and corona are in phase with it. Different parts of the solar corona and chromosphere exhibit significantly different variation characteristics, and are found to be heated by different magnetic categories and probably in different modes. This study posits that unraveling the heating mystery is best approached through the lens of magnetic categories, rather than magnetic activity events. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: accepted for publication in ApJ

arXiv:2401.01685 [pdf]

Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation

Authors: Hua Han, Cheng Li, Lei Xie, Yuan**g Feng, Alou Diakite, Shanshan Wang

Abstract: Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utili… ▽ More Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utilizes multi-modal magnetic resonance (MR) imaging information to enhance RGVP segmentation. Our ME-Net has two main contributions. Firstly, we introduce an effective multi-modal soft-exchange technique. Specifically, we design a channel and spatially mixed attention module to exchange modality information between T1-weighted and fractional anisotropy MR images. Secondly, we propose a cross-fusion module that further enhances the fusion of information between the two modalities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in terms of RGVP segmentation performance. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01654 [pdf, other]

LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

Authors: Alou Diakite, Cheng Li, Lei Xie, Yuan**g Feng, Hua Han, Shanshan Wang

Abstract: Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN… ▽ More Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN incorporates supervised and unsupervised losses, enabling the student and teacher models to mutually learn from each other, forming a self-ensembling mean teacher framework. Additionally, we introduce a reliable unlabeled sample selection (RUSS) mechanism to further enhance LESEN's effectiveness. Our experiments on the human connectome project (HCP) dataset demonstrate the superior performance of our method when compared to state-of-the-art techniques, advancing multimodal VP segmentation for comprehensive analysis in clinical and research settings. The implementation code will be available at: https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway- Delineation. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.00475 [pdf, other]

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Authors: Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie

Abstract: This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emo… ▽ More This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline LLMs, demonstrating its potential in emotional comprehension and human-machine interaction. △ Less

Submitted 6 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: 6 pages, 3 figures

arXiv:2312.17495 [pdf]

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction

Authors: Xiaohua Lu, Liangxu Xie, Lei Xu, Rongzhi Mao, Shan Chang, Xiaojun Xu

Abstract: Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, the inherent limitation of mono-modal learning arises from relying solely on one modality of molecular representation, which restricts a comprehensive understanding of drug molecul… ▽ More Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, the inherent limitation of mono-modal learning arises from relying solely on one modality of molecular representation, which restricts a comprehensive understanding of drug molecules and hampers their resilience against data noise. To overcome the limitations, we construct multimodal deep learning models to cover different molecular representations. We convert drug molecules into three molecular representations, SMILES-encoded vectors, ECFP fingerprints, and molecular graphs. To process the modal information, Transformer-Encoder, bi-directional gated recurrent units (BiGRU), and graph convolutional network (GCN) are utilized for feature learning respectively, which can enhance the model capability to acquire complementary and naturally occurring bioinformatics information. We evaluated our triple-modal model on six molecule datasets. Different from bi-modal learning models, we adopt five fusion methods to capture the specific features and leverage the contribution of each modal information better. Compared with mono-modal models, our multimodal fused deep learning (MMFDL) models outperform single models in accuracy, reliability, and resistance capability against noise. Moreover, we demonstrate its generalization ability in the prediction of binding constants for protein-ligand complex molecules in the refined set of PDBbind. The advantage of the multimodal model lies in its ability to process diverse sources of data using proper models and suitable fusion methods, which would enhance the noise resistance of the model while obtaining data diversity. △ Less

Submitted 29 December, 2023; originally announced December 2023.

arXiv:2312.16850 [pdf, other]

Accent-VITS:accent transfer for end-to-end TTS

Authors: Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

Abstract: Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable… ▽ More Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave map** in VITS is decomposed into text-to-accent and accent-to-wave map**s in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline. △ Less

Submitted 29 December, 2023; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Accepted by NCMMSC2023

arXiv:2312.15340 [pdf, other]

Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems

Authors: Amit Jena, Dileep Kalathil, Le Xie

Abstract: This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The state-of-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system par… ▽ More This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The state-of-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system parameters vary over time, the NLF methods fail to adapt to such changes and may lead to conservative stability assessment performance. We circumvent this issue by integrating Model Agnostic Meta-learning (MAML) with NLFs and propose meta-NLFs. In this process, we train a meta-function that adapts to any parametric shifts and updates into an NLF for the system with new test-time parameter values. We demonstrate the stability assessment performance of meta-NLFs on some standard benchmark autonomous dynamical systems. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: This article has been accepted for AAAI-24 (The 38th Annual AAAI Conference on Artificial Intelligence)

arXiv:2312.15067 [pdf, other]

Electromagnetic Transient Model of Cryptocurrency Mining Loads for Low-Voltage Ride Through Assessment in Transmission Grids

Authors: Anindita Samanta, Subir Majumder, Hasan Ibrahim, Prasad Enjeti, Le Xie

Abstract: In this paper, we developed an Electromagnetic Transient (EMT) model tailored for large cryptocurrency mining loads to understand the cross-interaction of these loads with the electric grid. The load model has been built using Electromagnetic Transients Program (EMTP) software. We have cross-validated the performance of the EMT model of the load with commercial application-specific integrated circ… ▽ More In this paper, we developed an Electromagnetic Transient (EMT) model tailored for large cryptocurrency mining loads to understand the cross-interaction of these loads with the electric grid. The load model has been built using Electromagnetic Transients Program (EMTP) software. We have cross-validated the performance of the EMT model of the load with commercial application-specific integrated circuit miners, typically used by large-scale mining facilities, by comparing their low-voltage ride-through (LVRT) capabilities. Subsequently, LVRT capabilities of the large-scale miners have been tested against various fault scenarios both within the miner's remote facility as well as at one of the distant buses of the interconnected grid. The significance of this model lies in its scalability to accommodate larger blocks of mining loads and its seamless integration into a larger electric grid. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 5 pages, 10 figures, conference

arXiv:2312.13076 [pdf, other]

How to Integrate Digital Twin and Virtual Reality in Robotics Systems? Design and Implementation for Providing Robotics Maintenance Services in Data Centers

Authors: Lin Xie, Hanyi Li

Abstract: In the context of Industry 4.0, the physical and digital worlds are closely connected, and robots are widely used to achieve system automation. Digital twin solutions have contributed significantly to the growth of Industry 4.0. Combining various technologies is a trend that aims to improve system performance. For example, digital twinning can be combined with virtual reality in automated systems.… ▽ More In the context of Industry 4.0, the physical and digital worlds are closely connected, and robots are widely used to achieve system automation. Digital twin solutions have contributed significantly to the growth of Industry 4.0. Combining various technologies is a trend that aims to improve system performance. For example, digital twinning can be combined with virtual reality in automated systems. This paper proposes a new concept to articulate this combination, which has mainly been implemented in engineering research projects. However, there are currently no guidelines, plans, or concepts to articulate this combination. The concept will be implemented in data centers, which are crucial for enabling virtual tasks in our daily lives. Due to the COVID-19 pandemic, there has been a surge in demand for services such as e-commerce and videoconferencing. Regular maintenance is necessary to ensure uninterrupted and reliable services. Manual maintenance strategies may not be sufficient to meet the current high demand, and innovative approaches are needed to address the problem. This paper presents a novel approach to data center maintenance: real-time monitoring by an autonomous robot. The robot is integrated with digital twins of assets and a virtual reality interface that allows human personnel to control it and respond to alarms. This methodology enables faster, more cost-effective, and higher quality data center maintenance. It has been validated in a real data centre and can be used for intelligent monitoring and management through joint data sources. The method has potential applications in other automated systems. △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.12380 [pdf, other]

doi 10.1093/mnras/stad3751

The stellar mass function of quiescent galaxies in 2 < z < 2.5 protoclusters

Authors: Adit H. Edward, Michael L. Balogh, Yannick M. Bahe, Michael C. Cooper, Nina A. Hatch, Justin Marchioni, Adam Muzzin, Allison Noble, Gregory H. Rednick, Benedetta Vulcani, Gillian Wilson, Gabriella De Lucia, Ricardo Demarco, Ben Forrest, Michaela Hirschmann, Gianluca Castignani, Pierluigi Cerulo, Rose A. Finn, Guillaume Hewitt, Pascale Jablonka, Yadayuki Kodama, Sophie Maurogordato, Julie Nantais, Lizhi Xie

Abstract: We present an analysis of the galaxy stellar mass function (SMF) of 14 known protoclusters between $2.0 < z < 2.5$ in the COSMOS field, down to a mass limit of $10^{9.5}$ M$_{\odot}$. We use existing photometric redshifts with a statistical background subtraction, and consider star-forming and quiescent galaxies identified from $(NUV - r)$ and $(r - J)$ colours separately. Our fiducial sample incl… ▽ More We present an analysis of the galaxy stellar mass function (SMF) of 14 known protoclusters between $2.0 < z < 2.5$ in the COSMOS field, down to a mass limit of $10^{9.5}$ M$_{\odot}$. We use existing photometric redshifts with a statistical background subtraction, and consider star-forming and quiescent galaxies identified from $(NUV - r)$ and $(r - J)$ colours separately. Our fiducial sample includes galaxies within 1 Mpc of the cluster centres. The shape of the protocluster SMF of star-forming galaxies is indistinguishable from that of the general field at this redshift. Quiescent galaxies, however, show a flatter SMF than in the field, with an upturn at low mass, though this is only significant at $\sim 2σ$. There is no strong evidence for a dominant population of quiescent galaxies at any mass, with a fraction of $< 15\%$ at $1σ$ confidence for galaxies with log$M_{\ast}/M_{\odot} < 10.5$. We compare our results with a sample of galaxies groups at $1 < z < 1.5$, and demonstrate that a significant amount of environmental quenching must take place between these epochs, increasing the relative abundance of high-mass ($\rm M > 10^{10.5} M_{\odot}$) quiescent galaxies by a factor of $\gtrsim$ 2. However, we find that at lower masses ($\rm M < 10^{10.5} M_{\odot}$), no additional environmental quenching is required. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 23 pages, 22 figures. Accepted for publication in MNRAS

arXiv:2312.11888 [pdf, other]

doi 10.1109/TAC.2020.3012630

Angle-Displacement Rigidity Theory with Application to Distributed Network Localization

Authors: Xu Fang, Xiaolei Li, Lihua Xie

Abstract: This paper investigates the localization problem of a network in 2-D and 3-D spaces given the positions of anchor nodes in a global frame and inter-node relative measurements in local coordinate frames. It is assumed that the local frames of different nodes have different unknown orientations. First, an angle-displacement rigidity theory is developed, which can be used to localize all the free nod… ▽ More This paper investigates the localization problem of a network in 2-D and 3-D spaces given the positions of anchor nodes in a global frame and inter-node relative measurements in local coordinate frames. It is assumed that the local frames of different nodes have different unknown orientations. First, an angle-displacement rigidity theory is developed, which can be used to localize all the free nodes by the known positions of the anchor nodes and local relative measurements (local relative position, distance, local relative bearing, angle, or ratio-of-distance measurements). Then, necessary and sufficient conditions for network localizability are given. Finally, a distributed network localization protocol is proposed, which can globally estimate the locations of all the free nodes of a network if the network is infinitesimally angle-displacement rigid. The proposed method unifies local-relative-position-based, distance-based, local-relative-bearing-based, angle-based, and ratio-of-distance-based distributed network localization approaches. The novelty of this work is that the proposed method can be applied in both generic and non-generic configurations with an unknown global coordinate frame in both 2-D and 3-D spaces. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.11851 [pdf, other]

doi 10.1109/TSMC.2023.3250199

Distributed Semi-global Output Feedback Formation Maneuver Control of High-order Multi-agent Systems

Authors: Xu Fang, Lihua Xie

Abstract: This paper addresses the formation maneuver control problem of leader-follower multi-agent systems with high-order integrator dynamics. A distributed output feedback formation maneuver controller is proposed to achieve desired maneuvers so that the scale, orientation, translation, and shape of formation can be manipulated continuously, where the followers do not need to know or estimate the time-v… ▽ More This paper addresses the formation maneuver control problem of leader-follower multi-agent systems with high-order integrator dynamics. A distributed output feedback formation maneuver controller is proposed to achieve desired maneuvers so that the scale, orientation, translation, and shape of formation can be manipulated continuously, where the followers do not need to know or estimate the time-varying maneuver parameters only known to the leaders. Compared with existing relative-measurement-based formation maneuver control, the advantages of the proposed method are that it is output (relative output) feedback based and shows how to realize different types of formation shape. In addition, it can be applied to non-generic and non-convex nominal configurations and the leaders are allowed to be maneuvered. It is worth noting that the proposed method can also be extended to general linear multi-agent systems under some additional conditions. The theoretical results are demonstrated by a simulation example. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10995 [pdf, other]

doi 10.1109/TSP.2020.3029399

3-D Distributed Localization with Mixed Local Relative Measurements

Authors: Xu Fang, Xiaolei Li, Lihua Xie

Abstract: This paper studies 3-D distributed network localization using mixed types of local relative measurements. Each node holds a local coordinate frame without a common orientation and can only measure one type of information (relative position, distance, relative bearing, angle, or ratio-of-distance measurements) about its neighboring nodes in its local coordinate frame. A novel rigidity-theory-based… ▽ More This paper studies 3-D distributed network localization using mixed types of local relative measurements. Each node holds a local coordinate frame without a common orientation and can only measure one type of information (relative position, distance, relative bearing, angle, or ratio-of-distance measurements) about its neighboring nodes in its local coordinate frame. A novel rigidity-theory-based distributed localization is developed to overcome the challenge due to the absence of a global coordinate frame. The main idea is to construct displacement constraints for the positions of the nodes by using mixed local relative measurements. Then, a linear distributed localization algorithm is proposed for each free node to estimate its position by solving the displacement constraints. The algebraic condition and graph condition are obtained to guarantee the global convergence of the proposed distributed localization algorithm. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10989 [pdf, other]

doi 10.1016/j.automatica.2023.110915

Distributed Localization in Dynamic Networks via Complex Laplacian

Authors: Xu Fang, Lihua Xie, Xiaolei Li

Abstract: Different from most existing distributed localization approaches in static networks where the agents in a network are static, this paper addresses the distributed localization problem in dynamic networks where the positions of the agents are time-varying. Firstly, complex constraints for the positions of the agents are constructed based on local relative position (distance and local bearing) measu… ▽ More Different from most existing distributed localization approaches in static networks where the agents in a network are static, this paper addresses the distributed localization problem in dynamic networks where the positions of the agents are time-varying. Firstly, complex constraints for the positions of the agents are constructed based on local relative position (distance and local bearing) measurements. Secondly, both algebraic condition and graph condition of network localizability in dynamic networks are given. Thirdly, a distributed localization protocol is proposed such that all the agents can cooperatively find their positions by solving the complex constraints in dynamic networks. Fourthly, the proposed method is extended to address the problem of integrated distributed localization and formation control. It is worth mentioning that the proposed algorithm can also be applied in the case that only distance and sign of direction measurements are available, where the sign of direction measurement is a kind of one bit local relative measurement and has less information than local bearing. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10037 [pdf, ps, other]

A system of dual quaternion matrix equations with its applications

Authors: Lv-Ming Xie, Qing-Wen Wang

Abstract: We employ the M-P inverses and ranks of quaternion matrices to establish the necessary and sufficient conditions for solving a system of the dual quaternion matrix equations $(AX, XC) = (B, D)$, along with providing an expression for its general solution. Serving as an application, we investigate the solutions to the dual quaternion matrix equations $AX = B$ and $XC=D$, including $η$-Hermitian sol… ▽ More We employ the M-P inverses and ranks of quaternion matrices to establish the necessary and sufficient conditions for solving a system of the dual quaternion matrix equations $(AX, XC) = (B, D)$, along with providing an expression for its general solution. Serving as an application, we investigate the solutions to the dual quaternion matrix equations $AX = B$ and $XC=D$, including $η$-Hermitian solutions. Lastly, we design a numerical example to validate the main research findings of this paper. △ Less

Submitted 13 November, 2023; originally announced December 2023.

arXiv:2312.09760 [pdf, other]

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

Authors: Ao Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, Lei Xie

Abstract: Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabu… ▽ More Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabulary KWS (U2-KWS) framework inspired by the two-pass ASR model U2. Specifically, we employ the CTC branch as the first stage model to detect potential keyword candidates and the decoder branch as the second stage model to validate candidates. In order to enhance any customized keywords, we redesign the U2 training procedure for U2-KWS and add keyword information by audio and text cross-attention into both branches. We perform experiments on our internal dataset and Aishell-1. The results show that U2-KWS can achieve a significant relative wake-up rate improvement of 41% compared to the traditional customized KWS systems when the false alarm rate is fixed to 0.5 times per hour. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ASRU2023

arXiv:2312.09747 [pdf, other]

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Authors: Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

Abstract: Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech… ▽ More Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech enhancement, which integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a detokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics alongside superior results in subjective perception. Our demos are available https://honee-w.github.io/SELM/. △ Less

Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2312.09746 [pdf, other]

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Authors: Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition perf… ▽ More Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multi-frame cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2312.06739 [pdf, other]

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Authors: Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

Abstract: Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding an… ▽ More Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Project page: https://yuzhou914.github.io/SmartEdit/

arXiv:2312.06607 [pdf, other]

DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection

Authors: Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, Lei Xie

Abstract: Reconstruction-based approaches have achieved remarkable outcomes in anomaly detection. The exceptional image reconstruction capabilities of recently popular diffusion models have sparked research efforts to utilize them for enhanced reconstruction of anomalous images. Nonetheless, these methods might face challenges related to the preservation of image categories and pixel-wise structural integri… ▽ More Reconstruction-based approaches have achieved remarkable outcomes in anomaly detection. The exceptional image reconstruction capabilities of recently popular diffusion models have sparked research efforts to utilize them for enhanced reconstruction of anomalous images. Nonetheless, these methods might face challenges related to the preservation of image categories and pixel-wise structural integrity in the more practical multi-class setting. To solve the above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection, which consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion's denoising network, and a feature-space pre-trained feature extractor. Firstly, The SG network is proposed for reconstructing anomalous regions while preserving the original image's semantic information. Secondly, we introduce Spatial-aware Feature Fusion (SFF) block to maximize reconstruction accuracy when dealing with extensively reconstructed areas. Thirdly, the input and reconstructed images are processed by a pre-trained feature extractor to generate anomaly maps based on features extracted at different scales. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach which surpasses the state-of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and detection respectively on multi-class MVTec-AD dataset. Code will be available at https://lewandofskee.github.io/projects/diad. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2312.06154 [pdf, other]

Predictive Reliability Assessment of Distribution Grids with Residential Distributed Energy Resources

Authors: Arun Kumar Karngala, Chanan Singh, Le Xie

Abstract: Distribution system end users are transforming from passive to active participants, marked by the push towards widespread adoption of edge-level Distributed Energy Resources (DERs). This paper addresses the challenges in distribution system planning arising from these dynamic changes. We introduce a bottom-up probabilistic approach that integrates these edge-level DERs into the reliability evaluat… ▽ More Distribution system end users are transforming from passive to active participants, marked by the push towards widespread adoption of edge-level Distributed Energy Resources (DERs). This paper addresses the challenges in distribution system planning arising from these dynamic changes. We introduce a bottom-up probabilistic approach that integrates these edge-level DERs into the reliability evaluation process. Our methodology leverages joint probability distributions to characterize and model the penetration of rooftop photovoltaic (PV) systems and energy storage across a distribution network at the individual residential level. Employing a scenario-based approach, we showcase the application of our probabilistic method using a Monte Carlo Simulation process to assess average system reliability indices and their variations at the user level. To validate our approach, we applied this methodology to the RBTS test system across various adoption scenarios, effectively showcasing the capability of our proposed method in quantifying the variation in end-user reliability indices for each scenario within the distribution system. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 10 Pages, 6 figures, Journal

arXiv:2312.04424 [pdf, other]

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Authors: Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian

Abstract: Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometr… ▽ More Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: Project page: https://cascadezero123.github.io/

arXiv:2312.04198 [pdf, other]

doi 10.1109/TAC.2023.3327932

Distributed Formation Maneuver Control Using Complex Laplacian

Authors: Xu Fang, Lihua Xie

Abstract: This paper studies the problem of distributed formation maneuver control of multi-agent systems via complex Laplacian. We will show how to change the translation, scaling, rotation, and also the shape of formation continuously by only tuning the positions of the leaders in both 2-D and 3-D spaces, where the rotation of formation in 3-D space is realized by changing the yaw angle, pitch angle, and… ▽ More This paper studies the problem of distributed formation maneuver control of multi-agent systems via complex Laplacian. We will show how to change the translation, scaling, rotation, and also the shape of formation continuously by only tuning the positions of the leaders in both 2-D and 3-D spaces, where the rotation of formation in 3-D space is realized by changing the yaw angle, pitch angle, and roll angle of formation sequentially. Compared with real-Laplacian-based methods, the first advantage of the proposed complex-Laplacian-based approach is that each follower requires fewer neighbors and lesser communication. The second advantage is that non-convex and non-generic nominal configurations are allowed and the uniqueness of the complex-constraint-based target formation can be guaranteed by the non-collocated nominal agents. The third advantage is that more formation shapes can be realized by only tuning the positions of the leaders. Two simulation examples are given to illustrate the theoretical results. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 8 pages

arXiv:2312.04131 [pdf, other]

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Authors: Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

Abstract: The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM… ▽ More The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.03016 [pdf, other]

Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence

Authors: Shuo Zhang, Lei Xie

Abstract: Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein s… ▽ More Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted by the AI for Science (AI4Science) Workshop and the New Frontiers of AI for Drug Discovery and Development (AI4D3) Workshop at NeurIPS 2023

arXiv:2312.00860 [pdf, other]

Segment Any 3D Gaussians

Authors: Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian

Abstract: This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching an scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity… ▽ More This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching an scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field. Our code will be released. △ Less

Submitted 27 May, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Work in progress. Project page: https://jumpat.github.io/SAGA

arXiv:2311.17112 [pdf, other]

Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model

Authors: Zelin Peng, Zhengqin Xu, Zhilin Zeng, Lingxi Xie, Qi Tian, Wei Shen

Abstract: Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community, PEFT has shown effectiveness in image classification, but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parame… ▽ More Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community, PEFT has shown effectiveness in image classification, but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parameters to align the proper projection directions in the parameter space for new scenarios. This raises a challenge to existing PEFT algorithms, as they often inject a limited number of individual parameters into each block, which prevents substantial adjustment of the projection direction of the parameter space due to the limitation of Hidden Markov Chain along blocks. In this paper, we equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios. We introduce a novel inter-block communication module, which integrates a learnable relation matrix to facilitate communication among different coefficient sets of each PEFT block's parameter space. Moreover, we propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer, further enhancing the impact of the adjustment of projection directions on the entire parameter space. Extensive experiments on diverse benchmarks demonstrate that our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters. △ Less

Submitted 28 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: Accepted by CVPR2024

arXiv:2311.16037 [pdf, other]

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Authors: Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, Qi Tian

Abstract: Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussia… ▽ More Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours). △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: Project page: https://GaussianEditor.github.io

arXiv:2311.15225 [pdf, other]

doi 10.1145/3633779

One-bit Supervision for Image Classification: Problem, Solution, and Beyond

Authors: Hengtong Hu, Lingxi Xie, Xinyue Hue, Richang Hong, Qi Tian

Abstract: This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification. Instead of training model using the accurate label of each sample, our setting requires the model to interact with the system by predicting the class label of each sample and learn from the answer whether the guess is correct, which provides one bit (yes or no) of information. An intri… ▽ More This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification. Instead of training model using the accurate label of each sample, our setting requires the model to interact with the system by predicting the class label of each sample and learn from the answer whether the guess is correct, which provides one bit (yes or no) of information. An intriguing property of the setting is that the burden of annotation largely alleviates in comparison to offering the accurate label. There are two keys to one-bit supervision, which are (i) improving the guess accuracy and (ii) making good use of the incorrect guesses. To achieve these goals, we propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm. Theoretical analysis shows that one-bit annotation is more efficient than full-bit annotation in most cases and gives the conditions of combining our approach with active learning. Inspired by this, we further integrate the one-bit supervision framework into the self-supervised learning algorithm which yields an even more efficient training schedule. Different from training from scratch, when self-supervised learning is used for initialization, both hard example mining and class balance are verified effective in boosting the learning performance. However, these two frameworks still need full-bit labels in the initial stage. To cast off this burden, we utilize unsupervised domain adaptation to train the initial model and conduct pure one-bit annotations on the target dataset. In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision. △ Less

Submitted 26 November, 2023; originally announced November 2023.

Comments: ACM TOMM. arXiv admin note: text overlap with arXiv:2009.06168

arXiv:2311.14861 [pdf, other]

Voltage Constrained Heavy Duty Vehicle Electrification: Formulation and Case Study

Authors: Apurv Shukla, Rayan El Helou, Le Xie

Abstract: The electrification of heavy-duty vehicles (HDEVs) is a rapidly emerging avenue for decarbonization of energy and transportation sectors. Compared to light duty vehicles, HDEVs exhibit unique travel and charging patterns over long distances. In this paper, we formulate an analytically tractable model that considers the routing decisions for the HDEVs and their charging implications on the power gr… ▽ More The electrification of heavy-duty vehicles (HDEVs) is a rapidly emerging avenue for decarbonization of energy and transportation sectors. Compared to light duty vehicles, HDEVs exhibit unique travel and charging patterns over long distances. In this paper, we formulate an analytically tractable model that considers the routing decisions for the HDEVs and their charging implications on the power grid. Our model captures the impacts of increased vehicle electrification on the transmission grid, with particular focus on HDEVs. We jointly model transportation and power networks coupling them through the demand generated for charging requirements of HDEVs. In particular, the voltage constraint violation is explicitly accounted for in the proposed model given the signifcant amount of charging power imposed by HDEVs. We obtain optimal routing schedules and generator dispatch satisfying mobility constraints of HDEVs while minimizing voltage violations in electric transmission network. Case study based on an IEEE 24-bus system is presented using realistic data of transit data of HDEVs. The numerical results suggest that the proposed model and algorithm effectively mitigate the voltage violation when a significant amount of HDEVs are integrated to the power transmission network. Such mitigation includes reduction in the voltage magnitude, geographical dispersion of voltage violations and worst-case voltage violations at critical nodes. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: Accepted at CDC 2023

arXiv:2311.12932 [pdf, ps, other]

doi 10.1051/0004-6361/202348688

Variation of the stellar initial mass function in semi-analytical models III: testing the cosmic ray regulated integrated galaxy-wide initial mass function

Authors: Fabio Fontanot, Francesco La Barbera, Gabriella De Lucia, Rachele Cecchi, Lizhi Xie, Michaela Hirschmann, Gustavo Bruzual, Stéphane Charlot, Alexandre Vazdekis

Abstract: In our previous work, we derive the CR-IGIMF: a new scenario for a variable stellar initial mass function (IMF), which combines numerical results on the role played by cosmic rays in setting the thermal state of star forming gas, with the analytical approach of the integrated galaxy-wide IMF. In this work, we study the implications of this scenario for the properties of local Early-Type galaxies (… ▽ More In our previous work, we derive the CR-IGIMF: a new scenario for a variable stellar initial mass function (IMF), which combines numerical results on the role played by cosmic rays in setting the thermal state of star forming gas, with the analytical approach of the integrated galaxy-wide IMF. In this work, we study the implications of this scenario for the properties of local Early-Type galaxies (ETG), as inferred from dynamical, photometric and spectroscopic studies. We implement a library of CR-IGIMF shapes in the framework of the Galaxy Evolution and Assembly (GAEA) model. Our realization includes a derivation of synthetic spectral energy distribution for each model galaxy, allowing a direct derivation of the mass fraction in the mean IMF of low-mass stars (i.e. the dwarf-to-giant ratio - $\rm f_{dg}$) and a comparison with IMF sensitive spectral features. The predictions of the GAEA model implementing the CR-IGIMF confirm our previous findings: it correctly reproduces both the observed excess of z$\sim$0 dynamical mass (mass-to-light ratios) with respect to spectroscopic (photometric) estimates assuming a universal, MW-like, IMF, and the observed increase of [$α$/Fe] ratios with stellar mass in spheroidal galaxies. Moreover, this realization reproduces the increasing trends of $\rm f_{dg}$, and IMF-sensitive line-strengths with velocity dispersion, although the predicted relations are significantly shallower than the observed ones. Our results show that the CR-IGIMF is a promising scenario that reproduces at the same time dynamical, photometric and spectroscopic indications of a varying IMF in local ETGs. The shallow relations found for spectral indices suggest that either a stronger variability as a function of galaxy properties or additional dependences (e.g. as a function of star forming gas metallicity) might be required to match the strength of the observed trends. △ Less

Submitted 18 April, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: 13 pages, 8 figures, 2 tables, A&A accepted

Journal ref: A&A 686, A302 (2024)

arXiv:2311.11228 [pdf, other]

doi 10.1038/s41598-023-46382-8

A Universal Framework for Accurate and Efficient Geometric Deep Learning of Molecular Systems

Authors: Shuo Zhang, Yang Liu, Lei Xie

Abstract: Molecular sciences address a wide range of problems involving molecules of different types and sizes and their complexes. Recently, geometric deep learning, especially Graph Neural Networks, has shown promising performance in molecular science applications. However, most existing works often impose targeted inductive biases to a specific molecular system, and are inefficient when applied to macrom… ▽ More Molecular sciences address a wide range of problems involving molecules of different types and sizes and their complexes. Recently, geometric deep learning, especially Graph Neural Networks, has shown promising performance in molecular science applications. However, most existing works often impose targeted inductive biases to a specific molecular system, and are inefficient when applied to macromolecules or large-scale tasks, thereby limiting their applications to many real-world problems. To address these challenges, we present PAMNet, a universal framework for accurately and efficiently learning the representations of three-dimensional (3D) molecules of varying sizes and types in any molecular system. Inspired by molecular mechanics, PAMNet induces a physics-informed bias to explicitly model local and non-local interactions and their combined effects. As a result, PAMNet can reduce expensive operations, making it time and memory efficient. In extensive benchmark studies, PAMNet outperforms state-of-the-art baselines regarding both accuracy and efficiency in three diverse learning tasks: small molecule properties, RNA 3D structures, and protein-ligand binding affinities. Our results highlight the potential for PAMNet in a broad range of molecular science applications. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Published in Scientific Reports (DOI: 10.1038/s41598-023-46382-8)

Journal ref: Scientific Reports 13, 19171 (2023)

arXiv:2311.10806 [pdf, other]

SEA++: Multi-Graph-based High-Order Sensor Alignment for Multivariate Time-Series Unsupervised Domain Adaptation

Authors: Yucheng Wang, Yuecong Xu, Jianfei Yang, Min Wu, Xiaoli Li, Lihua Xie, Zhenghua Chen

Abstract: Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between a labeled source domain and an unlabeled target domain. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically consist of multiple sensors, each with its own unique distribution. This characteristic mak… ▽ More Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between a labeled source domain and an unlabeled target domain. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically consist of multiple sensors, each with its own unique distribution. This characteristic makes it hard to adapt existing UDA methods, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, to reduce domain discrepancies for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to reduce domain discrepancy at both the local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based high-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on public MTS datasets for MTS-UDA. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:2311.10219 [pdf, other]

Measuring Moral Dimensions in Social Media with Mformer

Authors: Tuan Dung Nguyen, Ziyu Chen, Nicholas George Carroll, Alasdair Tran, Colin Klein, Lexing Xie

Abstract: The ever-growing textual records of contemporary social issues, often discussed online with moral rhetoric, present both an opportunity and a challenge for studying how moral concerns are debated in real life. Moral foundations theory is a taxonomy of intuitions widely used in data-driven analyses of online content, but current computational tools to detect moral foundations suffer from the incomp… ▽ More The ever-growing textual records of contemporary social issues, often discussed online with moral rhetoric, present both an opportunity and a challenge for studying how moral concerns are debated in real life. Moral foundations theory is a taxonomy of intuitions widely used in data-driven analyses of online content, but current computational tools to detect moral foundations suffer from the incompleteness and fragility of their lexicons and from poor generalization across data domains. In this paper, we fine-tune a large language model to measure moral foundations in text based on datasets covering news media and long- and short-form online discussions. The resulting model, called Mformer, outperforms existing approaches on the same domains by 4--12% in AUC and further generalizes well to four commonly used moral text datasets, improving by up to 17% in AUC. We present case studies using Mformer to analyze everyday moral dilemmas on Reddit and controversies on Twitter, showing that moral foundations can meaningfully describe people's stance on social issues and such variations are topic-dependent. Pre-trained model and datasets are released publicly. We posit that Mformer will help the research community quantify moral dimensions for a range of tasks and data domains, and eventually contribute to the understanding of moral situations faced by humans and machines. △ Less

Submitted 19 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: To be published in ICWSM 2024

arXiv:2311.08814 [pdf, ps, other]

The quotient spaces of topological groups with a $q$-point

Authors: Li-Hong Xie, Hai-Hua Lin, Piyu Li

Abstract: In this paper, we study the uniformities on the double coset spaces in topological groups. As an implication, the quotient spaces of topological groups with a $q$-point are studied. It mainly shows that: (1) Suppose that $G$ is a topological group with a $q$-point and $H$ is a closed subgroup of $G$; then the quotient space $G/H$ is an open and quasi-perfect preimage of a metrizable space; in part… ▽ More In this paper, we study the uniformities on the double coset spaces in topological groups. As an implication, the quotient spaces of topological groups with a $q$-point are studied. It mainly shows that: (1) Suppose that $G$ is a topological group with a $q$-point and $H$ is a closed subgroup of $G$; then the quotient space $G/H$ is an open and quasi-perfect preimage of a metrizable space; in particular, $G/H$ is an $M$-space. (2) Suppose that $G$ is a topological group with a strict $q$-point and $H$ is a closed subgroup of $G$; then the quotient space $G/H$ is an open and sequentially perfect preimage of a metrizable space. (3) Suppose that $G$ is a topological group with a strong $q$-point and $H$ is a closed subgroup of $G$; then the quotient space $G/H$ is an open and strongly sequentially perfect preimage of a metrizable space. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: 17

MSC Class: 54A20; 54H11; 54B15; 54C10; 54E15

arXiv:2311.08245 [pdf, other]

TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity Recognition

Authors: Yunjiao Zhou, Jianfei Yang, Han Zou, Lihua Xie

Abstract: Recent achievements in language models have showcased their extraordinary capabilities in bridging visual information with semantic language understanding. This leads us to a novel question: can language models connect textual semantics with IoT sensory signals to perform recognition tasks, e.g., Human Activity Recognition (HAR)? If so, an intelligent HAR system with human-like cognition can be bu… ▽ More Recent achievements in language models have showcased their extraordinary capabilities in bridging visual information with semantic language understanding. This leads us to a novel question: can language models connect textual semantics with IoT sensory signals to perform recognition tasks, e.g., Human Activity Recognition (HAR)? If so, an intelligent HAR system with human-like cognition can be built, capable of adapting to new environments and unseen categories. This paper explores its feasibility with an innovative approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly aligns textual embeddings with IoT sensor signals, including camera video, LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a unified semantic feature space that aligns multi-modal features with language embeddings, so that the IoT data corresponds to specific words that describe the IoT data. To enhance the connection between textual categories and their IoT data, we propose supplementary descriptions and learnable prompts that bring more semantic information into the joint feature space. TENT can not only recognize actions that have been seen but also ``guess'' the unseen action by the closest textual words from the feature space. We demonstrate TENT achieves state-of-the-art performance on zero-shot HAR tasks using different modalities, improving the best vision-language models by over 12%. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Preprint manuscript in submission

arXiv:2311.07179 [pdf, other]

SponTTS: modeling and transferring spontaneous style for TTS

Authors: Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

Abstract: Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous d… ▽ More Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers. △ Less

Submitted 8 January, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: 5 pages, 3 figures, Accepted by ICASSP2024

arXiv:2311.07081 [pdf, other]

Sensing Mutual Information with Random Signals in Gaussian Channels

Authors: Lei Xie, Fan Liu, Zhanyuan Xie, Zheng Jiang, Shenghui Song

Abstract: Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with determ… ▽ More Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with deterministic signals. However, the need to communicate in ISAC systems necessitates the use of random signals for sensing applications and the closed-form evaluation for the sensing mutual information (SMI) with random signals is not yet available in the literature. This paper investigates the achievable performance and precoder design for sensing applications with random signals. For that purpose, we first derive the closed-form expression for the SMI with random signals by utilizing random matrix theory. The result reveals some interesting physical insights regarding the relation between the SMI with deterministic and random signals. The derived SMI is then utilized to optimize the precoder by leveraging a manifold-based optimization approach. The effectiveness of the proposed methods is validated by simulation results. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.07062 [pdf, other]

doi 10.1109/TASLP.2023.3332542

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Authors: Qijie Shao, Pengcheng Guo, **ghao Yan, Pengfei Hu, Lei Xie

Abstract: Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related a… ▽ More Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related accent characteristics, while coarse-grained units are better for learning linguistic information. Moreover, an explicit interaction of two tasks can also provide complementary information and improve the performance of each other, but it is rarely used by existing approaches. In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR are first decoupled by separated branches and two-granular modeling units to learn task-specific representations. The AR branch is from our previously proposed linguistic-acoustic bimodal AR model and the ASR branch is an encoder-decoder based Conformer model. Then, for the task interaction, the CTC branch provides aligned text for the AR task, while accent embeddings extracted from our AR model are incorporated into the ASR branch's encoder and decoder. Finally, during ASR inference, a cross-granular rescoring method is introduced to fuse the complementary information from the CTC and attention decoder after the decoupling. Our experiments on English and Chinese datasets demonstrate the effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy relative improvement and 32.33%/14.55% ASR error rate relative reduction over a published standard baseline, respectively. △ Less

Submitted 17 November, 2023; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing (TASLP)

arXiv:2311.02817 [pdf, other]

Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments

Authors: Lu Yue, Dongliang Zhou, Liang Xie, Feitian Zhang, Ye Yan, Erwei Yin

Abstract: The task of vision-and-language navigation in continuous environments (VLN-CE) aims at training an autonomous agent to perform low-level actions to navigate through 3D continuous surroundings using visual observations and language instructions. The significant potential of VLN-CE for mobile robots has been demonstrated across a large number of studies. However, most existing works in VLN-CE focus… ▽ More The task of vision-and-language navigation in continuous environments (VLN-CE) aims at training an autonomous agent to perform low-level actions to navigate through 3D continuous surroundings using visual observations and language instructions. The significant potential of VLN-CE for mobile robots has been demonstrated across a large number of studies. However, most existing works in VLN-CE focus primarily on transferring the standard discrete vision-and-language navigation (VLN) methods to continuous environments, overlooking the problem of collisions. Such oversight often results in the agent deviating from the planned path or, in severe instances, the agent being trapped in obstacle areas and failing the navigational task. To address the above-mentioned issues, this paper investigates various collision scenarios within VLN-CE and proposes a classification method to predicate the underlying causes of collisions. Furthermore, a new VLN-CE algorithm, named Safe-VLN, is proposed to bolster collision avoidance capabilities including two key components, i.e., a waypoint predictor and a navigator. In particular, the waypoint predictor leverages a simulated 2D LiDAR occupancy mask to prevent the predicted waypoints from being situated in obstacle-ridden areas. The navigator, on the other hand, employs the strategy of `re-selection after collision' to prevent the robot agent from becoming ensnared in a cycle of perpetual collisions. The proposed Safe-VLN is evaluated on the R2R-CE, the results of which demonstrate an enhanced navigational performance and a statistically significant reduction in collision incidences. △ Less

Submitted 11 April, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2311.02612 [pdf, other]

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

Authors: Jiangning Zhang, Haoyang He, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

Abstract: Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and Vis… ▽ More Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}. △ Less

Submitted 16 April, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2311.02250 [pdf, other]

Efficient Scenario Generation for Chance-constrained Economic Dispatch Considering Ambient Wind Conditions

Authors: Qian Zhang, Apurv Shukla, Le Xie

Abstract: Scenario generation is an effective data-driven method for solving chance-constrained optimization while ensuring desired risk guarantees with a finite number of samples. Crucial challenges in deploying this technique in the real world arise due to the absence of appropriate risk-tuning models tailored for the desired application. In this paper, we focus on designing efficient scenario generation… ▽ More Scenario generation is an effective data-driven method for solving chance-constrained optimization while ensuring desired risk guarantees with a finite number of samples. Crucial challenges in deploying this technique in the real world arise due to the absence of appropriate risk-tuning models tailored for the desired application. In this paper, we focus on designing efficient scenario generation schemes for economic dispatch in power systems. We propose a novel scenario generation method based on filtering scenarios using ambient wind conditions. These filtered scenarios are deployed incrementally in order to meet desired risk levels while using minimum resources. In order to study the performance of the proposed scheme, we illustrate the procedure on case studies performed for both 24-bus and 118-bus systems with real-world wind power forecasting data. Numerical results suggest that the proposed filter-and-increment scenario generation model leads to a precise and efficient solution for the chance-constrained economic dispatch problem. △ Less

Submitted 2 January, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: 12 pages

arXiv:2311.00345 [pdf, ps, other]

Some characterizations of $ω$-balanced topological groups with a $q$-point

Authors: Deng-Bin Chen, Hai-Hua Lin, Li-Hong Xie

Abstract: In this paper, we study some characterizations of $q$-spaces, strict $q$-spaces and strong $q$-spaces under $ω$-balanced topological groups as follows: (1) A topological group $G$ is $ω$-balanced and a $q$-space if and only if for each open neighborhood $O$ of the identity in $G$, there is a countably compact invariant subgroup $H$ which is of countable character in $G$, such that… ▽ More In this paper, we study some characterizations of $q$-spaces, strict $q$-spaces and strong $q$-spaces under $ω$-balanced topological groups as follows: (1) A topological group $G$ is $ω$-balanced and a $q$-space if and only if for each open neighborhood $O$ of the identity in $G$, there is a countably compact invariant subgroup $H$ which is of countable character in $G$, such that $H \subseteq O$ and the canonical quotient map** $p:G\rightarrow G/H$ is quasi-perfect and the quotient group $G/H$ is metrizable. (2) A topological group $G$ is $ω$-balanced and a strict $q$-space if and only if for each open neighborhood $O$ of the identity in $G$, there is a closed sequentially compact invariant subgroup $H$ which is of countable character in $G$, such that $H \subseteq O$ and the canonical quotient map** $p:G\rightarrow G/H$ is sequential-perfect and the quotient group $G/H$ is metrizable. (3) A topological group $G$ is $ω$-balanced and a strong $q$-space if and only if for each open neighborhood $O$ of the identity in $G$, there is a closed sequentially compact invariant subgroup $H$ of countable character $\{V_{n}:n\in ω\} $, such that $H \subseteq O$ and $\{V_{n}:n\inω\}$ is a strong $q$-sequence at each $ y\in H $, in $G$ such that the canonical quotient map** $p:G\rightarrow G/H$ is strongly sequential-perfect and the quotient group $G/H$ is metrizable. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 11

arXiv:2311.00263 [pdf, other]

The bottleneck and ceiling effects in quantized tracking control of heterogeneous multi-agent systems under DoS attacks

Authors: Shuai Feng, Maopeng Ran, Baoyong Zhang, Lihua Xie, Shengyuan Xu

Abstract: In this paper, we investigate tracking control of heterogeneous multi-agent systems under Denial-of-Service (DoS) attacks and state quantization. Dynamic quantized mechanisms are designed for inter-follower communication and leader-follower communication. Zooming-in and out factors, and data rates of both mechanisms for preventing quantizer saturation are provided. Our results show that by tuning… ▽ More In this paper, we investigate tracking control of heterogeneous multi-agent systems under Denial-of-Service (DoS) attacks and state quantization. Dynamic quantized mechanisms are designed for inter-follower communication and leader-follower communication. Zooming-in and out factors, and data rates of both mechanisms for preventing quantizer saturation are provided. Our results show that by tuning the inter-follower quantized controller, one cannot improve the resilience beyond a level determined by the data rate of leader-follower quantized communication, i.e., the ceiling effect. Otherwise, overflow of followers' state quantizer can occur. On the other hand, if one selects a "large" data rate for leader-follower quantized communication, then the inter-follower quantized communication determines the resilience, and further increasing the data rate for leader-follower quantized communication cannot improve the resilience, i.e., the bottleneck effect. Simulation examples are provided to justify the results of our paper. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2310.19787 [pdf]

$e^{\text{RPCA}}$: Robust Principal Component Analysis for Exponential Family Distributions

Authors: Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

Abstract: Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, exis… ▽ More Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non-Gaussian. We thus propose a new method called Robust Principal Component Analysis for Exponential Family distributions ($e^{\text{RPCA}}$), which can perform the desired decomposition into low-rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient $e^{\text{RPCA}}$ decomposition. The effectiveness of $e^{\text{RPCA}}$ is then demonstrated in two applications: the first for steel sheet defect detection, and the second for crime activity monitoring in the Atlanta metropolitan area. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.18801 [pdf, other]

doi 10.1109/TAC.2023.3330801

Integrated Relative-Measurement-Based Network Localization and Formation Maneuver Control (Extended Version)

Authors: Xu Fang, Lihua Xie, Xiaolei Li

Abstract: This paper studies the problem of integrated distributed network localization and formation maneuver control. We develop an integrated relative-measurement-based scheme, which only uses relative positions, distances, bearings, angles, ratio-of-distances, or their combination to achieve distributed network localization and formation maneuver control in $\mathbb{R}^d (d \ge 2)$. By exploring the loc… ▽ More This paper studies the problem of integrated distributed network localization and formation maneuver control. We develop an integrated relative-measurement-based scheme, which only uses relative positions, distances, bearings, angles, ratio-of-distances, or their combination to achieve distributed network localization and formation maneuver control in $\mathbb{R}^d (d \ge 2)$. By exploring the localizability and invariance of the target formation, the scale, rotation, and translation of the formation can be controlled simultaneously by only tuning the leaders' positions, i.e., the followers do not need to know parameters of the scale, rotation, and translation of the target formation. The proposed method can globally drive the formation errors to zero in finite time over multi-layer $d\!+\!1$-rooted graphs. A simulation example is given to illustrate the theoretical results. △ Less

Submitted 13 November, 2023; v1 submitted 28 October, 2023; originally announced October 2023.

Comments: 12 pages; 7 figures, title corrected, DOI added

Showing 101–150 of 1,143 results for author: Xie, L