Search | arXiv e-print repository

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Authors: Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang

Abstract: Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduc… ▽ More Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Work in progress

arXiv:2406.17841 [pdf, other]

Probing many-body Bell correlation depth with superconducting qubits

Authors: Ke Wang, Weikang Li, Shibo Xu, Mengyao Hu, Jiachen Chen, Yaozu Wu, Chuanyu Zhang, Feitong **, Xuhao Zhu, Yu Gao, Ziqi Tan, Aosai Zhang, Ning Wang, Yiren Zou, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Zixuan Song, **feng Deng, Hang Dong, Xu Zhang, Pengfei Zhang, Wenjie Jiang , et al. (10 additional authors not shown)

Abstract: Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing… ▽ More Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing to machine learning. Nevertheless, the detection of nonlocality, especially in quantum many-body systems, is notoriously challenging. Here, we report an experimental certification of genuine multipartite Bell correlations, which signal nonlocality in quantum many-body systems, up to 24 qubits with a fully programmable superconducting quantum processor. In particular, we employ energy as a Bell correlation witness and variationally decrease the energy of a many-body system across a hierarchy of thresholds, below which an increasing Bell correlation depth can be certified from experimental data. As an illustrating example, we variationally prepare the low-energy state of a two-dimensional honeycomb model with 73 qubits and certify its Bell correlations by measuring an energy that surpasses the corresponding classical bound with up to 48 standard deviations. In addition, we variationally prepare a sequence of low-energy states and certify their genuine multipartite Bell correlations up to 24 qubits via energies measured efficiently by parity oscillation and multiple quantum coherence techniques. Our results establish a viable approach for preparing and certifying multipartite Bell correlations, which provide not only a finer benchmark beyond entanglement for quantum devices, but also a valuable guide towards exploiting multipartite Bell correlation in a wide spectrum of practical applications. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 11 pages,6 figures + 14 pages, 6 figures

arXiv:2406.16601 [pdf, other]

Do As I Do: Pose Guided Human Motion Copy

Authors: Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, Kui Ren

Abstract: Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional… ▽ More Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.03865 [pdf, other]

Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

Authors: Senran Fan, Zhicheng Bao, Chen Dong, Haotai Liang, Xiaodong Xu, ** Zhang

Abstract: Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSN… ▽ More Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric -- SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.04929 [pdf, ps, other]

Enabling Roll-up and Drill-down Operations in News Exploration with Knowledge Graphs for Due Diligence and Risk Management

Authors: Sha Wang, Yuchen Li, Hanhua Xiao, Zhifeng Bao, Lambert Deng, Yanfei Dong

Abstract: Efficient news exploration is crucial in real-world applications, particularly within the financial sector, where numerous control and risk assessment tasks rely on the analysis of public news reports. The current processes in this domain predominantly rely on manual efforts, often involving keywordbased searches and the compilation of extensive keyword lists. In this paper, we introduce NCEXPLORE… ▽ More Efficient news exploration is crucial in real-world applications, particularly within the financial sector, where numerous control and risk assessment tasks rely on the analysis of public news reports. The current processes in this domain predominantly rely on manual efforts, often involving keywordbased searches and the compilation of extensive keyword lists. In this paper, we introduce NCEXPLORER, a framework designed with OLAP-like operations to enhance the news exploration experience. NCEXPLORER empowers users to use roll-up operations for a broader content overview and drill-down operations for detailed insights. These operations are achieved through integration with external knowledge graphs (KGs), encompassing both fact-based and ontology-based structures. This integration significantly augments exploration capabilities, offering a more comprehensive and efficient approach to unveiling the underlying structures and nuances embedded in news content. Extensive empirical studies through master-qualified evaluators on Amazon Mechanical Turk demonstrate NCEXPLORER's superiority over existing state-of-the-art news search methodologies across an array of topic domains, using real-world news datasets. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: The paper was accepted by ICDE 2024

arXiv:2405.03708 [pdf]

Delta Tensor: Efficient Vector and Tensor Storage in Delta Lake

Authors: Zhiwei Bao, Liu Liao-Liao, Zhiyu Wu, Yifan Zhou, Dan Fan, Michal Aibin, Yvonne Coady, Andrew Brownsword

Abstract: The exponential growth of artificial intelligence (AI) and machine learning (ML) applications has necessitated the development of efficient storage solutions for vector and tensor data. This paper presents a novel approach for tensor storage in a Lakehouse architecture using Delta Lake. By adopting the multidimensional array storage strategy from array databases and sparse encoding methods to Delt… ▽ More The exponential growth of artificial intelligence (AI) and machine learning (ML) applications has necessitated the development of efficient storage solutions for vector and tensor data. This paper presents a novel approach for tensor storage in a Lakehouse architecture using Delta Lake. By adopting the multidimensional array storage strategy from array databases and sparse encoding methods to Delta Lake tables, experiments show that this approach has demonstrated notable improvements in both space and time efficiencies when compared to traditional serialization of tensors. These results provide valuable insights for the development and implementation of optimized vector and tensor storage solutions in data-intensive applications, contributing to the evolution of efficient data management practices in AI and ML domains in cloud-native environments △ Less

Submitted 13 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.02335 [pdf, other]

sDAC -- Semantic Digital Analog Converter for Semantic Communications

Authors: Zhicheng Bao, Chen Dong, Xiaodong Xu

Abstract: In this paper, we propose a novel semantic digital analog converter (sDAC) for the compatibility of semantic communications and digital communications. Most of the current semantic communication systems are based on the analog modulations, ignoring their incorporation with digital communication systems, which are more common in practice. In fact, quantization methods in traditional communication s… ▽ More In this paper, we propose a novel semantic digital analog converter (sDAC) for the compatibility of semantic communications and digital communications. Most of the current semantic communication systems are based on the analog modulations, ignoring their incorporation with digital communication systems, which are more common in practice. In fact, quantization methods in traditional communication systems are not appropriate for use in the era of semantic communication as these methods do not consider the semantic information inside symbols. In this case, any bit flip caused by channel noise can lead to a great performance drop. To address this challenge, sDAC is proposed. It is a simple yet efficient and generative module used to realize digital and analog bi-directional conversion. On the transmitter side, continuous values from the encoder are converted to binary bits and then can be modulated by any existing methods. After transmitting through the noisy channel, these bits get demodulated by paired methods and converted back to continuous values for further semantic decoding. The whole progress does not depend on any specific semantic model, modulation methods, or channel conditions. In the experiment section, the performance of sDAC is tested across different semantic models, semantic tasks, modulation methods, channel conditions and quantization orders. Test results show that the proposed sDAC has great generative properties and channel robustness. △ Less

Submitted 26 April, 2024; originally announced May 2024.

arXiv:2404.15878 [pdf, other]

Simulating unsteady fluid flows on a superconducting quantum processor

Authors: Zhaoyuan Meng, Jiarun Zhong, Shibo Xu, Ke Wang, Jiachen Chen, Feitong **, Xuhao Zhu, Yu Gao, Yaozu Wu, Chuanyu Zhang, Ning Wang, Yiren Zou, Aosai Zhang, Zhengyi Cui, Fanhao Shen, Zehang Bao, Zitian Zhu, Ziqi Tan, Tingting Li, Pengfei Zhang, Shiying Xiong, Hekang Li, Qiujiang Guo, Zhen Wang, Chao Song , et al. (2 additional authors not shown)

Abstract: Recent advancements of intermediate-scale quantum processors have triggered tremendous interest in the exploration of practical quantum advantage. The simulation of fluid dynamics, a highly challenging problem in classical physics but vital for practical applications, emerges as a good candidate for showing quantum utility. Here, we report an experiment on the digital simulation of unsteady flows,… ▽ More Recent advancements of intermediate-scale quantum processors have triggered tremendous interest in the exploration of practical quantum advantage. The simulation of fluid dynamics, a highly challenging problem in classical physics but vital for practical applications, emerges as a good candidate for showing quantum utility. Here, we report an experiment on the digital simulation of unsteady flows, which consists of quantum encoding, evolution, and detection of flow states, with a superconducting quantum processor. The quantum algorithm is based on the Hamiltonian simulation using the hydrodynamic formulation of the Schrödinger equation. With the median fidelities of 99.97% and 99.67% for parallel single- and two-qubit gates respectively, we simulate the dynamics of a two-dimensional (2D) compressible diverging flow and a 2D decaying vortex with ten qubits. The experimental results well capture the temporal evolution of averaged density and momentum profiles, and qualitatively reproduce spatial flow fields with moderate noises. This work demonstrates the potential of quantum computing in simulating more complex flows, such as turbulence, for practical applications. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.14249 [pdf, other]

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

Authors: Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, **gdong Wang, Qing Li, Kanglin Liu

Abstract: The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time synthesis of novel views in 3D scenes. Currently, it primarily focuses on geometry and appearance modeling, while lacking the semantic understanding of scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to efficiently comprehe… ▽ More The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time synthesis of novel views in 3D scenes. Currently, it primarily focuses on geometry and appearance modeling, while lacking the semantic understanding of scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to efficiently comprehend 3D environments without annotated semantic data. In specific, rather than straightforwardly learning and rendering high-dimensional semantic features of 3D Gaussians, which significantly diminishes the efficiency, we propose a Semantic Attribute Compactness (SAC) approach. SAC exploits the inherent unified semantics within objects to learn compact yet effective semantic representations of 3D Gaussians, enabling highly efficient rendering (>100 FPS). Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model. 3DCS imposes cross-view semantic consistency constraints by leveraging refined, self-predicted pseudo-labels derived from the trained 3D Gaussian model, thereby enhancing precise and view-consistent segmentation results. Extensive experiments demonstrate that our method remarkably outperforms existing state-of-the-art approaches, achieving improvements of 17.29% and 20.81% in mIoU metric on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, verifying the robustness of our method. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: https://github.com/gbliao/CLIP-GS

arXiv:2404.00091 [pdf, other]

doi 10.1038/s41567-024-02529-6

Non-Abelian braiding of Fibonacci anyons with a superconducting processor

Authors: Shibo Xu, Zheng-Zhi Sun, Ke Wang, Hekang Li, Zitian Zhu, Hang Dong, **feng Deng, Xu Zhang, Jiachen Chen, Yaozu Wu, Chuanyu Zhang, Feitong **, Xuhao Zhu, Yu Gao, Aosai Zhang, Ning Wang, Yiren Zou, Ziqi Tan, Fanhao Shen, Jiarun Zhong, Zehang Bao, Weikang Li, Wenjie Jiang, Li-Wei Yu, Zixuan Song , et al. (7 additional authors not shown)

Abstract: Non-Abelian topological orders offer an intriguing path towards fault-tolerant quantum computation, where information can be encoded and manipulated in a topologically protected manner immune to arbitrary local noises and perturbations. However, realizing non-Abelian topologically ordered states is notoriously challenging in both condensed matter and programmable quantum systems, and it was not un… ▽ More Non-Abelian topological orders offer an intriguing path towards fault-tolerant quantum computation, where information can be encoded and manipulated in a topologically protected manner immune to arbitrary local noises and perturbations. However, realizing non-Abelian topologically ordered states is notoriously challenging in both condensed matter and programmable quantum systems, and it was not until recently that signatures of non-Abelian statistics were observed through digital quantum simulation approaches. Despite these exciting progresses, none of them has demonstrated the appropriate type of topological orders and associated non-Abelian anyons whose braidings alone support universal quantum computation. Here, we report the realization of non-Abelian topologically ordered states of the Fibonacci string-net model and demonstrate braidings of Fibonacci anyons featuring universal computational power, with a superconducting quantum processor. We exploit efficient quantum circuits to prepare the desired states and verify their nontrivial topological nature by measuring the topological entanglement entropy. In addition, we create two pairs of Fibonacci anyons and demonstrate their fusion rule and non-Abelian braiding statistics by applying unitary gates on the underlying physical qubits. Our results establish a versatile digital approach to exploring exotic non-Abelian topological states and their associated braiding statistics with current noisy intermediate-scale quantum processors. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.16935 [pdf, other]

Measuring Spectral Form Factor in Many-Body Chaotic and Localized Phases of Quantum Processors

Authors: Hang Dong, Pengfei Zhang, Ceren B. Dag, Yu Gao, Ning Wang, **feng Deng, Xu Zhang, Jiachen Chen, Shibo Xu, Ke Wang, Yaozu Wu, Chuanyu Zhang, Feitong **, Xuhao Zhu, Aosai Zhang, Yiren Zou, Ziqi Tan, Zhengyi Cui, Zitian Zhu, Fanhao Shen, Tingting Li, Jiarun Zhong, Zehang Bao, Hekang Li, Zhen Wang , et al. (6 additional authors not shown)

Abstract: The spectral form factor (SFF) captures universal spectral fluctuations as signatures of quantum chaos, and has been instrumental in advancing multiple frontiers of physics including the studies of black holes and quantum many-body systems. However, the measurement of SFF in many-body systems is challenging due to the difficulty in resolving level spacings that become exponentially small with incr… ▽ More The spectral form factor (SFF) captures universal spectral fluctuations as signatures of quantum chaos, and has been instrumental in advancing multiple frontiers of physics including the studies of black holes and quantum many-body systems. However, the measurement of SFF in many-body systems is challenging due to the difficulty in resolving level spacings that become exponentially small with increasing system size. Here we experimentally measure the SFF to probe the presence or absence of chaos in quantum many-body systems using a superconducting quantum processor with a randomized measurement protocol. For a Floquet chaotic system, we observe signatures of spectral rigidity of random matrix theory in SFF given by the ramp-plateau behavior. For a Hamiltonian system, we utilize SFF to distinguish the quantum many-body chaotic phase and the prethermal many-body localization. We observe the dip-ramp-plateau behavior of random matrix theory in the chaotic phase, and contrast the scaling of the plateau time in system size between the many-body chaotic and localized phases. Furthermore, we probe the eigenstate statistics by measuring a generalization of the SFF, known as the partial SFF, and observe distinct behaviors in the purities of the reduced density matrix in the two phases. This work unveils a new way of extracting the universal signatures of many-body quantum chaos in quantum devices by probing the correlations in eigenenergies and eigenstates. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 12 pages, 9 figures

arXiv:2403.12542 [pdf, ps, other]

Attitude Tracking of Uncertain Flexible Spacecraft Systems Subject to Unknown External Disturbances

Authors: Zean Bao, Maobin Lu, Fang Deng, Jie Chen

Abstract: In this paper, we investigate the attitude tracking problem of uncertain flexible spacecraft systems subject to external disturbances. In sharp contrast to existing results, the dynamics of flexible spacecraft systems and external disturbances are allowed to be unknown. To deal with the challenges by these unknown factors, we develop a class of nonlinear internal models which converts the attitude… ▽ More In this paper, we investigate the attitude tracking problem of uncertain flexible spacecraft systems subject to external disturbances. In sharp contrast to existing results, the dynamics of flexible spacecraft systems and external disturbances are allowed to be unknown. To deal with the challenges by these unknown factors, we develop a class of nonlinear internal models which converts the attitude tracking problem of uncertain flexible spacecraft systems into a regulation problem of an augmented system. Furthermore, to overcome the difficulties caused by the unmeasurable modal variable, the uncertainty introduced by the internal model, and the cross-coupling of the uncertainties with the system state, we design an auxiliary dynamic system for auxiliary stabilization, a dynamic compensator for dynamic compensation, and a linearly parameterized transformation for adaptive regulation in sequence. By introducing a series of coordinate and input transformations, we propose an adaptive dynamic control law to achieve regulation of the augmented system and thus leading to the solution to the attitude tracking problem. In addition, we analyze the convergence issue of the estimated parameter to its true value by the persistently exciting condition. Finally, the effec tiveness of the developed approach is verified by its application to the attitude manoeuvre of a flexible spacecraft system in the presence of external disturbances. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages, 2 figures, submitted to TAC on 6 Dec. 2023

arXiv:2403.11495 [pdf, other]

Semantic-Enhanced Representation Learning for Road Networks with Temporal Dynamics

Authors: Yile Chen, Xiucheng Li, Gao Cong, Zhifeng Bao, Cheng Long

Abstract: In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic… ▽ More In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic patterns and traveling semantics. To achieve this, we refine the skip-gram module by incorporating auxiliary objectives aimed at predicting the traffic context associated with a target road segment. Moreover, we leverage trajectory data and design pre-training strategies based on Transformer to distill traveling semantics on road networks. DyToast further augments this framework by employing unified trigonometric functions characterized by their beneficial properties, enabling the capture of temporal evolution and dynamic nature of road networks more effectively. With these proposed techniques, we can obtain representations that encode multi-faceted aspects of knowledge within road networks, applicable across both road segment-based applications and trajectory-based applications. Extensive experiments on two real-world datasets across three tasks demonstrate that our proposed framework consistently outperforms the state-of-the-art baselines by a significant margin. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.01786 [pdf, other]

Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection

Authors: Zhongjie Ba, Qingyu Liu, Zhenguang Liu, Shuang Wu, Feng Lin, Li Lu, Kui Ren

Abstract: Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of high-fidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing onl… ▽ More Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of high-fidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing only on forgery clues within one or a few local regions. Moreover, existing works heavily rely on neural networks to extract forgery features, lacking theoretical constraints guaranteeing that sufficient forgery clues are extracted and superfluous features are eliminated. These deficiencies culminate in unsatisfactory accuracy and limited generalizability in real-life scenarios. In this paper, we try to tackle these challenges through three designs: (1) We present a novel framework to capture broader forgery clues by extracting multiple non-overlap** local representations and fusing them into a global semantic-rich feature. (2) Based on the information bottleneck theory, we derive Local Information Loss to guarantee the orthogonality of local representations while preserving comprehensive task-relevant information. (3) Further, to fuse the local representations and remove task-irrelevant information, we arrive at a Global Information Loss through the theoretical analysis of mutual information. Empirically, our method achieves state-of-the-art performance on five benchmark datasets.Our code is available at \url{https://github.com/QingyuLiu/Exposing-the-Deception}, ho** to inspire researchers. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: AAAI2024

arXiv:2402.05027 [pdf, other]

Towards Generalizability of Multi-Agent Reinforcement Learning in Graphs with Recurrent Message Passing

Authors: Jannis Weil, Zhenghua Bao, Osama Abboud, Tobias Meuser

Abstract: Graph-based environments pose unique challenges to multi-agent reinforcement learning. In decentralized approaches, agents operate within a given graph and make decisions based on partial or outdated observations. The size of the observed neighborhood limits the generalizability to different graphs and affects the reactivity of agents, the quality of the selected actions, and the communication ove… ▽ More Graph-based environments pose unique challenges to multi-agent reinforcement learning. In decentralized approaches, agents operate within a given graph and make decisions based on partial or outdated observations. The size of the observed neighborhood limits the generalizability to different graphs and affects the reactivity of agents, the quality of the selected actions, and the communication overhead. This work focuses on generalizability and resolves the trade-off in observed neighborhood size with a continuous information flow in the whole graph. We propose a recurrent message-passing model that iterates with the environment's steps and allows nodes to create a global representation of the graph by exchanging messages with their neighbors. Agents receive the resulting learned graph observations based on their location in the graph. Our approach can be used in a decentralized manner at runtime and in combination with a reinforcement learning algorithm of choice. We evaluate our method across 1000 diverse graphs in the context of routing in communication networks and find that it enables agents to generalize and adapt to changes in the graph. △ Less

Submitted 4 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted at AAMAS 2024, version with appendix; corrected typo in equation (1)

arXiv:2402.04648 [pdf, other]

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

Authors: Guibiao Liao, Kaichen Zhou, Zhenyu Bao, Kanglin Liu, Qing Li

Abstract: The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field… ▽ More The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from SAM to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and Scannet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.00936 [pdf, other]

doi 10.1038/s41467-024-48791-3

Enhanced quantum state transfer: Circumventing quantum chaotic behavior

Authors: Liang Xiang, Jiachen Chen, Zitian Zhu, Zixuan Song, Zehang Bao, Xuhao Zhu, Feitong **, Ke Wang, Shibo Xu, Yiren Zou, Hekang Li, Zhen Wang, Chao Song, Alexander Yue, Justine Partridge, Qiujiang Guo, Rubem Mondaini, H. Wang, Richard T. Scalettar

Abstract: The ability to realize high-fidelity quantum communication is one of the many facets required to build generic quantum computing devices. In addition to quantum processing, sensing, and storage, transferring the resulting quantum states demands a careful design that finds no parallel in classical communication. Existing experimental demonstrations of quantum information transfer in solid-state qua… ▽ More The ability to realize high-fidelity quantum communication is one of the many facets required to build generic quantum computing devices. In addition to quantum processing, sensing, and storage, transferring the resulting quantum states demands a careful design that finds no parallel in classical communication. Existing experimental demonstrations of quantum information transfer in solid-state quantum systems are largely confined to small chains with few qubits, often relying upon non-generic schemes. Here, by using a large-scale superconducting quantum circuit featuring thirty-six tunable qubits, accompanied by general optimization procedures deeply rooted in overcoming quantum chaotic behavior, we demonstrate a scalable protocol for transferring few-particle quantum states in a two-dimensional quantum network. These include single-qubit excitation and also two-qubit entangled states, and two excitations for which many-body effects are present. Our approach, combined with the quantum circuit's versatility, paves the way to short-distance quantum communication for connecting distributed quantum processors or registers, even if hampered by inherent imperfections in actual quantum devices. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 10 pages, 4 figures (main text); 14 pages, 20 figures (supplementary materials)

arXiv:2401.15704 [pdf, other]

Phoneme-Based Proactive Anti-Eavesdrop** with Controlled Recording Privilege

Authors: Peng Huang, Yao Wei, Peng Cheng, Zhongjie Ba, Li Lu, Feng Lin, Yang Wang, Kui Ren

Abstract: The widespread smart devices raise people's concerns of being eavesdropped on. To enhance voice privacy, recent studies exploit the nonlinearity in microphone to jam audio recorders with inaudible ultrasound. However, existing solutions solely rely on energetic masking. Their simple-form noise leads to several problems, such as high energy requirements and being easily removed by speech enhancemen… ▽ More The widespread smart devices raise people's concerns of being eavesdropped on. To enhance voice privacy, recent studies exploit the nonlinearity in microphone to jam audio recorders with inaudible ultrasound. However, existing solutions solely rely on energetic masking. Their simple-form noise leads to several problems, such as high energy requirements and being easily removed by speech enhancement techniques. Besides, most of these solutions do not support authorized recording, which restricts their usage scenarios. In this paper, we design an efficient yet robust system that can jam microphones while preserving authorized recording. Specifically, we propose a novel phoneme-based noise with the idea of informational masking, which can distract both machines and humans and is resistant to denoising techniques. Besides, we optimize the noise transmission strategy for broader coverage and implement a hardware prototype of our system. Experimental results show that our system can reduce the recognition accuracy of recordings to below 50\% under all tested speech recognition systems, which is much better than existing solutions. △ Less

Submitted 28 January, 2024; originally announced January 2024.

Comments: 14 pages, 28 figures; submitted to IEEE TDSC

arXiv:2401.14726 [pdf, other]

3D Reconstruction and New View Synthesis of Indoor Environments based on a Dual Neural Radiance Field

Authors: Zhenyu Bao, Guibiao Liao, Zhongyuan Zhao, Kanglin Liu, Qing Li, Guo** Qiu

Abstract: Simultaneously achieving 3D reconstruction and new view synthesis for indoor environments has widespread applications but is technically very challenging. State-of-the-art methods based on implicit neural functions can achieve excellent 3D reconstruction results, but their performances on new view synthesis can be unsatisfactory. The exciting development of neural radiance field (NeRF) has revolut… ▽ More Simultaneously achieving 3D reconstruction and new view synthesis for indoor environments has widespread applications but is technically very challenging. State-of-the-art methods based on implicit neural functions can achieve excellent 3D reconstruction results, but their performances on new view synthesis can be unsatisfactory. The exciting development of neural radiance field (NeRF) has revolutionized new view synthesis, however, NeRF-based models can fail to reconstruct clean geometric surfaces. We have developed a dual neural radiance field (Du-NeRF) to simultaneously achieve high-quality geometry reconstruction and view rendering. Du-NeRF contains two geometric fields, one derived from the SDF field to facilitate geometric reconstruction and the other derived from the density field to boost new view synthesis. One of the innovative features of Du-NeRF is that it decouples a view-independent component from the density field and uses it as a label to supervise the learning process of the SDF field. This reduces shape-radiance ambiguity and enables geometry and color to benefit from each other during the learning process. Extensive experiments demonstrate that Du-NeRF can significantly improve the performance of novel view synthesis and 3D reconstruction for indoor environments and it is particularly effective in constructing areas containing fine geometries that do not obey multi-view color consistency. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 20 pages, 8 figures

arXiv:2401.12900 [pdf, other]

PSAvatar: A Point-based Shape Model for Real-Time Head Avatar Animation with 3D Gaussian Splatting

Authors: Zhongyuan Zhao, Zhenyu Bao, Qing Li, Guo** Qiu, Kanglin Liu

Abstract: Despite much progress, achieving real-time high-fidelity head avatar animation is still difficult and existing methods have to trade-off between speed and quality. 3DMM based methods often fail to model non-facial structures such as eyeglasses and hairstyles, while neural implicit models suffer from deformation inflexibility and rendering inefficiency. Although 3D Gaussian has been demonstrated to… ▽ More Despite much progress, achieving real-time high-fidelity head avatar animation is still difficult and existing methods have to trade-off between speed and quality. 3DMM based methods often fail to model non-facial structures such as eyeglasses and hairstyles, while neural implicit models suffer from deformation inflexibility and rendering inefficiency. Although 3D Gaussian has been demonstrated to possess promising capability for geometry representation and radiance field reconstruction, applying 3D Gaussian in head avatar creation remains a major challenge since it is difficult for 3D Gaussian to model the head shape variations caused by changing poses and expressions. In this paper, we introduce PSAvatar, a novel framework for animatable head avatar creation that utilizes discrete geometric primitive to create a parametric morphable shape model and employs 3D Gaussian for fine detail representation and high fidelity rendering. The parametric morphable shape model is a Point-based Morphable Shape Model (PMSM) which uses points instead of meshes for 3D representation to achieve enhanced representation flexibility. The PMSM first converts the FLAME mesh to points by sampling on the surfaces as well as off the meshes to enable the reconstruction of not only surface-like structures but also complex geometries such as eyeglasses and hairstyles. By aligning these points with the head shape in an analysis-by-synthesis manner, the PMSM makes it possible to utilize 3D Gaussian for fine detail representation and appearance modeling, thus enabling the creation of high-fidelity avatars. We show that PSAvatar can reconstruct high-fidelity head avatars of a variety of subjects and the avatars can be animated in real-time ($\ge$ 25 fps at a resolution of 512 $\times$ 512 ). △ Less

Submitted 23 June, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: 13 pages, 10 figures

arXiv:2401.11087 [pdf, other]

Network evolution controlling strain-induced damage and self-healing of elastomers with dynamic bonds

Authors: Yikai Yin, Shaswat Mohanty, Christopher B. Cooper, Zhenan Bao, Wei Cai

Abstract: Highly stretchable and self-healable supramolecular elastomers are promising materials for future soft electronics, biomimetic systems, and smart textiles, due to their dynamic cross-linking bonds. The dynamic or reversible nature of the cross-links gives rise to interesting macroscopic responses in these materials such as self-healing and rapid stress-relaxation. However, the relationship between… ▽ More Highly stretchable and self-healable supramolecular elastomers are promising materials for future soft electronics, biomimetic systems, and smart textiles, due to their dynamic cross-linking bonds. The dynamic or reversible nature of the cross-links gives rise to interesting macroscopic responses in these materials such as self-healing and rapid stress-relaxation. However, the relationship between bond activity and macroscopic mechanical response, and the self-healing properties of these dynamic polymer networks (DPNs) remains poorly understood. Using coarse-grained molecular dynamics (CGMD) simulations, we reveal a fundamental connection between the macroscopic behaviors of DPNs and the shortest paths between distant nodes in the polymer network. Notably, the trajectories of the material on the shortest path-strain map provide key insights into understanding the stress-strain hysteresis, anisotropy, stress relaxation, and self-healing of DPNs. Based on CGMD simulations under various loading histories, we formulate a set of empirical rules that dictate how the shortest path interacts with stress and strain. This lays the foundation for the development of a physics-based theory centered around the non-local microstructural feature of shortest paths to predict the mechanical behavior of DPNs. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 17 pages, 7 figures

arXiv:2401.08284 [pdf, other]

Schrödinger cats growing up to 60 qubits and dancing in a cat scar enforced discrete time crystal

Authors: Zehang Bao, Shibo Xu, Zixuan Song, Ke Wang, Liang Xiang, Zitian Zhu, Jiachen Chen, Feitong **, Xuhao Zhu, Yu Gao, Yaozu Wu, Chuanyu Zhang, Ning Wang, Yiren Zou, Ziqi Tan, Aosai Zhang, Zhengyi Cui, Fanhao Shen, Jiarun Zhong, Tingting Li, **feng Deng, Xu Zhang, Hang Dong, Pengfei Zhang, Yang-Ren Liu , et al. (8 additional authors not shown)

Abstract: Greenberger-Horne-Zeilinger (GHZ) states, as maximally entangled Schrödinger cat states, play vital roles in the foundations of quantum physics and technology, but creating and preserving these fragile states pose tremendous challenges. Discrete time crystals (DTCs), originally aimed at exploring exotic nonequilibrium quantum matters, have raised significant scientific interest, but whether this b… ▽ More Greenberger-Horne-Zeilinger (GHZ) states, as maximally entangled Schrödinger cat states, play vital roles in the foundations of quantum physics and technology, but creating and preserving these fragile states pose tremendous challenges. Discrete time crystals (DTCs), originally aimed at exploring exotic nonequilibrium quantum matters, have raised significant scientific interest, but whether this brilliant concept can lead to true applications remains unclear. Here we propose an efficient protocol suitable for two-dimensional quantum processors, and by employing sequences of high-fidelity quantum gates, we achieve genuine GHZ entanglement at the intermediate scale with up to 60 superconducting qubits. More importantly, we take a new perspective on DTC by deterministically engineering pairwise cat eigenstates as quantum many-body scars, which shield the GHZ states from generic perturbations and enable dynamical switching during the state evolution. Our results not only nail down a direct application of DTC, but also establish engineerable many-body systems far from equilibrium as a versatile platform to protect and steer fragile but intriguing quantum entanglement. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 7 pages, 4 figures + supplementary information

arXiv:2401.04333 [pdf, other]

Long-lived topological time-crystalline order on a quantum processor

Authors: Liang Xiang, Wenjie Jiang, Zehang Bao, Zixuan Song, Shibo Xu, Ke Wang, Jiachen Chen, Feitong **, Xuhao Zhu, Zitian Zhu, Fanhao Shen, Ning Wang, Chuanyu Zhang, Yaozu Wu, Yiren Zou, Jiarun Zhong, Zhengyi Cui, Aosai Zhang, Ziqi Tan, Tingting Li, Yu Gao, **feng Deng, Xu Zhang, Hang Dong, Pengfei Zhang , et al. (16 additional authors not shown)

Abstract: Topologically ordered phases of matter elude Landau's symmetry-breaking theory, featuring a variety of intriguing properties such as long-range entanglement and intrinsic robustness against local perturbations. Their extension to periodically driven systems gives rise to exotic new phenomena that are forbidden in thermal equilibrium. Here, we report the observation of signatures of such a phenomen… ▽ More Topologically ordered phases of matter elude Landau's symmetry-breaking theory, featuring a variety of intriguing properties such as long-range entanglement and intrinsic robustness against local perturbations. Their extension to periodically driven systems gives rise to exotic new phenomena that are forbidden in thermal equilibrium. Here, we report the observation of signatures of such a phenomenon -- a prethermal topologically ordered time crystal -- with programmable superconducting qubits arranged on a square lattice. By periodically driving the superconducting qubits with a surface-code Hamiltonian, we observe discrete time-translation symmetry breaking dynamics that is only manifested in the subharmonic temporal response of nonlocal logical operators. We further connect the observed dynamics to the underlying topological order by measuring a nonzero topological entanglement entropy and studying its subsequent dynamics. Our results demonstrate the potential to explore exotic topologically ordered nonequilibrium phases of matter with noisy intermediate-scale quantum processors. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: 8 pages (main text), 16 pages (supplementary information)

arXiv:2401.00659 [pdf, ps, other]

Advanced Dataset Discovery: When Multi-Query-Dataset Cardinality Estimation Matters

Authors: Tingting Wang, Shixun Huang, Zhifeng Bao, J. Shane Culpepper, Reza Arablouei, Volkan Dedeoglu

Abstract: As available data increases, so too does the demand to dataset discovery. Existing studies often yield coarse-grained results where significant information overlaps and non-relevant data occur. They also implicitly assume that a user can purchase all datasets found, which is rarely true in practice. Therefore, achieving dataset discovery results with less redundancy using fine-grained information… ▽ More As available data increases, so too does the demand to dataset discovery. Existing studies often yield coarse-grained results where significant information overlaps and non-relevant data occur. They also implicitly assume that a user can purchase all datasets found, which is rarely true in practice. Therefore, achieving dataset discovery results with less redundancy using fine-grained information needs and a budget is desirable. To achieve this, we study the problem of finding a set of datasets that maximize distinctiveness based on a user's fine-grained information needs and a base dataset while kee** the total price of the datasets within a budget. The user's fine-grained information needs are expressed as a query set and the distinctiveness for a set of datasets, which is the number of distinct tuples produced by the query set on the datasets which do not overlap with the base dataset. First, we prove the NP-hardness of this problem. Then, we develop a greedy algorithm that achieves an approximation of (1-e^{-1})/2. But this algorithm is neither efficient nor scalable as it frequently computes the exact distinctiveness during dataset selection, which requires every tuple for the query result overlap in multiple datasets to be tested. To this end, we propose an efficient and effective machine-learning-based (ML-based) algorithm to estimate the distinctiveness for a set of datasets, without the need for testing every tuple. The proposed algorithm is the first to support cardinality estimation (CE) for a query set on multiple datasets, as previous studies only support CE for a single query on a single dataset, and cannot effectively identify query result overlaps in multiple datasets. Extensive experiments using five real-world data pools demonstrate that our greedy algorithm using ML-based distinctiveness estimation outperforms all other baselines in both effectiveness and efficiency. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.16057 [pdf, other]

Semantic Importance-Aware Based for Multi-User Communication Over MIMO Fading Channels

Authors: Haotai Liang, Zhicheng Bao, Wannian An, Chen Dong, Xiaodong Xu

Abstract: Semantic communication, as a novel communication paradigm, has attracted the interest of many scholars, with multi-user, multi-input multi-output (MIMO) scenarios being one of the critical contexts. This paper presents a semantic importance-aware based communication system (SIA-SC) over MIMO Rayleigh fading channels. Combining the semantic symbols' inequality and the equivalent subchannels of MIMO… ▽ More Semantic communication, as a novel communication paradigm, has attracted the interest of many scholars, with multi-user, multi-input multi-output (MIMO) scenarios being one of the critical contexts. This paper presents a semantic importance-aware based communication system (SIA-SC) over MIMO Rayleigh fading channels. Combining the semantic symbols' inequality and the equivalent subchannels of MIMO channels based on Singular Value Decomposition (SVD) maximizes the end-to-end semantic performance through the new layer map** method. For multi-user scenarios, a method of semantic interference cancellation is proposed. Furthermore, a new metric, namely semantic information distortion (SID), is established to unify the expressions of semantic performance, which is affected by channel bandwidth ratio (CBR) and signal-to-noise ratio (SNR). With the help of the proposed metric, we derived performance expressions and Semantic Outage Probability (SOP) of SIA-SC for Single-User Single-Input Single-Output (SU-SISO), Single-User MIMO (SU-MIMO), Multi-Users SISO (MU-MIMO) and Multi-Users MIMO (MU-MIMO) scenarios. Numerical experiments show that SIA-SC can significantly improve semantic performance across various scenarios. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.14901 [pdf, other]

Ancilla-Assisted Process Tomography with Bipartiete Mixed Separable States

Authors: Zhuoran Bao, Daniel F. V. James

Abstract: It has been shown that the entanglement between the system state and the ancillary state is not a strict requirement for performing ancilla-assisted process tomography(AAPT). Instead, it only requires that the system-ancilla state be faithful, which, in practice, is the invertibility of a certain matrix representing the state. Our paper takes on the operational definition of faithfulness, i.e., a… ▽ More It has been shown that the entanglement between the system state and the ancillary state is not a strict requirement for performing ancilla-assisted process tomography(AAPT). Instead, it only requires that the system-ancilla state be faithful, which, in practice, is the invertibility of a certain matrix representing the state. Our paper takes on the operational definition of faithfulness, i.e., a state is faithful if one can extract complete information about the quantum process, and we restrict the process to single-qubit operations on a two-qubit system-ancilla state. We present a theoretical analysis to connect the invertibility problem to the concept of Sinisterness, which quantifies the correlation of two qubits. Using Sinisterness, we derive a way of constructing two-qubit states that are guaranteed to be faithful in an operational sense and estimate the bound on the average error of the process. Our analysis agrees that the maximally entangled states provided the smallest error amplification. Nevertheless, it maps out a numerical region where the advantage of the entanglement starts. △ Less

Submitted 20 May, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: 6 pages

arXiv:2312.09708 [pdf, other]

GraphRARE: Reinforcement Learning Enhanced Graph Neural Network with Relative Entropy

Authors: Tianhao Peng, Wenjun Wu, Haitao Yuan, Zhifeng Bao, Zhao Pengrui, Xin Yu, Xuetao Lin, Yu Liang, Yanjun Pu

Abstract: Graph neural networks (GNNs) have shown advantages in graph-based analysis tasks. However, most existing methods have the homogeneity assumption and show poor performance on heterophilic graphs, where the linked nodes have dissimilar features and different class labels, and the semantically related nodes might be multi-hop away. To address this limitation, this paper presents GraphRARE, a general… ▽ More Graph neural networks (GNNs) have shown advantages in graph-based analysis tasks. However, most existing methods have the homogeneity assumption and show poor performance on heterophilic graphs, where the linked nodes have dissimilar features and different class labels, and the semantically related nodes might be multi-hop away. To address this limitation, this paper presents GraphRARE, a general framework built upon node relative entropy and deep reinforcement learning, to strengthen the expressive capability of GNNs. An innovative node relative entropy, which considers node features and structural similarity, is used to measure mutual information between node pairs. In addition, to avoid the sub-optimal solutions caused by mixing useful information and noises of remote nodes, a deep reinforcement learning-based algorithm is developed to optimize the graph topology. This algorithm selects informative nodes and discards noisy nodes based on the defined node relative entropy. Extensive experiments are conducted on seven real-world datasets. The experimental results demonstrate the superiority of GraphRARE in node classification and its capability to optimize the original graph topology. △ Less

Submitted 13 April, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: 14 pages, 7 figures

arXiv:2312.06712 [pdf, other]

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

Authors: Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert

Abstract: Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. Whi… ▽ More Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability. △ Less

Submitted 31 January, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

arXiv:2312.05911 [pdf, ps, other]

A leave-one-out approach to approximate message passing

Authors: Zhigang Bao, Qiyang Han, Xiaocong Xu

Abstract: Approximate message passing (AMP) has emerged both as a popular class of iterative algorithms and as a powerful analytic tool in a wide range of statistical estimation problems and statistical physics models. A well established line of AMP theory proves Gaussian approximations for the empirical distributions of the AMP iterate in the high dimensional limit, under the GOE random matrix model and it… ▽ More Approximate message passing (AMP) has emerged both as a popular class of iterative algorithms and as a powerful analytic tool in a wide range of statistical estimation problems and statistical physics models. A well established line of AMP theory proves Gaussian approximations for the empirical distributions of the AMP iterate in the high dimensional limit, under the GOE random matrix model and its variants. This paper provides a non-asymptotic, leave-one-out representation for the AMP iterate that holds under a broad class of Gaussian random matrix models with general variance profiles. In contrast to the typical AMP theory that describes the empirical distributions of the AMP iterate via a low dimensional state evolution, our leave-one-out representation yields an intrinsically high dimensional state evolution formula which provides non-asymptotic characterizations for the possibly heterogeneous, entrywise behavior of the AMP iterate under the prescribed random matrix models. To exemplify some distinct features of our AMP theory in applications, we analyze, in the context of regularized linear estimation, the precise stochastic behavior of the Ridge estimator for independent and non-identically distributed observations whose covariates exhibit general variance profiles. We find that its finite-sample distribution is characterized via a weighted Ridge estimator in a heterogeneous Gaussian sequence model. Notably, in contrast to the i.i.d. sampling scenario, the effective noise and regularization are now full dimensional vectors determined via a high dimensional system of equations. Our leave-one-out method of proof differs significantly from the widely adopted conditioning approach for rotational invariant ensembles, and relies instead on an inductive method that utilizes almost solely integration-by-parts and concentration techniques. △ Less

Submitted 25 December, 2023; v1 submitted 10 December, 2023; originally announced December 2023.

arXiv:2311.14321 [pdf, other]

Hypercontractivity for Quantum Erasure Channels via Multipartite Log-Sobolev Inequality

Authors: Zongbo Bao, Yang**g Dong, Fengning Ou, Penghui Yao

Abstract: We prove an almost optimal hypercontractive inequality for quantum erasure channels, generalizing the hypercontractivity for classical binary erasure channels [NW16]. To our knowledge, this is the first hypercontractivity bound for non-unital quantum channels. The traditional inductive arguments for classical hypercontractivity cannot be generalized to the quantum setting due to the nature of non-… ▽ More We prove an almost optimal hypercontractive inequality for quantum erasure channels, generalizing the hypercontractivity for classical binary erasure channels [NW16]. To our knowledge, this is the first hypercontractivity bound for non-unital quantum channels. The traditional inductive arguments for classical hypercontractivity cannot be generalized to the quantum setting due to the nature of non-commutativity of matrices. To overcome the difficulty, we establish a multipartite quantum log-Sobolev inequality, which includes the classical log-Sobolev inequality [DSC96] and the quantum log-Sobolev inequality [KT13] as one-partite cases. We establish a connection between our multipartite quantum log-Sobolev inequality and the hypercontractivity bound for quantum erasure channels via a refined quantum Gross' lemma [Gro75a], extending the analogous connection [Kin14] between the quantum log-Sobolev inequality and the hypercontractivity for qubit unital channels. As an application, we prove an almost tight bound (up to a constant factor) on the classical communication complexity of two-party common randomness generation assisted with erased-noisy EPR states, generalizing the tight bound on the same task assisted with erased-noisy random strings due to Guruswami and Radhakrishnan [GR16]. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: 39 pages, 2 figures

arXiv:2311.10492 [pdf, other]

A Relay System for Semantic Image Transmission based on Shared Feature Extraction and Hyperprior Entropy Compression

Authors: Wannian An, Zhicheng Bao, Haotai Liang, Chen Dong, Xiaodong

Abstract: Nowadays, the need for high-quality image reconstruction and restoration is more and more urgent. However, most image transmission systems may suffer from image quality degradation or transmission interruption in the face of interference such as channel noise and link fading. To solve this problem, a relay communication network for semantic image transmission based on shared feature extraction and… ▽ More Nowadays, the need for high-quality image reconstruction and restoration is more and more urgent. However, most image transmission systems may suffer from image quality degradation or transmission interruption in the face of interference such as channel noise and link fading. To solve this problem, a relay communication network for semantic image transmission based on shared feature extraction and hyperprior entropy compression (HEC) is proposed, where the shared feature extraction technology based on Pearson correlation is proposed to eliminate partial shared feature of extracted semantic latent feature. In addition, the HEC technology is used to resist the effect of channel noise and link fading and carried out respectively at the source node and the relay node. Experimental results demonstrate that compared with other recent research methods, the proposed system has lower transmission overhead and higher semantic image transmission performance. Particularly, under the same conditions, the multi-scale structural similarity (MS-SSIM) of this system is superior to the comparison method by approximately 0.2. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:2310.15696 [pdf]

doi 10.1002/adfm.202311895

Strain-driven switching between antiferromagnetic states in frustrated antiferromagnet UO2 probed by exchange bias effect

Authors: E. A. Tereshina-Chitrova, L. V. Pourovskii, S. Khmelevskyi, L. Horak, Z. Bao, A. Mackova, P. Malinsky, T. Gouder, R. Caciuffo

Abstract: Frustrated antiferromagnets offer a captivating platform to study the intricate relationship of magnetic interactions, geometric constraints, and emergent phenomena. By controlling spin orientations, these materials can be tailored for applications in spintronics and quantum information processing. The research focuses on the interplay of magnetic and exchange anisotropy effects in artificial hete… ▽ More Frustrated antiferromagnets offer a captivating platform to study the intricate relationship of magnetic interactions, geometric constraints, and emergent phenomena. By controlling spin orientations, these materials can be tailored for applications in spintronics and quantum information processing. The research focuses on the interplay of magnetic and exchange anisotropy effects in artificial heterostructures based on a canonical frustrated antiferromagnet, UO2. The potential to manipulate the spin directions in this material and switch between distinct antiferromagnetic states is investigated using substrate-induced strain. The phenomenon is probed using exchange bias (EB) effects in stoichiometric UO2/Fe3O4 bilayers. By employing many-body first-principles calculations magnetic configurations in the UO2 layers are identified. Even a minor tetragonal distortion triggers a transition between antiferromagnetic states of different symmetries, driven by a robust alteration of single-ion anisotropy due to the distortion. Consequently, this change influences the arrangement of magnetic moments at the UO2/Fe3O4 interface, affecting the magnitude of exchange bias. The findings showcase how epitaxial strain can effectively manipulate the antiferromagnetic states in frustrated antiferromagnets by controlling single-site anisotropy. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.13424 [pdf, other]

FLTracer: Accurate Poisoning Attack Provenance in Federated Learning

Authors: Xinyu Zhang, Qingyu Liu, Zhongjie Ba, Yuan Hong, Tianhang Zheng, Feng Lin, Li Lu, Kui Ren

Abstract: Federated Learning (FL) is a promising distributed learning approach that enables multiple clients to collaboratively train a shared global model. However, recent studies show that FL is vulnerable to various poisoning attacks, which can degrade the performance of global models or introduce backdoors into them. In this paper, we first conduct a comprehensive study on prior FL attacks and detection… ▽ More Federated Learning (FL) is a promising distributed learning approach that enables multiple clients to collaboratively train a shared global model. However, recent studies show that FL is vulnerable to various poisoning attacks, which can degrade the performance of global models or introduce backdoors into them. In this paper, we first conduct a comprehensive study on prior FL attacks and detection methods. The results show that all existing detection methods are only effective against limited and specific attacks. Most detection methods suffer from high false positives, which lead to significant performance degradation, especially in not independent and identically distributed (non-IID) settings. To address these issues, we propose FLTracer, the first FL attack provenance framework to accurately detect various attacks and trace the attack time, objective, type, and poisoned location of updates. Different from existing methodologies that rely solely on cross-client anomaly detection, we propose a Kalman filter-based cross-round detection to identify adversaries by seeking the behavior changes before and after the attack. Thus, this makes it resilient to data heterogeneity and is effective even in non-IID settings. To further improve the accuracy of our detection method, we employ four novel features and capture their anomalies with the joint decisions. Extensive evaluations show that FLTracer achieves an average true positive rate of over $96.88\%$ at an average false positive rate of less than $2.67\%$, significantly outperforming SOTA detection methods. \footnote{Code is available at \url{https://github.com/Eyr3/FLTracer}.} △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 18 pages, 27 figures

arXiv:2309.17450 [pdf, other]

Multi-task View Synthesis with Neural Radiance Fields

Authors: Shuhong Zheng, Zhipeng Bao, Martial Hebert, Yu-Xiong Wang

Abstract: Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capability for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MT… ▽ More Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capability for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MTVS), which reinterprets multi-task prediction as a set of novel-view synthesis tasks for multiple scene properties, including RGB. To tackle the MTVS problem, we propose MuvieNeRF, a framework that incorporates both multi-task and cross-view knowledge to simultaneously synthesize multiple scene properties. MuvieNeRF integrates two key modules, the Cross-Task Attention (CTA) and Cross-View Attention (CVA) modules, enabling the efficient use of information across multiple views and tasks. Extensive evaluation on both synthetic and realistic benchmarks demonstrates that MuvieNeRF is capable of simultaneously synthesizing different scene properties with promising visual quality, even outperforming conventional discriminative models in various settings. Notably, we show that MuvieNeRF exhibits universal applicability across a range of NeRF backbones. Our code is available at https://github.com/zsh2000/MuvieNeRF. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: ICCV 2023, Website: https://zsh2000.github.io/mtvs.github.io/

arXiv:2309.14122 [pdf, other]

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

Authors: Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren

Abstract: Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work… ▽ More Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 14 pages, 11 figures

arXiv:2309.11131 [pdf, other]

Locate and Verify: A Two-Stream Network for Improved Deepfake Detection

Authors: Chao Shuai, Jieming Zhong, Shuang Wu, Feng Lin, Zhibo Wang, Zhongjie Ba, Zhenguang Liu, Lorenzo Cavallaro, Kui Ren

Abstract: Deepfake has taken the world by storm, triggering a trust crisis. Current deepfake detection methods are typically inadequate in generalizability, with a tendency to overfit to image contents such as the background, which are frequently occurring but relatively unimportant in the training dataset. Furthermore, current methods heavily rely on a few dominant forgery regions and may ignore other equa… ▽ More Deepfake has taken the world by storm, triggering a trust crisis. Current deepfake detection methods are typically inadequate in generalizability, with a tendency to overfit to image contents such as the background, which are frequently occurring but relatively unimportant in the training dataset. Furthermore, current methods heavily rely on a few dominant forgery regions and may ignore other equally important regions, leading to inadequate uncovering of forgery cues. In this paper, we strive to address these shortcomings from three aspects: (1) We propose an innovative two-stream network that effectively enlarges the potential regions from which the model extracts forgery evidence. (2) We devise three functional modules to handle the multi-stream and multi-scale features in a collaborative learning scheme. (3) Confronted with the challenge of obtaining forgery annotations, we propose a Semi-supervised Patch Similarity Learning strategy to estimate patch-level forged location annotations. Empirically, our method demonstrates significantly improved robustness and generalizability, outperforming previous methods on six benchmarks, and improving the frame-level AUC on Deepfake Detection Challenge preview dataset from 0.797 to 0.835 and video-level AUC on CelebDF$\_$v1 dataset from 0.811 to 0.847. Our implementation is available at https://github.com/sccsok/Locate-and-Verify. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 10 pages, 8 figures, 60 references. This paper has been accepted for ACM MM 2023

arXiv:2309.10979 [pdf, other]

Towards Data-centric Graph Machine Learning: Review and Outlook

Authors: Xin Zheng, Yixin Liu, Zhifeng Bao, Meng Fang, Xia Hu, Alan Wee-Chung Liew, Shirui Pan

Abstract: Data-centric AI, with its primary focus on the collection, management, and utilization of data to drive AI models and applications, has attracted increasing attention in recent years. In this article, we conduct an in-depth and comprehensive review, offering a forward-looking outlook on the current efforts in data-centric AI pertaining to graph data-the fundamental data structure for representing… ▽ More Data-centric AI, with its primary focus on the collection, management, and utilization of data to drive AI models and applications, has attracted increasing attention in recent years. In this article, we conduct an in-depth and comprehensive review, offering a forward-looking outlook on the current efforts in data-centric AI pertaining to graph data-the fundamental data structure for representing and capturing intricate dependencies among massive and diverse real-life entities. We introduce a systematic framework, Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of the graph data lifecycle, including graph data collection, exploration, improvement, exploitation, and maintenance. A thorough taxonomy of each stage is presented to answer three critical graph-centric questions: (1) how to enhance graph data availability and quality; (2) how to learn from graph data with limited-availability and low-quality; (3) how to build graph MLOps systems from the graph data-centric view. Lastly, we pinpoint the future prospects of the DC-GML domain, providing insights to navigate its advancements and applications. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 42 pages, 9 figures

arXiv:2309.09526 [pdf, other]

DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery Clues

Authors: Kun Pan, Yin Yifang, Yao Wei, Feng Lin, Zhongjie Ba, Zhenguang Liu, ZhiBo Wang, Lorenzo Cavallaro, Kui Ren

Abstract: The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel increm… ▽ More The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, obtaining the new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at https://github.com/DeepFakeIL/DFIL. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted by ACMMM2023

arXiv:2308.14346 [pdf, other]

DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation

Authors: Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuan**g Huang, Zhongyu Wei

Abstract: We propose DISC-MedLLM, a comprehensive solution that leverages Large Language Models (LLMs) to provide accurate and truthful medical response in end-to-end conversational healthcare services. To construct high-quality Supervised Fine-Tuning (SFT) datasets, we employ three strategies: utilizing medical knowledge-graphs, reconstructing real-world dialogues, and incorporating human-guided preference… ▽ More We propose DISC-MedLLM, a comprehensive solution that leverages Large Language Models (LLMs) to provide accurate and truthful medical response in end-to-end conversational healthcare services. To construct high-quality Supervised Fine-Tuning (SFT) datasets, we employ three strategies: utilizing medical knowledge-graphs, reconstructing real-world dialogues, and incorporating human-guided preference rephrasing. These datasets are instrumental in training DISC-MedLLM, surpassing existing medical LLMs in both single-turn and multi-turn consultation scenarios. Extensive experimental results demonstrate the effectiveness of the proposed model in bridging the gap between general language models and real-world medical consultation. Additionally, we release the constructed dataset and model weights to further contribute to research and development. Further details and resources can be found at https://github.com/FudanDISC/DISC-MedLLM △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: Work in progress

arXiv:2308.09581 [pdf, ps, other]

Phase transition for the smallest eigenvalue of covariance matrices

Authors: Zhigang Bao, Jaehun Lee, Xiaocong Xu

Abstract: In this paper, we study the smallest non-zero eigenvalue of the sample covariance matrices $\mathcal{S}(Y)=YY^*$, where $Y=(y_{ij})$ is an $M\times N$ matrix with iid mean $0$ variance $N^{-1}$ entries. We prove a phase transition for its distribution, induced by the fatness of the tail of $y_{ij}$'s. More specifically, we assume that $y_{ij}$ is symmetrically distributed with tail probability… ▽ More In this paper, we study the smallest non-zero eigenvalue of the sample covariance matrices $\mathcal{S}(Y)=YY^*$, where $Y=(y_{ij})$ is an $M\times N$ matrix with iid mean $0$ variance $N^{-1}$ entries. We prove a phase transition for its distribution, induced by the fatness of the tail of $y_{ij}$'s. More specifically, we assume that $y_{ij}$ is symmetrically distributed with tail probability $\mathbb{P}(|\sqrt{N}y_{ij}|\geq x)\sim x^{-α}$ when $x\to \infty$, for some $α\in (2,4)$. We show the following conclusions: (i). When $α>\frac83$, the smallest eigenvalue follows the Tracy-Widom law on scale $N^{-\frac23}$; (ii). When $2<α<\frac83$, the smallest eigenvalue follows the Gaussian law on scale $N^{-\fracα{4}}$; (iii). When $α=\frac83$, the distribution is given by an interpolation between Tracy-Widom and Gaussian; (iv). In case $α\leq \frac{10}{3}$, in addition to the left edge of the MP law, a deterministic shift of order $N^{1-\fracα{2}}$ shall be subtracted from the smallest eigenvalue, in both the Tracy-Widom law and the Gaussian law. Overall speaking, our proof strategy is inspired by \cite{ALY} which is originally done for the bulk regime of the Lévy Wigner matrices. In addition to various technical complications arising from the bulk-to-edge extension, two ingredients are needed for our derivation: an intermediate left edge local law based on a simple but effective matrix minor argument, and a mesoscopic CLT for the linear spectral statistic with asymptotic expansion for its expectation. △ Less

Submitted 8 November, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Typos in equations (1.13) and (2.3) have been corrected

arXiv:2308.05338 [pdf, other]

MDVSC -- Wireless Model Division Video Semantic Communication

Authors: Zhicheng Bao, Haotai Liang, Chen Dong, Cong Li, Xiaodong Xu, ** Zhang

Abstract: This paper introduces a novel method for transmitting video data over noisy wireless channels with high efficiency and controllability. The method derivates from model division multiple access (MDMA) to extract common semantic features from video frames. It also uses deep joint source-channel coding (JSCC) as the main framework to establish communication links and deal with channel noise. An entro… ▽ More This paper introduces a novel method for transmitting video data over noisy wireless channels with high efficiency and controllability. The method derivates from model division multiple access (MDMA) to extract common semantic features from video frames. It also uses deep joint source-channel coding (JSCC) as the main framework to establish communication links and deal with channel noise. An entropy-based variable length coding scheme is developed to adjust the data amount accurately and explicitly. We name our method as model division video semantic communication (MDVSC). The main steps of our approach are as follows: first, video frames are transformed into a latent space to reduce computational complexity and redistribute data. Then, common features and individual features are extracted, and variable length coding is applied to further eliminate redundant semantic information under the communication bandwidth constraint. We evaluate our method on standard video test sequences and compare it with traditional wireless video coding methods. The results show that MDVSC generally surpasses the conventional methods in terms of quality metrics and has the capability to control code length precisely. Moreover, additional experiments and ablation studies are conducted to demonstrate its potential for various tasks. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: arXiv admin note: text overlap with arXiv:2305.15799

arXiv:2307.16630 [pdf, other]

doi 10.1109/SP54263.2024.00053

Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

Authors: Xinyu Zhang, Hanbin Hong, Yuan Hong, Peng Huang, Binghui Wang, Zhongjie Ba, Kui Ren

Abstract: The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely… ▽ More The language models, especially the basic text classification models, have been shown to be susceptible to textual adversarial attacks such as synonym substitution and word insertion attacks. To defend against such attacks, a growing body of research has been devoted to improving the model robustness. However, providing provable robustness guarantees instead of empirical robustness is still widely unexplored. In this paper, we propose Text-CRS, a generalized certified robustness framework for natural language processing (NLP) based on randomized smoothing. To our best knowledge, existing certified schemes for NLP can only certify the robustness against $\ell_0$ perturbations in synonym substitution attacks. Representing each word-level adversarial operation (i.e., synonym substitution, word reordering, insertion, and deletion) as a combination of permutation and embedding transformation, we propose novel smoothing theorems to derive robustness bounds in both permutation and embedding space against such adversarial operations. To further improve certified accuracy and radius, we consider the numerical relationships between discrete words and select proper noise distributions for the randomized smoothing. Finally, we conduct substantial experiments on multiple language models and datasets. Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks. △ Less

Submitted 11 June, 2024; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: Published in the 2024 IEEE Symposium on Security and Privacy (SP)

arXiv:2307.10129 [pdf, other]

General vs. Long-Tailed Age Estimation: An Approach to Kill Two Birds with One Stone

Authors: Zenghao Bao, Zichang Tan, Jun Li, Jun Wan, Xibo Ma, Zhen Lei

Abstract: Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation. However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward… ▽ More Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation. However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation. However, Long-tailed Age Estimation usually faces a performance trade-off, i.e., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.07909 [pdf, other]

Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training

Authors: Yao Wei, Yanchao Sun, Ruijie Zheng, Sai Vemprala, Rogerio Bonatti, Shuhang Chen, Ratnesh Madaan, Zhongjie Ba, Ashish Kapoor, Shuang Ma

Abstract: We introduce DualMind, a generalist agent designed to tackle various decision-making tasks that addresses challenges posed by current methods, such as overfitting behaviors and dependence on task-specific fine-tuning. DualMind uses a novel "Dual-phase" training strategy that emulates how humans learn to act in the world. The model first learns fundamental common knowledge through a self-supervised… ▽ More We introduce DualMind, a generalist agent designed to tackle various decision-making tasks that addresses challenges posed by current methods, such as overfitting behaviors and dependence on task-specific fine-tuning. DualMind uses a novel "Dual-phase" training strategy that emulates how humans learn to act in the world. The model first learns fundamental common knowledge through a self-supervised objective tailored for control tasks and then learns how to make decisions based on different contexts through imitating behaviors conditioned on given prompts. DualMind can handle tasks across domains, scenes, and embodiments using just a single set of model weights and can execute zero-shot prompting without requiring task-specific fine-tuning. We evaluate DualMind on MetaWorld and Habitat through extensive experiments and demonstrate its superior generalizability compared to previous techniques, outperforming other generalist agents by over 50$\%$ and 70$\%$ on Habitat and MetaWorld, respectively. On the 45 tasks in MetaWorld, DualMind achieves over 30 tasks at a 90$\%$ success rate. △ Less

Submitted 9 October, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

arXiv:2307.06027 [pdf, other]

Semantic Communications System with Model Division Multiple Access and Controllable Coding Rate for Point Cloud

Authors: Xiaoyi Liu, Haotai Liang, Zhicheng Bao, Chen Dong, Xiaodong Xu

Abstract: Point cloud, as a 3D representation, is widely used in autonomous driving, virtual reality (VR), and augmented reality (AR). However, traditional communication systems think that the point cloud's semantic information is irrelevant to communication, which hinders the efficient transmission of point clouds in the era of artificial intelligence (AI). This paper proposes a point cloud based semantic… ▽ More Point cloud, as a 3D representation, is widely used in autonomous driving, virtual reality (VR), and augmented reality (AR). However, traditional communication systems think that the point cloud's semantic information is irrelevant to communication, which hinders the efficient transmission of point clouds in the era of artificial intelligence (AI). This paper proposes a point cloud based semantic communication system (PCSC), which uses AI-based encoding techniques to extract the semantic information of the point cloud and joint source-channel coding (JSCC) technology to overcome the distortion caused by noise channels and solve the "cliff effect" in traditional communication. In addition, the system realizes the controllable coding rate without fine-tuning the network. The method analyzes the coded semantic vector's importance and discards semantically-unimportant information, thereby improving the transmission efficiency. Besides, PCSC and the recently proposed non-orthogonal model division multiple access (MDMA) technology are combined to design a point cloud MDMA transmission system (M-PCSC) for multi-user transmission. Relevant experimental results show that the proposed method outperforms the traditional method 10dB in the same channel bandwidth ratio under the PSNR D1 and PSNR D2 metrics. In terms of transmission, the proposed method can effectively solve the "cliff effect" in the traditional methods. △ Less

Submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.11363 [pdf, other]

Masked Diffusion Models Are Fast Distribution Learners

Authors: Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo Wang, Zhenguang Liu, Kui Ren

Abstract: Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it… ▽ More Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90\%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise visible areas. In subsequent fine-tuning stage, we efficiently train diffusion model without masking. Utilizing the two-stage training framework, we achieves significant training acceleration and a new FID score record of 6.27 on CelebA-HQ $256 \times 256$ for ViT-based diffusion models. The generalizability of a pre-trained model further helps building models that perform better than ones trained from scratch on different downstream datasets. For instance, a diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when fine-tuned on a different dataset that contains only 3000 images. Our code is available at \url{https://github.com/jiachenlei/maskdm}. △ Less

Submitted 27 November, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2306.05239 [pdf, other]

Point-Voxel Absorbing Graph Representation Learning for Event Stream based Recognition

Authors: Bo Jiang, Chengguo Yuan, Xiao Wang, Zhimin Bao, Lin Zhu, Yonghong Tian, ** Tang

Abstract: Sampled point and voxel methods are usually employed to downsample the dense events into sparse ones. After that, one popular way is to leverage a graph model which treats the sparse points/voxels as nodes and adopts graph neural networks (GNNs) to learn the representation of event data. Although good performance can be obtained, however, their results are still limited mainly due to two issues. (… ▽ More Sampled point and voxel methods are usually employed to downsample the dense events into sparse ones. After that, one popular way is to leverage a graph model which treats the sparse points/voxels as nodes and adopts graph neural networks (GNNs) to learn the representation of event data. Although good performance can be obtained, however, their results are still limited mainly due to two issues. (1) Existing event GNNs generally adopt the additional max (or mean) pooling layer to summarize all node embeddings into a single graph-level representation for the whole event data representation. However, this approach fails to capture the importance of graph nodes and also fails to be fully aware of the node representations. (2) Existing methods generally employ either a sparse point or voxel graph representation model which thus lacks consideration of the complementary between these two types of representation models. To address these issues, we propose a novel dual point-voxel absorbing graph representation learning for event stream data representation. To be specific, given the input event stream, we first transform it into the sparse event cloud and voxel grids and build dual absorbing graph models for them respectively. Then, we design a novel absorbing graph convolutional network (AGCN) for our dual absorbing graph representation and learning. The key aspect of the proposed AGCN is its ability to effectively capture the importance of nodes and thus be fully aware of node representations in summarizing all node representations through the introduced absorbing nodes. Extensive experiments on multiple event-based classification benchmark datasets fully validated the effectiveness of our framework. △ Less

Submitted 29 July, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: In Peer Review

arXiv:2306.02604 [pdf]

A Simple Yet High-Performing On-disk Learned Index: Can We Have Our Cake and Eat it Too?

Authors: Hai Lan, Zhifeng Bao, J. Shane Culpepper, Renata Borovica-Gajic, Yu Dong

Abstract: While in-memory learned indexes have shown promising performance as compared to B+-tree, most widely used databases in real applications still rely on disk-based operations. Based on our experiments, we observe that directly applying the existing learned indexes on disk suffers from several drawbacks and cannot outperform a standard B+-tree in most cases. Therefore, in this work we make the first… ▽ More While in-memory learned indexes have shown promising performance as compared to B+-tree, most widely used databases in real applications still rely on disk-based operations. Based on our experiments, we observe that directly applying the existing learned indexes on disk suffers from several drawbacks and cannot outperform a standard B+-tree in most cases. Therefore, in this work we make the first attempt to show how the idea of learned index can benefit the on-disk index by proposing AULID, a fully on-disk updatable learned index that can achieve state-of-the-art performance across multiple workload types. The AULID approach combines the benefits from both traditional indexing techniques and the learned indexes to reduce the I/O cost, the main overhead under disk setting. Specifically, three aspects are taken into consideration in reducing I/O costs: (1) reduce the overhead in updating the index structure; (2) induce shorter paths from root to leaf node; (3) achieve better locality to minimize the number of block reads required to complete a scan. Five principles are proposed to guide the design of AULID which shows remarkable performance gains and meanwhile is easy to implement. Our evaluation shows that AULID has comparable storage costs to a B+-tree and is much smaller than other learned indexes, and AULID is up to 2.11x, 8.63x, 1.72x, 5.51x, and 8.02x more efficient than FITing-tree, PGM, B+-tree, ALEX, and LIPP. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 14 pages

arXiv:2305.19580 [pdf]

Monofluorinated Ether Electrolyte with Acetal Backbone for High-Performance Lithium Metal Batteries

Authors: Elizabeth Zhang, Yuelang Chen, Zhiao Yu, Yi Cui, Zhenan Bao

Abstract: High degree of fluorination for ether electrolytes has resulted in improved cycling stability of lithium metal batteries (LMBs) due to stable SEI formation and good oxidative stability. However, the sluggish ion transport and environmental concerns of high fluorination degree drives the need to develop less fluorinated structures. Here, we introduce bis(2-fluoroethoxy)methane (F2DEM) which feature… ▽ More High degree of fluorination for ether electrolytes has resulted in improved cycling stability of lithium metal batteries (LMBs) due to stable SEI formation and good oxidative stability. However, the sluggish ion transport and environmental concerns of high fluorination degree drives the need to develop less fluorinated structures. Here, we introduce bis(2-fluoroethoxy)methane (F2DEM) which features monofluorination of the acetal backbone. High coulombic efficiency (CE) and stable long-term cycling in Li||Cu half cells can be achieved with F2DEM even under fast Li metal plating conditions. The performance of F2DEM is further compared with diethoxymethane (DEM) and 2-[2-(2,2-Difluoroethoxy)ethoxy]-1,1,1-Trifluoroethane (F5DEE). The structural similarity of DEM allows us to better probe the effects of monofluorination, while F5DEE is chosen as the one of the best performing LMB electrolytes for reference. The monofluorine substitution provides improved oxidation stability compared to non-fluorinated DEM, as demonstrated in the linear sweep voltammetry (LSV) and voltage holding experiments in Li||Pt and Li||Al cells. Higher ionic conductivity compared to F5DEE is also observed due to the decreased degree of fluorination. Furthermore, 1.75 M lithium bis(fluorosulfonyl)imide (LiFSI) / F2DEM displays significantly lower overpotential compared with the two reference electrolytes, which improves energy efficiency and enables its application in high-rate conditions. Comparative studies of F2DEM with DEM and F5DEE in anode-free (LiFePO4) LFP pouch cells and high-loading LFP coin cells with 20 μm excess Li further show improved capacity retention of F2DEM electrolyte. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: 19 pages, 6 figures

arXiv:2305.15799 [pdf, other]

MDVSC -- Wireless Model Division Video Semantic Communication

Authors: Zhicheng Bao, Haotai Liang, Chen Dong, Xiaodong Xu, Geng Liu

Abstract: In this paper, we propose a new wireless video communication scheme to achieve high-efficiency video transmission over noisy channels. It exploits the idea of model division multiple access (MDMA) and extracts common semantic features across video frames. Besides, deep joint source-channel coding (JSCC) is applied to overcome the distortion caused by noisy channels. The proposed framework is colle… ▽ More In this paper, we propose a new wireless video communication scheme to achieve high-efficiency video transmission over noisy channels. It exploits the idea of model division multiple access (MDMA) and extracts common semantic features across video frames. Besides, deep joint source-channel coding (JSCC) is applied to overcome the distortion caused by noisy channels. The proposed framework is collected under the name model division video semantic communication (MDVSC). In particular, temporal relative video frames are first transformed into a latent space for computing complexity reduction and data redistribution. Accordingly, a novel entropy-based variable length coding is developed further to compress semantic information under the communication bandwidth cost limitation. The whole MDVSC is an end-to-end learnable system. It can be formulated as an optimization problem whose goal is to minimize end-to-end transmission distortion under restricted communication resources. Across standard video source test sequences, test results show that the MDVSC outperforms traditional wireless video coding schemes generally under perceptual quality metrics and has the ability to control code length precisely. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Showing 1–50 of 175 results for author: Ba, Z