-
Contribution Evaluation of Heterogeneous Participants in Federated Learning via Prototypical Representations
Authors:
Qi Guo,
Minghao Yao,
Zhen Tian,
Saiyu Qi,
Yong Qi,
Yun Lin,
** Song Dong
Abstract:
Contribution evaluation in federated learning (FL) has become a pivotal research area due to its applicability across various domains, such as detecting low-quality datasets, enhancing model robustness, and designing incentive mechanisms. Existing contribution evaluation methods, which primarily rely on data volume, model similarity, and auxiliary test datasets, have shown success in diverse scena…
▽ More
Contribution evaluation in federated learning (FL) has become a pivotal research area due to its applicability across various domains, such as detecting low-quality datasets, enhancing model robustness, and designing incentive mechanisms. Existing contribution evaluation methods, which primarily rely on data volume, model similarity, and auxiliary test datasets, have shown success in diverse scenarios. However, their effectiveness often diminishes due to the heterogeneity of data distributions, presenting a significant challenge to their applicability. In response, this paper explores contribution evaluation in FL from an entirely new perspective of representation. In this work, we propose a new method for the contribution evaluation of heterogeneous participants in federated learning (FLCE), which introduces a novel indicator \emph{class contribution momentum} to conduct refined contribution evaluation. Our core idea is the construction and application of the class contribution momentum indicator from individual, relative, and holistic perspectives, thereby achieving an effective and efficient contribution evaluation of heterogeneous participants without relying on an auxiliary test dataset. Extensive experimental results demonstrate the superiority of our method in terms of fidelity, effectiveness, efficiency, and heterogeneity across various scenarios.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation
Authors:
Yushun Tang,
Shuoshuo Chen,
Zhehan Kan,
Yi Zhang,
Qinghai Guo,
Zhihai He
Abstract:
Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture th…
▽ More
Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
A Universal Railway Obstacle Detection System based on Semi-supervised Segmentation And Optical Flow
Authors:
Qiushi Guo
Abstract:
Detecting obstacles in railway scenarios is both crucial and challenging due to the wide range of obstacle categories and varying ambient conditions such as weather and light. Given the impossibility of encompassing all obstacle categories during the training stage, we address this out-of-distribution (OOD) issue with a semi-supervised segmentation approach guided by optical flow clues. We reformu…
▽ More
Detecting obstacles in railway scenarios is both crucial and challenging due to the wide range of obstacle categories and varying ambient conditions such as weather and light. Given the impossibility of encompassing all obstacle categories during the training stage, we address this out-of-distribution (OOD) issue with a semi-supervised segmentation approach guided by optical flow clues. We reformulate the task as a binary segmentation problem instead of the traditional object detection approach. To mitigate data shortages, we generate highly realistic synthetic images using Segment Anything (SAM) and YOLO, eliminating the need for manual annotation to produce abundant pixel-level annotations. Additionally, we leverage optical flow as prior knowledge to train the model effectively. Several experiments are conducted, demonstrating the feasibility and effectiveness of our approach.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
A PST Algorithm for FPs Suppression in Two-stage CNN Detection Methods
Authors:
Qiang Guo
Abstract:
Pedestrian detection has been a hot spot in computer vision over the past decades due to the wide spectrum of promising applications, the major challenge of which is False Positives (FPs) that occur during pedestrian detection. The emergence various Convolutional Neural Network-based detection strategies substantially enhance the pedestrian detection accuracy but still not well solve this problem.…
▽ More
Pedestrian detection has been a hot spot in computer vision over the past decades due to the wide spectrum of promising applications, the major challenge of which is False Positives (FPs) that occur during pedestrian detection. The emergence various Convolutional Neural Network-based detection strategies substantially enhance the pedestrian detection accuracy but still not well solve this problem. This paper deeply analysis the detection framework of the two-stage CNN detection methods and find out false positives in detection results is due to its training strategy miss classify some false proposals, thus weakens the classification capability of following subnetwork and hardly to suppress false ones. To solve this problem, This paper proposes a pedestrian-sensitive training algorithm to effectively help two-stage CNN detection methods learn to distinguish the pedestrian and non-pedestrian samples and suppress the false positives in final detection results. The core of the proposed training algorithm is to redesign the training proposal generating pipeline of the two-stage CNN detection methods, which can avoid a certain number of false ones that mislead its training process. With the help of the proposed algorithm, the detection accuracy of the MetroNext, an smaller and accurate metro passenger detector, is further improved, which further decreases false ones in its metro passengers detection results. Based on various challenging benchmark datasets, experiment results have demonstrated that feasibility of the proposed algorithm to improve pedestrian detection accuracy by removing the false positives. Compared with the competitors, MetroNext-PST demonstrates better overall prediction performance in accuracy, total number of parameters, and inference time, thus it can become a practical solution for hunting pedestrian tailored for mobile and edge devices.
△ Less
Submitted 24 May, 2024;
originally announced June 2024.
-
Probing many-body Bell correlation depth with superconducting qubits
Authors:
Ke Wang,
Weikang Li,
Shibo Xu,
Mengyao Hu,
Jiachen Chen,
Yaozu Wu,
Chuanyu Zhang,
Feitong **,
Xuhao Zhu,
Yu Gao,
Ziqi Tan,
Aosai Zhang,
Ning Wang,
Yiren Zou,
Tingting Li,
Fanhao Shen,
Jiarun Zhong,
Zehang Bao,
Zitian Zhu,
Zixuan Song,
**feng Deng,
Hang Dong,
Xu Zhang,
Pengfei Zhang,
Wenjie Jiang
, et al. (10 additional authors not shown)
Abstract:
Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing…
▽ More
Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing to machine learning. Nevertheless, the detection of nonlocality, especially in quantum many-body systems, is notoriously challenging. Here, we report an experimental certification of genuine multipartite Bell correlations, which signal nonlocality in quantum many-body systems, up to 24 qubits with a fully programmable superconducting quantum processor. In particular, we employ energy as a Bell correlation witness and variationally decrease the energy of a many-body system across a hierarchy of thresholds, below which an increasing Bell correlation depth can be certified from experimental data. As an illustrating example, we variationally prepare the low-energy state of a two-dimensional honeycomb model with 73 qubits and certify its Bell correlations by measuring an energy that surpasses the corresponding classical bound with up to 48 standard deviations. In addition, we variationally prepare a sequence of low-energy states and certify their genuine multipartite Bell correlations up to 24 qubits via energies measured efficiently by parity oscillation and multiple quantum coherence techniques. Our results establish a viable approach for preparing and certifying multipartite Bell correlations, which provide not only a finer benchmark beyond entanglement for quantum devices, but also a valuable guide towards exploiting multipartite Bell correlation in a wide spectrum of practical applications.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Authors:
Zhengyue Zhao,
Xiaoyun Zhang,
Kaidi Xu,
Xing Hu,
Rui Zhang,
Zidong Du,
Qi Guo,
Yunji Chen
Abstract:
With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational over…
▽ More
With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates. To this end, we propose Adversarial Contrastive Decoding (ACD), an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding. ACD only needs to apply a lightweight prompt tuning on a rather small anchor dataset (< 3 min for each model) without training the target model. Experiments conducted on extensive models and benchmarks demonstrate that the proposed method achieves much better safety performance than previous model training-free decoding methods without sacrificing its original generation ability.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation
Authors:
Tianyu Wei,
Shanmin Pang,
Qi Guo,
Yizhuo Ma,
Qing Guo
Abstract:
Text-to-image diffusion models can create realistic images based on input texts. Users can describe an object to convey their opinions visually. In this work, we unveil a previously unrecognized and latent risk of using diffusion models to generate images; we utilize emotion in the input texts to introduce negative contents, potentially eliciting unfavorable emotions in users. Emotions play a cruc…
▽ More
Text-to-image diffusion models can create realistic images based on input texts. Users can describe an object to convey their opinions visually. In this work, we unveil a previously unrecognized and latent risk of using diffusion models to generate images; we utilize emotion in the input texts to introduce negative contents, potentially eliciting unfavorable emotions in users. Emotions play a crucial role in expressing personal opinions in our daily interactions, and the inclusion of maliciously negative content can lead users astray, exacerbating negative emotions. Specifically, we identify the emotion-aware backdoor attack (EmoAttack) that can incorporate malicious negative content triggered by emotional texts during image generation. We formulate such an attack as a diffusion personalization problem to avoid extensive model retraining and propose the EmoBooth. Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a map** between a cluster of emotional words and a given reference image containing malicious negative content. To validate the effectiveness of our method, we built a dataset and conducted extensive analysis and discussion about its effectiveness. Given consumers' widespread use of diffusion models, uncovering this threat is critical for society.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Evolutionary Spiking Neural Networks: A Survey
Authors:
Shuaijie Shen,
Rui Zhang,
Chao Wang,
Renzhuo Huang,
Aiersi Tuerhong,
Qinghai Guo,
Zhichao Lu,
Jianguo Zhang,
Luziwei Leng
Abstract:
Spiking neural networks (SNNs) are gaining increasing attention as potential computationally efficient alternatives to traditional artificial neural networks(ANNs). However, the unique information propagation mechanisms and the complexity of SNN neuron models pose challenges for adopting traditional methods developed for ANNs to SNNs. These challenges include both weight learning and architecture…
▽ More
Spiking neural networks (SNNs) are gaining increasing attention as potential computationally efficient alternatives to traditional artificial neural networks(ANNs). However, the unique information propagation mechanisms and the complexity of SNN neuron models pose challenges for adopting traditional methods developed for ANNs to SNNs. These challenges include both weight learning and architecture design. While surrogate gradient learning has shown some success in addressing the former challenge, the latter remains relatively unexplored. Recently, a novel paradigm utilizing evolutionary computation methods has emerged to tackle these challenges. This approach has resulted in the development of a variety of energy-efficient and high-performance SNNs across a wide range of machine learning benchmarks. In this paper, we present a survey of these works and initiate discussions on potential challenges ahead.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
AutoSurvey: Large Language Models Can Automatically Write Surveys
Authors:
Yidong Wang,
Qi Guo,
Wen** Yao,
Hongbo Zhang,
Xin Zhang,
Zhen Wu,
Meishan Zhang,
Xinyu Dai,
Min Zhang,
Qingsong Wen,
Wei Ye,
Shikun Zhang,
Yue Zhang
Abstract:
This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in…
▽ More
This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in automating this process, challenges such as context window limitations, parametric knowledge constraints, and the lack of evaluation benchmarks remain. AutoSurvey addresses these challenges through a systematic approach that involves initial retrieval and outline generation, subsection drafting by specialized LLMs, integration and refinement, and rigorous evaluation and iteration. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.We open our resources at \url{https://github.com/AutoSurveys/AutoSurvey}.
△ Less
Submitted 17 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Integrated Near Field Sensing and Communications Using Unitary Approximate Message Passing Based Matrix Factorization
Authors:
Zhengdao Yuan,
Qinghua Guo,
Yonina C. Eldar,
Yonghui Li
Abstract:
Due to the utilization of large antenna arrays at base stations (BSs) and the operations of wireless communications in high frequency bands, mobile terminals often find themselves in the near-field of the array aperture. In this work, we address the signal processing challenges of integrated near-field localization and communication in uplink transmission of an integrated sensing and communication…
▽ More
Due to the utilization of large antenna arrays at base stations (BSs) and the operations of wireless communications in high frequency bands, mobile terminals often find themselves in the near-field of the array aperture. In this work, we address the signal processing challenges of integrated near-field localization and communication in uplink transmission of an integrated sensing and communication (ISAC) system, where the BS performs joint near-field localization and signal detection (JNFLSD). We show that JNFLSD can be formulated as a matrix factorization (MF) problem with proper structures imposed on the factor matrices. Then, leveraging the variational inference (VI) and unitary approximate message passing (UAMP), we develop a low complexity Bayesian approach to MF, called UAMP-MF, to handle a generic MF problem. We then apply the UAMP-MF algorithm to solve the JNFLSD problem, where the factor matrix structures are fully exploited. Extensive simulation results are provided to demonstrate the superior performance of the proposed method.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications
Authors:
Zhou Zhou,
Guohang He,
Zheng Zhang,
Luziwei Leng,
Qinghai Guo,
Jianxing Liao,
Xuan Song,
Ran Cheng
Abstract:
Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks…
▽ More
Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Graph Neural Network Enhanced Retrieval for Question Answering of LLMs
Authors:
Zijian Li,
Qingyan Guo,
Jiawei Shao,
Lei Song,
Jiang Bian,
Jun Zhang,
Rui Wang
Abstract:
Retrieval augmented generation has revolutionized large language model (LLM) outputs by providing factual supports. Nevertheless, it struggles to capture all the necessary knowledge for complex reasoning questions. Existing retrieval methods typically divide reference documents into passages, treating them in isolation. These passages, however, are often interrelated, such as passages that are con…
▽ More
Retrieval augmented generation has revolutionized large language model (LLM) outputs by providing factual supports. Nevertheless, it struggles to capture all the necessary knowledge for complex reasoning questions. Existing retrieval methods typically divide reference documents into passages, treating them in isolation. These passages, however, are often interrelated, such as passages that are contiguous or share the same keywords. Therefore, recognizing the relatedness is crucial for enhancing the retrieval process. In this paper, we propose a novel retrieval method, called GNN-Ret, which leverages graph neural networks (GNNs) to enhance retrieval by considering the relatedness between passages. Specifically, we first construct a graph of passages by connecting passages that are structure-related and keyword-related. A graph neural network (GNN) is then leveraged to exploit the relationships between passages and improve the retrieval of supporting passages. Furthermore, we extend our method to handle multi-hop reasoning questions using a recurrent graph neural network (RGNN), named RGNN-Ret. At each step, RGNN-Ret integrates the graphs of passages from previous steps, thereby enhancing the retrieval of supporting passages. Extensive experiments on benchmark datasets demonstrate that GNN-Ret achieves higher accuracy for question answering with a single query of LLMs than strong baselines that require multiple queries, and RGNN-Ret further improves accuracy and achieves state-of-the-art performance, with up to 10.4% accuracy improvement on the 2WikiMQA dataset.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Texture Re-scalable Universal Adversarial Perturbation
Authors:
Yihao Huang,
Qing Guo,
Felix Juefei-Xu,
Ming Hu,
Xiaojun Jia,
Xiaochun Cao,
Geguang Pu,
Yang Liu
Abstract:
Universal adversarial perturbation (UAP), also known as image-agnostic perturbation, is a fixed perturbation map that can fool the classifier with high probabilities on arbitrary images, making it more practical for attacking deep models in the real world. Previous UAP methods generate a scale-fixed and texture-fixed perturbation map for all images, which ignores the multi-scale objects in images…
▽ More
Universal adversarial perturbation (UAP), also known as image-agnostic perturbation, is a fixed perturbation map that can fool the classifier with high probabilities on arbitrary images, making it more practical for attacking deep models in the real world. Previous UAP methods generate a scale-fixed and texture-fixed perturbation map for all images, which ignores the multi-scale objects in images and usually results in a low fooling ratio. Since the widely used convolution neural networks tend to classify objects according to semantic information stored in local textures, it seems a reasonable and intuitive way to improve the UAP from the perspective of utilizing local contents effectively. In this work, we find that the fooling ratios significantly increase when we add a constraint to encourage a small-scale UAP map and repeat it vertically and horizontally to fill the whole image domain. To this end, we propose texture scale-constrained UAP (TSC-UAP), a simple yet effective UAP enhancement method that automatically generates UAPs with category-specific local textures that can fool deep models more easily. Through a low-cost operation that restricts the texture scale, TSC-UAP achieves a considerable improvement in the fooling ratio and attack transferability for both data-dependent and data-free UAP methods. Experiments conducted on two state-of-the-art UAP methods, eight popular CNN models and four classical datasets show the remarkable performance of TSC-UAP.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Semantic Segmentation on VSPW Dataset through Masked Video Consistency
Authors:
Chen Liang,
Qiang Guo,
Chongkai Yu,
Cheng**g Wu,
Ting Liu,
Luoqi Liu
Abstract:
Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked…
▽ More
Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
In-depth Analysis of Densest Subgraph Discovery in a Unified Framework
Authors:
Yingli Zhou,
Qingshuo Guo,
Yi Yang,
Yixiang Fang,
Chenhao Ma,
Laks Lakshmanan
Abstract:
As a fundamental topic in graph mining, Densest Subgraph Discovery (DSD) has found a wide spectrum of real applications. Several DSD algorithms, including exact and approximation algorithms, have been proposed in the literature. However, these algorithms have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first propose a unified framewo…
▽ More
As a fundamental topic in graph mining, Densest Subgraph Discovery (DSD) has found a wide spectrum of real applications. Several DSD algorithms, including exact and approximation algorithms, have been proposed in the literature. However, these algorithms have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first propose a unified framework to incorporate all DSD algorithms from a high-level perspective. We then extensively compare representative DSD algorithms over a range of graphs -- from small to billion-scale -- and examine the effectiveness of all methods. Moreover, we suggest new variants of the DSD algorithms by combining the existing techniques, which are up to 10 X faster than the state-of-the-art algorithm with the same accuracy guarantee. Finally, based on the findings, we offer promising research opportunities. We believe that a deeper understanding of the behavior of existing algorithms can provide new valuable insights for future research. The codes are released at https://anonymous.4open.science/r/DensestSubgraph-245A
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN
Authors:
Kang You,
Zekai Xu,
Chen Nie,
Zhijie Deng,
Qinghai Guo,
Xiang Wang,
Zhezhi He
Abstract:
Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing…
▽ More
Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing (NLP), the Transformer-based SNNs are still encounting the lower accuracy w.r.t the ANN counterparts. In this work, we introduce a novel ANN-to-SNN conversion method called SpikeZIP-TF, where ANN and SNN are exactly equivalent, thus incurring no accuracy degradation. SpikeZIP-TF achieves 83.82% accuracy on CV dataset (ImageNet) and 93.79% accuracy on NLP dataset (SST-2), which are higher than SOTA Transformer-based SNNs. The code is available in GitHub: https://github.com/Intelligent-Computing-Research-Group/SpikeZIP_transformer
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Prompt-based Visual Alignment for Zero-shot Policy Transfer
Authors:
Haihan Gao,
Rui Zhang,
Qi Yi,
Hantao Yao,
Haochen Li,
Jiaming Guo,
Shaohui Peng,
Yunkai Gao,
QiCheng Wang,
Xing Hu,
Yuanbo Wen,
Zihao Zhang,
Zidong Du,
Ling Li,
Qi Guo,
Yunji Chen
Abstract:
Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issue…
▽ More
Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good generalization performance. To better depict semantic information, prompt tuning is applied to learn a sequence of learnable tokens. With explicit constraints of semantic information, PVA can learn unified cross-domain representation under limited access to cross-domain data and achieves great zero-shot generalization ability in unseen domains. We verify PVA on a vision-based autonomous driving task with CARLA simulator. Experiments show that the agent generalizes well on unseen domains under limited access to multi-domain data.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
On Affine Homotopy between Language Encoders
Authors:
Robin SM Chan,
Reda Boumasmoud,
Anej Svete,
Yuxin Ren,
Qipeng Guo,
Zhi**g **,
Shauli Ravfogel,
Mrinmaya Sachan,
Bernhard Schölkopf,
Mennatallah El-Assady,
Ryan Cotterell
Abstract:
Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the…
▽ More
Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models
Authors:
Qingkai Min,
Qipeng Guo,
Xiangkun Hu,
Songfang Huang,
Zheng Zhang,
Yue Zhang
Abstract:
Cross-document event coreference resolution (CDECR) involves clustering event mentions across multiple documents that refer to the same real-world events. Existing approaches utilize fine-tuning of small language models (SLMs) like BERT to address the compatibility among the contexts of event mentions. However, due to the complexity and diversity of contexts, these models are prone to learning sim…
▽ More
Cross-document event coreference resolution (CDECR) involves clustering event mentions across multiple documents that refer to the same real-world events. Existing approaches utilize fine-tuning of small language models (SLMs) like BERT to address the compatibility among the contexts of event mentions. However, due to the complexity and diversity of contexts, these models are prone to learning simple co-occurrences. Recently, large language models (LLMs) like ChatGPT have demonstrated impressive contextual understanding, yet they encounter challenges in adapting to specific information extraction (IE) tasks. In this paper, we propose a collaborative approach for CDECR, leveraging the capabilities of both a universally capable LLM and a task-specific SLM. The collaborative strategy begins with the LLM accurately and comprehensively summarizing events through prompting. Then, the SLM refines its learning of event representations based on these insights during fine-tuning. Experimental results demonstrate that our approach surpasses the performance of both the large and small language models individually, forming a complementary advantage. Across various datasets, our approach achieves state-of-the-art performance, underscoring its effectiveness in diverse scenarios.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
Authors:
An-Chieh Cheng,
Hongxu Yin,
Yang Fu,
Qiushan Guo,
Ruihan Yang,
Jan Kautz,
Xiaolong Wang,
Sifei Liu
Abstract:
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curati…
▽ More
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark will be released at https://www.anjiecheng.me/SpatialRGPT
△ Less
Submitted 18 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Scaffold Splits Overestimate Virtual Screening Performance
Authors:
Qianrong Guo,
Saiveth Hernandez-Hernandez,
Pedro J Ballester
Abstract:
Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct c…
▽ More
Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grou** molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS
△ Less
Submitted 30 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Quantum Computing in Intelligent Transportation Systems: A Survey
Authors:
Yifan Zhuang,
Talha Azfar,
Yinhai Wang,
Wei Sun,
Xiaokun Cara Wang,
Qianwen Vivian Guo,
Ruimin Ke
Abstract:
Quantum computing, a field utilizing the principles of quantum mechanics, promises great advancements across various industries. This survey paper is focused on the burgeoning intersection of quantum computing and intelligent transportation systems, exploring its potential to transform areas such as traffic optimization, logistics, routing, and autonomous vehicles. By examining current research ef…
▽ More
Quantum computing, a field utilizing the principles of quantum mechanics, promises great advancements across various industries. This survey paper is focused on the burgeoning intersection of quantum computing and intelligent transportation systems, exploring its potential to transform areas such as traffic optimization, logistics, routing, and autonomous vehicles. By examining current research efforts, challenges, and future directions, this survey aims to provide a comprehensive overview of how quantum computing could affect the future of transportation.
△ Less
Submitted 3 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
Authors:
Zifan Song,
Yudong Wang,
Wenwei Zhang,
Kuikun Liu,
Chengqi Lyu,
Demin Song,
Qipeng Guo,
Hang Yan,
Dahua Lin,
Kai Chen,
Cairong Zhao
Abstract:
Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enh…
▽ More
Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Text Modality Oriented Image Feature Extraction for Detecting Diffusion-based DeepFake
Authors:
Di Yang,
Yihao Huang,
Qing Guo,
Felix Juefei-Xu,
Xiaojun Jia,
Run Wang,
Geguang Pu,
Yang Liu
Abstract:
The widespread use of diffusion methods enables the creation of highly realistic images on demand, thereby posing significant risks to the integrity and safety of online information and highlighting the necessity of DeepFake detection. Our analysis of features extracted by traditional image encoders reveals that both low-level and high-level features offer distinct advantages in identifying DeepFa…
▽ More
The widespread use of diffusion methods enables the creation of highly realistic images on demand, thereby posing significant risks to the integrity and safety of online information and highlighting the necessity of DeepFake detection. Our analysis of features extracted by traditional image encoders reveals that both low-level and high-level features offer distinct advantages in identifying DeepFake images produced by various diffusion methods. Inspired by this finding, we aim to develop an effective representation that captures both low-level and high-level features to detect diffusion-based DeepFakes. To address the problem, we propose a text modality-oriented feature extraction method, termed TOFE. Specifically, for a given target image, the representation we discovered is a corresponding text embedding that can guide the generation of the target image with a specific text-to-image model. Experiments conducted across ten diffusion types demonstrate the efficacy of our proposed method.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling
Authors:
Bowen Zhang,
Xiaofei Xie,
Haotian Lu,
Na Ma,
Tianlin Li,
Qing Guo
Abstract:
Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To ta…
▽ More
Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To tackle this, we propose an intuitive and straightforward solution: splicing multiple single-action video segments sequentially. The core challenge lies in generating smooth and natural transitions between these segments given the inherent complexity and variability of action transitions. We introduce MAVIN (Multi-Action Video INfilling model), designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. MAVIN incorporates several innovative techniques to address challenges in the transition video infilling task. Firstly, a consecutive noising strategy coupled with variable-length sampling is employed to handle large infilling gaps and varied generation lengths. Secondly, boundary frame guidance (BFG) is proposed to address the lack of semantic guidance during transition generation. Lastly, a Gaussian filter mixer (GFM) dynamically manages noise initialization during inference, mitigating train-test discrepancy while preserving generation flexibility. Additionally, we introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions compared to existing methods.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map
Authors:
Shunyu Liu,
Wei Luo,
Yanzhen Zhou,
Kaixuan Chen,
Quan Zhang,
Huating Xu,
Qinglai Guo,
Mingli Song
Abstract:
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coup…
▽ More
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification
Authors:
Yuxuan Guo,
Shaohui Peng,
Jiaming Guo,
Di Huang,
Xishan Zhang,
Rui Zhang,
Yifan Hao,
Ling Li,
Zikang Tian,
Mingju Gao,
Yutai Li,
Yiming Gan,
Shuai Liang,
Zihao Zhang,
Zidong Du,
Qi Guo,
Xing Hu,
Yunji Chen
Abstract:
Building open agents has always been the ultimate goal in AI research, and creative agents are the more enticing. Existing LLM agents excel at long-horizon tasks with well-defined goals (e.g., `mine diamonds' in Minecraft). However, they encounter difficulties on creative tasks with open goals and abstract criteria due to the inability to bridge the gap between them, thus lacking feedback for self…
▽ More
Building open agents has always been the ultimate goal in AI research, and creative agents are the more enticing. Existing LLM agents excel at long-horizon tasks with well-defined goals (e.g., `mine diamonds' in Minecraft). However, they encounter difficulties on creative tasks with open goals and abstract criteria due to the inability to bridge the gap between them, thus lacking feedback for self-improvement in solving the task. In this work, we introduce autonomous embodied verification techniques for agents to fill the gap, laying the groundwork for creative tasks. Specifically, we propose the Luban agent target creative building tasks in Minecraft, which equips with two-level autonomous embodied verification inspired by human design practices: (1) visual verification of 3D structural speculates, which comes from agent synthesized CAD modeling programs; (2) pragmatic verification of the creation by generating and verifying environment-relevant functionality programs based on the abstract criteria. Extensive multi-dimensional human studies and Elo ratings show that the Luban completes diverse creative building tasks in our proposed benchmark and outperforms other baselines ($33\%$ to $100\%$) in both visualization and pragmatism. Additional demos on the real-world robotic arm show the creation potential of the Luban in the physical world.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services
Authors:
Zheming Yang,
Yuanhao Yang,
Chang Zhao,
Qi Guo,
Wenkai He,
Wen Ji
Abstract:
With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challen…
▽ More
With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models
Authors:
Xiangkun Hu,
Dongyu Ru,
Lin Qiu,
Qipeng Guo,
Tianhang Zhang,
Yang Xu,
Yun Luo,
Pengfei Liu,
Yue Zhang,
Zheng Zhang
Abstract:
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. W…
▽ More
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs
Authors:
Yihao Huang,
Chong Wang,
Xiaojun Jia,
Qing Guo,
Felix Juefei-Xu,
Jian Zhang,
Geguang Pu,
Yang Liu
Abstract:
With the rising popularity of Large Language Models (LLMs), assessing their trustworthiness through security tasks has gained critical importance. Regarding the new task of universal goal hijacking, previous efforts have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To fill this gap, we propose a universal goal hijacking method called POUGH that incorp…
▽ More
With the rising popularity of Large Language Models (LLMs), assessing their trustworthiness through security tasks has gained critical importance. Regarding the new task of universal goal hijacking, previous efforts have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To fill this gap, we propose a universal goal hijacking method called POUGH that incorporates semantic-guided prompt processing strategies. Specifically, the method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes the prompts. Once the prompts are organized sequentially, the method employs an iterative optimization algorithm to generate the universal fixed suffix for the prompts. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness of our method.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography
Authors:
Nhat Chung,
Sensen Gao,
Tuan-Anh Vu,
Jie Zhang,
Aishan Liu,
Yun Lin,
** Song Dong,
Qing Guo
Abstract:
Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems due to their advanced visual-language reasoning capabilities, targeting the perception, prediction, planning, and control mechanisms. However, Vision-LLMs have demonstrated susceptibilities against various types of adversarial attacks, which would compromise their reliability and safet…
▽ More
Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems due to their advanced visual-language reasoning capabilities, targeting the perception, prediction, planning, and control mechanisms. However, Vision-LLMs have demonstrated susceptibilities against various types of adversarial attacks, which would compromise their reliability and safety. To further explore the risk in AD systems and the transferability of practical threats, we propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs. Different from the few existing works develo** general datasets of typographic attacks, this paper focuses on realistic traffic scenarios where these attacks can be deployed, on their potential effects on the decision-making autonomy, and on the practical ways in which these attacks can be physically presented. To achieve the above goals, we first propose a dataset-agnostic framework for automatically generating false answers that can mislead Vision-LLMs' reasoning. Then, we present a linguistic augmentation scheme that facilitates attacks at image-level and region-level reasoning, and we extend it with attack patterns against multiple reasoning tasks simultaneously. Based on these, we conduct a study on how these attacks can be realized in physical traffic scenarios. Through our empirical study, we evaluate the effectiveness, transferability, and realizability of typographic attacks in traffic scenes. Our findings demonstrate particular harmfulness of the typographic attacks against existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA, and Imp), thereby raising community awareness of vulnerabilities when incorporating such models into AD systems. We will release our source code upon acceptance.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Authors:
Zhangyue Yin,
Qiushi Sun,
Qipeng Guo,
Zhiyuan Zeng,
Xiaonan Li,
Tianxiang Sun,
Cheng Chang,
Qinyuan Cheng,
Ding Wang,
Xiaofeng Mou,
Xipeng Qiu,
Xuan**g Huang
Abstract:
Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify t…
▽ More
Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Hummer: Towards Limited Competitive Preference Dataset
Authors:
Li Jiang,
Yusen Wu,
Junwu Xiong,
**gqing Ruan,
Yichuan Ding,
Qingpei Guo,
Zujie Wen,
Jun Zhou,
Xiaotie Deng
Abstract:
Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment object…
▽ More
Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present \texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. \texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.
△ Less
Submitted 20 May, 2024; v1 submitted 19 May, 2024;
originally announced May 2024.
-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Authors:
Chengyue Wu,
Yixiao Ge,
Qiushan Guo,
Jiahao Wang,
Zhixuan Liang,
Zeyu Lu,
Ying Shan,
** Luo
Abstract:
The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLM…
▽ More
The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Preventive Audits for Data Applications Before Data Sharing in the Power IoT
Authors:
Bohong Wang,
Qinglai Guo,
Yanxi Lin,
Yang Yu
Abstract:
With the increase in data volume, more types of data are being used and shared, especially in the power Internet of Things (IoT). However, the processes of data sharing may lead to unexpected information leakage because of the ubiquitous relevance among the different data, thus it is necessary for data owners to conduct preventive audits for data applications before data sharing to avoid the risk…
▽ More
With the increase in data volume, more types of data are being used and shared, especially in the power Internet of Things (IoT). However, the processes of data sharing may lead to unexpected information leakage because of the ubiquitous relevance among the different data, thus it is necessary for data owners to conduct preventive audits for data applications before data sharing to avoid the risk of key information leakage. Considering that the same data may play completely different roles in different application scenarios, data owners should know the expected data applications of the data buyers in advance and provide modified data that are less relevant to the private information of the data owners and more relevant to the nonprivate information that the data buyers need. In this paper, data sharing in the power IoT is regarded as the background, and the mutual information of the data and their implicit information is selected as the data feature parameter to indicate the relevance between the data and their implicit information or the ability to infer the implicit information from the data. Therefore, preventive audits should be conducted based on changes in the data feature parameters before and after data sharing. The probability exchange adjustment method is proposed as the theoretical basis of preventive audits under simplified consumption, and the corresponding optimization models are constructed and extended to more practical scenarios with multivariate characteristics. Finally, case studies are used to validate the effectiveness of the proposed preventive audits.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Authors:
Xuzheng Yu,
Chen Jiang,
Xingning Dong,
Tian Gan,
Ming Yang,
Qingpei Guo
Abstract:
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing appr…
▽ More
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
△ Less
Submitted 6 May, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Authors:
Yuchi Wang,
Shuhuai Ren,
Rundong Gao,
Linli Yao,
Qingyan Guo,
Kaikai An,
Jianhong Bai,
Xu Sun
Abstract:
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With…
▽ More
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios using Diffusion Models
Authors:
Qi Guo,
Shanmin Pang,
Xiaojun Jia,
Qing Guo
Abstract:
Targeted transfer-based attacks involving adversarial examples pose a significant threat to large visual-language models (VLMs). However, the state-of-the-art (SOTA) transfer-based attacks incur high costs due to excessive iteration counts. Furthermore, the generated adversarial examples exhibit pronounced adversarial noise and demonstrate limited efficacy in evading defense methods such as DiffPu…
▽ More
Targeted transfer-based attacks involving adversarial examples pose a significant threat to large visual-language models (VLMs). However, the state-of-the-art (SOTA) transfer-based attacks incur high costs due to excessive iteration counts. Furthermore, the generated adversarial examples exhibit pronounced adversarial noise and demonstrate limited efficacy in evading defense methods such as DiffPure. To address these issues, inspired by score matching, we introduce AdvDiffVLM, which utilizes diffusion models to generate natural, unrestricted adversarial examples. Specifically, AdvDiffVLM employs Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring the adversarial examples produced contain natural adversarial semantics and thus possess enhanced transferability. Simultaneously, to enhance the quality of adversarial examples further, we employ the GradCAM-guided Mask method to disperse adversarial semantics throughout the image, rather than concentrating them in a specific area. Experimental results demonstrate that our method achieves a speedup ranging from 10X to 30X compared to existing transfer-based attack methods, while maintaining superior quality of adversarial examples. Additionally, the generated adversarial examples possess strong transferability and exhibit increased robustness against adversarial defense methods. Notably, AdvDiffVLM can successfully attack commercial VLMs, including GPT-4V, in a black-box manner.
△ Less
Submitted 18 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Joint Near Field Uplink Communication and Localization Using Message Passing-Based Sparse Bayesian Learning
Authors:
Fei Liu,
Zhengdao Yuan,
Qinghua Guo,
Yuanyuan Zhang,
Zhongyong Wang,
J. Andrew Zhang
Abstract:
This work deals with the problem of uplink communication and localization in an integrated sensing and communication system, where users are in the near field (NF) of antenna aperture due to the use of high carrier frequency and large antenna arrays at base stations. We formulate joint NF signal detection and localization as a problem of recovering signals with a sparse pattern. To solve the probl…
▽ More
This work deals with the problem of uplink communication and localization in an integrated sensing and communication system, where users are in the near field (NF) of antenna aperture due to the use of high carrier frequency and large antenna arrays at base stations. We formulate joint NF signal detection and localization as a problem of recovering signals with a sparse pattern. To solve the problem, we develop a message passing based sparse Bayesian learning (SBL) algorithm, where multiple unitary approximate message passing (UAMP)-based sparse signal estimators work jointly to recover the sparse signals with low complexity. Simulation results demonstrate the effectiveness of the proposed method.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks
Authors:
Jianlang Chen,
Xuhong Ren,
Qing Guo,
Felix Juefei-Xu,
Di Lin,
Wei Feng,
Lei Ma,
Jianjun Zhao
Abstract:
Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when the…
▽ More
Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
Authors:
**long Li,
Baolu Li,
Zhengzhong Tu,
Xinyu Liu,
Qing Guo,
Felix Juefei-Xu,
Runsheng Xu,
Hongkai Yu
Abstract:
Vision-centric perception systems for autonomous driving have gained considerable attention recently due to their cost-effectiveness and scalability, especially compared to LiDAR-based systems. However, these systems often struggle in low-light conditions, potentially compromising their performance and safety. To address this, our paper introduces LightDiff, a domain-tailored framework designed to…
▽ More
Vision-centric perception systems for autonomous driving have gained considerable attention recently due to their cost-effectiveness and scalability, especially compared to LiDAR-based systems. However, these systems often struggle in low-light conditions, potentially compromising their performance and safety. To address this, our paper introduces LightDiff, a domain-tailored framework designed to enhance the low-light image quality for autonomous driving applications. Specifically, we employ a multi-condition controlled diffusion model. LightDiff works without any human-collected paired data, leveraging a dynamic data degradation process instead. It incorporates a novel multi-condition adapter that adaptively controls the input weights from different modalities, including depth maps, RGB images, and text captions, to effectively illuminate dark scenes while maintaining context consistency. Furthermore, to align the enhanced images with the detection model's knowledge, LightDiff employs perception-specific scores as rewards to guide the diffusion training process through reinforcement learning. Extensive experiments on the nuScenes datasets demonstrate that LightDiff can significantly improve the performance of several state-of-the-art 3D detectors in night-time conditions while achieving high visual quality scores, highlighting its potential to safeguard autonomous driving.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
CSST Strong Lensing Preparation: a Framework for Detecting Strong Lenses in the Multi-color Imaging Survey by the China Survey Space Telescope (CSST)
Authors:
Xu Li,
Ruiqi Sun,
Jiameng Lv,
Peng Jia,
Nan Li,
Chengliang Wei,
Zou Hu,
Xinzhong Er,
Yun Chen,
Zhang Ban,
Yuedong Fang,
Qi Guo,
Dezi Liu,
Guoliang Li,
Lin Lin,
Ming Li,
Ran Li,
Xiaobo Li,
Yu Luo,
Xianmin Meng,
Jundan Nie,
Zhaoxiang Qi,
Yisheng Qiu,
Li Shao,
Hao Tian
, et al. (7 additional authors not shown)
Abstract:
Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to…
▽ More
Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to cut-out-centered galaxies. However, according to the design and survey strategy of optical surveys by CSST, preparing cutouts with multiple bands requires considerable efforts. To overcome these challenges, we have developed a framework based on a hierarchical visual Transformer with a sliding window technique to search for strong lensing systems within entire images. Moreover, given that multi-color images of strong lensing systems can provide insights into their physical characteristics, our framework is specifically crafted to identify strong lensing systems in images with any number of channels. As evaluated using CSST mock data based on an Semi-Analytic Model named CosmoDC2, our framework achieves precision and recall rates of 0.98 and 0.90, respectively. To evaluate the effectiveness of our method in real observations, we have applied it to a subset of images from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations. 61 new strong lensing system candidates are discovered by our method. However, we also identified false positives arising primarily from the simplified galaxy morphology assumptions within the simulation. This underscores the practical limitations of our approach while simultaneously highlighting potential avenues for future improvements.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion
Authors:
Qi Guo,
Xiaohong Li,
Xiaofei Xie,
Shangqing Liu,
Ze Tang,
Ruitao Feng,
Junjie Wang,
Jidong Ge,
Lei Bu
Abstract:
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, au…
▽ More
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions.
To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Force-EvT: A Closer Look at Robotic Gripper Force Measurement with Event-based Vision Transformer
Authors:
Qianyu Guo,
Ziqing Yu,
Jiaming Fu,
Yawen Lu,
Yahya Zweiri,
Dongming Gan
Abstract:
Robotic grippers are receiving increasing attention in various industries as essential components of robots for interacting and manipulating objects. While significant progress has been made in the past, conventional rigid grippers still have limitations in handling irregular objects and can damage fragile objects. We have shown that soft grippers offer deformability to adapt to a variety of objec…
▽ More
Robotic grippers are receiving increasing attention in various industries as essential components of robots for interacting and manipulating objects. While significant progress has been made in the past, conventional rigid grippers still have limitations in handling irregular objects and can damage fragile objects. We have shown that soft grippers offer deformability to adapt to a variety of object shapes and maximize object protection. At the same time, dynamic vision sensors (e.g., event-based cameras) are capable of capturing small changes in brightness and streaming them asynchronously as events, unlike RGB cameras, which do not perform well in low-light and fast-moving environments. In this paper, a dynamic-vision-based algorithm is proposed to measure the force applied to the gripper. In particular, we first set up a DVXplorer Lite series event camera to capture twenty-five sets of event data. Second, motivated by the impressive performance of the Vision Transformer (ViT) algorithm in dense image prediction tasks, we propose a new approach that demonstrates the potential for real-time force estimation and meets the requirements of real-world scenarios. We extensively evaluate the proposed algorithm on a wide range of scenarios and settings, and show that it consistently outperforms recent approaches.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
MCNet: A crowd denstity estimation network based on integrating multiscale attention module
Authors:
Qiang Guo,
Rubo Zhang,
Di Zhao
Abstract:
Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic cro…
▽ More
Aiming at the metro video surveillance system has not been able to effectively solve the metro crowd density estimation problem, a Metro Crowd density estimation Network (called MCNet) is proposed to automatically classify crowd density level of passengers. Firstly, an Integrating Multi-scale Attention (IMA) module is proposed to enhance the ability of the plain classifiers to extract semantic crowd texture features to accommodate to the characteristics of the crowd texture feature. The innovation of the IMA module is to fuse the dilation convolution, multiscale feature extraction and attention mechanism to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, and to strengthen the crowds activation state of convolutional features in top layers. Secondly, a novel lightweight crowd texture feature extraction network is proposed, which can directly process video frames and automatically extract texture features for crowd density estimation, while its faster image processing speed and fewer network parameters make it flexible to be deployed on embedded platforms with limited hardware resources. Finally, this paper integrates IMA module and the lightweight crowd texture feature extraction network to construct the MCNet, and validate the feasibility of this network on image classification dataset: Cifar10 and four crowd density datasets: PETS2009, Mall, QUT and SH_METRO to validate the MCNet whether can be a suitable solution for crowd density estimation in metro video surveillance where there are image processing challenges such as high density, high occlusion, perspective distortion and limited hardware resources.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Generative Quanta Color Imaging
Authors:
Vishal Purohit,
Junjie Luo,
Yiheng Chi,
Qi Guo,
Stanley H. Chan,
Qiang Qiu
Abstract:
The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However, the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper, we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently fi…
▽ More
The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However, the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper, we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently find this problem being particularly difficult to standard colorization approaches due to the substantial degree of exposure variation. The core innovation of our paper is an exposure synthesis model framed under a neural ordinary differential equation (Neural ODE) that allows us to generate a continuum of exposures from a single observation. This innovation ensures consistent exposure in binary images that colorizers take on, resulting in notably enhanced colorization. We demonstrate applications of the method in single-image and burst colorization and show superior generative performance over baselines. Project website can be found at https://vishal-s-p.github.io/projects/2023/generative_quanta_color.html.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection
Authors:
Jiayi Zhu,
Qing Guo,
Felix Juefei-Xu,
Yihao Huang,
Yang Liu,
Geguang Pu
Abstract:
Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress, state-of-the-art CoSODs could be easily affected by some adversarial perturbations, leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semant…
▽ More
Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress, state-of-the-art CoSODs could be easily affected by some adversarial perturbations, leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semantic information (e.g., concept) of the co-salient objects. In this paper, we propose a novel robustness enhancement framework by first learning the concept of the co-salient objects based on the input group images and then leveraging this concept to purify adversarial perturbations, which are subsequently fed to CoSODs for robustness enhancement. Specifically, we propose CosalPure containing two modules, i.e., group-image concept learning and concept-guided diffusion purification. For the first module, we adopt a pre-trained text-to-image diffusion model to learn the concept of co-salient objects within group images where the learned concept is robust to adversarial examples. For the second module, we map the adversarial image to the latent space and then perform diffusion generation by embedding the learned concept into the noise prediction function as an extra condition. Our method can effectively alleviate the influence of the SOTA adversarial attack containing different adversarial patterns, including exposure and noise. The extensive results demonstrate that our method could enhance the robustness of CoSODs significantly.
△ Less
Submitted 11 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders
Authors:
Youhua Li,
Hanwen Du,
Yongxin Ni,
Yuanqi He,
Junchen Fu,
Xiangyan Liu,
Qi Guo
Abstract:
Sequential Recommendation (SR) aims to predict future user-item interactions based on historical interactions. While many SR approaches concentrate on user IDs and item IDs, the human perception of the world through multi-modal signals, like text and images, has inspired researchers to delve into constructing SR from multi-modal information without using IDs. However, the complexity of multi-modal…
▽ More
Sequential Recommendation (SR) aims to predict future user-item interactions based on historical interactions. While many SR approaches concentrate on user IDs and item IDs, the human perception of the world through multi-modal signals, like text and images, has inspired researchers to delve into constructing SR from multi-modal information without using IDs. However, the complexity of multi-modal learning manifests in diverse feature extractors, fusion methods, and pre-trained models. Consequently, designing a simple and universal \textbf{M}ulti-\textbf{M}odal \textbf{S}equential \textbf{R}ecommendation (\textbf{MMSR}) framework remains a formidable challenge. We systematically summarize the existing multi-modal related SR methods and distill the essence into four core components: visual encoder, text encoder, multimodal fusion module, and sequential architecture. Along these dimensions, we dissect the model designs, and answer the following sub-questions: First, we explore how to construct MMSR from scratch, ensuring its performance either on par with or exceeds existing SR methods without complex techniques. Second, we examine if MMSR can benefit from existing multi-modal pre-training paradigms. Third, we assess MMSR's capability in tackling common challenges like cold start and domain transferring. Our experiment results across four real-world recommendation scenarios demonstrate the great potential ID-agnostic multi-modal sequential recommendation. Our framework can be found at: https://github.com/MMSR23/MMSR.
△ Less
Submitted 30 March, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
InternLM2 Technical Report
Authors:
Zheng Cai,
Maosong Cao,
Haojiong Chen,
Kai Chen,
Keyu Chen,
Xin Chen,
Xun Chen,
Zehui Chen,
Zhi Chen,
Pei Chu,
Xiaoyi Dong,
Haodong Duan,
Qi Fan,
Zhaoye Fei,
Yang Gao,
Jiaye Ge,
Chenya Gu,
Yuzhe Gu,
Tao Gui,
Aijia Guo,
Qipeng Guo,
Conghui He,
Yingfan Hu,
Ting Huang,
Tao Jiang
, et al. (75 additional authors not shown)
Abstract:
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m…
▽ More
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks
Authors:
Wei Xu,
Junjie Luo,
Qi Guo
Abstract:
We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form…
▽ More
We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form of a pre-defined local boundary representation, the field-of-junctions (FoJ). Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously. Our quantitative analysis shows that CT-Bound outperforms the previous best algorithms in edge detection on very noisy images. It also increases the edge detection accuracy of FoJ-based methods while having a 3-time speed improvement. Finally, we demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.
△ Less
Submitted 25 June, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.