Search | arXiv e-print repository

arXiv:2406.19477 [pdf, other]

doi 10.3233/FAIA230444

Multi-agent Cooperative Games Using Belief Map Assisted Training

Authors: Qinwei Huang, Chen Luo, Alex B. Wu, Simon Khan, Hai Li, Qinru Qiu

Abstract: In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learn… ▽ More In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learning (RL), the message passing system needs to be optimized together with the agent policies. This consequently increases the model's complexity and poses significant challenges to the convergence and performance of learning. To address this issue, we propose the Belief-map Assisted Multi-agent System (BAMS), which leverages a neuro-symbolic belief map to enhance training. The belief map decodes the agent's hidden state to provide a symbolic representation of the agent's understanding of the environment and other agent's status. The simplicity of symbolic representation allows the gathering and comparison of the ground truth information with the belief, which provides an additional channel of feedback for the learning. Compared to the sporadic and delayed feedback coming from the reward in RL, the feedback from the belief map is more consistent and reliable. Agents using BAMS can learn a more effective message passing network to better understand each other, resulting in better performance in a cooperative predator and prey game with varying levels of map complexity and compare it to previous multi-agent message passing models. The simulation results showed that BAMS reduced training epochs by 66\%, and agents who apply the BAMS model completed the game with 34.62\% fewer steps on average. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Journal ref: ECAI 2023. IOS Press, 2023: 1617-1624

arXiv:2406.18569 [pdf, other]

FLOW: Fusing and Shuffling Global and Local Views for Cross-User Human Activity Recognition with IMUs

Authors: Qi Qiu, Tao Zhu, Furong Duan, Kevin I-Kai Wang, Liming Chen, Mingxing Nie, Mingxing Nie

Abstract: Inertial Measurement Unit (IMU) sensors are widely employed for Human Activity Recognition (HAR) due to their portability, energy efficiency, and growing research interest. However, a significant challenge for IMU-HAR models is achieving robust generalization performance across diverse users. This limitation stems from substantial variations in data distribution among individual users. One primary… ▽ More Inertial Measurement Unit (IMU) sensors are widely employed for Human Activity Recognition (HAR) due to their portability, energy efficiency, and growing research interest. However, a significant challenge for IMU-HAR models is achieving robust generalization performance across diverse users. This limitation stems from substantial variations in data distribution among individual users. One primary reason for this distribution disparity lies in the representation of IMU sensor data in the local coordinate system, which is susceptible to subtle user variations during IMU wearing. To address this issue, we propose a novel approach that extracts a global view representation based on the characteristics of IMU data, effectively alleviating the data distribution discrepancies induced by wearing styles. To validate the efficacy of the global view representation, we fed both global and local view data into model for experiments. The results demonstrate that global view data significantly outperforms local view data in cross-user experiments. Furthermore, we propose a Multi-view Supervised Network (MVFNet) based on Shuffling to effectively fuse local view and global view data. It supervises the feature extraction of each view through view division and view shuffling, so as to avoid the model ignoring important features as much as possible. Extensive experiments conducted on OPPORTUNITY and PAMAP2 datasets demonstrate that the proposed algorithm outperforms the current state-of-the-art methods in cross-user HAR. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.13897 [pdf, other]

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Authors: Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, **gyi Yu

Abstract: In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic… ▽ More In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity. △ Less

Submitted 30 May, 2024; originally announced June 2024.

Comments: Project page: https://sites.google.com/view/clay-3dlm Video: https://youtu.be/YcKFp4U2Voo

arXiv:2406.13568 [pdf, other]

Trapezoidal Gradient Descent for Effective Reinforcement Learning in Spiking Networks

Authors: Yuhao Pan, Xiucheng Wang, Nan Cheng, Qi Qiu

Abstract: With the rapid development of artificial intelligence technology, the field of reinforcement learning has continuously achieved breakthroughs in both theory and practice. However, traditional reinforcement learning algorithms often entail high energy consumption during interactions with the environment. Spiking Neural Network (SNN), with their low energy consumption characteristics and performance… ▽ More With the rapid development of artificial intelligence technology, the field of reinforcement learning has continuously achieved breakthroughs in both theory and practice. However, traditional reinforcement learning algorithms often entail high energy consumption during interactions with the environment. Spiking Neural Network (SNN), with their low energy consumption characteristics and performance comparable to deep neural networks, have garnered widespread attention. To reduce the energy consumption of practical applications of reinforcement learning, researchers have successively proposed the Pop-SAN and MDC-SAN algorithms. Nonetheless, these algorithms use rectangular functions to approximate the spike network during the training process, resulting in low sensitivity, thus indicating room for improvement in the training effectiveness of SNN. Based on this, we propose a trapezoidal approximation gradient method to replace the spike network, which not only preserves the original stable learning state but also enhances the model's adaptability and response sensitivity under various signal dynamics. Simulation results show that the improved algorithm, using the trapezoidal approximation gradient to replace the spike network, achieves better convergence speed and performance compared to the original algorithm and demonstrates good training stability. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.02075 [pdf, other]

ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU

Authors: Qi Qiu, Tao Zhu, Helin Gong, Liming Chen, Huansheng Ning

Abstract: Limited by the complexity of basis function (B-spline) calculations, Kolmogorov-Arnold Networks (KAN) suffer from restricted parallel computing capability on GPUs. This paper proposes a novel ReLU-KAN implementation that inherits the core idea of KAN. By adopting ReLU (Rectified Linear Unit) and point-wise multiplication, we simplify the design of KAN's basis function and optimize the computation… ▽ More Limited by the complexity of basis function (B-spline) calculations, Kolmogorov-Arnold Networks (KAN) suffer from restricted parallel computing capability on GPUs. This paper proposes a novel ReLU-KAN implementation that inherits the core idea of KAN. By adopting ReLU (Rectified Linear Unit) and point-wise multiplication, we simplify the design of KAN's basis function and optimize the computation process for efficient CUDA computing. The proposed ReLU-KAN architecture can be readily implemented on existing deep learning frameworks (e.g., PyTorch) for both inference and training. Experimental results demonstrate that ReLU-KAN achieves a 20x speedup compared to traditional KAN with 4-layer networks. Furthermore, ReLU-KAN exhibits a more stable training process with superior fitting ability while preserving the "catastrophic forgetting avoidance" property of KAN. You can get the code in https://github.com/quiqi/relu_kan △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2404.00879 [pdf, other]

Model-Agnostic Human Preference Inversion in Diffusion Models

Authors: Jeeyung Kim, Ze Wang, Qiang Qiu

Abstract: Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality… ▽ More Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.19066 [pdf, other]

Generative Quanta Color Imaging

Authors: Vishal Purohit, Junjie Luo, Yiheng Chi, Qi Guo, Stanley H. Chan, Qiang Qiu

Abstract: The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However, the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper, we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently fi… ▽ More The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However, the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper, we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently find this problem being particularly difficult to standard colorization approaches due to the substantial degree of exposure variation. The core innovation of our paper is an exposure synthesis model framed under a neural ordinary differential equation (Neural ODE) that allows us to generate a continuum of exposures from a single observation. This innovation ensures consistent exposure in binary images that colorizers take on, resulting in notably enhanced colorization. We demonstrate applications of the method in single-image and burst colorization and show superior generative performance over baselines. Project website can be found at https://vishal-s-p.github.io/projects/2023/generative_quanta_color.html. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted at IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

arXiv:2403.00269 [pdf, other]

Large Convolutional Model Tuning via Filter Subspace

Authors: Wei Chen, Zichen Miao, Qiang Qiu

Abstract: Efficient fine-tuning methods are critical to address the high computational and parameter complexity while adapting large pre-trained models to downstream tasks. Our study is inspired by prior research that represents each convolution filter as a linear combination of a small set of filter subspace elements, referred to as filter atoms. In this paper, we propose to fine-tune pre-trained models by… ▽ More Efficient fine-tuning methods are critical to address the high computational and parameter complexity while adapting large pre-trained models to downstream tasks. Our study is inspired by prior research that represents each convolution filter as a linear combination of a small set of filter subspace elements, referred to as filter atoms. In this paper, we propose to fine-tune pre-trained models by adjusting only filter atoms, which are responsible for spatial-only convolution, while preserving spatially-invariant channel combination knowledge in atom coefficients. In this way, we bring a new filter subspace view for model tuning. Furthermore, each filter atom can be recursively decomposed as a combination of another set of atoms, which naturally expands the number of tunable parameters in the filter subspace. By only adapting filter atoms constructed by a small number of parameters, while maintaining the rest of model parameters constant, the proposed approach is highly parameter-efficient. It effectively preserves the capabilities of pre-trained models and prevents overfitting to downstream tasks. Extensive experiments show that such a simple scheme surpasses previous tuning baselines for both discriminate and generative tasks. △ Less

Submitted 11 March, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

arXiv:2402.11025 [pdf, other]

Training Bayesian Neural Networks with Sparse Subspace Variational Inference

Authors: Junbo Li, Zichen Miao, Qiang Qiu, Ruqi Zhang

Abstract: Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particula… ▽ More Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently highly sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20x compression in model size with under 3\% performance drop, and up to 20x FLOPs reduction during training compared with dense VI training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs on both accuracy and uncertainty metrics. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Journal ref: Published at International Conference on Learning Representations (ICLR) 2024

arXiv:2402.08936 [pdf, other]

Predictive Temporal Attention on Event-based Video Stream for Energy-efficient Situation Awareness

Authors: Yiming Bu, Jiayang Liu, Qinru Qiu

Abstract: The Dynamic Vision Sensor (DVS) is an innovative technology that efficiently captures and encodes visual information in an event-driven manner. By combining it with event-driven neuromorphic processing, the sparsity in DVS camera output can result in high energy efficiency. However, similar to many embedded systems, the off-chip communication between the camera and processor presents a bottleneck… ▽ More The Dynamic Vision Sensor (DVS) is an innovative technology that efficiently captures and encodes visual information in an event-driven manner. By combining it with event-driven neuromorphic processing, the sparsity in DVS camera output can result in high energy efficiency. However, similar to many embedded systems, the off-chip communication between the camera and processor presents a bottleneck in terms of power consumption. Inspired by the predictive coding model and expectation suppression phenomenon found in human brain, we propose a temporal attention mechanism to throttle the camera output and pay attention to it only when the visual events cannot be well predicted. The predictive attention not only reduces power consumption in the sensor-processor interface but also effectively decreases the computational workload by filtering out noisy events. We demonstrate that the predictive attention can reduce 46.7% of data communication between the camera and the processor and reduce 43.8% computation activities in the processor. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2401.13076 [pdf, other]

SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

Authors: Mingyang Li, Yue Ma, Qinru Qiu

Abstract: Current techniques in Visual Simultaneous Localization and Map** (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visu… ▽ More Current techniques in Visual Simultaneous Localization and Map** (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn't require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 2023 IEEE Symposium Series on Computational Intelligence (SSCI) 6 pages

Journal ref: 2023 IEEE Symposium Series on Computational Intelligence (SSCI)

arXiv:2401.09258 [pdf, other]

An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation

Authors: Yinuo Zhao, Kun Wu, Tianjiao Yi, Zhiyuan Xu, Xiaozhu Ju, Zheng** Che, Qinru Qiu, Chi Harold Liu, Jian Tang

Abstract: Visuomotor policies, which learn control mechanisms directly from high-dimensional visual observations, confront challenges in adapting to new environments with intricate visual variations. Data augmentation emerges as a promising method for bridging these generalization gaps by enriching data variety. However, straightforwardly augmenting the entire observation shall impose excessive burdens on p… ▽ More Visuomotor policies, which learn control mechanisms directly from high-dimensional visual observations, confront challenges in adapting to new environments with intricate visual variations. Data augmentation emerges as a promising method for bridging these generalization gaps by enriching data variety. However, straightforwardly augmenting the entire observation shall impose excessive burdens on policy learning and may even result in performance degradation. In this paper, we propose to improve the generalization ability of visuomotor policies as well as preserve training stability from two aspects: 1) We learn a control-aware mask through a self-supervised reconstruction task with three auxiliary losses and then apply strong augmentation only to those control-irrelevant regions based on the mask to reduce the generalization gaps. 2) To address training instability issues prevalent in visual reinforcement learning (RL), we distill the knowledge from a pretrained RL expert processing low-level environment states, to the student visuomotor policy. The policy is subsequently deployed to unseen environments without any further finetuning. We conducted comparison and ablation studies across various benchmarks: the DMControl Generalization Benchmark (DMC-GB), the enhanced Robot Manipulation Distraction Benchmark (RMDB), and a specialized long-horizontal drawer-opening robotic task. The extensive experimental results well demonstrate the effectiveness of our method, e.g., showing a 17\% improvement over previous methods in the video-hard setting of DMC-GB. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.08957 [pdf, other]

SWBT: Similarity Weighted Behavior Transformer with the Imperfect Demonstration for Robotic Manipulation

Authors: Kun Wu, Ning Liu, Zhen Zhao, Di Qiu, **ming Li, Zheng** Che, Zhiyuan Xu, Qinru Qiu, Jian Tang

Abstract: Imitation learning (IL), aiming to learn optimal control policies from expert demonstrations, has been an effective method for robot manipulation tasks. However, previous IL methods either only use expensive expert demonstrations and omit imperfect demonstrations or rely on interacting with the environment and learning from online experiences. In the context of robotic manipulation, we aim to conq… ▽ More Imitation learning (IL), aiming to learn optimal control policies from expert demonstrations, has been an effective method for robot manipulation tasks. However, previous IL methods either only use expensive expert demonstrations and omit imperfect demonstrations or rely on interacting with the environment and learning from online experiences. In the context of robotic manipulation, we aim to conquer the above two challenges and propose a novel framework named Similarity Weighted Behavior Transformer (SWBT). SWBT effectively learn from both expert and imperfect demonstrations without interaction with environments. We reveal that the easy-to-get imperfect demonstrations, such as forward and inverse dynamics, significantly enhance the network by learning fruitful information. To the best of our knowledge, we are the first to attempt to integrate imperfect demonstrations into the offline imitation learning setting for robot manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks demonstrated that the proposed method can extract better features and improve the success rates for all tasks. Our code will be released upon acceptance of the paper. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 8 pages, 5 figures

ACM Class: I.2.9

arXiv:2312.12721 [pdf, other]

Cross-Modal Reasoning with Event Correlation for Video Question Answering

Authors: Chengxiang Yin, Zheng** Che, Kun Wu, Zhiyuan Xu, Qinru Qiu, Jian Tang

Abstract: Video Question Answering (VideoQA) is a very attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., the spatio-temporal video content and the word sequence in question. Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the t… ▽ More Video Question Answering (VideoQA) is a very attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., the spatio-temporal video content and the word sequence in question. Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the two modalities, one limitation of the predominant VideoQA methods is the lack of reasoning with event correlation, that is, sensing and analyzing relationships among abundant and informative events contained in the video. In this paper, we introduce the dense caption modality as a new auxiliary and distill event-correlated information from it to infer the correct answer. To this end, we propose a novel end-to-end trainable model, Event-Correlated Graph Neural Networks (EC-GNNs), to perform cross-modal reasoning over information from the three modalities (i.e., caption, video, and question). Besides the exploitation of a brand new modality, we employ cross-modal reasoning modules for explicitly modeling inter-modal relationships and aggregating relevant information across different modalities, and we propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning. We evaluate our model on two widely-used benchmark datasets and conduct an ablation study to justify the effectiveness of each proposed component. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.11933 [pdf, other]

Dynamic Frequency Domain Graph Convolutional Network for Traffic Forecasting

Authors: Yujie Li, Zezhi Shao, Yongjun Xu, Qiang Qiu, Zhaogang Cao, Fei Wang

Abstract: Complex spatial dependencies in transportation networks make traffic prediction extremely challenging. Much existing work is devoted to learning dynamic graph structures among sensors, and the strategy of mining spatial dependencies from traffic data, known as data-driven, tends to be an intuitive and effective approach. However, Time-Shift of traffic patterns and noise induced by random factors h… ▽ More Complex spatial dependencies in transportation networks make traffic prediction extremely challenging. Much existing work is devoted to learning dynamic graph structures among sensors, and the strategy of mining spatial dependencies from traffic data, known as data-driven, tends to be an intuitive and effective approach. However, Time-Shift of traffic patterns and noise induced by random factors hinder data-driven spatial dependence modeling. In this paper, we propose a novel dynamic frequency domain graph convolution network (DFDGCN) to capture spatial dependencies. Specifically, we mitigate the effects of time-shift by Fourier transform, and introduce the identity embedding of sensors and time embedding when capturing data for graph learning since traffic data with noise is not entirely reliable. The graph is combined with static predefined and self-adaptive graphs during graph convolution to predict future traffic data through classical causal convolutions. Extensive experiments on four real-world datasets demonstrate that our model is effective and outperforms the baselines. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2311.00240 [pdf, other]

Intell-dragonfly: A Cybersecurity Attack Surface Generation Engine Based On Artificial Intelligence-generated Content Technology

Authors: Xingchen Wu, Qin Qiu, Jiaqi Li, Yang Zhao

Abstract: With the rapid development of the Internet, cyber security issues have become increasingly prominent. Traditional cyber security defense methods are limited in the face of ever-changing threats, so it is critical to seek innovative attack surface generation methods. This study proposes Intell-dragonfly, a cyber security attack surface generation engine based on artificial intelligence generation t… ▽ More With the rapid development of the Internet, cyber security issues have become increasingly prominent. Traditional cyber security defense methods are limited in the face of ever-changing threats, so it is critical to seek innovative attack surface generation methods. This study proposes Intell-dragonfly, a cyber security attack surface generation engine based on artificial intelligence generation technology, to meet the challenges of cyber security. Based on ChatGPT technology, this paper designs an automated attack surface generation process, which can generate diversified and personalized attack scenarios, targets, elements and schemes. Through experiments in a real network environment, the effect of the engine is verified and compared with traditional methods, which improves the authenticity and applicability of the attack surface. The experimental results show that the ChatGPT-based method has significant advantages in the accuracy, diversity and operability of attack surface generation. Furthermore, we explore the strengths and limitations of the engine and discuss its potential applications in the field of cyber security. This research provides a novel approach to the field of cyber security that is expected to have a positive impact on defense and prevention of cyberthreats. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: 17 pages, 7 figures, 2 tables

arXiv:2310.13295 [pdf, other]

PathRL: An End-to-End Path Generation Method for Collision Avoidance via Deep Reinforcement Learning

Authors: Wenhao Yu, Jie Peng, Quecheng Qiu, Hanyu Wang, Lu Zhang, Jianmin Ji

Abstract: Robot navigation using deep reinforcement learning (DRL) has shown great potential in improving the performance of mobile robots. Nevertheless, most existing DRL-based navigation methods primarily focus on training a policy that directly commands the robot with low-level controls, like linear and angular velocities, which leads to unstable speeds and unsmooth trajectories of the robot during the l… ▽ More Robot navigation using deep reinforcement learning (DRL) has shown great potential in improving the performance of mobile robots. Nevertheless, most existing DRL-based navigation methods primarily focus on training a policy that directly commands the robot with low-level controls, like linear and angular velocities, which leads to unstable speeds and unsmooth trajectories of the robot during the long-term execution. An alternative method is to train a DRL policy that outputs the navigation path directly. However, two roadblocks arise for training a DRL policy that outputs paths: (1) The action space for potential paths often involves higher dimensions comparing to low-level commands, which increases the difficulties of training; (2) It takes multiple time steps to track a path instead of a single time step, which requires the path to predicate the interactions of the robot w.r.t. the dynamic environment in multiple time steps. This, in turn, amplifies the challenges associated with training. In response to these challenges, we propose PathRL, a novel DRL method that trains the policy to generate the navigation path for the robot. Specifically, we employ specific action space discretization techniques and tailored state space representation methods to address the associated challenges. In our experiments, PathRL achieves better success rates and reduces angular rotation variability compared to other DRL navigation methods, facilitating stable and smooth robot movement. We demonstrate the competitive edge of PathRL in both real-world scenarios and multiple challenging simulation environments. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.08370 [pdf, other]

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Authors: Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, Haoyi Zhu, Tong He, Shixiang Tang, Hengshuang Zhao, Qibo Qiu, Binbin Lin, Xiaofei He, Wanli Ouyang

Abstract: In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD impl… ▽ More In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set, achieving state-of-the-art results in comparison with previous methods. The code will be available at https://github.com/Nightmare-n/UniPAD. △ Less

Submitted 7 April, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: CVPR2024

arXiv:2309.13235 [pdf, other]

M$^3$CS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders

Authors: Qibo Qiu, Honghui Yang, Wenxiao Wang, Shun Zhang, Haiming Gao, Haochao Ying, Wei Hua, Xiaofei He

Abstract: Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the original points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities to capture geometric details and… ▽ More Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the original points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities to capture geometric details and semantic contexts during pre-training. To this end, M$^3$CS is proposed to enable the model with the above abilities. Specifically, with masked point cloud as input, M$^3$CS introduces two decoders to predict masked representations and the original points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the amount of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such way, we can enforce the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M$^3$CS achieves superior performance at both classification and segmentation tasks, outperforming existing methods. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2308.04118 [pdf, other]

Multimodal Color Recommendation in Vector Graphic Documents

Authors: Qianru Qiu, Xueting Wang, Mayu Otani

Abstract: Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to… ▽ More Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023

arXiv:2308.00287 [pdf, other]

A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation

Authors: Minghao Chen, Zepeng Gao, Shuai Zhao, Qibo Qiu, Wenxiao Wang, Binbin Lin, Xiaofei He

Abstract: Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. However, these methods necessitate a labeled target validation set for hyper-parameter tuning and model selection. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels. We begin with the met… ▽ More Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. However, these methods necessitate a labeled target validation set for hyper-parameter tuning and model selection. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels. We begin with the metric based on mutual information of the model prediction. Through empirical analysis, we identify three prevalent issues with this metric: 1) It does not account for the source structure. 2) It can be easily attacked. 3) It fails to detect negative transfer caused by the over-alignment of source and target features. To address the first two issues, we incorporate source accuracy into the metric and employ a new MLP classifier that is held out during training, significantly improving the result. To tackle the final issue, we integrate this enhanced metric with data augmentation, resulting in a novel unsupervised UDA metric called the Augmentation Consistency Metric (ACM). Additionally, we empirically demonstrate the shortcomings of previous experiment settings and conduct large-scale experiments to validate the effectiveness of our proposed metric. Furthermore, we employ our metric to automatically search for the optimal hyper-parameter set, achieving superior performance compared to manually tuned sets across four common benchmarks. Codes will be available soon. △ Less

Submitted 18 September, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.11314 [pdf]

Neuromorphic Online Learning for Spatiotemporal Patterns with a Forward-only Timeline

Authors: Zhenhang Zhang, **gang **, Haowen Fang, Qinru Qiu

Abstract: Spiking neural networks (SNNs) are bio-plausible computing models with high energy efficiency. The temporal dynamics of neurons and synapses enable them to detect temporal patterns and generate sequences. While Backpropagation Through Time (BPTT) is traditionally used to train SNNs, it is not suitable for online learning of embedded applications due to its high computation and memory cost as well… ▽ More Spiking neural networks (SNNs) are bio-plausible computing models with high energy efficiency. The temporal dynamics of neurons and synapses enable them to detect temporal patterns and generate sequences. While Backpropagation Through Time (BPTT) is traditionally used to train SNNs, it is not suitable for online learning of embedded applications due to its high computation and memory cost as well as extended latency. Previous works have proposed online learning algorithms, but they often utilize highly simplified spiking neuron models without synaptic dynamics and reset feedback, resulting in subpar performance. In this work, we present Spatiotemporal Online Learning for Synaptic Adaptation (SOLSA), specifically designed for online learning of SNNs composed of Leaky Integrate and Fire (LIF) neurons with exponentially decayed synapses and soft reset. The algorithm not only learns the synaptic weight but also adapts the temporal filters associated to the synapses. Compared to the BPTT algorithm, SOLSA has much lower memory requirement and achieves a more balanced temporal workload distribution. Moreover, SOLSA incorporates enhancement techniques such as scheduled weight update, early stop training and adaptive synapse filter, which speed up the convergence and enhance the learning performance. When compared to other non-BPTT based SNN learning, SOLSA demonstrates an average learning accuracy improvement of 14.2%. Furthermore, compared to BPTT, SOLSA achieves a 5% higher average learning accuracy with a 72% reduction in memory cost. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 9 pages,8 figures

arXiv:2306.01205 [pdf, other]

SelFLoc: Selective Feature Fusion for Large-scale Point Cloud-based Place Recognition

Authors: Qibo Qiu, Haiming Gao, Wenxiao Wang, Zhiyi Su, Tian Xie, Wei Hua, Xiaofei He

Abstract: Point cloud-based place recognition is crucial for mobile robots and autonomous vehicles, especially when the global positioning sensor is not accessible. LiDAR points are scattered on the surface of objects and buildings, which have strong shape priors along different axes. To enhance message passing along particular axes, Stacked Asymmetric Convolution Block (SACB) is designed, which is one of t… ▽ More Point cloud-based place recognition is crucial for mobile robots and autonomous vehicles, especially when the global positioning sensor is not accessible. LiDAR points are scattered on the surface of objects and buildings, which have strong shape priors along different axes. To enhance message passing along particular axes, Stacked Asymmetric Convolution Block (SACB) is designed, which is one of the main contributions in this paper. Comprehensive experiments demonstrate that asymmetric convolution and its corresponding strategies employed by SACB can contribute to the more effective representation of point cloud feature. On this basis, Selective Feature Fusion Block (SFFB), which is formed by stacking point- and channel-wise gating layers in a predefined sequence, is proposed to selectively boost salient local features in certain key regions, as well as to align the features before fusion phase. SACBs and SFFBs are combined to construct a robust and accurate architecture for point cloud-based place recognition, which is termed SelFLoc. Comparative experimental results show that SelFLoc achieves the state-of-the-art (SOTA) performance on the Oxford and other three in-house benchmarks with an improvement of 1.6 absolute percentages on mean average recall@1. △ Less

Submitted 5 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2304.04820 [pdf, other]

Binary Latent Diffusion

Authors: Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu

Abstract: In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional map**s between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribu… ▽ More In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional map**s between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribution can be modeled more efficiently than pixels or continuous latent representations. On the other hand, we now represent each image patch as a binary vector instead of an index of a learned cookbook as in discrete image representations with vector quantization. In this way, we obtain binary latent representations that allow for better image quality and high-resolution image representations without any multi-stage hierarchy in the latent space. In this binary latent space, images can now be generated effectively using a binary latent diffusion model tailored specifically for modeling the prior over the binary image representations. We present both conditional and unconditional image generation experiments with multiple datasets, and show that the proposed method performs comparably to state-of-the-art methods while dramatically improving the sampling efficiency to as few as 16 steps without using any test-time acceleration. The proposed framework can also be seamlessly scaled to $1024 \times 1024$ high-resolution image generation without resorting to latent hierarchy or multi-stage refinements. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2304.03117 [pdf, other]

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Authors: Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, **gyi Yu

Abstract: Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for… ▽ More Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present DreamFace, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures, and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space, and subsequently optimize both the details displacements and normals using Score Distillation Sampling from generic Latent Diffusion Model. Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provides compact priors for fine-grained synthesis. Our generated neutral assets naturally support blendshapes-based facial animations. We further improve the animation ability with personalized deformation characteristics by learning the universal expression prior using the cross-identity hypernetwork. Notably, DreamFace can generate of realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies. △ Less

Submitted 1 April, 2023; originally announced April 2023.

Comments: Go to DreamFace project page https://sites.google.com/view/dreamface watch our video at https://youtu.be/yCuvzgGMvPM and experience DreamFace online at https://hyperhuman.top

arXiv:2303.12354 [pdf, other]

Deep Reinforcement Learning for Localizability-Enhanced Navigation in Dynamic Human Environments

Authors: Yuan Chen, Quecheng Qiu, Xiangyu Liu, Guangda Chen, Shunyi Yao, Jie Peng, Jianmin Ji, Yanyong Zhang

Abstract: Reliable localization is crucial for autonomous robots to navigate efficiently and safely. Some navigation methods can plan paths with high localizability (which describes the capability of acquiring reliable localization). By following these paths, the robot can access the sensor streams that facilitate more accurate location estimation results by the localization algorithms. However, most of the… ▽ More Reliable localization is crucial for autonomous robots to navigate efficiently and safely. Some navigation methods can plan paths with high localizability (which describes the capability of acquiring reliable localization). By following these paths, the robot can access the sensor streams that facilitate more accurate location estimation results by the localization algorithms. However, most of these methods require prior knowledge and struggle to adapt to unseen scenarios or dynamic changes. To overcome these limitations, we propose a novel approach for localizability-enhanced navigation via deep reinforcement learning in dynamic human environments. Our proposed planner automatically extracts geometric features from 2D laser data that are helpful for localization. The planner learns to assign different importance to the geometric features and encourages the robot to navigate through areas that are helpful for laser localization. To facilitate the learning of the planner, we suggest two techniques: (1) an augmented state representation that considers the dynamic changes and the confidence of the localization results, which provides more information and allows the robot to make better decisions, (2) a reward metric that is capable to offer both sparse and dense feedback on behaviors that affect localization accuracy. Our method exhibits significant improvements in lost rate and arrival rate when tested in previously unseen environments. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2303.06908 [pdf, other]

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Authors: Wenxiao Wang, Wei Chen, Qibo Qiu, Long Chen, Boxi Wu, Binbin Lin, Xiaofei He, Wei Liu

Abstract: While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different… ▽ More While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e., the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer. △ Less

Submitted 30 November, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: 16 pages, 7 figures

arXiv:2302.14290 [pdf, other]

Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

Authors: Gaurav Patel, Konda Reddy Mopuri, Qiang Qiu

Abstract: Data-free Knowledge Distillation (DFKD) has gained popularity recently, with the fundamental idea of carrying out knowledge transfer from a Teacher neural network to a Student neural network in the absence of training data. However, in the Adversarial DFKD framework, the student network's accuracy, suffers due to the non-stationary distribution of the pseudo-samples under multiple generator update… ▽ More Data-free Knowledge Distillation (DFKD) has gained popularity recently, with the fundamental idea of carrying out knowledge transfer from a Teacher neural network to a Student neural network in the absence of training data. However, in the Adversarial DFKD framework, the student network's accuracy, suffers due to the non-stationary distribution of the pseudo-samples under multiple generator updates. To this end, at every generator update, we aim to maintain the student's performance on previously encountered examples while acquiring knowledge from samples of the current distribution. Thus, we propose a meta-learning inspired framework by treating the task of Knowledge-Acquisition (learning from newly generated samples) and Knowledge-Retention (retaining knowledge on previously met samples) as meta-train and meta-test, respectively. Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we identify an implicit aligning factor between the Knowledge-Retention and Knowledge-Acquisition tasks indicating that the proposed student update strategy enforces a common gradient direction for both tasks, alleviating interference between the two objectives. Finally, we support our hypothesis by exhibiting extensive evaluation and comparison of our method with prior arts on multiple datasets. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: Accepted at CVPR 2023

arXiv:2302.01384 [pdf, other]

Energy-Inspired Self-Supervised Pretraining for Vision Models

Authors: Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu

Abstract: Motivated by the fact that forward and backward passes of a deep network naturally form symmetric map**s between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single ne… ▽ More Motivated by the fact that forward and backward passes of a deep network naturally form symmetric map**s between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling. △ Less

Submitted 2 February, 2023; originally announced February 2023.

Journal ref: ICLR 2023

arXiv:2209.10820 [pdf, other]

Color Recommendation for Vector Graphic Documents based on Multi-Palette Representation

Authors: Qianru Qiu, Xueting Wang, Mayu Otani, Yuki Iwazaki

Abstract: Vector graphic documents present multiple visual elements, such as images, shapes, and texts. Choosing appropriate colors for multiple visual elements is a difficult but crucial task for both amateurs and professional designers. Instead of creating a single color palette for all elements, we extract multiple color palettes from each visual element in a graphic document, and then combine them into… ▽ More Vector graphic documents present multiple visual elements, such as images, shapes, and texts. Choosing appropriate colors for multiple visual elements is a difficult but crucial task for both amateurs and professional designers. Instead of creating a single color palette for all elements, we extract multiple color palettes from each visual element in a graphic document, and then combine them into a color sequence. We propose a masked color model for color sequence completion and recommend the specified colors based on color context in multi-palette with high probability. We train the model and build a color recommendation system on a large-scale dataset of vector graphic documents. The proposed color recommendation method outperformed other state-of-the-art methods by both quantitative and qualitative evaluations on color prediction and our color recommendation system received positive feedback from professional designers in an interview study. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: Accepted to WACV 2023

arXiv:2209.06994 [pdf, ps, other]

PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer

Authors: Qibo Qiu, Haiming Gao, Wei Hua, Gang Huang, Xiaofei He

Abstract: Lane detection is one of the fundamental modules in self-driving. In this paper we employ a transformer-only method for lane detection, thus it could benefit from the blooming development of fully vision transformer and achieve the state-of-the-art (SOTA) performance on both CULane and TuSimple benchmarks, by fine-tuning the weight fully pre-trained on large datasets. More importantly, this paper… ▽ More Lane detection is one of the fundamental modules in self-driving. In this paper we employ a transformer-only method for lane detection, thus it could benefit from the blooming development of fully vision transformer and achieve the state-of-the-art (SOTA) performance on both CULane and TuSimple benchmarks, by fine-tuning the weight fully pre-trained on large datasets. More importantly, this paper proposes a novel and general framework called PriorLane, which is used to enhance the segmentation performance of the fully vision transformer by introducing the low-cost local prior knowledge. Specifically, PriorLane utilizes an encoder-only transformer to fuse the feature extracted by a pre-trained segmentation model with prior knowledge embeddings. Note that a Knowledge Embedding Alignment (KEA) module is adapted to enhance the fusion performance by aligning the knowledge embedding. Extensive experiments on our Zjlab dataset show that PriorLane outperforms SOTA lane detection methods by a 2.82% mIoU when prior knowledge is employed, and the code will be released at: https://github.com/vincentqqb/PriorLane. △ Less

Submitted 7 February, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

Comments: Accepted by ICRA 2023

arXiv:2209.00641 [pdf, other]

Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition

Authors: Gaurav Patel, Jan Allebach, Qiang Qiu

Abstract: This paper looks at semi-supervised learning (SSL) for image-based text recognition. One of the most popular SSL approaches is pseudo-labeling (PL). PL approaches assign labels to unlabeled data before re-training the model with a combination of labeled and pseudo-labeled data. However, PL methods are severely degraded by noise and are prone to over-fitting to noisy labels, due to the inclusion of… ▽ More This paper looks at semi-supervised learning (SSL) for image-based text recognition. One of the most popular SSL approaches is pseudo-labeling (PL). PL approaches assign labels to unlabeled data before re-training the model with a combination of labeled and pseudo-labeled data. However, PL methods are severely degraded by noise and are prone to over-fitting to noisy labels, due to the inclusion of erroneous high confidence pseudo-labels generated from poorly calibrated models, thus, rendering threshold-based selection ineffective. Moreover, the combinatorial complexity of the hypothesis space and the error accumulation due to multiple incorrect autoregressive steps posit pseudo-labeling challenging for sequence models. To this end, we propose a pseudo-label generation and an uncertainty-based data selection framework for semi-supervised text recognition. We first use Beam-Search inference to yield highly probable hypotheses to assign pseudo-labels to the unlabelled examples. Then we adopt an ensemble of models, sampled by applying dropout, to obtain a robust estimate of the uncertainty associated with the prediction, considering both the character-level and word-level predictive distribution to select good quality pseudo-labels. Extensive experiments on several benchmark handwriting and scene-text datasets show that our method outperforms the baseline approaches and the previous state-of-the-art semi-supervised text-recognition methods. △ Less

Submitted 6 October, 2022; v1 submitted 30 August, 2022; originally announced September 2022.

Comments: Accepted at WACV 2023

arXiv:2206.06635 [pdf, ps, other]

CVR-LSE: Compact Vectorization Representation of Local Static Environments for Unmanned Ground Vehicles

Authors: Haiming Gao, Qibo Qiu, Wei Hua, Xuebo Zhang, Zhengyong Han, Shun Zhang

Abstract: According to the requirement of general static obstacle detection, this paper proposes a compact vectorization representation approach of local static environments for unmanned ground vehicles. At first, by fusing the data of LiDAR and IMU, high-frequency pose information is obtained. Then, through the two-dimensional (2D) obstacle points generation, the process of grid map maintenance with a fixe… ▽ More According to the requirement of general static obstacle detection, this paper proposes a compact vectorization representation approach of local static environments for unmanned ground vehicles. At first, by fusing the data of LiDAR and IMU, high-frequency pose information is obtained. Then, through the two-dimensional (2D) obstacle points generation, the process of grid map maintenance with a fixed size is proposed. Finally, the local static environment is described via multiple convex polygons, which is realized throungh the double threshold-based boundary simplification and the convex polygon segmentation. Our proposed approach has been applied in a practical driverless project in the park, and the qualitative experimental results on typical scenes verify the effectiveness and robustness. In addition, the quantitative evaluation shows the superior performance on making use of fewer number of points information (decreased by about 60%) to represent the local static environment compared with the traditional grid map-based methods. Furthermore, the performance of running time (15ms) shows that the proposed approach can be used for real-time local static environment perception. The corresponding code can be accessed at https://github.com/ghm0819/cvr_lse. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2205.11139 [pdf, other]

doi 10.1145/3477495.3531848

GraphAD: A Graph Neural Network for Entity-Wise Multivariate Time-Series Anomaly Detection

Authors: Xu Chen, Qiu Qiu, Changshan Li, Kunqing Xie

Abstract: In recent years, the emergence and development of third-party platforms have greatly facilitated the growth of the Online to Offline (O2O) business. However, the large amount of transaction data raises new challenges for retailers, especially anomaly detection in operating conditions. Thus, platforms begin to develop intelligent business assistants with embedded anomaly detection methods to reduce… ▽ More In recent years, the emergence and development of third-party platforms have greatly facilitated the growth of the Online to Offline (O2O) business. However, the large amount of transaction data raises new challenges for retailers, especially anomaly detection in operating conditions. Thus, platforms begin to develop intelligent business assistants with embedded anomaly detection methods to reduce the management burden on retailers. Traditional time-series anomaly detection methods capture underlying patterns from the perspectives of time and attributes, ignoring the difference between retailers in this scenario. Besides, similar transaction patterns extracted by the platforms can also provide guidance to individual retailers and enrich their available information without privacy issues. In this paper, we pose an entity-wise multivariate time-series anomaly detection problem that considers the time-series of each unique entity. To address this challenge, we propose GraphAD, a novel multivariate time-series anomaly detection model based on the graph neural network. GraphAD decomposes the Key Performance Indicator (KPI) into stable and volatility components and extracts their patterns in terms of attributes, entities and temporal perspectives via graph neural networks. We also construct a real-world entity-wise multivariate time-series dataset from the business data of Ele.me. The experimental results on this dataset show that GraphAD significantly outperforms existing anomaly detection methods. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: SIGIR'22 Short Paper

arXiv:2203.16154 [pdf, other]

Learning to Socially Navigate in Pedestrian-rich Environments with Interaction Capacity

Authors: Quecheng Qiu, Shunyi Yao, **g Wang, Jun Ma, Guangda Chen, Jianmin Ji

Abstract: Existing navigation policies for autonomous robots tend to focus on collision avoidance while ignoring human-robot interactions in social life. For instance, robots can pass along the corridor safer and easier if pedestrians notice them. Sounds have been considered as an efficient way to attract the attention of pedestrians, which can alleviate the freezing robot problem. In this work, we present… ▽ More Existing navigation policies for autonomous robots tend to focus on collision avoidance while ignoring human-robot interactions in social life. For instance, robots can pass along the corridor safer and easier if pedestrians notice them. Sounds have been considered as an efficient way to attract the attention of pedestrians, which can alleviate the freezing robot problem. In this work, we present a new deep reinforcement learning (DRL) based social navigation approach for autonomous robots to move in pedestrian-rich environments with interaction capacity. Most existing DRL based methods intend to train a general policy that outputs both navigation actions, i.e., expected robot's linear and angular velocities, and interaction actions, i.e., the beep action, in the context of reinforcement learning. Different from these methods, we intend to train the policy via both supervised learning and reinforcement learning. In specific, we first train an interaction policy in the context of supervised learning, which provides a better understanding of the social situation, then we use this interaction policy to train the navigation policy via multiple reinforcement learning algorithms. We evaluate our approach in various simulation environments and compare it to other methods. The experimental results show that our approach outperforms others in terms of the success rate. We also deploy the trained policy on a real-world robot, which shows a nice performance in crowded environments. △ Less

Submitted 30 March, 2022; originally announced March 2022.

arXiv:2202.05628 [pdf, other]

doi 10.1145/3528223.3530086

Artemis: Articulated Neural Pets with Appearance and Motion synthesis

Authors: Haimin Luo, Teng Xu, Yuheng Jiang, Chenglin Zhou, Qiwei Qiu, Yingliang Zhang, Wei Yang, Lan Xu, **gyi Yu

Abstract: We, humans, are entering into a virtual era and indeed want to bring animals to the virtual world as well for companion. Yet, computer-generated (CGI) furry animals are limited by tedious off-line rendering, let alone interactive motion control. In this paper, we present ARTEMIS, a novel neural modeling and rendering pipeline for generating ARTiculated neural pets with appEarance and Motion synthe… ▽ More We, humans, are entering into a virtual era and indeed want to bring animals to the virtual world as well for companion. Yet, computer-generated (CGI) furry animals are limited by tedious off-line rendering, let alone interactive motion control. In this paper, we present ARTEMIS, a novel neural modeling and rendering pipeline for generating ARTiculated neural pets with appEarance and Motion synthesIS. Our ARTEMIS enables interactive motion control, real-time animation, and photo-realistic rendering of furry animals. The core of our ARTEMIS is a neural-generated (NGI) animal engine, which adopts an efficient octree-based representation for animal animation and fur rendering. The animation then becomes equivalent to voxel-level deformation based on explicit skeletal war**. We further use a fast octree indexing and efficient volumetric rendering scheme to generate appearance and density features maps. Finally, we propose a novel shading network to generate high-fidelity details of appearance and opacity under novel poses from appearance and density feature maps. For the motion control module in ARTEMIS, we combine state-of-the-art animal motion capture approach with recent neural character control scheme. We introduce an effective optimization scheme to reconstruct the skeletal motion of real animals captured by a multi-view RGB and Vicon camera array. We feed all the captured motion into a neural character control scheme to generate abstract control signals with motion styles. We further integrate ARTEMIS into existing engines that support VR headsets, providing an unprecedented immersive experience where a user can intimately interact with a variety of virtual animals with vivid movements and photo-realistic appearance. We make available our ARTEMIS model and dynamic furry animal dataset at https://haiminluo.github.io/publication/artemis/. △ Less

Submitted 17 June, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

Comments: Accepted to ACM SIGGRAPH 2022 (Journal track)

arXiv:2202.00280 [pdf, other]

Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?

Authors: Sheikh Shams Azam, Seyyedali Hosseinalipour, Qiang Qiu, Christopher Brinton

Abstract: In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across epochs (i.e., the gradient-space) in centralized model training, and observe that this gradient-space often consists of a few leading principal components accounting… ▽ More In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across epochs (i.e., the gradient-space) in centralized model training, and observe that this gradient-space often consists of a few leading principal components accounting for an overwhelming majority (95-99%) of the explained variance. Motivated by this, we propose the "Look-back Gradient Multiplier" (LBGM) algorithm, which exploits this low-rank property to enable gradient recycling between model update rounds of federated learning, reducing transmissions of large parameters to single scalars for aggregation. We analytically characterize the convergence behavior of LBGM, revealing the nature of the trade-off between communication savings and model performance. Our subsequent experimental results demonstrate the improvement LBGM obtains in communication overhead compared to conventional federated learning on several datasets and deep learning models. Additionally, we show that LBGM is a general plug-and-play algorithm that can be used standalone or stacked on top of existing sparsification techniques for distributed model training. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: In Proceedings of the 10th International Conference on Learning Representations (ICLR) 2022

arXiv:2111.03628 [pdf, other]

Exploiting a Zoo of Checkpoints for Unseen Tasks

Authors: Jiaji Huang, Qiang Qiu, Kenneth Church

Abstract: There are so many models in the literature that it is difficult for practitioners to decide which combinations are likely to be effective for a new task. This paper attempts to address this question by capturing relationships among checkpoints published on the web. We model the space of tasks as a Gaussian process. The covariance can be estimated from checkpoints and unlabeled probing data. With t… ▽ More There are so many models in the literature that it is difficult for practitioners to decide which combinations are likely to be effective for a new task. This paper attempts to address this question by capturing relationships among checkpoints published on the web. We model the space of tasks as a Gaussian process. The covariance can be estimated from checkpoints and unlabeled probing data. With the Gaussian process, we can identify representative checkpoints by a maximum mutual information criterion. This objective is submodular. A greedy method identifies representatives that are likely to "cover" the task space. These representatives generalize to new tasks with superior performance. Empirical evidence is provided for applications from both computational linguistics as well as computer vision. △ Less

Submitted 5 November, 2021; originally announced November 2021.

Comments: Accepted in Neurips 2021

arXiv:2110.10108 [pdf, other]

TESSERACT: Gradient Flip Score to Secure Federated Learning Against Model Poisoning Attacks

Authors: Atul Sharma, Wei Chen, Joshua Zhao, Qiang Qiu, Somali Chaterji, Saurabh Bagchi

Abstract: Federated learning---multi-party, distributed learning in a decentralized environment---is vulnerable to model poisoning attacks, even more so than centralized learning approaches. This is because malicious clients can collude and send in carefully tailored model updates to make the global model inaccurate. This motivated the development of Byzantine-resilient federated learning algorithms, such a… ▽ More Federated learning---multi-party, distributed learning in a decentralized environment---is vulnerable to model poisoning attacks, even more so than centralized learning approaches. This is because malicious clients can collude and send in carefully tailored model updates to make the global model inaccurate. This motivated the development of Byzantine-resilient federated learning algorithms, such as Krum, Bulyan, FABA, and FoolsGold. However, a recently developed untargeted model poisoning attack showed that all prior defenses can be bypassed. The attack uses the intuition that simply by changing the sign of the gradient updates that the optimizer is computing, for a set of malicious clients, a model can be diverted from the optima to increase the test error rate. In this work, we develop TESSERACT---a defense against this directed deviation attack, a state-of-the-art model poisoning attack. TESSERACT is based on a simple intuition that in a federated learning setting, certain patterns of gradient flips are indicative of an attack. This intuition is remarkably stable across different learning algorithms, models, and datasets. TESSERACT assigns reputation scores to the participating clients based on their behavior during the training phase and then takes a weighted contribution of the clients. We show that TESSERACT provides robustness against even a white-box version of the attack. △ Less

Submitted 19 October, 2021; originally announced October 2021.

Comments: 12 pages

arXiv:2109.02541 [pdf, other]

Crowd-Aware Robot Navigation for Pedestrians with Multiple Collision Avoidance Strategies via Map-based Deep Reinforcement Learning

Authors: Shunyi Yao1, Guangda Chen, Quecheng Qiu, Jun Ma, ** Chen, Jianmin Ji

Abstract: It is challenging for a mobile robot to navigate through human crowds. Existing approaches usually assume that pedestrians follow a predefined collision avoidance strategy, like social force model (SFM) or optimal reciprocal collision avoidance (ORCA). However, their performances commonly need to be further improved for practical applications, where pedestrians follow multiple different collision… ▽ More It is challenging for a mobile robot to navigate through human crowds. Existing approaches usually assume that pedestrians follow a predefined collision avoidance strategy, like social force model (SFM) or optimal reciprocal collision avoidance (ORCA). However, their performances commonly need to be further improved for practical applications, where pedestrians follow multiple different collision avoidance strategies. In this paper, we propose a map-based deep reinforcement learning approach for crowd-aware robot navigation with various pedestrians. We use the sensor map to represent the environmental information around the robot, including its shape and observable appearances of obstacles. We also introduce the pedestrian map that specifies the movements of pedestrians around the robot. By applying both maps as inputs of the neural network, we show that a navigation policy can be trained to better interact with pedestrians following different collision avoidance strategies. We evaluate our approach under multiple scenarios both in the simulator and on an actual robot. The results show that our approach allows the robot to successfully interact with various pedestrians and outperforms compared methods in terms of the success rate. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 7 page

arXiv:2108.07895 [pdf, other]

Adaptive Convolutions with Per-pixel Dynamic Filter Atom

Authors: Ze Wang, Zichen Miao, Jun Hu, Qiang Qiu

Abstract: Applying feature dependent network weights have been proved to be effective in many fields. However, in practice, restricted by the enormous size of model parameters and memory footprints, scalable and versatile dynamic convolutions with per-pixel adapted filters are yet to be fully explored. In this paper, we address this challenge by decomposing filters, adapted to each spatial position, over dy… ▽ More Applying feature dependent network weights have been proved to be effective in many fields. However, in practice, restricted by the enormous size of model parameters and memory footprints, scalable and versatile dynamic convolutions with per-pixel adapted filters are yet to be fully explored. In this paper, we address this challenge by decomposing filters, adapted to each spatial position, over dynamic filter atoms generated by a light-weight network from local features. Adaptive receptive fields can be supported by further representing each filter atom over sets of pre-fixed multi-scale bases. As plug-and-play replacements to convolutional layers, the introduced adaptive convolutions with per-pixel dynamic atoms enable explicit modeling of intra-image variance, while avoiding heavy computation, parameters, and memory cost. Our method preserves the appealing properties of conventional convolutions as being translation-equivariant and parametrically efficient. We present experiments to show that, the proposed method delivers comparable or even better performance across tasks, and are particularly effective on handling tasks with significant intra-image variance. △ Less

Submitted 17 August, 2021; originally announced August 2021.

arXiv:2105.03649 [pdf, other]

In-Hardware Learning of Multilayer Spiking Neural Networks on a Neuromorphic Processor

Authors: Amar Shrestha, Haowen Fang, Daniel Patrick Rider, Zaidao Mei, Qinru Qiu

Abstract: Although widely used in machine learning, backpropagation cannot directly be applied to SNN training and is not feasible on a neuromorphic processor that emulates biological neuron and synapses. This work presents a spike-based backpropagation algorithm with biological plausible local update rules and adapts it to fit the constraint in a neuromorphic hardware. The algorithm is implemented on Intel… ▽ More Although widely used in machine learning, backpropagation cannot directly be applied to SNN training and is not feasible on a neuromorphic processor that emulates biological neuron and synapses. This work presents a spike-based backpropagation algorithm with biological plausible local update rules and adapts it to fit the constraint in a neuromorphic hardware. The algorithm is implemented on Intel Loihi chip enabling low power in-hardware supervised online learning of multilayered SNNs for mobile applications. We test this implementation on MNIST, Fashion-MNIST, CIFAR-10 and MSTAR datasets with promising performance and energy-efficiency, and demonstrate a possibility of incremental online learning with the implementation. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Comments: 6 pages, 5 figures, accepted for Design Automation Conference (DAC) 2021

arXiv:2104.10712 [pdf, other]

Neuromorphic Algorithm-hardware Codesign for Temporal Pattern Learning

Authors: Haowen Fang, Brady Taylor, Ziru Li, Zaidao Mei, Hai Li, Qinru Qiu

Abstract: Neuromorphic computing and spiking neural networks (SNN) mimic the behavior of biological systems and have drawn interest for their potential to perform cognitive tasks with high energy efficiency. However, some factors such as temporal dynamics and spike timings prove critical for information processing but are often ignored by existing works, limiting the performance and applications of neuromor… ▽ More Neuromorphic computing and spiking neural networks (SNN) mimic the behavior of biological systems and have drawn interest for their potential to perform cognitive tasks with high energy efficiency. However, some factors such as temporal dynamics and spike timings prove critical for information processing but are often ignored by existing works, limiting the performance and applications of neuromorphic computing. On one hand, due to the lack of effective SNN training algorithms, it is difficult to utilize the temporal neural dynamics. Many existing algorithms still treat neuron activation statistically. On the other hand, utilizing temporal neural dynamics also poses challenges to hardware design. Synapses exhibit temporal dynamics, serving as memory units that hold historical information, but are often simplified as a connection with weight. Most current models integrate synaptic activations in some storage medium to represent membrane potential and institute a hard reset of membrane potential after the neuron emits a spike. This is done for its simplicity in hardware, requiring only a "clear" signal to wipe the storage medium, but destroys temporal information stored in the neuron. In this work, we derive an efficient training algorithm for Leaky Integrate and Fire neurons, which is capable of training a SNN to learn complex spatial temporal patterns. We achieved competitive accuracy on two complex datasets. We also demonstrate the advantage of our model by a novel temporal pattern association task. Codesigned with this algorithm, we have developed a CMOS circuit implementation for a memristor-based network of neuron and synapses which retains critical neural dynamics with reduced complexity. This circuit implementation of the neuron model is simulated to demonstrate its ability to react to temporal spiking patterns with an adaptive threshold. △ Less

Submitted 6 May, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

arXiv:2012.02938 [pdf, other]

Cirrus: A Long-range Bi-pattern LiDAR Dataset

Authors: Ze Wang, Sihao Ding, Ying Li, Jonas Fenn, Sohini Roychowdhury, Andreas Wallin, Lane Martin, Scott Ryvola, Guillermo Sapiro, Qiang Qiu

Abstract: In this paper, we introduce Cirrus, a new long-range bi-pattern LiDAR public dataset for autonomous driving tasks such as 3D object detection, critical to highway driving and timely decision making. Our platform is equipped with a high-resolution video camera and a pair of LiDAR sensors with a 250-meter effective range, which is significantly longer than existing public datasets. We record paired… ▽ More In this paper, we introduce Cirrus, a new long-range bi-pattern LiDAR public dataset for autonomous driving tasks such as 3D object detection, critical to highway driving and timely decision making. Our platform is equipped with a high-resolution video camera and a pair of LiDAR sensors with a 250-meter effective range, which is significantly longer than existing public datasets. We record paired point clouds simultaneously using both Gaussian and uniform scanning patterns. Point density varies significantly across such a long range, and different scanning patterns further diversify object representation in LiDAR. In Cirrus, eight categories of objects are exhaustively annotated in the LiDAR point clouds for the entire effective range. To illustrate the kind of studies supported by this new dataset, we introduce LiDAR model adaptation across different ranges, scanning patterns, and sensor devices. Promising results show the great potential of this new dataset to the robotics and computer vision communities. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2011.09928 [pdf, other]

Using Text to Teach Image Retrieval

Authors: Haoyu Dong, Ze Wang, Qiang Qiu, Guillermo Sapiro

Abstract: Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold sampl… ▽ More Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between visual data and text data. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval given only an image and a textual instruction on the desired modifications over the image △ Less

Submitted 19 November, 2020; originally announced November 2020.

arXiv:2011.01408 [pdf, other]

Hybrid Visual Servoing Tracking Control of Uncalibrated Robotic Systems for Dynamic Dwarf Culture Orchards Harvest

Authors: Tao Li, Quan Qiu, Chunjiang Zhao

Abstract: The paper is concerned with the dynamic tracking problem of SNAP orchards harvesting robots in the presence of multiple uncalibrated model parameters in the application of dwarf culture orchards harvest. A new hybrid visual servoing adaptive tracking controller and three adaptive laws are proposed to guarantee harvesting robots to finish the dynamic harvesting task and the adaption to unknown para… ▽ More The paper is concerned with the dynamic tracking problem of SNAP orchards harvesting robots in the presence of multiple uncalibrated model parameters in the application of dwarf culture orchards harvest. A new hybrid visual servoing adaptive tracking controller and three adaptive laws are proposed to guarantee harvesting robots to finish the dynamic harvesting task and the adaption to unknown parameters including camera intrinsic and extrinsic model and robot dynamics. By the Lyapunov theory, asymptotic convergence of the closed-loop system with the proposed control scheme is rigorously proven. Experimental and simulation results have been conducted to verify the performance of the proposed control scheme. The results demonstrate its effectiveness and superiority. △ Less

Submitted 10 June, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: 6 pages,15 figures, accepted by IEEE ICDL2021

MSC Class: 70B15; 70E60; 70Q05; 93C85

arXiv:2011.01022

Depth Ranging Performance Evaluation and Improvement for RGB-D Cameras on Field-Based High-Throughput Phenoty** Robots

Authors: Zhengqiang Fan, Na Sun, Quan Qiu, Chunjiang Zhao

Abstract: RGB-D cameras have been successfully used for indoor High-ThroughpuT Phenoty** (HTTP). However, their capability and feasibility for in-field HTTP still need to be evaluated, due to the noise and disturbances generated by unstable illumination, specular reflection, and diffuse reflection, etc. To solve these problems, we evaluated the depth-ranging performances of two consumer-level RGB-D camera… ▽ More RGB-D cameras have been successfully used for indoor High-ThroughpuT Phenoty** (HTTP). However, their capability and feasibility for in-field HTTP still need to be evaluated, due to the noise and disturbances generated by unstable illumination, specular reflection, and diffuse reflection, etc. To solve these problems, we evaluated the depth-ranging performances of two consumer-level RGB-D cameras (RealSense D435i and Kinect V2) under in-field HTTP scenarios, and proposed a strategy to compensate the depth measurement error. For performance evaluation, we focused on determining their optimal ranging areas for different crop organs. Based on the evaluation results, we proposed a brightness-and-distance-based Support Vector Regression Strategy, to compensate the ranging error. Furthermore, we analyzed the depth filling rate of two RGB-D cameras under different lighting intensities. Experimental results showed that: 1) For RealSense D435i, its effective ranging area is [0.160, 1.400] m, and in-field filling rate is approximately 90%. 2) For Kinect V2, it has a high ranging accuracy in the [0.497, 1.200] m, but its in-field filling rate is less than 24.9%. 3) Our error compensation model can effectively reduce the influences of lighting intensity and target distance. The maximum MSE and minimum R2 of this model are 0.029 and 0.867, respectively. To sum up, RealSense D435i has better ranging performances than Kinect V2 on in-field HTTP. △ Less

Submitted 27 April, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: We want to improve the work of this paper before publishing it publicly

arXiv:2010.10341 [pdf, other]

Learning to Learn Variational Semantic Memory

Authors: Xiantong Zhen, Yingjun Du, Huan Xiong, Qiang Qiu, Cees G. M. Snoek, Ling Shao

Abstract: In this paper, we introduce variational semantic memory into meta-learning to acquire long-term knowledge for few-shot learning. The variational semantic memory accrues and stores semantic information for the probabilistic inference of class prototypes in a hierarchical Bayesian framework. The semantic memory is grown from scratch and gradually consolidated by absorbing information from tasks it e… ▽ More In this paper, we introduce variational semantic memory into meta-learning to acquire long-term knowledge for few-shot learning. The variational semantic memory accrues and stores semantic information for the probabilistic inference of class prototypes in a hierarchical Bayesian framework. The semantic memory is grown from scratch and gradually consolidated by absorbing information from tasks it experiences. By doing so, it is able to accumulate long-term, general knowledge that enables it to learn new concepts of objects. We formulate memory recall as the variational inference of a latent memory variable from addressed contents, which offers a principled way to adapt the knowledge to individual tasks. Our variational semantic memory, as a new long-term memory module, confers principled recall and update mechanisms that enable semantic information to be efficiently accrued and adapted for few-shot learning. Experiments demonstrate that the probabilistic modelling of prototypes achieves a more informative representation of object classes compared to deterministic vectors. The consistent new state-of-the-art performance on four benchmarks shows the benefit of variational semantic memory in boosting few-shot recognition. △ Less

Submitted 14 July, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: accepted to NeurIPS 2020; code is available in https://github.com/YDU-uva/VSM

arXiv:2009.02386 [pdf, other]

ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution

Authors: Ze Wang, Xiuyuan Cheng, Guillermo Sapiro, Qiang Qiu

Abstract: Convolutional Neural Networks (CNNs) are known to be significantly over-parametrized, and difficult to interpret, train and adapt. In this paper, we introduce a structural regularization across convolutional kernels in a CNN. In our approach, each convolution kernel is first decomposed as 2D dictionary atoms linearly combined by coefficients. The widely observed correlation and redundancy in a CNN… ▽ More Convolutional Neural Networks (CNNs) are known to be significantly over-parametrized, and difficult to interpret, train and adapt. In this paper, we introduce a structural regularization across convolutional kernels in a CNN. In our approach, each convolution kernel is first decomposed as 2D dictionary atoms linearly combined by coefficients. The widely observed correlation and redundancy in a CNN hint a common low-rank structure among the decomposed coefficients, which is here further supported by our empirical observations. We then explicitly regularize CNN kernels by enforcing decomposed coefficients to be shared across sub-structures, while leaving each sub-structure only its own dictionary atoms, a few hundreds of parameters typically, which leads to dramatic model reductions. We explore models with sharing across different sub-structures to cover a wide range of trade-offs between parameter reduction and expressiveness. Our proposed regularized network structures open the door to better interpreting, training and adapting deep models. We validate the flexibility and compatibility of our method by image classification experiments on multiple datasets and underlying network structures, and show that CNNs now maintain performance with dramatic reduction in parameters and computations, e.g., only 5\% parameters are used in a ResNet-18 to achieve comparable performance. Further experiments on few-shot classification show that faster and more robust task adaptation is obtained in comparison with models with standard convolutions. △ Less

Submitted 4 September, 2020; originally announced September 2020.

arXiv:2008.01818 [pdf, other]

Graph Convolution with Low-rank Learnable Local Filters

Authors: Xiuyuan Cheng, Zichen Miao, Qiang Qiu

Abstract: Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutio… ▽ More Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians, lack expressiveness to capture local features required for representation of signals lying on the non-Euclidean grid. The current paper introduces a new type of graph convolution with learnable low-rank local filters, which is provably more expressive than previous spectral graph convolution methods. The model also provides a unified framework for both spectral and spatial graph convolutions. To improve model robustness, regularization by local graph Laplacians is introduced. The representation stability against input graph data perturbation is theoretically proved, making use of the graph filter locality and the local graph regularization. Experiments on spherical mesh data, real-world facial expression recognition/skeleton-based action recognition data, and data with simulated graph noise show the empirical advantage of the proposed model. △ Less

Submitted 11 October, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Showing 1–50 of 110 results for author: Qiu, Q