-
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences
Authors:
Shangkun Sun,
Jiaming Liu,
Thomas H. Li,
Huaxia Li,
Guoqing Liu,
Wei Gao
Abstract:
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. The inherent ambiguity introduced by occlusions directly violates the brightness constancy constraint and considerably hinders pixel-to-pixel matching. To address this issue, multi-frame optical flow methods leverage adjacent frames to mitigate the local ambiguity. Nevertheless, prior multi-fr…
▽ More
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. The inherent ambiguity introduced by occlusions directly violates the brightness constancy constraint and considerably hinders pixel-to-pixel matching. To address this issue, multi-frame optical flow methods leverage adjacent frames to mitigate the local ambiguity. Nevertheless, prior multi-frame methods predominantly adopt recursive flow estimation, resulting in a considerable computational overlap. In contrast, we propose a streamlined in-batch framework that eliminates the need for extensive redundant recursive computations while concurrently develo** effective spatio-temporal modeling approaches under in-batch estimation constraints. Specifically, we present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks. Furthermore, we introduce an efficient Integrative Spatio-temporal Coherence (ISC) modeling method for effective spatio-temporal modeling during the encoding phase, which introduces no additional parameter overhead. Additionally, we devise a Global Temporal Regressor (GTR) that effectively explores temporal relations during decoding. Benefiting from the efficient SIM pipeline and effective modules, StreamFlow not only excels in terms of performance on the challenging KITTI and Sintel datasets, with particular improvement in occluded areas but also attains a remarkable $63.82\%$ enhancement in speed compared with previous multi-frame methods. The code will be available soon at https://github.com/littlespray/StreamFlow.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Authors:
Ruyang Liu,
**gjia Huang,
Wei Gao,
Thomas H. Li,
Ge Li
Abstract:
Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the ima…
▽ More
Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transferring from the perspective of the model and the data, unveiling two key obstacles impeding the adaptation of image-language models: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework extending image-text model to diverse video tasks and video-text data.Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results verify Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
Efficient Test-Time Adaptation for Super-Resolution with Second-Order Degradation and Reconstruction
Authors:
Zeshuai Deng,
Zhuokun Chen,
Shuaicheng Niu,
Thomas H. Li,
Bohan Zhuang,
Mingkui Tan
Abstract:
Image super-resolution (SR) aims to learn a map** from low-resolution (LR) to high-resolution (HR) using paired HR-LR training images. Conventional SR methods typically gather the paired training data by synthesizing LR images from HR images using a predetermined degradation model, e.g., Bicubic down-sampling. However, the realistic degradation type of test images may mismatch with the training-…
▽ More
Image super-resolution (SR) aims to learn a map** from low-resolution (LR) to high-resolution (HR) using paired HR-LR training images. Conventional SR methods typically gather the paired training data by synthesizing LR images from HR images using a predetermined degradation model, e.g., Bicubic down-sampling. However, the realistic degradation type of test images may mismatch with the training-time degradation type due to the dynamic changes of the real-world scenarios, resulting in inferior-quality SR images. To address this, existing methods attempt to estimate the degradation model and train an image-specific model, which, however, is quite time-consuming and impracticable to handle rapidly changing domain shifts. Moreover, these methods largely concentrate on the estimation of one degradation type (e.g., blur degradation), overlooking other degradation types like noise and JPEG in real-world test-time scenarios, thus limiting their practicality. To tackle these problems, we present an efficient test-time adaptation framework for SR, named SRTTA, which is able to quickly adapt SR models to test domains with different/unknown degradation types. Specifically, we design a second-order degradation scheme to construct paired data based on the degradation type of the test image, which is predicted by a pre-trained degradation classifier. Then, we adapt the SR model by implementing feature-level reconstruction learning from the initial test image to its second-order degraded counterparts, which helps the SR model generate plausible HR images. Extensive experiments are conducted on newly synthesized corrupted DIV2K datasets with 8 different degradations and several real-world datasets, demonstrating that our SRTTA framework achieves an impressive improvement over existing methods with satisfying speed. The source code is available at https://github.com/DengZeshuai/SRTTA.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation
Authors:
Xinyu Sun,
Peihao Chen,
Jugang Fan,
Thomas H. Li,
Jian Chen,
Mingkui Tan
Abstract:
Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence…
▽ More
Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations. 1) They may miss detailed information in the goal image, and thus fail to reason the goal location. 2) More critically, it is hard to focus on the goal-relevant regions in the observation image, because they attempt to understand observation without goal conditioning. In this paper, we aim to overcome these limitations by designing a Fine-grained Goal Prompting (FGPrompt) method for image-goal navigation. In particular, we leverage fine-grained and high-resolution feature maps in the goal image as prompts to perform conditioned embedding, which preserves detailed information in the goal image and guides the observation encoder to pay attention to goal-relevant regions. Compared with existing methods on the image-goal navigation benchmark, our method brings significant performance improvement on 3 benchmark datasets (i.e., Gibson, MP3D, and HM3D). Especially on Gibson, we surpass the state-of-the-art success rate by 8% with only 1/50 model size. Project page: https://xinyusun.github.io/fgprompt-pages
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Authors:
Ruyang Liu,
Chen Li,
Yixiao Ge,
Ying Shan,
Thomas H. Li,
Ge Li
Abstract:
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedbac…
▽ More
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while kee** the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
△ Less
Submitted 27 June, 2024; v1 submitted 27 September, 2023;
originally announced September 2023.
-
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
Authors:
Peihao Chen,
Xinyu Sun,
Hongyan Zhi,
Runhao Zeng,
Thomas H. Li,
Gaowen Liu,
Mingkui Tan,
Chuang Gan
Abstract:
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"…
▽ More
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Learning Vision-and-Language Navigation from YouTube Videos
Authors:
Kunyang Lin,
Peihao Chen,
Diwei Huang,
Thomas H. Li,
Mingkui Tan,
Chuang Gan
Abstract:
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experience…
▽ More
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an entropy-based method to construct the nodes of a path trajectory. Then, we propose an action-aware generator for generating instructions from unlabeled trajectories. Last, we devise a trajectory judgment pretext task to encourage the agent to mine the layout knowledge. Experimental results show that our method achieves state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code is available at https://github.com/JeremyLinky/YouTube-VLN
△ Less
Submitted 22 July, 2023;
originally announced July 2023.
-
Hard Sample Matters a Lot in Zero-Shot Quantization
Authors:
Huantong Li,
Xiangmiao Wu,
Fanbing Lv,
Daihai Liao,
Thomas H. Li,
Yonggang Zhang,
Bo Han,
Mingkui Tan
Abstract:
Zero-shot quantization (ZSQ) is promising for compressing and accelerating deep neural networks when the data for training full-precision models are inaccessible. In ZSQ, network quantization is performed using synthetic samples, thus, the performance of quantized models depends heavily on the quality of synthetic samples. Nonetheless, we find that the synthetic samples constructed in existing ZSQ…
▽ More
Zero-shot quantization (ZSQ) is promising for compressing and accelerating deep neural networks when the data for training full-precision models are inaccessible. In ZSQ, network quantization is performed using synthetic samples, thus, the performance of quantized models depends heavily on the quality of synthetic samples. Nonetheless, we find that the synthetic samples constructed in existing ZSQ methods can be easily fitted by models. Accordingly, quantized models obtained by these methods suffer from significant performance degradation on hard samples. To address this issue, we propose HArd sample Synthesizing and Training (HAST). Specifically, HAST pays more attention to hard samples when synthesizing samples and makes synthetic samples hard to fit when training quantized models. HAST aligns features extracted by full-precision and quantized models to ensure the similarity between features extracted by these two models. Extensive experiments show that HAST significantly outperforms existing ZSQ methods, achieving performance comparable to models that are quantized with real data.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
Detecting the open-world objects with the help of the Brain
Authors:
Shuailei Ma,
Yuefeng Wang,
Ying Wei,
Peihao Chen,
Zhixiang Ye,
Jiaqi Fan,
Enming Zhang,
Thomas H. Li
Abstract:
Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to detect unseen/unknown objects and incrementally learn them. The natural instinct of humans to identify unknown…
▽ More
Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to detect unseen/unknown objects and incrementally learn them. The natural instinct of humans to identify unknown objects in their environments mainly depends on their brains' knowledge base. It is difficult for a model to do this only by learning from the annotation of several tiny datasets. The large pre-trained grounded language-image models - VL (\ie GLIP) have rich knowledge about the open world but are limited to the text prompt. We propose leveraging the VL as the ``Brain'' of the open-world detector by simply generating unknown labels. Leveraging it is non-trivial because the unknown labels impair the model's learning of known objects. In this paper, we alleviate these problems by proposing the down-weight loss function and decoupled detection structure. Moreover, our detector leverages the ``Brain'' to learn novel objects beyond VL through our pseudo-labeling scheme.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
LIO-PPF: Fast LiDAR-Inertial Odometry via Incremental Plane Pre-Fitting and Skeleton Tracking
Authors:
Xingyu Chen,
Peixi Wu,
Ge Li,
Thomas H. Li
Abstract:
As a crucial infrastructure of intelligent mobile robots, LiDAR-Inertial odometry (LIO) provides the basic capability of state estimation by tracking LiDAR scans. The high-accuracy tracking generally involves the kNN search, which is used with minimizing the point-to-plane distance. The cost for this, however, is maintaining a large local map and performing kNN plane fit for each point. In this wo…
▽ More
As a crucial infrastructure of intelligent mobile robots, LiDAR-Inertial odometry (LIO) provides the basic capability of state estimation by tracking LiDAR scans. The high-accuracy tracking generally involves the kNN search, which is used with minimizing the point-to-plane distance. The cost for this, however, is maintaining a large local map and performing kNN plane fit for each point. In this work, we reduce both time and space complexity of LIO by saving these unnecessary costs. Technically, we design a plane pre-fitting (PPF) pipeline to track the basic skeleton of the 3D scene. In PPF, planes are not fitted individually for each scan, let alone for each point, but are updated incrementally as the scene 'flows'. Unlike kNN, the PPF is more robust to noisy and non-strict planes with our iterative Principal Component Analyse (iPCA) refinement. Moreover, a simple yet effective sandwich layer is introduced to eliminate false point-to-plane matches. Our method was extensively tested on a total number of 22 sequences across 5 open datasets, and evaluated in 3 existing state-of-the-art LIO systems. By contrast, LIO-PPF can consume only 36% of the original local map size to achieve up to 4x faster residual computing and 1.92x overall FPS, while maintaining the same level of accuracy. We fully open source our implementation at https://github.com/xingyuuchen/LIO-PPF.
△ Less
Submitted 3 August, 2023; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Authors:
Ruyang Liu,
**gjia Huang,
Ge Li,
Jiashi Feng,
Xinglong Wu,
Thomas H. Li
Abstract:
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the…
▽ More
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection
Authors:
Shuailei Ma,
Yuefeng Wang,
Jiaqi Fan,
Ying Wei,
Thomas H. Li,
Hongli Liu,
Fanbing Lv
Abstract:
Open-world object detection (OWOD), as a more general and challenging goal, requires the model trained from data on known objects to detect both known and unknown objects and incrementally learn to identify these unknown objects. The existing works which employ standard detection framework and fixed pseudo-labelling mechanism (PLM) have the following problems: (i) The inclusion of detecting unknow…
▽ More
Open-world object detection (OWOD), as a more general and challenging goal, requires the model trained from data on known objects to detect both known and unknown objects and incrementally learn to identify these unknown objects. The existing works which employ standard detection framework and fixed pseudo-labelling mechanism (PLM) have the following problems: (i) The inclusion of detecting unknown objects substantially reduces the model's ability to detect known ones. (ii) The PLM does not adequately utilize the priori knowledge of inputs. (iii) The fixed selection manner of PLM cannot guarantee that the model is trained in the right direction. We observe that humans subconsciously prefer to focus on all foreground objects and then identify each one in detail, rather than localize and identify a single object simultaneously, for alleviating the confusion. This motivates us to propose a novel solution called CAT: LoCalization and IdentificAtion Cascade Detection Transformer which decouples the detection process via the shared decoder in the cascade decoding way. In the meanwhile, we propose the self-adaptive pseudo-labelling mechanism which combines the model-driven with input-driven PLM and self-adaptively generates robust pseudo-labels for unknown objects, significantly improving the ability of CAT to retrieve unknown objects. Comprehensive experiments on two benchmark datasets, i.e., MS-COCO and PASCAL VOC, show that our model outperforms the state-of-the-art in terms of all metrics in the task of OWOD, incremental object detection (IOD) and open-set detection.
△ Less
Submitted 27 March, 2023; v1 submitted 5 January, 2023;
originally announced January 2023.
-
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation
Authors:
Peihao Chen,
Dongyu Ji,
Kunyang Lin,
Runhao Zeng,
Thomas H. Li,
Mingkui Tan,
Chuang Gan
Abstract:
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the envi…
▽ More
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Learning Active Camera for Multi-Object Navigation
Authors:
Peihao Chen,
Dongyu Ji,
Kunyang Lin,
Weiwen Hu,
Wenbing Huang,
Thomas H. Li,
Mingkui Tan,
Chuang Gan
Abstract:
Getting robots to navigate to multiple objects autonomously is essential yet difficult in robot applications. One of the key challenges is how to explore environments efficiently with camera sensors only. Existing navigation methods mainly focus on fixed cameras and few attempts have been made to navigate with active cameras. As a result, the agent may take a very long time to perceive the environ…
▽ More
Getting robots to navigate to multiple objects autonomously is essential yet difficult in robot applications. One of the key challenges is how to explore environments efficiently with camera sensors only. Existing navigation methods mainly focus on fixed cameras and few attempts have been made to navigate with active cameras. As a result, the agent may take a very long time to perceive the environment due to limited camera scope. In contrast, humans typically gain a larger field of view by looking around for a better perception of the environment. How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics. In this paper, we consider navigating to multiple objects more efficiently with active cameras. Specifically, we cast moving camera to a Markov Decision Process and reformulate the active camera problem as a reinforcement learning problem. However, we have to address two new challenges: 1) how to learn a good camera policy in complex environments and 2) how to coordinate it with the navigation policy. To address these, we carefully design a reward function to encourage the agent to explore more areas by moving camera actively. Moreover, we exploit human experience to infer a rule-based camera action to guide the learning process. Last, to better coordinate two kinds of policies, the camera policy takes navigation actions into account when making camera moving decisions. Experimental results show our camera policy consistently improves the performance of multi-object navigation over four baselines on two datasets.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Masked Motion Encoding for Self-Supervised Video Representation Learning
Authors:
Xinyu Sun,
Peihao Chen,
Liangwei Chen,
Changhao Li,
Thomas H. Li,
Mingkui Tan,
Chuang Gan
Abstract:
How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from…
▽ More
How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.
△ Less
Submitted 23 March, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Frequency-Aware Self-Supervised Monocular Depth Estimation
Authors:
Xingyu Chen,
Thomas H. Li,
Ruonan Zhang,
Ge Li
Abstract:
We present two versatile methods to generally enhance self-supervised monocular depth estimation (MDE) models. The high generalizability of our methods is achieved by solving the fundamental and ubiquitous problems in photometric loss function. In particular, from the perspective of spatial frequency, we first propose Ambiguity-Masking to suppress the incorrect supervision under photometric loss a…
▽ More
We present two versatile methods to generally enhance self-supervised monocular depth estimation (MDE) models. The high generalizability of our methods is achieved by solving the fundamental and ubiquitous problems in photometric loss function. In particular, from the perspective of spatial frequency, we first propose Ambiguity-Masking to suppress the incorrect supervision under photometric loss at specific object boundaries, the cause of which could be traced to pixel-level ambiguity. Second, we present a novel frequency-adaptive Gaussian low-pass filter, designed to robustify the photometric loss in high-frequency regions. We are the first to propose blurring images to improve depth estimators with an interpretable analysis. Both modules are lightweight, adding no parameters and no need to manually change the network structures. Experiments show that our methods provide performance boosts to a large number of existing models, including those who claimed state-of-the-art, while introducing no extra inference computation at all.
△ Less
Submitted 14 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening Problem
Authors:
Xingyu Chen,
Ruonan Zhang,
Ji Jiang,
Yan Wang,
Ge Li,
Thomas H. Li
Abstract:
Self-supervised monocular depth estimation (MDE) models universally suffer from the notorious edge-fattening issue. Triplet loss, as a widespread metric learning strategy, has largely succeeded in many computer vision applications. In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue. We show two drawbacks of the raw triplet loss in MDE an…
▽ More
Self-supervised monocular depth estimation (MDE) models universally suffer from the notorious edge-fattening issue. Triplet loss, as a widespread metric learning strategy, has largely succeeded in many computer vision applications. In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue. We show two drawbacks of the raw triplet loss in MDE and demonstrate our problem-driven redesigns. First, we present a min. operator based strategy applied to all negative samples, to prevent well-performing negatives sheltering the error of edge-fattening negatives. Second, we split the anchor-positive distance and anchor-negative distance from within the original triplet, which directly optimizes the positives without any mutual effect with the negatives. Extensive experiments show the combination of these two small redesigns can achieve unprecedented results: Our powerful and versatile triplet loss not only makes our model outperform all previous SoTA by a large margin, but also provides substantial performance boosts to a large number of existing models, while introducing no extra inference computation at all.
△ Less
Submitted 3 January, 2023; v1 submitted 1 October, 2022;
originally announced October 2022.
-
Deep Geometry Post-Processing for Decompressed Point Clouds
Authors:
Xiaoqing Fan,
Ge Li,
Dingquan Li,
Yurui Ren,
Wei Gao,
Thomas H. Li
Abstract:
Point cloud compression plays a crucial role in reducing the huge cost of data storage and transmission. However, distortions can be introduced into the decompressed point clouds due to quantization. In this paper, we propose a novel learning-based post-processing method to enhance the decompressed point clouds. Specifically, a voxelized point cloud is first divided into small cubes. Then, a 3D co…
▽ More
Point cloud compression plays a crucial role in reducing the huge cost of data storage and transmission. However, distortions can be introduced into the decompressed point clouds due to quantization. In this paper, we propose a novel learning-based post-processing method to enhance the decompressed point clouds. Specifically, a voxelized point cloud is first divided into small cubes. Then, a 3D convolutional network is proposed to predict the occupancy probability for each location of a cube. We leverage both local and global contexts by generating multi-scale probabilities. These probabilities are progressively summed to predict the results in a coarse-to-fine manner. Finally, we obtain the geometry-refined point clouds based on the predicted probabilities. Different from previous methods, we deal with decompressed point clouds with huge variety of distortions using a single model. Experimental results show that the proposed method can significantly improve the quality of the decompressed point clouds, achieving 9.30dB BDPSNR gain on three representative datasets on average.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis
Authors:
Yurui Ren,
Xiaoqing Fan,
Ge Li,
Shan Liu,
Thomas H. Li
Abstract:
We deal with the controllable person image synthesis task which aims to re-render a human from a reference image with explicit control over body pose and appearance. Observing that person images are highly structured, we propose to generate desired images by extracting and distributing semantic entities of reference images. To achieve this goal, a neural texture extraction and distribution operati…
▽ More
We deal with the controllable person image synthesis task which aims to re-render a human from a reference image with explicit control over body pose and appearance. Observing that person images are highly structured, we propose to generate desired images by extracting and distributing semantic entities of reference images. To achieve this goal, a neural texture extraction and distribution operation based on double attention is described. This operation first extracts semantic neural textures from reference feature maps. Then, it distributes the extracted neural textures according to the spatial distributions learned from target poses. Our model is trained to predict human images in arbitrary poses, which encourages it to extract disentangled and expressive neural textures representing the appearance of different semantic entities. The disentangled representation further enables explicit appearance control. Neural textures of different reference images can be fused to control the appearance of the interested areas. Experimental comparisons show the superiority of the proposed model. Code is available at https://github.com/RenYurui/Neural-Texture-Extraction-Distribution.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering
Authors:
Yurui Ren,
Ge Li,
Yuanqi Chen,
Thomas H. Li,
Shan Liu
Abstract:
Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of o…
▽ More
Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of other individuals. In this paper, a Portrait Image Neural Renderer (PIRenderer) is proposed to control the face motions with the parameters of three-dimensional morphable face models (3DMMs). The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications. Experiments on both direct and indirect editing tasks demonstrate the superiority of this model. Meanwhile, we further extend this model to tackle the audio-driven facial reenactment task by extracting sequential motions from audio inputs. We show that our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream. Our source code is available at https://github.com/RenYurui/PIRender.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Combining Attention with Flow for Person Image Synthesis
Authors:
Yurui Ren,
Yubo Wu,
Thomas H. Li,
Shan Liu,
Ge Li
Abstract:
Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate…
▽ More
Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate accurate target structures but also uses the flow-based operation to sample realistic source textures. Both objective and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive ablation studies verify our hypotheses and show the efficacy of the proposed modules. Besides, additional experiments on the portrait image editing task demonstrate the versatility of the proposed combination.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation
Authors:
Yurui Ren,
Ge Li,
Shan Liu,
Thomas H. Li
Abstract:
Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level.…
▽ More
Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. This framework first estimates global flow fields between sources and targets. Then, corresponding local source feature patches are sampled with content-aware local attention coefficients. We show that our framework can spatially transform the inputs in an efficient manner. Meanwhile, we further model the temporal consistency for the person image animation task to generate coherent videos. The experiment results of both image generation and animation tasks demonstrate the superiority of our model. Besides, additional results of novel view synthesis and face image animation show that our model is applicable to other tasks requiring spatial transformation. The source code of our project is available at https://github.com/RenYurui/Global-Flow-Local-Attention.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.
-
Deep Image Spatial Transformation for Person Image Generation
Authors:
Yurui Ren,
Xiaoming Yu,
Junming Chen,
Thomas H. Li,
Ge Li
Abstract:
Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically,…
▽ More
Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.
△ Less
Submitted 18 March, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
StructureFlow: Image Inpainting via Structure-aware Appearance Flow
Authors:
Yurui Ren,
Xiaoming Yu,
Ruonan Zhang,
Thomas H. Li,
Shan Liu,
Ge Li
Abstract:
Image inpainting techniques have shown significant improvements by using deep neural networks recently. However, most of them may either fail to reconstruct reasonable structures or restore fine-grained textures. In order to solve this problem, in this paper, we propose a two-stage model which splits the inpainting task into two parts: structure reconstruction and texture generation. In the first…
▽ More
Image inpainting techniques have shown significant improvements by using deep neural networks recently. However, most of them may either fail to reconstruct reasonable structures or restore fine-grained textures. In order to solve this problem, in this paper, we propose a two-stage model which splits the inpainting task into two parts: structure reconstruction and texture generation. In the first stage, edge-preserved smooth images are employed to train a structure reconstructor which completes the missing structures of the inputs. In the second stage, based on the reconstructed structures, a texture generator using appearance flow is designed to yield image details. Experiments on multiple publicly available datasets show the superior performance of the proposed network.
△ Less
Submitted 11 August, 2019;
originally announced August 2019.
-
Deep AutoEncoder-based Lossy Geometry Compression for Point Clouds
Authors:
Wei Yan,
Yiting shao,
Shan Liu,
Thomas H Li,
Zhu Li,
Ge Li
Abstract:
Point cloud is a fundamental 3D representation which is widely used in real world applications such as autonomous driving. As a newly-developed media format which is characterized by complexity and irregularity, point cloud creates a need for compression algorithms which are more flexible than existing codecs. Recently, autoencoders(AEs) have shown their effectiveness in many visual analysis tasks…
▽ More
Point cloud is a fundamental 3D representation which is widely used in real world applications such as autonomous driving. As a newly-developed media format which is characterized by complexity and irregularity, point cloud creates a need for compression algorithms which are more flexible than existing codecs. Recently, autoencoders(AEs) have shown their effectiveness in many visual analysis tasks as well as image compression, which inspires us to employ it in point cloud compression. In this paper, we propose a general autoencoder-based architecture for lossy geometry point cloud compression. To the best of our knowledge, it is the first autoencoder-based geometry compression codec that directly takes point clouds as input rather than voxel grids or collections of images. Compared with handcrafted codecs, this approach adapts much more quickly to previously unseen media contents and media formats, meanwhile achieving competitive performance. Our architecture consists of a pointnet-based encoder, a uniform quantizer, an entropy estimation block and a nonlinear synthesis transformation module. In lossy geometry compression of point cloud, results show that the proposed method outperforms the test model for categories 1 and 3 (TMC13) published by MPEG-3DG group on the 125th meeting, and on average a 73.15\% BD-rate gain is achieved.
△ Less
Submitted 17 April, 2019;
originally announced May 2019.
-
Graph Convolutional Label Noise Cleaner: Train a Plug-and-play Action Classifier for Anomaly Detection
Authors:
Jia-Xing Zhong,
Nannan Li,
Weijie Kong,
Shan Liu,
Thomas H. Li,
Ge Li
Abstract:
Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advan…
▽ More
Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advantage of these well-developed classifiers. For this purpose, we devise a graph convolutional network to correct noisy labels. Based upon feature similarity and temporal consistency, our network propagates supervisory signals from high-confidence snippets to low-confidence ones. In this manner, the network is capable of providing cleaned supervision for action classifiers. During the test phase, we only need to obtain snippet-wise predictions from the action classifier without any extra post-processing. Extensive experiments on 3 datasets at different scales with 2 types of action classifiers demonstrate the efficacy of our method. Remarkably, we obtain the frame-level AUC score of 82.12% on UCF-Crime.
△ Less
Submitted 18 March, 2019;
originally announced March 2019.
-
Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector
Authors:
Jia-Xing Zhong,
Nannan Li,
Weijie Kong,
Tao Zhang,
Thomas H. Li,
Ge Li
Abstract:
Weakly supervised temporal action detection is a Herculean task in understanding untrimmed videos, since no supervisory signal except the video-level category label is available on training data. Under the supervision of category labels, weakly supervised detectors are usually built upon classifiers. However, there is an inherent contradiction between classifier and detector; i.e., a classifier in…
▽ More
Weakly supervised temporal action detection is a Herculean task in understanding untrimmed videos, since no supervisory signal except the video-level category label is available on training data. Under the supervision of category labels, weakly supervised detectors are usually built upon classifiers. However, there is an inherent contradiction between classifier and detector; i.e., a classifier in pursuit of high classification performance prefers top-level discriminative video clips that are extremely fragmentary, whereas a detector is obliged to discover the whole action instance without missing any relevant snippet. To reconcile this contradiction, we train a detector by driving a series of classifiers to find new actionness clips progressively, via step-by-step erasion from a complete video. During the test phase, all we need to do is to collect detection results from the one-by-one trained classifiers at various erasing steps. To assist in the collection process, a fully connected conditional random field is established to refine the temporal localization outputs. We evaluate our approach on two prevailing datasets, THUMOS'14 and ActivityNet. The experiments show that our detector advances state-of-the-art weakly supervised temporal action detection results, and even compares with quite a few strongly supervised methods.
△ Less
Submitted 18 July, 2018; v1 submitted 8 July, 2018;
originally announced July 2018.
-
Exploiting the Value of the Center-dark Channel Prior for Salient Object Detection
Authors:
Chunbiao Zhu,
Wenhao Zhang,
Thomas H. Li,
Ge Li
Abstract:
Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this paper, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a cente…
▽ More
Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this paper, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a center-dark channel map based on center saliency and dark channel priors. Finally, we fuse the initial saliency map with the center dark channel map to generate the final saliency map. Extensive evaluations over four benchmark datasets demonstrate that our proposed method performs favorably against most of the state-of-the-art approaches. Besides, we further discuss the application of the proposed algorithm in small target detection and demonstrate the universal value of center-dark channel priors in the field of object detection.
△ Less
Submitted 14 May, 2018;
originally announced May 2018.
-
PDNet: Prior-model Guided Depth-enhanced Network for Salient Object Detection
Authors:
Chunbiao Zhu,
Xing Cai,
Kan Huang,
Thomas H Li,
Ge Li
Abstract:
Fully convolutional neural networks (FCNs) have shown outstanding performance in many computer vision tasks including salient object detection. However, there still remains two issues needed to be addressed in deep learning based saliency detection. One is the lack of tremendous amount of annotated data to train a network. The other is the lack of robustness for extracting salient objects in image…
▽ More
Fully convolutional neural networks (FCNs) have shown outstanding performance in many computer vision tasks including salient object detection. However, there still remains two issues needed to be addressed in deep learning based saliency detection. One is the lack of tremendous amount of annotated data to train a network. The other is the lack of robustness for extracting salient objects in images containing complex scenes. In this paper, we present a new architecture$ - $PDNet, a robust prior-model guided depth-enhanced network for RGB-D salient object detection. In contrast to existing works, in which RGB-D values of image pixels are fed directly to a network, the proposed architecture is composed of a master network for processing RGB values, and a sub-network making full use of depth cues and incorporate depth-based features into the master network. To overcome the limited size of the labeled RGB-D dataset for training, we employ a large conventional RGB dataset to pre-train the master network, which proves to contribute largely to the final accuracy. Extensive evaluations over five benchmark datasets demonstrate that our proposed method performs favorably against the state-of-the-art approaches.
△ Less
Submitted 13 October, 2018; v1 submitted 22 March, 2018;
originally announced March 2018.