Skip to main content

Showing 1–29 of 29 results for author: Li, T H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.17099  [pdf, other

    cs.CV cs.AI

    StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

    Authors: Shangkun Sun, Jiaming Liu, Thomas H. Li, Huaxia Li, Guoqing Liu, Wei Gao

    Abstract: Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. The inherent ambiguity introduced by occlusions directly violates the brightness constancy constraint and considerably hinders pixel-to-pixel matching. To address this issue, multi-frame optical flow methods leverage adjacent frames to mitigate the local ambiguity. Nevertheless, prior multi-fr… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  2. arXiv:2311.15075  [pdf, other

    cs.CV

    Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding

    Authors: Ruyang Liu, **gjia Huang, Wei Gao, Thomas H. Li, Ge Li

    Abstract: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the ima… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  3. arXiv:2310.19011  [pdf, other

    cs.CV

    Efficient Test-Time Adaptation for Super-Resolution with Second-Order Degradation and Reconstruction

    Authors: Zeshuai Deng, Zhuokun Chen, Shuaicheng Niu, Thomas H. Li, Bohan Zhuang, Mingkui Tan

    Abstract: Image super-resolution (SR) aims to learn a map** from low-resolution (LR) to high-resolution (HR) using paired HR-LR training images. Conventional SR methods typically gather the paired training data by synthesizing LR images from HR images using a predetermined degradation model, e.g., Bicubic down-sampling. However, the realistic degradation type of test images may mismatch with the training-… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted by 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  4. arXiv:2310.07473  [pdf, other

    cs.CV cs.RO

    FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation

    Authors: Xinyu Sun, Peihao Chen, Jugang Fan, Thomas H. Li, Jian Chen, Mingkui Tan

    Abstract: Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

  5. arXiv:2309.15785  [pdf, other

    cs.CV

    BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

    Authors: Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li

    Abstract: The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedbac… ▽ More

    Submitted 27 June, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

  6. arXiv:2308.07997  [pdf, other

    cs.CV cs.RO

    $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

    Authors: Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan

    Abstract: We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

  7. arXiv:2307.11984  [pdf, other

    cs.CV cs.CL

    Learning Vision-and-Language Navigation from YouTube Videos

    Authors: Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

    Abstract: Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experience… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023

  8. arXiv:2303.13826  [pdf, other

    cs.CV

    Hard Sample Matters a Lot in Zero-Shot Quantization

    Authors: Huantong Li, Xiangmiao Wu, Fanbing Lv, Daihai Liao, Thomas H. Li, Yonggang Zhang, Bo Han, Mingkui Tan

    Abstract: Zero-shot quantization (ZSQ) is promising for compressing and accelerating deep neural networks when the data for training full-precision models are inaccessible. In ZSQ, network quantization is performed using synthetic samples, thus, the performance of quantized models depends heavily on the quality of synthetic samples. Nonetheless, we find that the synthetic samples constructed in existing ZSQ… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: 12 pages, CVPR 2023

  9. arXiv:2303.11623  [pdf, other

    cs.CV

    Detecting the open-world objects with the help of the Brain

    Authors: Shuailei Ma, Yuefeng Wang, Ying Wei, Peihao Chen, Zhixiang Ye, Jiaqi Fan, Enming Zhang, Thomas H. Li

    Abstract: Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to detect unseen/unknown objects and incrementally learn them. The natural instinct of humans to identify unknown… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: text overlap with arXiv:2301.01970

  10. arXiv:2302.14674  [pdf, other

    cs.RO

    LIO-PPF: Fast LiDAR-Inertial Odometry via Incremental Plane Pre-Fitting and Skeleton Tracking

    Authors: Xingyu Chen, Peixi Wu, Ge Li, Thomas H. Li

    Abstract: As a crucial infrastructure of intelligent mobile robots, LiDAR-Inertial odometry (LIO) provides the basic capability of state estimation by tracking LiDAR scans. The high-accuracy tracking generally involves the kNN search, which is used with minimizing the point-to-plane distance. The cost for this, however, is maintaining a large local map and performing kNN plane fit for each point. In this wo… ▽ More

    Submitted 3 August, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: IROS 2023

  11. arXiv:2301.11116  [pdf, other

    cs.CV cs.AI

    Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

    Authors: Ruyang Liu, **gjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li

    Abstract: Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  12. arXiv:2301.01970  [pdf, other

    cs.CV

    CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection

    Authors: Shuailei Ma, Yuefeng Wang, Jiaqi Fan, Ying Wei, Thomas H. Li, Hongli Liu, Fanbing Lv

    Abstract: Open-world object detection (OWOD), as a more general and challenging goal, requires the model trained from data on known objects to detect both known and unknown objects and incrementally learn to identify these unknown objects. The existing works which employ standard detection framework and fixed pseudo-labelling mechanism (PLM) have the following problems: (i) The inclusion of detecting unknow… ▽ More

    Submitted 27 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: CVPR 2023 camera-ready version

  13. arXiv:2210.07506  [pdf, other

    cs.CV

    Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

    Authors: Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

    Abstract: We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the envi… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted by NeurIPS 2022

  14. arXiv:2210.07505  [pdf, other

    cs.CV cs.RO

    Learning Active Camera for Multi-Object Navigation

    Authors: Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

    Abstract: Getting robots to navigate to multiple objects autonomously is essential yet difficult in robot applications. One of the key challenges is how to explore environments efficiently with camera sensors only. Existing navigation methods mainly focus on fixed cameras and few attempts have been made to navigate with active cameras. As a result, the agent may take a very long time to perceive the environ… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted by NeurIPS 2022

  15. arXiv:2210.06096  [pdf, other

    cs.CV

    Masked Motion Encoding for Self-Supervised Video Representation Learning

    Authors: Xinyu Sun, Peihao Chen, Liangwei Chen, Changhao Li, Thomas H. Li, Mingkui Tan, Chuang Gan

    Abstract: How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from… ▽ More

    Submitted 23 March, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: CVPR 2023 camera-ready version

  16. arXiv:2210.05479  [pdf, other

    cs.CV cs.AI cs.LG

    Frequency-Aware Self-Supervised Monocular Depth Estimation

    Authors: Xingyu Chen, Thomas H. Li, Ruonan Zhang, Ge Li

    Abstract: We present two versatile methods to generally enhance self-supervised monocular depth estimation (MDE) models. The high generalizability of our methods is achieved by solving the fundamental and ubiquitous problems in photometric loss function. In particular, from the perspective of spatial frequency, we first propose Ambiguity-Masking to suppress the incorrect supervision under photometric loss a… ▽ More

    Submitted 14 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 8 pages, 5 figures, published to WACV2023

  17. arXiv:2210.00411  [pdf, other

    cs.CV cs.AI cs.LG

    Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening Problem

    Authors: Xingyu Chen, Ruonan Zhang, Ji Jiang, Yan Wang, Ge Li, Thomas H. Li

    Abstract: Self-supervised monocular depth estimation (MDE) models universally suffer from the notorious edge-fattening issue. Triplet loss, as a widespread metric learning strategy, has largely succeeded in many computer vision applications. In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue. We show two drawbacks of the raw triplet loss in MDE an… ▽ More

    Submitted 3 January, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

    Comments: 8 pages, 7 figures, published to WACV2023

  18. arXiv:2204.13952  [pdf, other

    cs.CV eess.IV

    Deep Geometry Post-Processing for Decompressed Point Clouds

    Authors: Xiaoqing Fan, Ge Li, Dingquan Li, Yurui Ren, Wei Gao, Thomas H. Li

    Abstract: Point cloud compression plays a crucial role in reducing the huge cost of data storage and transmission. However, distortions can be introduced into the decompressed point clouds due to quantization. In this paper, we propose a novel learning-based post-processing method to enhance the decompressed point clouds. Specifically, a voxelized point cloud is first divided into small cubes. Then, a 3D co… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

  19. arXiv:2204.06160  [pdf, other

    cs.CV cs.AI

    Neural Texture Extraction and Distribution for Controllable Person Image Synthesis

    Authors: Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, Thomas H. Li

    Abstract: We deal with the controllable person image synthesis task which aims to re-render a human from a reference image with explicit control over body pose and appearance. Observing that person images are highly structured, we propose to generate desired images by extracting and distributing semantic entities of reference images. To achieve this goal, a neural texture extraction and distribution operati… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

  20. arXiv:2109.08379  [pdf, other

    cs.CV cs.AI

    PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

    Authors: Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, Shan Liu

    Abstract: Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of o… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

  21. arXiv:2108.01823  [pdf, other

    cs.CV

    Combining Attention with Flow for Person Image Synthesis

    Authors: Yurui Ren, Yubo Wu, Thomas H. Li, Shan Liu, Ge Li

    Abstract: Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

  22. Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation

    Authors: Yurui Ren, Ge Li, Shan Liu, Thomas H. Li

    Abstract: Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level.… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: arXiv admin note: text overlap with arXiv:2003.00696

  23. arXiv:2003.00696  [pdf, other

    cs.CV cs.AI

    Deep Image Spatial Transformation for Person Image Generation

    Authors: Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H. Li, Ge Li

    Abstract: Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically,… ▽ More

    Submitted 18 March, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

  24. arXiv:1908.03852  [pdf, other

    cs.CV

    StructureFlow: Image Inpainting via Structure-aware Appearance Flow

    Authors: Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H. Li, Shan Liu, Ge Li

    Abstract: Image inpainting techniques have shown significant improvements by using deep neural networks recently. However, most of them may either fail to reconstruct reasonable structures or restore fine-grained textures. In order to solve this problem, in this paper, we propose a two-stage model which splits the inpainting task into two parts: structure reconstruction and texture generation. In the first… ▽ More

    Submitted 11 August, 2019; originally announced August 2019.

  25. arXiv:1905.03691  [pdf, other

    cs.CV cs.MM eess.IV

    Deep AutoEncoder-based Lossy Geometry Compression for Point Clouds

    Authors: Wei Yan, Yiting shao, Shan Liu, Thomas H Li, Zhu Li, Ge Li

    Abstract: Point cloud is a fundamental 3D representation which is widely used in real world applications such as autonomous driving. As a newly-developed media format which is characterized by complexity and irregularity, point cloud creates a need for compression algorithms which are more flexible than existing codecs. Recently, autoencoders(AEs) have shown their effectiveness in many visual analysis tasks… ▽ More

    Submitted 17 April, 2019; originally announced May 2019.

  26. arXiv:1903.07256  [pdf, other

    cs.CV

    Graph Convolutional Label Noise Cleaner: Train a Plug-and-play Action Classifier for Anomaly Detection

    Authors: Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, Ge Li

    Abstract: Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advan… ▽ More

    Submitted 18 March, 2019; originally announced March 2019.

    Comments: To appear in CVPR 2019

  27. arXiv:1807.02929  [pdf, other

    cs.CV

    Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector

    Authors: Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H. Li, Ge Li

    Abstract: Weakly supervised temporal action detection is a Herculean task in understanding untrimmed videos, since no supervisory signal except the video-level category label is available on training data. Under the supervision of category labels, weakly supervised detectors are usually built upon classifiers. However, there is an inherent contradiction between classifier and detector; i.e., a classifier in… ▽ More

    Submitted 18 July, 2018; v1 submitted 8 July, 2018; originally announced July 2018.

    Comments: To Appear in ACM Multimedia 2018

  28. arXiv:1805.05132  [pdf, other

    cs.CV

    Exploiting the Value of the Center-dark Channel Prior for Salient Object Detection

    Authors: Chunbiao Zhu, Wenhao Zhang, Thomas H. Li, Ge Li

    Abstract: Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this paper, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a cente… ▽ More

    Submitted 14 May, 2018; originally announced May 2018.

    Comments: Project website: https://chunbiaozhu.github.io/ACVR2017/

  29. arXiv:1803.08636  [pdf, other

    cs.CV cs.AI cs.MM

    PDNet: Prior-model Guided Depth-enhanced Network for Salient Object Detection

    Authors: Chunbiao Zhu, Xing Cai, Kan Huang, Thomas H Li, Ge Li

    Abstract: Fully convolutional neural networks (FCNs) have shown outstanding performance in many computer vision tasks including salient object detection. However, there still remains two issues needed to be addressed in deep learning based saliency detection. One is the lack of tremendous amount of annotated data to train a network. The other is the lack of robustness for extracting salient objects in image… ▽ More

    Submitted 13 October, 2018; v1 submitted 22 March, 2018; originally announced March 2018.

    Comments: This paper is under review. Project website: https://github.com/ChunbiaoZhu/PDNet/