-
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Authors:
Yuxuan Zhang,
Tianheng Cheng,
Rui Hu,
ei Liu,
Heng Liu,
Long** Ran,
Xiaoxin Chen,
Wenyu Liu,
Xinggang Wang
Abstract:
Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (…
▽ More
Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
An experimental study of the response time in an edge-cloud continuum with ClusterLink
Authors:
Marc Michalke,
Fin Gentzen,
Admela Jukan,
Kfir Toledo,
Etai Lev Ran
Abstract:
In this paper, we conduct an experimental study to provide a general sense of the application response time implications that inter-cluster communication experiences at the edge at the example of a specific IoT-edge-cloud contiuum solution from the EU Project ICOS called ClusterLink. We create an environment to emulate different networking topologies that include multiple cloud or edge sites scena…
▽ More
In this paper, we conduct an experimental study to provide a general sense of the application response time implications that inter-cluster communication experiences at the edge at the example of a specific IoT-edge-cloud contiuum solution from the EU Project ICOS called ClusterLink. We create an environment to emulate different networking topologies that include multiple cloud or edge sites scenarios, and conduct a set of tests to compare the application response times via ClusterLink to direct communications in relation to node distances and request/response payload size. Our results show that, in an edge context, ClusterLink does not introduce a significant processing overhead to the communication for small payloads as compared to cloud. For higher payloads and on comparably more aged consumer hardware, ClusterLink version 0.2 introduces communication overhead relative to the delay experienced on the link.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation
Authors:
Lingyan Ran,
Lushuang Wang,
Tao Zhuo,
Yinghui Xing
Abstract:
Semantic segmentation of remote sensing images is a challenging and hot issue due to the large amount of unlabeled data. Unsupervised domain adaptation (UDA) has proven to be advantageous in incorporating unclassified information from the target domain. However, independently fine-tuning UDA models on the source and target domains has a limited effect on the outcome. This paper proposes a hybrid t…
▽ More
Semantic segmentation of remote sensing images is a challenging and hot issue due to the large amount of unlabeled data. Unsupervised domain adaptation (UDA) has proven to be advantageous in incorporating unclassified information from the target domain. However, independently fine-tuning UDA models on the source and target domains has a limited effect on the outcome. This paper proposes a hybrid training strategy as well as a novel dual-domain image fusion strategy that effectively utilizes the original image, transformation image, and intermediate domain information. Moreover, to enhance the precision of pseudo-labels, we present a pseudo-label region-specific weight strategy. The efficacy of our approach is substantiated by extensive benchmark experiments and ablation studies conducted on the ISPRS Vaihingen and Potsdam datasets.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Semi-Supervised Semantic Segmentation Based on Pseudo-Labels: A Survey
Authors:
Lingyan Ran,
Yali Li,
Guoqiang Liang,
Yanning Zhang
Abstract:
Semantic segmentation is an important and popular research area in computer vision that focuses on classifying pixels in an image based on their semantics. However, supervised deep learning requires large amounts of data to train models and the process of labeling images pixel by pixel is time-consuming and laborious. This review aims to provide a first comprehensive and organized overview of the…
▽ More
Semantic segmentation is an important and popular research area in computer vision that focuses on classifying pixels in an image based on their semantics. However, supervised deep learning requires large amounts of data to train models and the process of labeling images pixel by pixel is time-consuming and laborious. This review aims to provide a first comprehensive and organized overview of the state-of-the-art research results on pseudo-label methods in the field of semi-supervised semantic segmentation, which we categorize from different perspectives and present specific methods for specific application areas. In addition, we explore the application of pseudo-label technology in medical and remote-sensing image segmentation. Finally, we also propose some feasible future research directions to address the existing challenges.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Distributed Generalized Nash Equilibria Seeking Algorithms Involving Synchronous and Asynchronous Schemes
Authors:
Huaqing Li,
Liang Ran,
Lifeng Zheng,
Zhe Li,
**hui Hu,
Jun Li,
Tingwen Huang
Abstract:
This paper considers a class of noncooperative games in which the feasible decision sets of all players are coupled together by a coupled inequality constraint. Adopting the variational inequality formulation of the game, we first introduce a new local edge-based equilibrium condition and develop a distributed primal-dual proximal algorithm with full information. Considering challenges when commun…
▽ More
This paper considers a class of noncooperative games in which the feasible decision sets of all players are coupled together by a coupled inequality constraint. Adopting the variational inequality formulation of the game, we first introduce a new local edge-based equilibrium condition and develop a distributed primal-dual proximal algorithm with full information. Considering challenges when communication delays occur, we devise an asynchronous distributed algorithm to seek a generalized Nash equilibrium. This asynchronous scheme arbitrarily activates one player to start new computations independently at different iteration instants, which means that the picked player can use the involved out-dated information from itself and its neighbors to perform new updates. A distinctive attribute is that the proposed algorithms enable the derivation of new distributed forward-backward-like extensions. In theoretical aspect, we provide explicit conditions on algorithm parameters, for instance, the step-sizes to establish a sublinear convergence rate for the proposed synchronous algorithm. Moreover, the asynchronous algorithm guarantees almost sure convergence in expectation under the same step-size conditions and some standard assumptions. An interesting observation is that our analysis approach improves the convergence rate of prior synchronous distributed forward-backward-based algorithms. Finally, the viability and performance of the proposed algorithms are demonstrated by numerical studies on the networked Cournot competition.
△ Less
Submitted 11 February, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
Authors:
Lingmin Ran,
Xiaodong Cun,
Jia-Wei Liu,
Rui Zhao,
Song Zijie,
Xintao Wang,
Jussi Keppo,
Mike Zheng Shou
Abstract:
We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the o…
▽ More
We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable map** layers that bridge the decoders from models of different versions for feature remap**. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
△ Less
Submitted 23 April, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Zero-Shot Object Goal Visual Navigation With Class-Independent Relationship Network
Authors:
Xinting Li,
Shiguang Zhang,
Yue LU,
Kerry Dang,
Lingyan Ran
Abstract:
This paper investigates the zero-shot object goal visual navigation problem. In the object goal visual navigation task, the agent needs to locate navigation targets from its egocentric visual input. "Zero-shot" means that the target the agent needs to find is not trained during the training phase. To address the issue of coupling navigation ability with target features during training, we propose…
▽ More
This paper investigates the zero-shot object goal visual navigation problem. In the object goal visual navigation task, the agent needs to locate navigation targets from its egocentric visual input. "Zero-shot" means that the target the agent needs to find is not trained during the training phase. To address the issue of coupling navigation ability with target features during training, we propose the Class-Independent Relationship Network (CIRN). This method combines target detection information with the relative semantic similarity between the target and the navigation target, and constructs a brand new state representation based on similarity ranking, this state representation does not include target feature or environment feature, effectively decoupling the agent's navigation ability from target features. And a Graph Convolutional Network (GCN) is employed to learn the relationships between different objects based on their similarities. During testing, our approach demonstrates strong generalization capabilities, including zero-shot navigation tasks with different targets and environments. Through extensive experiments in the AI2-THOR virtual environment, our method outperforms the current state-of-the-art approaches in the zero-shot object goal visual navigation task. Furthermore, we conducted experiments in more challenging cross-target and cross-scene settings, which further validate the robustness and generalization ability of our method. Our code is available at: https://github.com/SmartAndCleverRobot/ICRA-CIRN.
△ Less
Submitted 14 March, 2024; v1 submitted 15 October, 2023;
originally announced October 2023.
-
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Authors:
David Junhao Zhang,
Jay Zhangjie Wu,
Jia-Wei Liu,
Rui Zhao,
Lingmin Ran,
Yuchao Gu,
Difei Gao,
Mike Zheng Shou
Abstract:
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marri…
▽ More
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
△ Less
Submitted 17 October, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection
Authors:
Yinghui Xing,
Dexuan Kong,
Shizhou Zhang,
Geng Chen,
Lingyan Ran,
Peng Wang,
Yanning Zhang
Abstract:
Camouflaged object detection (COD), aiming to segment camouflaged objects which exhibit similar patterns with the background, is a challenging task. Most existing works are dedicated to establishing specialized modules to identify camouflaged objects with complete and fine details, while the boundary can not be well located for the lack of object-related semantics. In this paper, we propose a nove…
▽ More
Camouflaged object detection (COD), aiming to segment camouflaged objects which exhibit similar patterns with the background, is a challenging task. Most existing works are dedicated to establishing specialized modules to identify camouflaged objects with complete and fine details, while the boundary can not be well located for the lack of object-related semantics. In this paper, we propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged objects. By introducing a large pre-trained model, abundant knowledge learned from massive multi-modal data can be directly transferred to COD. A lightweight parallel adapter is inserted to adjust the features suitable for the downstream COD task. Extensive experiments on four challenging benchmark datasets demonstrate that our method outperforms existing state-of-the-art COD models by large margins. Moreover, we design a multi-task learning scheme for tuning the adapter to exploit the shareable knowledge across different semantic classes. Comprehensive experimental results showed that the generalization ability of our model can be substantially improved with multi-task adapter initialization on source tasks and multi-task adaptation on target tasks.
△ Less
Submitted 22 August, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
A Novel Rapid-flooding Approach with Real-time Delay Compensation for Wireless Sensor Network Time Synchronization
Authors:
Fanrong Shi,
Simon X. Yang,
Xianguo Tuo,
Lili Ran,
Yuqing Huang
Abstract:
One-way-broadcast based flooding time synchronization algorithms are commonly used in wireless sensor networks (WSNs). However, the packet delay and clock drift pose challenges to accuracy, as they entail serious by-hop error accumulation problems in the WSNs. To overcome it, a rapid flooding multi-broadcast time synchronization with real-time delay compensation (RDC-RMTS) is proposed in this pape…
▽ More
One-way-broadcast based flooding time synchronization algorithms are commonly used in wireless sensor networks (WSNs). However, the packet delay and clock drift pose challenges to accuracy, as they entail serious by-hop error accumulation problems in the WSNs. To overcome it, a rapid flooding multi-broadcast time synchronization with real-time delay compensation (RDC-RMTS) is proposed in this paper. By using a rapid-flooding protocol, flooding latency of the referenced time information is significantly reduced in the RDC-RMTS. In addition, a new joint clock skew-offset maximum likelihood estimation is developed to obtain the accurate clock parameter estimations, and the real-time packet delay estimation. Moreover, an innovative implementation of the RDC-RMTS is designed with an adaptive clock offset estimation. The experimental results indicate that, the RDC-RMTS can easily reduce the variable delay and significantly slow the growth of by-hop error accumulation. Thus, the proposed RDC-RMTS can achieve accurate time synchronization in large-scale complex WSNs.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
Authors:
Stan Weixian Lei,
Difei Gao,
Yuxuan Wang,
Dongxing Mao,
Zihan Liang,
Lingmin Ran,
Mike Zheng Shou
Abstract:
It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure te…
▽ More
It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this TQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 3.2k multimodal questions on 1.6k video segments from instructional videos on diverse daily-used items. To address TQVSR, we develop a simple yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Code and data are available at \url{https://github.com/StanLei52/TQVSR}.
△ Less
Submitted 10 October, 2022; v1 submitted 29 November, 2021;
originally announced November 2021.
-
Weakly-supervised Instance Segmentation via Class-agnostic Learning with Salient Images
Authors:
Xinggang Wang,
Jiapei Feng,
Bin Hu,
Qi Ding,
Long** Ran,
Xiaoxin Chen,
Wenyu Liu
Abstract:
Humans have a strong class-agnostic object segmentation ability and can outline boundaries of unknown objects precisely, which motivates us to propose a box-supervised class-agnostic object segmentation (BoxCaseg) based solution for weakly-supervised instance segmentation. The BoxCaseg model is jointly trained using box-supervised images and salient images in a multi-task learning manner. The fine…
▽ More
Humans have a strong class-agnostic object segmentation ability and can outline boundaries of unknown objects precisely, which motivates us to propose a box-supervised class-agnostic object segmentation (BoxCaseg) based solution for weakly-supervised instance segmentation. The BoxCaseg model is jointly trained using box-supervised images and salient images in a multi-task learning manner. The fine-annotated salient images provide class-agnostic and precise object localization guidance for box-supervised images. The object masks predicted by a pretrained BoxCaseg model are refined via a novel merged and dropped strategy as proxy ground truth to train a Mask R-CNN for weakly-supervised instance segmentation. Only using $7991$ salient images, the weakly-supervised Mask R-CNN is on par with fully-supervised Mask R-CNN on PASCAL VOC and significantly outperforms previous state-of-the-art box-supervised instance segmentation methods on COCO. The source code, pretrained models and datasets are available at \url{https://github.com/hustvl/BoxCaseg}.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.