Search | arXiv e-print repository

Annotation Free Semantic Segmentation with Vision Foundation Models

Authors: Soroush Seifi, Daniel Olmeda Reino, Fabien Despinoy, Rahaf Aljundi

Abstract: Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, w… ▽ More Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, we generate free annotations for any semantic segmentation dataset using existing foundation models. We use CLIP to detect objects and SAM to generate high quality object masks. Next, we build a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation. Our approach can bring language-based semantics to any pretrained vision encoder with minimal training. Our module is lightweight, uses foundation models as the sole source of supervision and shows impressive generalization capability from little training data with no annotation. △ Less

Submitted 24 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2310.01942 [pdf, other]

OOD Aware Supervised Contrastive Learning

Authors: Soroush Seifi, Daniel Olmeda Reino, Nikolay Chumerin, Rahaf Aljundi

Abstract: Out-of-Distribution (OOD) detection is a crucial problem for the safe deployment of machine learning models identifying samples that fall outside of the training distribution, i.e. in-distribution data (ID). Most OOD works focus on the classification models trained with Cross Entropy (CE) and attempt to fix its inherent issues. In this work we leverage powerful representation learned with Supervis… ▽ More Out-of-Distribution (OOD) detection is a crucial problem for the safe deployment of machine learning models identifying samples that fall outside of the training distribution, i.e. in-distribution data (ID). Most OOD works focus on the classification models trained with Cross Entropy (CE) and attempt to fix its inherent issues. In this work we leverage powerful representation learned with Supervised Contrastive (SupCon) training and propose a holistic approach to learn a classifier robust to OOD data. We extend SupCon loss with two additional contrast terms. The first term pushes auxiliary OOD representations away from ID representations without imposing any constraints on similarities among auxiliary data. The second term pushes OOD features far from the existing class prototypes, while pushing ID representations closer to their corresponding class prototype. When auxiliary OOD data is not available, we propose feature mixing techniques to efficiently generate pseudo-OOD features. Our solution is simple and efficient and acts as a natural extension of the closed-set supervised contrastive representation learning. We compare against different OOD detection methods on the common benchmarks and show state-of-the-art results. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2306.10850 [pdf]

Detection of Sensor-To-Sensor Variations using Explainable AI

Authors: Sarah Seifi, Sebastian A. Schober, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille

Abstract: With the growing concern for air quality and its impact on human health, interest in environmental gas monitoring has increased. However, chemi-resistive gas sensing devices are plagued by issues of sensor reproducibility during manufacturing. This study proposes a novel approach for detecting sensor-to-sensor variations in sensing devices using the explainable AI (XAI) method of SHapley Additive… ▽ More With the growing concern for air quality and its impact on human health, interest in environmental gas monitoring has increased. However, chemi-resistive gas sensing devices are plagued by issues of sensor reproducibility during manufacturing. This study proposes a novel approach for detecting sensor-to-sensor variations in sensing devices using the explainable AI (XAI) method of SHapley Additive exPlanations (SHAP). This is achieved by identifying sensors that contribute the most to environmental gas concentration estimation via machine learning, and measuring the similarity of feature rankings between sensors to flag deviations or outliers. The methodology is tested using artificial and realistic Ozone concentration profiles to train a Gated Recurrent Unit (GRU) model. Two applications were explored in the study: the detection of wrong behaviors of sensors in the train dataset, and the detection of deviations in the test dataset. By training the GRU with the pruned train dataset, we could reduce computational costs while improving the model performance. Overall, the results show that our approach improves the understanding of sensor behavior, successfully detects sensor deviations down to 5-10% from the normal behavior, and leads to more efficient model preparation and calibration. Our method provides a novel solution for identifying deviating sensors, linking inconsistencies in hardware to sensor-to-sensor variations in the manufacturing process on an AI model-level. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 6 pages, 6 figures, accepted at Smart Systems Integration Conference and Exhibition 2023

arXiv:2108.11717 [pdf, other]

Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration

Authors: Soroush Seifi, Abhishek Jha, Tinne Tuytelaars

Abstract: Active visual exploration aims to assist an agent with a limited field of view to understand its environment based on partial observations made by choosing the best viewing directions in the scene. Recent methods have tried to address this problem either by using reinforcement learning, which is difficult to train, or by uncertainty maps, which are task-specific and can only be implemented for den… ▽ More Active visual exploration aims to assist an agent with a limited field of view to understand its environment based on partial observations made by choosing the best viewing directions in the scene. Recent methods have tried to address this problem either by using reinforcement learning, which is difficult to train, or by uncertainty maps, which are task-specific and can only be implemented for dense prediction tasks. In this paper, we propose the Glimpse-Attend-and-Explore model which: (a) employs self-attention to guide the visual exploration instead of task-specific uncertainty maps; (b) can be used for both dense and sparse prediction tasks; and (c) uses a contrastive stream to further improve the representations learned. Unlike previous works, we show the application of our model on multiple tasks like reconstruction, segmentation and classification. Our model provides encouraging results while being less dependent on dataset bias in driving the exploration. We further perform an ablation study to investigate the features and attention learned by our model. Finally, we show that our self-attention module learns to attend different regions of the scene by minimizing the loss on the downstream task. Code: https://github.com/soroushseifi/glimpse-attend-explore. △ Less

Submitted 26 August, 2021; originally announced August 2021.

arXiv:2007.11548 [pdf, other]

Attend and Segment: Attention Guided Active Semantic Segmentation

Authors: Soroush Seifi, Tinne Tuytelaars

Abstract: In a dynamic environment, an agent with a limited field of view/resource cannot fully observe the scene before attempting to parse it. The deployment of common semantic segmentation architectures is not feasible in such settings. In this paper we propose a method to gradually segment a scene given a sequence of partial observations. The main idea is to refine an agent's understanding of the enviro… ▽ More In a dynamic environment, an agent with a limited field of view/resource cannot fully observe the scene before attempting to parse it. The deployment of common semantic segmentation architectures is not feasible in such settings. In this paper we propose a method to gradually segment a scene given a sequence of partial observations. The main idea is to refine an agent's understanding of the environment by attending the areas it is most uncertain about. Our method includes a self-supervised attention mechanism and a specialized architecture to maintain and exploit spatial memory maps for filling-in the unseen areas in the environment. The agent can select and attend an area while relying on the cues coming from the visited areas to hallucinate the other parts. We reach a mean pixel-wise accuracy of 78.1%, 80.9% and 76.5% on CityScapes, CamVid, and Kitti datasets by processing only 18% of the image pixels (10 retina-like glimpses). We perform an ablation study on the number of glimpses, input image size and effectiveness of retina-like glimpses. We compare our method to several baselines and show that the optimal results are achieved by having access to a very low resolution view of the scene at the first timestep. △ Less

Submitted 22 July, 2020; originally announced July 2020.

arXiv:1909.10312 [pdf, other]

How to improve CNN-based 6-DoF camera pose estimation

Authors: Soroush Seifi, Tinne Tuytelaars

Abstract: Convolutional neural networks (CNNs) and transfer learning have recently been used for 6 degrees of freedom (6-DoF) camera pose estimation. While they do not reach the same accuracy as visual SLAM-based approaches and are restricted to a specific environment, they excel in robustness and can be applied even to a single image. In this paper, we study PoseNet [1] and investigate modifications based… ▽ More Convolutional neural networks (CNNs) and transfer learning have recently been used for 6 degrees of freedom (6-DoF) camera pose estimation. While they do not reach the same accuracy as visual SLAM-based approaches and are restricted to a specific environment, they excel in robustness and can be applied even to a single image. In this paper, we study PoseNet [1] and investigate modifications based on datasets' characteristics to improve the accuracy of the pose estimates. In particular, we emphasize the importance of field-of-view over image resolution; we present a data augmentation scheme to reduce overfitting; we study the effect of Long-Short-Term-Memory (LSTM) cells. Lastly, we combine these modifications and improve PoseNet's performance for monocular CNN based camera pose regression. △ Less

Submitted 28 November, 2019; v1 submitted 23 September, 2019; originally announced September 2019.

Comments: Accepted at Deep Learning for Visual SLAM workshop at ICCV 2019

arXiv:1909.10304 [pdf, other]

Where to Look Next: Unsupervised Active Visual Exploration on 360° Input

Authors: Soroush Seifi, Tinne Tuytelaars

Abstract: We address the problem of active visual exploration of large 360° inputs. In our setting an active agent with a limited camera bandwidth explores its 360° environment by changing its viewing direction at limited discrete time steps. As such, it observes the world as a sequence of narrow field-of-view 'glimpses', deciding for itself where to look next. Our proposed method exceeds previous works' pe… ▽ More We address the problem of active visual exploration of large 360° inputs. In our setting an active agent with a limited camera bandwidth explores its 360° environment by changing its viewing direction at limited discrete time steps. As such, it observes the world as a sequence of narrow field-of-view 'glimpses', deciding for itself where to look next. Our proposed method exceeds previous works' performance by a significant margin without the need for deep reinforcement learning or training separate networks as sidekicks. A key component of our system are the spatial memory maps that make the system aware of the glimpses' orientations (locations in the 360° image). Further, we stress the advantages of retina-like glimpses when the agent's sensor bandwidth and time-steps are limited. Finally, we use our trained model to do classification of the whole scene using only the information observed in the glimpses. △ Less

Submitted 28 November, 2019; v1 submitted 23 September, 2019; originally announced September 2019.

Comments: Oral Presentation and best Paper Award at 360 Perception and Interaction Workshop at ICCV 2019

arXiv:1909.09690 [pdf]

A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts

Authors: Hossein Keshavarz, Shohreh Tabatabayi Seifi, Mohammad Izadi

Abstract: In this paper, we propose a novel approach for measuring the degree of similarity between categories of two pieces of Persian text, which were published as descriptions of two separate advertisements. We built an appropriate dataset for this work using a dataset which consists of advertisements posted on an e-commerce website. We generated a significant number of paired texts from this dataset and… ▽ More In this paper, we propose a novel approach for measuring the degree of similarity between categories of two pieces of Persian text, which were published as descriptions of two separate advertisements. We built an appropriate dataset for this work using a dataset which consists of advertisements posted on an e-commerce website. We generated a significant number of paired texts from this dataset and assigned each pair a score from 0 to 3, which demonstrates the degree of similarity between the domains of the pair. In this work, we represent words with word embedding vectors derived from word2vec. Then deep neural network models are used to represent texts. Eventually, we employ concatenation of absolute difference and bit-wise multiplication and a fully-connected neural network to produce a probability distribution vector for the score of the pairs. Through a supervised learning approach, we trained our model on a GPU, and our best model achieved an F1 score of 0.9865. △ Less

Submitted 24 September, 2019; v1 submitted 12 September, 2019; originally announced September 2019.

Showing 1–8 of 8 results for author: Seifi, S