Search | arXiv e-print repository

arXiv:2405.20917 [pdf, other]

Learning to Estimate System Specifications in Linear Temporal Logic using Transformers and Mamba

Authors: İlker Işık, Ebru Aydin Gol, Ramazan Gokberk Cinbis

Abstract: Temporal logic is a framework for representing and reasoning about propositions that evolve over time. It is commonly used for specifying requirements in various domains, including hardware and software systems, as well as robotics. Specification mining or formula generation involves extracting temporal logic formulae from system traces and has numerous applications, such as detecting bugs and imp… ▽ More Temporal logic is a framework for representing and reasoning about propositions that evolve over time. It is commonly used for specifying requirements in various domains, including hardware and software systems, as well as robotics. Specification mining or formula generation involves extracting temporal logic formulae from system traces and has numerous applications, such as detecting bugs and improving interpretability. Although there has been a surge of deep learning-based methods for temporal logic satisfiability checking in recent years, the specification mining literature has been lagging behind in adopting deep learning methods despite their many advantages, such as scalability. In this paper, we introduce autoregressive models that can generate linear temporal logic formulae from traces, towards addressing the specification mining problem. We propose multiple architectures for this task: transformer encoder-decoder, decoder-only transformer, and Mamba, which is an emerging alternative to transformer models. Additionally, we devise a metric for quantifying the distinctiveness of the generated formulae and a straightforward algorithm to enforce the syntax constraints. Our experiments show that the proposed architectures yield promising results, generating correct and distinct formulae at a fraction of the compute cost needed for the combinatorial baseline. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 20 pages, 15 figures

arXiv:2307.11823 [pdf, other]

HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness

Authors: Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu

Abstract: Convolutional Neural Networks (CNN) are known to exhibit poor generalization performance under distribution shifts. Their generalization have been studied extensively, and one line of work approaches the problem from a frequency-centric perspective. These studies highlight the fact that humans and CNNs might focus on different frequency components of an image. First, inspired by these observations… ▽ More Convolutional Neural Networks (CNN) are known to exhibit poor generalization performance under distribution shifts. Their generalization have been studied extensively, and one line of work approaches the problem from a frequency-centric perspective. These studies highlight the fact that humans and CNNs might focus on different frequency components of an image. First, inspired by these observations, we propose a simple yet effective data augmentation method HybridAugment that reduces the reliance of CNNs on high-frequency components, and thus improves their robustness while kee** their clean accuracy high. Second, we propose HybridAugment++, which is a hierarchical augmentation method that attempts to unify various frequency-spectrum augmentations. HybridAugment++ builds on HybridAugment, and also reduces the reliance of CNNs on the amplitude component of images, and promotes phase information instead. This unification results in competitive to or better than state-of-the-art results on clean accuracy (CIFAR-10/100 and ImageNet), corruption benchmarks (ImageNet-C, CIFAR-10-C and CIFAR-100-C), adversarial robustness on CIFAR-10 and out-of-distribution detection on various datasets. HybridAugment and HybridAugment++ are implemented in a few lines of code, does not require extra data, ensemble models or additional networks. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: Accepted to ICCV 2023

arXiv:2306.07890 [pdf, other]

VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON

Authors: Hao** Bai, Shancong Mou, Tatiana Likhomanenko, Ramazan Gokberk Cinbis, Oncel Tuzel, ** Huang, Jiulong Shan, Jianjun Shi, Meng Cao

Abstract: Despite progress in vision-based inspection algorithms, real-world industrial challenges -- specifically in data availability, quality, and complex production requirements -- often remain under-addressed. We introduce the VISION Datasets, a diverse collection of 14 industrial inspection datasets, uniquely poised to meet these challenges. Unlike previous datasets, VISION brings versatility to defec… ▽ More Despite progress in vision-based inspection algorithms, real-world industrial challenges -- specifically in data availability, quality, and complex production requirements -- often remain under-addressed. We introduce the VISION Datasets, a diverse collection of 14 industrial inspection datasets, uniquely poised to meet these challenges. Unlike previous datasets, VISION brings versatility to defect detection, offering annotation masks across all splits and catering to various detection methodologies. Our datasets also feature instance-segmentation annotation, enabling precise defect identification. With a total of 18k images encompassing 44 defect types, VISION strives to mirror a wide range of real-world production scenarios. By supporting two ongoing challenge competitions on the VISION Datasets, we hope to foster further advancements in vision-based industrial inspection. △ Less

Submitted 17 June, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2304.12161 [pdf, other]

Meta-tuning Loss Functions and Data Augmentation for Few-shot Object Detection

Authors: Berkan Demirel, Orhun Buğra Baran, Ramazan Gokberk Cinbis

Abstract: Few-shot object detection, the problem of modelling novel object detection categories with few training instances, is an emerging topic in the area of few-shot learning and object detection. Contemporary techniques can be divided into two groups: fine-tuning based and meta-learning based approaches. While meta-learning approaches aim to learn dedicated meta-models for map** samples to novel clas… ▽ More Few-shot object detection, the problem of modelling novel object detection categories with few training instances, is an emerging topic in the area of few-shot learning and object detection. Contemporary techniques can be divided into two groups: fine-tuning based and meta-learning based approaches. While meta-learning approaches aim to learn dedicated meta-models for map** samples to novel class models, fine-tuning approaches tackle few-shot detection in a simpler manner, by adapting the detection model to novel classes through gradient based optimization. Despite their simplicity, fine-tuning based approaches typically yield competitive detection results. Based on this observation, we focus on the role of loss functions and augmentations as the force driving the fine-tuning process, and propose to tune their dynamics through meta-learning principles. The proposed training scheme, therefore, allows learning inductive biases that can boost few-shot detection, while kee** the advantages of fine-tuning based approaches. In addition, the proposed approach yields interpretable loss functions, as opposed to highly parametric and complex few-shot meta-models. The experimental results highlight the merits of the proposed scheme, with significant improvements over the strong fine-tuning based few-shot detection baselines on benchmark Pascal VOC and MS-COCO datasets, in terms of both standard and generalized few-shot performance metrics. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: To appear at IEEE/CVF CVPR 2023

arXiv:2204.13492 [pdf, other]

Representation Recycling for Streaming Video Analysis

Authors: Can Ufuk Ertenli, Ramazan Gokberk Cinbis, Emre Akbas

Abstract: We present StreamDEQ, a method that aims to infer frame-wise representations on videos with minimal per-frame computation. Conventional deep networks do feature extraction from scratch at each frame in the absence of ad-hoc solutions. We instead aim to build streaming recognition models that can natively exploit temporal smoothness between consecutive video frames. We observe that the recently eme… ▽ More We present StreamDEQ, a method that aims to infer frame-wise representations on videos with minimal per-frame computation. Conventional deep networks do feature extraction from scratch at each frame in the absence of ad-hoc solutions. We instead aim to build streaming recognition models that can natively exploit temporal smoothness between consecutive video frames. We observe that the recently emerging implicit layer models provide a convenient foundation to construct such models, as they define representations as the fixed-points of shallow networks, which need to be estimated using iterative methods. Our main insight is to distribute the inference iterations over the temporal axis by using the most recent representation as a starting point at each frame. This scheme effectively recycles the recent inference computations and greatly reduces the needed processing time. Through extensive experimental analysis, we show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration. Our experiments on video semantic segmentation, video object detection, and human pose estimation in videos show that StreamDEQ achieves on-par accuracy with the baseline while being more than 2-4x faster. △ Less

Submitted 6 January, 2024; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: v3: ECCV2022 paper. This version: extended version under review at TPAMI

arXiv:2201.10972 [pdf, other]

doi 10.1016/j.imavis.2022.104392

How Robust are Discriminatively Trained Zero-Shot Learning Models?

Authors: Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu

Abstract: Data shift robustness has been primarily investigated from a fully supervised perspective, and robustness of zero-shot learning (ZSL) models have been largely neglected. In this paper, we present novel analyses on the robustness of discriminative ZSL to image corruptions. We subject several ZSL models to a large set of common corruptions and defenses. In order to realize the corruption analysis, w… ▽ More Data shift robustness has been primarily investigated from a fully supervised perspective, and robustness of zero-shot learning (ZSL) models have been largely neglected. In this paper, we present novel analyses on the robustness of discriminative ZSL to image corruptions. We subject several ZSL models to a large set of common corruptions and defenses. In order to realize the corruption analysis, we curate and release the first ZSL corruption robustness datasets SUN-C, CUB-C and AWA2-C. We analyse our results by taking into account the dataset characteristics, class imbalance, class transitions between seen and unseen classes and the discrepancies between ZSL and GZSL performances. Our results show that discriminative ZSL suffers from corruptions and this trend is further exacerbated by the severe class imbalance and model weakness inherent in ZSL methods. We then combine our findings with those based on adversarial attacks in ZSL, and highlight the different effects of corruptions and adversarial examples, such as the pseudo-robustness effect present under adversarial attacks. We also obtain new strong baselines for both models with the defense methods. Finally, our experiments show that although existing methods to improve robustness somewhat work for ZSL models, they do not produce a tangible effect. △ Less

Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

arXiv:2201.05914 [pdf, other]

doi 10.1109/TPAMI.2022.3143074

Towards Zero-shot Sign Language Recognition

Authors: Yunus Can Bilge, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis

Abstract: This paper tackles the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign classes to recognize the instances of unseen sign classes. In this context, readily available textual sign descriptions and attributes collected from sign language dictionaries are utilized as semantic class representations for knowledge transfer. For this… ▽ More This paper tackles the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign classes to recognize the instances of unseen sign classes. In this context, readily available textual sign descriptions and attributes collected from sign language dictionaries are utilized as semantic class representations for knowledge transfer. For this novel problem setup, we introduce three benchmark datasets with their accompanying textual and attribute descriptions to analyze the problem in detail. Our proposed approach builds spatiotemporal models of body and hand regions. By leveraging the descriptive text and attribute embeddings along with these visual representations within a zero-shot learning framework, we show that textual and attribute based class definitions can provide effective knowledge for the recognition of previously unseen sign classes. We additionally introduce techniques to analyze the influence of binary attributes in correct and incorrect zero-shot predictions. We anticipate that the introduced approaches and the accompanying datasets will provide a basis for further exploration of zero-shot learning in sign language recognition. △ Less

Submitted 15 January, 2022; originally announced January 2022.

arXiv:2201.03043 [pdf, other]

doi 10.1016/j.neucom.2022.09.121

Semantics-driven Attentive Few-shot Learning over Clean and Noisy Samples

Authors: Orhun Buğra Baran, Ramazan Gökberk Cinbiş

Abstract: Over the last couple of years few-shot learning (FSL) has attracted great attention towards minimizing the dependency on labeled training examples. An inherent difficulty in FSL is the handling of ambiguities resulting from having too few training samples per class. To tackle this fundamental challenge in FSL, we aim to train meta-learner models that can leverage prior semantic knowledge about nov… ▽ More Over the last couple of years few-shot learning (FSL) has attracted great attention towards minimizing the dependency on labeled training examples. An inherent difficulty in FSL is the handling of ambiguities resulting from having too few training samples per class. To tackle this fundamental challenge in FSL, we aim to train meta-learner models that can leverage prior semantic knowledge about novel classes to guide the classifier synthesis process. In particular, we propose semantically-conditioned feature attention and sample attention mechanisms that estimate the importance of representation dimensions and training instances. We also study the problem of sample noise in FSL, towards the utilization of meta-learners in more realistic and imperfect settings. Our experimental results demonstrate the effectiveness of the proposed semantic FSL model with and without sample noise. △ Less

Submitted 3 February, 2023; v1 submitted 9 January, 2022; originally announced January 2022.

Comments: 25 pages, 4 figures

arXiv:2110.12207 [pdf, other]

MaskSplit: Self-supervised Meta-learning for Few-shot Semantic Segmentation

Authors: Mustafa Sercan Amac, Ahmet Sencan, Orhun Bugra Baran, Nazli Ikizler-Cinbis, Ramazan Gokberk Cinbis

Abstract: Just like other few-shot learning problems, few-shot segmentation aims to minimize the need for manual annotation, which is particularly costly in segmentation tasks. Even though the few-shot setting reduces this cost for novel test classes, there is still a need to annotate the training data. To alleviate this need, we propose a self-supervised training approach for learning few-shot segmentation… ▽ More Just like other few-shot learning problems, few-shot segmentation aims to minimize the need for manual annotation, which is particularly costly in segmentation tasks. Even though the few-shot setting reduces this cost for novel test classes, there is still a need to annotate the training data. To alleviate this need, we propose a self-supervised training approach for learning few-shot segmentation models. We first use unsupervised saliency estimation to obtain pseudo-masks on images. We then train a simple prototype based model over different splits of pseudo masks and augmentations of images. Our extensive experiments show that the proposed approach achieves promising results, highlighting the potential of self-supervised training. To the best of our knowledge this is the first work that addresses unsupervised few-shot segmentation problem on natural images. △ Less

Submitted 3 November, 2021; v1 submitted 23 October, 2021; originally announced October 2021.

Comments: To appear at WACV 2022, 11 pages, 5 figures

arXiv:2108.06165 [pdf, other]

doi 10.1016/j.imavis.2022.104515

Caption Generation on Scenes with Seen and Unseen Object Categories

Authors: Berkan Demirel, Ramazan Gokberk Cinbis

Abstract: Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detect… ▽ More Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach. △ Less

Submitted 1 July, 2022; v1 submitted 13 August, 2021; originally announced August 2021.

Comments: Accepted for Publication at Image and Vision Computing (IMAVIS)

arXiv:2105.10983 [pdf, other]

doi 10.1016/j.isprsjprs.2021.03.021

Weakly Supervised Instance Attention for Multisource Fine-Grained Object Recognition with an Application to Tree Species Classification

Authors: Bulut Aygunes, Ramazan Gokberk Cinbis, Selim Aksoy

Abstract: Multisource image analysis that leverages complementary spectral, spatial, and structural information benefits fine-grained object recognition that aims to classify an object into one of many similar subcategories. However, for multisource tasks that involve relatively small objects, even the smallest registration errors can introduce high uncertainty in the classification process. We approach thi… ▽ More Multisource image analysis that leverages complementary spectral, spatial, and structural information benefits fine-grained object recognition that aims to classify an object into one of many similar subcategories. However, for multisource tasks that involve relatively small objects, even the smallest registration errors can introduce high uncertainty in the classification process. We approach this problem from a weakly supervised learning perspective in which the input images correspond to larger neighborhoods around the expected object locations where an object with a given class label is present in the neighborhood without any knowledge of its exact location. The proposed method uses a single-source deep instance attention model with parallel branches for joint localization and classification of objects, and extends this model into a multisource setting where a reference source that is assumed to have no location uncertainty is used to aid the fusion of multiple sources in four different levels: probability level, logit level, feature level, and pixel level. We show that all levels of fusion provide higher accuracies compared to the state-of-the-art, with the best performing method of feature-level fusion resulting in 53% accuracy for the recognition of 40 different types of trees, corresponding to an improvement of 5.7% over the best performing baseline when RGB, multispectral, and LiDAR data are used. We also provide an in-depth comparison by evaluating each model at various parameter complexity settings, where the increased model capacity results in a further improvement of 6.3% over the default capacity setting. △ Less

Submitted 25 May, 2021; v1 submitted 23 May, 2021; originally announced May 2021.

Comments: Accepted for publication in ISPRS Journal of Photogrammetry and Remote Sensing

arXiv:2009.07576 [pdf, other]

Red Carpet to Fight Club: Partially-supervised Domain Transfer for Face Recognition in Violent Videos

Authors: Yunus Can Bilge, Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis, Pinar Duygulu

Abstract: In many real-world problems, there is typically a large discrepancy between the characteristics of data used in training versus deployment. A prime example is the analysis of aggression videos: in a criminal incidence, typically suspects need to be identified based on their clean portrait-like photos, instead of their prior video recordings. This results in three major challenges; large domain dis… ▽ More In many real-world problems, there is typically a large discrepancy between the characteristics of data used in training versus deployment. A prime example is the analysis of aggression videos: in a criminal incidence, typically suspects need to be identified based on their clean portrait-like photos, instead of their prior video recordings. This results in three major challenges; large domain discrepancy between violence videos and ID-photos, the lack of video examples for most individuals and limited training data availability. To mimic such scenarios, we formulate a realistic domain-transfer problem, where the goal is to transfer the recognition model trained on clean posed images to the target domain of violent videos, where training videos are available only for a subset of subjects. To this end, we introduce the WildestFaces dataset, tailored to study cross-domain recognition under a variety of adverse conditions. We divide the task of transferring a recognition model from the domain of clean images to the violent videos into two sub-problems and tackle them using (i) stacked affine-transforms for classifier-transfer, (ii) attention-driven pooling for temporal-adaptation. We additionally formulate a self-attention based model for domain-transfer. We establish a rigorous evaluation protocol for this clean-to-violent recognition task, and present a detailed analysis of the proposed dataset and the methods. Our experiments highlight the unique challenges introduced by the WildestFaces dataset and the advantages of the proposed approach. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Comments: To appear in WACV 2021

arXiv:2008.07651 [pdf, other]

A Deep Dive into Adversarial Robustness in Zero-Shot Learning

Authors: Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu

Abstract: Machine learning (ML) systems have introduced significant advances in various fields, due to the introduction of highly complex models. Despite their success, it has been shown multiple times that machine learning models are prone to imperceptible perturbations that can severely degrade their accuracy. So far, existing studies have primarily focused on models where supervision across all classes w… ▽ More Machine learning (ML) systems have introduced significant advances in various fields, due to the introduction of highly complex models. Despite their success, it has been shown multiple times that machine learning models are prone to imperceptible perturbations that can severely degrade their accuracy. So far, existing studies have primarily focused on models where supervision across all classes were available. In constrast, Zero-shot Learning (ZSL) and Generalized Zero-shot Learning (GZSL) tasks inherently lack supervision across all classes. In this paper, we present a study aimed on evaluating the adversarial robustness of ZSL and GZSL models. We leverage the well-established label embedding model and subject it to a set of established adversarial attacks and defenses across multiple datasets. In addition to creating possibly the first benchmark on adversarial robustness of ZSL models, we also present analyses on important points that require attention for better interpretation of ZSL robustness results. We hope these points, along with the benchmark, will help researchers establish a better understanding what challenges lie ahead and help guide their work. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: To appear in ECCV 2020, Workshop on Adversarial Robustness in the Real World

arXiv:1908.10172 [pdf, other]

doi 10.1016/j.patcog.2020.107327

Key Protected Classification for Collaborative Learning

Authors: Mert Bülent Sarıyıldız, Ramazan Gökberk Cinbiş, Erman Ayday

Abstract: Large-scale datasets play a fundamental role in training deep learning models. However, dataset collection is difficult in domains that involve sensitive information. Collaborative learning techniques provide a privacy-preserving solution, by enabling training over a number of private datasets that are not shared by their owners. However, recently, it has been shown that the existing collaborative… ▽ More Large-scale datasets play a fundamental role in training deep learning models. However, dataset collection is difficult in domains that involve sensitive information. Collaborative learning techniques provide a privacy-preserving solution, by enabling training over a number of private datasets that are not shared by their owners. However, recently, it has been shown that the existing collaborative learning frameworks are vulnerable to an active adversary that runs a generative adversarial network (GAN) attack. In this work, we propose a novel classification model that is resilient against such attacks by design. More specifically, we introduce a key-based classification model and a principled training scheme that protects class scores by using class-specific private keys, which effectively hide the information necessary for a GAN attack. We additionally show how to utilize high dimensional keys to improve the robustness against attacks without increasing the model complexity. Our detailed experiments demonstrate the effectiveness of the proposed technique. Source code is available at https://github.com/mbsariyildiz/key-protected-classification. △ Less

Submitted 22 April, 2020; v1 submitted 27 August, 2019; originally announced August 2019.

Comments: Accepted to Pattern Recognition

arXiv:1908.00047 [pdf, other]

Image Captioning with Unseen Objects

Authors: Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis

Abstract: Image caption generation is a long standing and challenging problem at the intersection of computer vision and natural language processing. A number of recently proposed approaches utilize a fully supervised object recognition model within the captioning approach. Such models, however, tend to generate sentences which only consist of objects predicted by the recognition models, excluding instances… ▽ More Image caption generation is a long standing and challenging problem at the intersection of computer vision and natural language processing. A number of recently proposed approaches utilize a fully supervised object recognition model within the captioning approach. Such models, however, tend to generate sentences which only consist of objects predicted by the recognition models, excluding instances of the classes without labelled training examples. In this paper, we propose a new challenging scenario that targets the image captioning problem in a fully zero-shot learning setting, where the goal is to be able to generate captions of test images containing objects that are not seen during training. The proposed approach jointly uses a novel zero-shot object detection model and a template-based sentence generator. Our experiments show promising results on the COCO dataset. △ Less

Submitted 31 July, 2019; originally announced August 2019.

Comments: To appear in British Machine Vision Conference (BMVC) 2019

arXiv:1907.10292 [pdf, other]

Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

Authors: Yunus Can Bilge, Nazli Ikizler-Cinbis, Ramazan Gokberk Cinbis

Abstract: We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark da… ▽ More We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: To appear in British Machine Vision Conference (BMVC) 2019

arXiv:1905.06764 [pdf, other]

Learning Visually Consistent Label Embeddings for Zero-Shot Learning

Authors: Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis

Abstract: In this work, we propose a zero-shot learning method to effectively model knowledge transfer between classes via jointly learning visually consistent word vectors and label embedding model in an end-to-end manner. The main idea is to project the vector space word vectors of attributes and classes into the visual space such that word representations of semantically related classes become more close… ▽ More In this work, we propose a zero-shot learning method to effectively model knowledge transfer between classes via jointly learning visually consistent word vectors and label embedding model in an end-to-end manner. The main idea is to project the vector space word vectors of attributes and classes into the visual space such that word representations of semantically related classes become more closer, and use the projected vectors in the proposed embedding model to identify unseen classes. We evaluate the proposed approach on two benchmark datasets and the experimental results show that our method yields significant improvements in recognition accuracy. △ Less

Submitted 16 May, 2019; originally announced May 2019.

Comments: To appear at IEEE Int. Conference on Image Processing (ICIP) 2019

arXiv:1903.08225 [pdf, other]

Cross-task weakly supervised learning from instructional videos

Authors: Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

Abstract: In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be tra… ▽ More In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality. △ Less

Submitted 29 April, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

Comments: 18 pages, 17 figures, to be published in proceedings of the CVPR, 2019

arXiv:1901.06403 [pdf, other]

doi 10.1109/TGRS.2019.2894425

Multisource Region Attention Network for Fine-Grained Object Recognition in Remote Sensing Imagery

Authors: Gencer Sumbul, Ramazan Gokberk Cinbis, Selim Aksoy

Abstract: Fine-grained object recognition concerns the identification of the type of an object among a large number of closely related sub-categories. Multisource data analysis, that aims to leverage the complementary spectral, spatial, and structural information embedded in different sources, is a promising direction towards solving the fine-grained recognition problem that involves low between-class varia… ▽ More Fine-grained object recognition concerns the identification of the type of an object among a large number of closely related sub-categories. Multisource data analysis, that aims to leverage the complementary spectral, spatial, and structural information embedded in different sources, is a promising direction towards solving the fine-grained recognition problem that involves low between-class variance, small training set sizes for rare classes, and class imbalance. However, the common assumption of co-registered sources may not hold at the pixel level for small objects of interest. We present a novel methodology that aims to simultaneously learn the alignment of multisource data and the classification model in a unified framework. The proposed method involves a multisource region attention network that computes per-source feature representations, assigns attention scores to candidate regions sampled around the expected object locations by using these representations, and classifies the objects by using an attention-driven multisource representation that combines the feature representations and the attention scores from all sources. All components of the model are realized using deep neural networks and are learned in an end-to-end fashion. Experiments using RGB, multispectral, and LiDAR elevation data for classification of street trees showed that our approach achieved 64.2% and 47.3% accuracies for the 18-class and 40-class settings, respectively, which correspond to 13% and 14.3% improvement relative to the commonly used feature concatenation approach from multiple sources. △ Less

Submitted 18 January, 2019; originally announced January 2019.

Comments: G. Sumbul, R. G. Cinbis, S. Aksoy, "Multisource Region Attention Network for Fine-Grained Object Recognition in Remote Sensing Imagery", IEEE Transactions on Geoscience and Remote Sensing (TGRS), in press, 2019

arXiv:1805.07566 [pdf, other]

Wildest Faces: Face Detection and Recognition in Violent Settings

Authors: Mehmet Kerim Yucel, Yunus Can Bilge, Oguzhan Oguz, Nazli Ikizler-Cinbis, Pinar Duygulu, Ramazan Gokberk Cinbis

Abstract: With the introduction of large-scale datasets and deep learning models capable of learning complex representations, impressive advances have emerged in face detection and recognition tasks. Despite such advances, existing datasets do not capture the difficulty of face recognition in the wildest scenarios, such as hostile disputes or fights. Furthermore, existing datasets do not represent completel… ▽ More With the introduction of large-scale datasets and deep learning models capable of learning complex representations, impressive advances have emerged in face detection and recognition tasks. Despite such advances, existing datasets do not capture the difficulty of face recognition in the wildest scenarios, such as hostile disputes or fights. Furthermore, existing datasets do not represent completely unconstrained cases of low resolution, high blur and large pose/occlusion variances. To this end, we introduce the Wildest Faces dataset, which focuses on such adverse effects through violent scenes. The dataset consists of an extensive set of violent scenes of celebrities from movies. Our experimental results demonstrate that state-of-the-art techniques are not well-suited for violent scenes, and therefore, Wildest Faces is likely to stir further interest in face detection and recognition research. △ Less

Submitted 19 May, 2018; originally announced May 2018.

Comments: Submitted to BMVC 2018

arXiv:1805.06157 [pdf, other]

Zero-Shot Object Detection by Hybrid Region Embedding

Authors: Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis

Abstract: Object detection is considered as one of the most challenging problems in computer vision, since it requires correct prediction of both classes and locations of objects in images. In this study, we define a more difficult scenario, namely zero-shot object detection (ZSD) where no visual training data is available for some of the target object classes. We present a novel approach to tackle this ZSD… ▽ More Object detection is considered as one of the most challenging problems in computer vision, since it requires correct prediction of both classes and locations of objects in images. In this study, we define a more difficult scenario, namely zero-shot object detection (ZSD) where no visual training data is available for some of the target object classes. We present a novel approach to tackle this ZSD problem, where a convex combination of embeddings are used in conjunction with a detection framework. For evaluation of ZSD methods, we propose a simple dataset constructed from Fashion-MNIST images and also a custom zero-shot split for the Pascal VOC detection challenge. The experimental results suggest that our method yields promising results for ZSD. △ Less

Submitted 17 May, 2018; v1 submitted 16 May, 2018; originally announced May 2018.

Journal ref: Published in British Machine Vision Conference 2018

arXiv:1712.03323 [pdf, other]

doi 10.1109/TGRS.2017.2754648

Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery

Authors: Gencer Sumbul, Ramazan Gokberk Cinbis, Selim Aksoy

Abstract: Fine-grained object recognition that aims to identify the type of an object among a large number of subcategories is an emerging application with the increasing resolution that exposes new details in image data. Traditional fully supervised algorithms fail to handle this problem where there is low between-class variance and high within-class variance for the classes of interest with small sample s… ▽ More Fine-grained object recognition that aims to identify the type of an object among a large number of subcategories is an emerging application with the increasing resolution that exposes new details in image data. Traditional fully supervised algorithms fail to handle this problem where there is low between-class variance and high within-class variance for the classes of interest with small sample sizes. We study an even more extreme scenario named zero-shot learning (ZSL) in which no training example exists for some of the classes. ZSL aims to build a recognition model for new unseen categories by relating them to seen classes that were previously learned. We establish this relation by learning a compatibility function between image features extracted via a convolutional neural network and auxiliary information that describes the semantics of the classes of interest by using training samples from the seen classes. Then, we show how knowledge transfer can be performed for the unseen classes by maximizing this function during inference. We introduce a new data set that contains 40 different types of street trees in 1-ft spatial resolution aerial data, and evaluate the performance of this model with manually annotated attributes, a natural language model, and a scientific taxonomy as auxiliary information. The experiments show that the proposed model achieves 14.3% recognition accuracy for the classes with no training examples, which is significantly better than a random guess accuracy of 6.3% for 16 test classes, and three other ZSL algorithms. △ Less

Submitted 8 December, 2017; originally announced December 2017.

Comments: G. Sumbul, R. G. Cinbis, S. Aksoy, "Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery", IEEE Transactions on Geoscience and Remote Sensing (TGRS), in press, 2017

arXiv:1705.01734 [pdf, other]

Attributes2Classname: A discriminative model for attribute-based unsupervised zero-shot learning

Authors: Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis

Abstract: We propose a novel approach for unsupervised zero-shot learning (ZSL) of classes based on their names. Most existing unsupervised ZSL methods aim to learn a model for directly comparing image features and class names. However, this proves to be a difficult task due to dominance of non-visual semantics in underlying vector-space embeddings of class names. To address this issue, we discriminatively… ▽ More We propose a novel approach for unsupervised zero-shot learning (ZSL) of classes based on their names. Most existing unsupervised ZSL methods aim to learn a model for directly comparing image features and class names. However, this proves to be a difficult task due to dominance of non-visual semantics in underlying vector-space embeddings of class names. To address this issue, we discriminatively learn a word representation such that the similarities between class and combination of attribute names fall in line with the visual similarity. Contrary to the traditional zero-shot learning approaches that are built upon attribute presence, our approach bypasses the laborious attribute-class relation annotations for unseen classes. In addition, our proposed approach renders text-only training possible, hence, the training can be augmented without the need to collect additional image data. The experimental results show that our method yields state-of-the-art results for unsupervised ZSL in three benchmark datasets. △ Less

Submitted 5 August, 2017; v1 submitted 4 May, 2017; originally announced May 2017.

Comments: To appear at IEEE Int. Conference on Computer Vision (ICCV) 2017

arXiv:1510.00857 [pdf, other]

doi 10.1109/TPAMI.2015.2484342

Approximate Fisher Kernels of non-iid Image Models for Image Categorization

Authors: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: The bag-of-words (BoW) model treats images as sets of local descriptors and represents them by visual word histograms. The Fisher vector (FV) representation extends BoW, by considering the first and second order statistics of local descriptors. In both representations local descriptors are assumed to be identically and independently distributed (iid), which is a poor assumption from a modeling per… ▽ More The bag-of-words (BoW) model treats images as sets of local descriptors and represents them by visual word histograms. The Fisher vector (FV) representation extends BoW, by considering the first and second order statistics of local descriptors. In both representations local descriptors are assumed to be identically and independently distributed (iid), which is a poor assumption from a modeling perspective. It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization. In this paper, we introduce non-iid models by treating the model parameters as latent variables which are integrated out, rendering all local regions dependent. Using the Fisher kernel principle we encode an image by the gradient of the data log-likelihood w.r.t. the model hyper-parameters. Our models naturally generate discounting effects in the representations; suggesting that such transformations have proven successful because they closely correspond to the representations obtained for non-iid models. To enable tractable computation, we rely on variational free-energy bounds to learn the hyper-parameters and to compute approximate Fisher kernels. Our experimental evaluation results validate that our models lead to performance improvements comparable to using power normalization, as employed in state-of-the-art feature aggregation methods. △ Less

Submitted 3 October, 2015; originally announced October 2015.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, 2015

Journal ref: IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1084-1098, June 1 2016

arXiv:1503.00949 [pdf, other]

doi 10.1109/TPAMI.2016.2535231

Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning

Authors: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their lo… ▽ More Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach. △ Less

Submitted 22 February, 2016; v1 submitted 3 March, 2015; originally announced March 2015.

Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

Showing 1–25 of 25 results for author: Cinbis, R G