Search | arXiv e-print repository

Deep Continuous Networks

Authors: Nergis Tomen, Silvia L. Pintea, Jan C. van Gemert

Abstract: CNNs and computational models of biological vision share some fundamental principles, which opened new avenues of research. However, fruitful cross-field research is hampered by conventional CNN architectures being based on spatially and depthwise discrete representations, which cannot accommodate certain aspects of biological complexity such as continuously varying receptive field sizes and dynam… ▽ More CNNs and computational models of biological vision share some fundamental principles, which opened new avenues of research. However, fruitful cross-field research is hampered by conventional CNN architectures being based on spatially and depthwise discrete representations, which cannot accommodate certain aspects of biological complexity such as continuously varying receptive field sizes and dynamics of neuronal responses. Here we propose deep continuous networks (DCNs), which combine spatially continuous filters, with the continuous depth framework of neural ODEs. This allows us to learn the spatial support of the filters during training, as well as model the continuous evolution of feature maps, linking DCNs closely to biological models. We show that DCNs are versatile and highly applicable to standard image classification and reconstruction problems, where they improve parameter and data efficiency, and allow for meta-parametrization. We illustrate the biological plausibility of the scale distributions learned by DCNs and explore their performance in a neuroscientifically inspired pattern completion task. Finally, we investigate an efficient implementation of DCNs by changing input contrast. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Presented at ICML 2021

Journal ref: In International Conference on Machine Learning 2021 Jul 1 (pp. 10324-10335). PMLR

arXiv:2308.10603 [pdf, other]

A step towards understanding why classification helps regression

Authors: Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert

Abstract: A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbal… ▽ More A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV-2023

arXiv:2308.05533 [pdf, other]

Is there progress in activity progress prediction?

Authors: Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea

Abstract: Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estima… ▽ More Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted at ICCVw-2023 (AI for Creative Video Editing and Understanding, ICCV workshop 2023)

arXiv:2308.04770 [pdf, other]

Objects do not disappear: Video object detection by single-frame object location anticipation

Authors: Xin Liu, Fatemeh Karimi Nejadasl, Jan C. van Gemert, Olaf Booij, Silvia L. Pintea

Abstract: Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neig… ▽ More Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV 2023

arXiv:2307.08483 [pdf, other]

Differentiable Transportation Pruning

Authors: Yunqiang Li, Jan C. van Gemert, Torsten Hoefler, Bert Moons, Evangelos Eleftheriou, Bram-Ernst Verhoef

Abstract: Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the outp… ▽ More Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities. △ Less

Submitted 31 July, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: ICCV 2023

arXiv:2205.02887 [pdf, other]

Evaluating Context for Deep Object Detectors

Authors: Osman Semih Kayhan, Jan C. van Gemert

Abstract: Which object detector is suitable for your context sensitive task? Deep object detectors exploit scene context for recognition differently. In this paper, we group object detectors into 3 categories in terms of context use: no context by crop** the input (RCNN), partial context by crop** the featuremap (two-stage methods) and full context without any crop** (single-stage methods). We systema… ▽ More Which object detector is suitable for your context sensitive task? Deep object detectors exploit scene context for recognition differently. In this paper, we group object detectors into 3 categories in terms of context use: no context by crop** the input (RCNN), partial context by crop** the featuremap (two-stage methods) and full context without any crop** (single-stage methods). We systematically evaluate the effect of context for each deep detector category. We create a fully controlled dataset for varying context and investigate the context for deep detectors. We also evaluate gradually removing the background context and the foreground object on MS COCO. We demonstrate that single-stage and two-stage object detectors can and will use the context by virtue of their large receptive field. Thus, choosing the best object detector may depend on the application context. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: 4 pages, 5 figures

arXiv:2203.08586 [pdf, other]

Deep vanishing point detection: Geometric priors make dataset variations vanish

Authors: Yancong Lin, Ruben Wiersma, Silvia L. Pintea, Klaus Hildebrandt, Elmar Eisemann, Jan C. van Gemert

Abstract: Deep learning has improved vanishing point detection in images. Yet, deep networks require expensive annotated datasets trained on costly hardware and do not generalize to even slightly different domains, and minor problem variants. Here, we address these issues by injecting deep vanishing point detection networks with prior knowledge. This prior knowledge no longer needs to be learned from data,… ▽ More Deep learning has improved vanishing point detection in images. Yet, deep networks require expensive annotated datasets trained on costly hardware and do not generalize to even slightly different domains, and minor problem variants. Here, we address these issues by injecting deep vanishing point detection networks with prior knowledge. This prior knowledge no longer needs to be learned from data, saving valuable annotation efforts and compute, unlocking realistic few-sample scenarios, and reducing the impact of domain changes. Moreover, the interpretability of the priors allows to adapt deep networks to minor problem variations such as switching between Manhattan and non-Manhattan worlds. We seamlessly incorporate two geometric priors: (i) Hough Transform -- map** image pixels to straight lines, and (ii) Gaussian sphere -- map** lines to great circles whose intersections denote vanishing points. Experimentally, we ablate our choices and show comparable accuracy to existing models in the large-data setting. We validate our model's improved data efficiency, robustness to domain changes, adaptability to non-Manhattan settings. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: CVPR2022, code available at https://github.com/yanconglin/VanishingPoint_HoughTransform_GaussianSphere

arXiv:2112.03406 [pdf, other]

Equal Bits: Enforcing Equally Distributed Binary Network Weights

Authors: Yunqiang Li, Silvia L. Pintea, Jan C. van Gemert

Abstract: Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cann… ▽ More Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cannot precisely control the binary weight distribution during training, and therefore cannot guarantee maximum entropy. Here, we show that quantizing using optimal transport can guarantee any bit ratio, including equal ratios. We investigate experimentally that equal bit ratios are indeed preferable and show that our method leads to optimization benefits. We show that our quantization method is effective when compared to state-of-the-art binarization methods, even when using binary weight pruning. △ Less

Submitted 6 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

arXiv:2111.06660 [pdf, other]

Frequency learning for structured CNN filters with Gaussian fractional derivatives

Authors: Nikhil Saldanha, Silvia L. Pintea, Jan C. van Gemert, Nergis Tomen

Abstract: Frequency information lies at the base of discriminating between textures, and therefore between different objects. Classical CNN architectures limit the frequency learning through fixed filter sizes, and lack a way of explicitly controlling it. Here, we build on the structured receptive field filters with Gaussian derivative basis. Yet, rather than using predetermined derivative orders, which typ… ▽ More Frequency information lies at the base of discriminating between textures, and therefore between different objects. Classical CNN architectures limit the frequency learning through fixed filter sizes, and lack a way of explicitly controlling it. Here, we build on the structured receptive field filters with Gaussian derivative basis. Yet, rather than using predetermined derivative orders, which typically result in fixed frequency responses for the basis functions, we learn these. We show that by learning the order of the basis we can accurately learn the frequency of the filters, and hence adapt to the optimal frequencies for the underlying learning task. We investigate the well-founded mathematical formulation of fractional derivatives to adapt the filter frequencies during training. Our formulation leads to parameter savings and data efficiency when compared to the standard CNNs and the Gaussian derivative CNN filter networks that we build upon. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: Accepted at BMVC 2021

arXiv:2110.08059 [pdf, other]

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

Authors: David W. Romero, Robert-Jan Bruintjes, Jakub M. Tomczak, Erik J. Bekkers, Mark Hoogendoorn, Jan C. van Gemert

Abstract: When designing Convolutional Neural Networks (CNNs), one must select the size\break of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size h… ▽ More When designing Convolutional Neural Networks (CNNs), one must select the size\break of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy. △ Less

Submitted 17 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: First two authors contributed equally to this work

arXiv:2108.07533 [pdf, other]

Investigating transformers in the decomposition of polygonal shapes as point collections

Authors: Andrea Alfieri, Yancong Lin, Jan C. van Gemert

Abstract: Transformers can generate predictions in two approaches: 1. auto-regressively by conditioning each sequence element on the previous ones, or 2. directly produce an output sequences in parallel. While research has mostly explored upon this difference on sequential tasks in NLP, we study the difference between auto-regressive and parallel prediction on visual set prediction tasks, and in particular… ▽ More Transformers can generate predictions in two approaches: 1. auto-regressively by conditioning each sequence element on the previous ones, or 2. directly produce an output sequences in parallel. While research has mostly explored upon this difference on sequential tasks in NLP, we study the difference between auto-regressive and parallel prediction on visual set prediction tasks, and in particular on polygonal shapes in images because polygons are representative of numerous types of objects, such as buildings or obstacles for aerial vehicles. This is challenging for deep learning architectures as a polygon can consist of a varying carnality of points. We provide evidence on the importance of natural orders for Transformers, and show the benefit of decomposing complex polygons into collections of points in an auto-regressive manner. △ Less

Submitted 17 August, 2021; originally announced August 2021.

Comments: DLGC@ICCVW 2021

arXiv:2108.05137 [pdf, other]

Zero-Shot Day-Night Domain Adaptation with a Physics Prior

Authors: Attila Lengyel, Sourav Garg, Michael Milford, Jan C. van Gemert

Abstract: We explore the zero-shot setting for day-night domain adaptation. The traditional domain adaptation setting is to train on one domain and adapt to the target domain by exploiting unlabeled data samples from the test set. As gathering relevant test data is expensive and sometimes even impossible, we remove any reliance on test data imagery and instead exploit a visual inductive prior derived from p… ▽ More We explore the zero-shot setting for day-night domain adaptation. The traditional domain adaptation setting is to train on one domain and adapt to the target domain by exploiting unlabeled data samples from the test set. As gathering relevant test data is expensive and sometimes even impossible, we remove any reliance on test data imagery and instead exploit a visual inductive prior derived from physics-based reflection models for domain adaptation. We cast a number of color invariant edge detectors as trainable layers in a convolutional neural network and evaluate their robustness to illumination changes. We show that the color invariant layer reduces the day-night distribution shift in feature map activations throughout the network. We demonstrate improved performance for zero-shot day to night domain adaptation on both synthetic as well as natural datasets in various tasks, including classification, segmentation and place recognition. △ Less

Submitted 11 October, 2021; v1 submitted 11 August, 2021; originally announced August 2021.

Comments: ICCV 2021 Oral presentation. Code, datasets and supplementary material: https://github.com/Attila94/CIConv

Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 4399-4409

arXiv:2106.11280 [pdf, other]

The Arm-Swing Is Discriminative in Video Gait Recognition for Athlete Re-Identification

Authors: Yapkan Choi, Yeshwanth Napolean, Jan C. van Gemert

Abstract: In this paper we evaluate running gait as an attribute for video person re-identification in a long-distance running event. We show that running gait recognition achieves competitive performance compared to appearance-based approaches in the cross-camera retrieval task and that gait and appearance features are complementary to each other. For gait, the arm swing during running is less distinguisha… ▽ More In this paper we evaluate running gait as an attribute for video person re-identification in a long-distance running event. We show that running gait recognition achieves competitive performance compared to appearance-based approaches in the cross-camera retrieval task and that gait and appearance features are complementary to each other. For gait, the arm swing during running is less distinguishable when using binary gait silhouettes, due to ambiguity in the torso region. We propose to use human semantic parsing to create partial gait silhouettes where the torso is left out. Leaving out the torso improves recognition results by allowing the arm swing to be more visible in the frontal and oblique viewing angles, which offers hints that arm swings are somewhat personal. Experiments show an increase of 3.2% mAP on the CampusRun and increased accuracy with 4.8% in the frontal and rear view on CASIA-B, compared to using the full body silhouettes. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: ICIP 2021

arXiv:2106.04914 [pdf, other]

Exploiting Learned Symmetries in Group Equivariant Convolutions

Authors: Attila Lengyel, Jan C. van Gemert

Abstract: Group Equivariant Convolutions (GConvs) enable convolutional neural networks to be equivariant to various transformation groups, but at an additional parameter and compute cost. We investigate the filter parameters learned by GConvs and find certain conditions under which they become highly redundant. We show that GConvs can be efficiently decomposed into depthwise separable convolutions while pre… ▽ More Group Equivariant Convolutions (GConvs) enable convolutional neural networks to be equivariant to various transformation groups, but at an additional parameter and compute cost. We investigate the filter parameters learned by GConvs and find certain conditions under which they become highly redundant. We show that GConvs can be efficiently decomposed into depthwise separable convolutions while preserving equivariance properties and demonstrate improved performance and data efficiency on two datasets. All code is publicly available at github.com/Attila94/SepGrouPy. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2106.03412 [pdf, other]

doi 10.1109/TIP.2021.3115001

Resolution learning in deep convolutional networks using scale-space theory

Authors: Silvia L. Pintea, Nergis Tomen, Stanley F. Goes, Marco Loog, Jan C. van Gemert

Abstract: Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome.… ▽ More Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome. We propose to do away with hard-coded resolution hyper-parameters and aim to learn the appropriate resolution from data. We use scale-space theory to obtain a self-similar parametrization of filters and make use of the N-Jet: a truncated Taylor series to approximate a filter by a learned combination of Gaussian derivative filters. The parameter sigma of the Gaussian basis controls both the amount of detail the filter encodes and the spatial extent of the filter. Since sigma is a continuous parameter, we can optimize it with respect to the loss. The proposed N-Jet layer achieves comparable performance when used in state-of-the art architectures, while learning the correct resolution in each layer automatically. We evaluate our N-Jet layer on both classification and segmentation, and we show that learning sigma is especially beneficial for inputs at multiple sizes. △ Less

Submitted 24 October, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Preprint accepted by IEEE Transactions on Image Processing, 2021 (TIP). Link to final published article: https://ieeexplore.ieee.org/abstract/document/9552550

Journal ref: IEEE Transactions on Image Processing, vol. 30, pp. 8342-8353, 2021

arXiv:2106.02523 [pdf, other]

Hallucination In Object Detection -- A Study In Visual Part Verification

Authors: Osman Semih Kayhan, Bart Vredebregt, Jan C. van Gemert

Abstract: We show that object detectors can hallucinate and detect missing objects; potentially even accurately localized at their expected, but non-existing, position. This is particularly problematic for applications that rely on visual part verification: detecting if an object part is present or absent. We show how popular object detectors hallucinate objects in a visual part verification task and introd… ▽ More We show that object detectors can hallucinate and detect missing objects; potentially even accurately localized at their expected, but non-existing, position. This is particularly problematic for applications that rely on visual part verification: detecting if an object part is present or absent. We show how popular object detectors hallucinate objects in a visual part verification task and introduce the first visual part verification dataset: DelftBikes, which has 10,000 bike photographs, with 22 densely annotated parts per image, where some parts may be missing. We explicitly annotated an extra object state label for each part to reflect if a part is missing or intact. We propose to evaluate visual part verification by relying on recall and compare popular object detectors on DelftBikes. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: ICIP 2021

arXiv:2103.15395 [pdf, other]

No frame left behind: Full Video Action Recognition

Authors: Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C. van Gemert

Abstract: Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make… ▽ More Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: Accepted to CVPR 2021

arXiv:2011.13202 [pdf, other]

t-EVA: Time-Efficient t-SNE Video Annotation

Authors: Soroosh Poorgholi, Osman Semih Kayhan, Jan C. van Gemert

Abstract: Video understanding has received more attention in the past few years due to the availability of several large-scale video datasets. However, annotating large-scale video datasets are cost-intensive. In this work, we propose a time-efficient video annotation method using spatio-temporal feature similarity and t-SNE dimensionality reduction to speed up the annotation process massively. Placing the… ▽ More Video understanding has received more attention in the past few years due to the availability of several large-scale video datasets. However, annotating large-scale video datasets are cost-intensive. In this work, we propose a time-efficient video annotation method using spatio-temporal feature similarity and t-SNE dimensionality reduction to speed up the annotation process massively. Placing the same actions from different videos near each other in the two-dimensional space based on feature similarity helps the annotator to group-label video clips. We evaluate our method on two subsets of the ActivityNet (v1.3) and a subset of the Sports-1M dataset. We show that t-EVA can outperform other video annotation tools while maintaining test accuracy on video classification. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: ICPR 2020 (HCAU)

arXiv:2010.10451 [pdf, other]

Tilting at windmills: Data augmentation for deep pose estimation does not help with occlusions

Authors: Rafal Pytel, Osman Semih Kayhan, Jan C. van Gemert

Abstract: Occlusion degrades the performance of human pose estimation. In this paper, we introduce targeted keypoint and body part occlusion attacks. The effects of the attacks are systematically analyzed on the best performing methods. In addition, we propose occlusion specific data augmentation techniques against keypoint and part attacks. Our extensive experiments show that human pose estimation methods… ▽ More Occlusion degrades the performance of human pose estimation. In this paper, we introduce targeted keypoint and body part occlusion attacks. The effects of the attacks are systematically analyzed on the best performing methods. In addition, we propose occlusion specific data augmentation techniques against keypoint and part attacks. Our extensive experiments show that human pose estimation methods are not robust to occlusion and data augmentation does not solve the occlusion problems. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Comments: ICPR 2020

arXiv:2007.09493 [pdf, other]

Deep Hough-Transform Line Priors

Authors: Yancong Lin, Silvia L. Pintea, Jan C. van Gemert

Abstract: Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel grou**s, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on th… ▽ More Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel grou**s, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data. Keywords: Hough transform; global line prior, line segment detection. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: ECCV 2020, code online: https://github.com/yanconglin/Deep-Hough-Transform-Line-Priors

arXiv:2004.07629 [pdf, other]

Top-Down Networks: A coarse-to-fine reimagination of CNNs

Authors: Ioannis Lelekas, Nergis Tomen, Silvia L. Pintea, Jan C. van Gemert

Abstract: Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this… ▽ More Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: CVPR Workshop Deep Vision 2020

arXiv:2003.07064 [pdf, other]

On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location

Authors: Osman Semih Kayhan, Jan C. van Gemert

Abstract: In this paper we challenge the common assumption that convolutional layers in modern CNNs are translation invariant. We show that CNNs can and will exploit the absolute spatial location by learning filters that respond exclusively to particular absolute locations by exploiting image boundary effects. Because modern CNNs filters have a huge receptive field, these boundary effects operate even far f… ▽ More In this paper we challenge the common assumption that convolutional layers in modern CNNs are translation invariant. We show that CNNs can and will exploit the absolute spatial location by learning filters that respond exclusively to particular absolute locations by exploiting image boundary effects. Because modern CNNs filters have a huge receptive field, these boundary effects operate even far from the image boundary, allowing the network to exploit absolute spatial location all over the image. We give a simple solution to remove spatial location encoding which improves translation invariance and thus gives a stronger visual inductive bias which particularly benefits small data sets. We broadly demonstrate these benefits on several architectures and various applications such as image classification, patch matching, and two video classification datasets. △ Less

Submitted 30 May, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: CVPR 2020. (Minor revision on Figure 4, arguments unchanged.)

arXiv:1909.03552 [pdf, other]

Cross Domain Image Matching in Presence of Outliers

Authors: Xin Liu, Seyran Khademi, Jan C. van Gemert

Abstract: Cross domain image matching between image collections from different source and target domains is challenging in times of deep learning due to i) limited variation of image conditions in a training set, ii) lack of paired-image labels during training, iii) the existing of outliers that makes image matching domains not fully overlap. To this end, we propose an end-to-end architecture that can match… ▽ More Cross domain image matching between image collections from different source and target domains is challenging in times of deep learning due to i) limited variation of image conditions in a training set, ii) lack of paired-image labels during training, iii) the existing of outliers that makes image matching domains not fully overlap. To this end, we propose an end-to-end architecture that can match cross domain images without labels in the target domain and handle non-overlap** domains by outlier detection. We leverage domain adaptation and triplet constraints for training a network capable of learning domain invariant and identity distinguishable representations, and iteratively detecting the outliers with an entropy loss and our proposed weighted MK-MMD. Extensive experimental evidence on Office [17] dataset and our proposed datasets Shape, Pitts-CycleGAN shows that the proposed approach yields state-of-the-art cross domain image matching and outlier detection performance on different benchmarks. The code will be made publicly available. △ Less

Submitted 8 September, 2019; originally announced September 2019.

Comments: ICCV Workshop on Transferring and Adaptive Source Knowledge in Computer Vision (TASK-CV) 2019

arXiv:1809.03258 [pdf, other]

Using phase instead of optical flow for action recognition

Authors: Omar Hommos, Silvia L. Pintea, Pascal S. M. Mettes, Jan C. van Gemert

Abstract: Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has… ▽ More Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset. △ Less

Submitted 14 September, 2018; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: ECCV-2018 Workshop on "What is Optical Flow for?"

arXiv:1809.03218 [pdf, other]

Hand-tremor frequency estimation in videos

Authors: Silvia L. Pintea, Jian Zheng, Xilin Li, Paulina J. M. Bank, Jacobus J. van Hilten, Jan C. van Gemert

Abstract: We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson's disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the vid… ▽ More We focus on the problem of estimating human hand-tremor frequency from input RGB video data. Estimating tremors from video is important for non-invasive monitoring, analyzing and diagnosing patients suffering from motor-disorders such as Parkinson's disease. We consider two approaches for hand-tremor frequency estimation: (a) a Lagrangian approach where we detect the hand at every frame in the video, and estimate the tremor frequency along the trajectory; and (b) an Eulerian approach where we first localize the hand, we subsequently remove the large motion along the movement trajectory of the hand, and we use the video information over time encoded as intensity values or phase information to estimate the tremor frequency. We estimate hand tremors on a new human tremor dataset, TIM-Tremor, containing static tasks as well as a multitude of more dynamic tasks, involving larger motion of the hands. The dataset has 55 tremor patient recordings together with: associated ground truth accelerometer data from the most affected hand, RGB video data, and aligned depth data. △ Less

Submitted 10 September, 2018; originally announced September 2018.

Comments: Best paper at ECCV-2018 Workshop on Observing and Understanding Hands in Action

arXiv:1805.07170 [pdf, other]

Recurrent knowledge distillation

Authors: Silvia L. Pintea, Yue Liu, Jan C. van Gemert

Abstract: Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of a… ▽ More Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of adding recurrent connections into the student network, and show experimentally on CIFAR-10, Scenes and MiniPlaces, that we can reduce the number of parameters at little loss in accuracy. △ Less

Submitted 18 May, 2018; originally announced May 2018.

Comments: International Conference on Image Processing (ICIP), 2018

arXiv:1803.06962 [pdf, other]

Featureless: Bypassing feature extraction in action categorization

Authors: Silvia L. Pintea, Pascal S. Mettes, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational… ▽ More This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational expense it can predict 2 D video representations as well as 3 D ones, based on motion. The proposed model relies on discriminative Waldboost, which we enhance to a multiclass formulation for the purpose of learning video representations. The suitability of the proposed approach as well as its time efficiency are tested on the UCF11 action recognition dataset. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: Published in the proceedings of the International Conference on Image Processing (ICIP), 2016

arXiv:1803.06952 [pdf, other]

Asymmetric kernel in Gaussian Processes for learning target variance

Authors: Silvia L. Pintea, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center select… ▽ More This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center selects an individualized kernel metric. This enables each center to adjust the kernel space in its vicinity in correspondence with the topology of the targets --- a multi-modal approach. We additionally add descriptiveness by allowing each center to learn a precision matrix. We demonstrate empirically the reliability of the model. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: Accepted in Pattern Recognition Letters, 2018

arXiv:1803.06951 [pdf, other]

Deja Vu: Motion Prediction in Static Images

Authors: Silvia L. Pintea, Jan C. van Gemert, Arnold W. M. Smeulders

Abstract: This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a novel method to predict motion from appearance in a… ▽ More This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a novel method to predict motion from appearance in a single static image, (ii) to that end, extends of the Structured Random Forest with regression derived from first principles, and (iii) shows the value of adding motion predictions in different tasks such as: weak frame-proposals containing unexpected events, action recognition, motion saliency. Illustrative results indicate that motion prediction is not only feasible, but also provides valuable information for a number of applications. △ Less

Submitted 21 March, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: Published in the proceedings of the European Conference on Computer Vision (ECCV), 2014

arXiv:1706.09556 [pdf, other]

Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

Authors: A. Bazzica, J. C. van Gemert, C. C. S. Liem, A. Hanjalic

Abstract: Acoustic events often have a visual counterpart. Knowledge of visual information can aid the understanding of complex auditory scenes, even when only a stereo mixdown is available in the audio domain, \eg identifying which musicians are playing in large musical ensembles. In this paper, we consider a vision-based approach to note onset detection. As a case study we focus on challenging, real-world… ▽ More Acoustic events often have a visual counterpart. Knowledge of visual information can aid the understanding of complex auditory scenes, even when only a stereo mixdown is available in the audio domain, \eg identifying which musicians are playing in large musical ensembles. In this paper, we consider a vision-based approach to note onset detection. As a case study we focus on challenging, real-world clarinetist videos and carry out preliminary experiments on a 3D convolutional neural network based on multiple streams and purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5 hours of clarinetist videos together with cleaned annotations which include about 36,000 onsets and the coordinates for a number of salient points and regions of interest. By performing several training trials on our dataset, we learned that the problem is challenging. We found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall. To encourage further research, we publicly share our dataset, annotations and all models and detail which issues we came across during our preliminary experiments. △ Less

Submitted 28 June, 2017; originally announced June 2017.

Comments: Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE])

Report number: DLM/2017/8 MSC Class: 68Txx ACM Class: C.1.3; H.5.1

Journal ref: Proc of the First Int Workshop on Deep Learning and Music. Anchorage, US. 1(1). pp 31-36 (2017)

arXiv:1704.04186 [pdf, other]

Video Acceleration Magnification

Authors: Yichao Zhang, Silvia L. Pintea, Jan C. van Gemert

Abstract: The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely distorts current video amplification methods that magnify change linearly. In this work we propose a method to cope with large motions while still magnifying smal… ▽ More The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely distorts current video amplification methods that magnify change linearly. In this work we propose a method to cope with large motions while still magnifying small changes. We make the following two observations: i) large motions are linear on the temporal scale of the small changes; ii) small changes deviate from this linearity. We ignore linear motion and propose to magnify acceleration. Our method is pure Eulerian and does not require any optical flow, temporal alignment or region annotations. We link temporal second-order derivative filtering to spatial acceleration magnification. We apply our method to moving objects where we show motion magnification and color magnification. We provide quantitative as well as qualitative evidence for our method while comparing to the state-of-the-art. △ Less

Submitted 22 April, 2017; v1 submitted 13 April, 2017; originally announced April 2017.

Comments: Accepted paper at CVPR 2017. Project webpage: http://acceleration-magnification.github.io/

arXiv:1703.06971 [pdf, other]

Active Decision Boundary Annotation with Deep Generative Models

Authors: Miriam W. Huijser, Jan C. van Gemert

Abstract: This paper is on active learning where the goal is to reduce the data annotation burden by interacting with a (human) oracle during training. Standard active learning methods ask the oracle to annotate data samples. Instead, we take a profoundly different approach: we ask for annotations of the decision boundary. We achieve this using a deep generative model to create novel instances along a 1d li… ▽ More This paper is on active learning where the goal is to reduce the data annotation burden by interacting with a (human) oracle during training. Standard active learning methods ask the oracle to annotate data samples. Instead, we take a profoundly different approach: we ask for annotations of the decision boundary. We achieve this using a deep generative model to create novel instances along a 1d line. A point on the decision boundary is revealed where the instances change class. Experimentally we show on three data sets that our method can be plugged-in to other active learning schemes, that human oracles can effectively annotate points on the decision boundary, that our method is robust to annotation noise, and that decision boundary annotations improve over annotating data samples. △ Less

Submitted 2 August, 2017; v1 submitted 20 March, 2017; originally announced March 2017.

Comments: ICCV 2017

arXiv:1610.01801 [pdf, other]

Searching Scenes by Abstracting Things

Authors: Svetlana Kordumova, Jan C. van Gemert, Cees G. M. Snoek, Arnold W. M. Smeulders

Abstract: In this paper we propose to represent a scene as an abstraction of 'things'. We start from 'things' as generated by modern object proposals, and we investigate their immediately observable properties: position, size, aspect ratio and color, and those only. Where the recent successes and excitement of the field lie in object identification, we represent the scene composition independent of object i… ▽ More In this paper we propose to represent a scene as an abstraction of 'things'. We start from 'things' as generated by modern object proposals, and we investigate their immediately observable properties: position, size, aspect ratio and color, and those only. Where the recent successes and excitement of the field lie in object identification, we represent the scene composition independent of object identities. We make three contributions in this work. First, we study simple observable properties of 'things', and call it things syntax. Second, we propose translating the things syntax in linguistic abstract statements and study their descriptive effect to retrieve scenes. Thirdly, we propose querying of scenes with abstract block illustrations and study their effectiveness to discriminate among different types of scenes. The benefit of abstract statements and block illustrations is that we generate them directly from the images, without any learning beforehand as in the standard attribute learning. Surprisingly, we show that even though we use the simplest of features from 'things' layout and no learning at all, we can still retrieve scenes reasonably well. △ Less

Submitted 6 October, 2016; originally announced October 2016.

arXiv:1609.01693 [pdf, other]

Making a Case for Learning Motion Representations with Phase

Authors: S. L. Pintea, J. C. van Gemert

Abstract: This work advocates Eulerian motion representation learning over the current standard Lagrangian optical flow model. Eulerian motion is well captured by using phase, as obtained by decomposing the image through a complex-steerable pyramid. We discuss the gain of Eulerian motion in a set of practical use cases: (i) action recognition, (ii) motion prediction in static images, (iii) motion transfer i… ▽ More This work advocates Eulerian motion representation learning over the current standard Lagrangian optical flow model. Eulerian motion is well captured by using phase, as obtained by decomposing the image through a complex-steerable pyramid. We discuss the gain of Eulerian motion in a set of practical use cases: (i) action recognition, (ii) motion prediction in static images, (iii) motion transfer in static images and, (iv) motion transfer in video. For each task we motivate the phase-based direction and provide a possible approach. △ Less

Submitted 8 September, 2016; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: ECCV 2016 Workshop on Brave new ideas for motion representations in videos

arXiv:1604.07602 [pdf, other]

Spot On: Action Localization from Pointly-Supervised Proposals

Authors: Pascal Mettes, Jan C. van Gemert, Cees G. M. Snoek

Abstract: We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frame… ▽ More We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at http://tinyurl.com/hollywood2tubes. △ Less

Submitted 25 July, 2016; v1 submitted 26 April, 2016; originally announced April 2016.

Report number: ECCV/2016/10

arXiv:1510.06939 [pdf, other]

Objects2action: Classifying and localizing actions without any video example

Authors: Mihir Jain, Jan C. van Gemert, Thomas Mensink, Cees G. M. Snoek

Abstract: The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute map**s to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model… ▽ More The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute map**s to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach. △ Less

Submitted 23 October, 2015; originally announced October 2015.

arXiv:1510.04908 [pdf, other]

No Spare Parts: Sharing Part Detectors for Image Categorization

Authors: Pascal Mettes, Jan C. van Gemert, Cees G. M. Snoek

Abstract: This work aims for image categorization using a representation of distinctive parts. Different from existing part-based work, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts d… ▽ More This work aims for image categorization using a representation of distinctive parts. Different from existing part-based work, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts defining the class, the parts coming from the background context and parts from other image categories improve categorization performance. Part selection should not be done separately for each category, but instead be shared and optimized over all categories. To incorporate part sharing between categories, we present an algorithm based on AdaBoost to jointly optimize part sharing and selection, as well as fusion with the global image representation. We achieve results competitive to the state-of-the-art on object, scene, and action categories, further improving over deep convolutional neural networks. △ Less

Submitted 12 July, 2016; v1 submitted 16 October, 2015; originally announced October 2015.

Showing 1–37 of 37 results for author: van Gemert, J C