Search | arXiv e-print repository

The Trifecta: Three simple techniques for training deeper Forward-Forward networks

Authors: Thomas Dooms, Ing Jyh Tsang, Jose Oramas

Abstract: Modern machine learning models are able to outperform humans on a variety of non-trivial tasks. However, as the complexity of the models increases, they consume significant amounts of power and still struggle to generalize effectively to unseen data. Local learning, which focuses on updating subsets of a model's parameters at a time, has emerged as a promising technique to address these issues. Re… ▽ More Modern machine learning models are able to outperform humans on a variety of non-trivial tasks. However, as the complexity of the models increases, they consume significant amounts of power and still struggle to generalize effectively to unseen data. Local learning, which focuses on updating subsets of a model's parameters at a time, has emerged as a promising technique to address these issues. Recently, a novel local learning algorithm, called Forward-Forward, has received widespread attention due to its innovative approach to learning. Unfortunately, its application has been limited to smaller datasets due to scalability issues. To this end, we propose The Trifecta, a collection of three simple techniques that synergize exceptionally well and drastically improve the Forward-Forward algorithm on deeper networks. Our experiments demonstrate that our models are on par with similarly structured, backpropagation-based models in both training speed and test accuracy on simple datasets. This is achieved by the ability to learn representations that are informative locally, on a layer-by-layer basis, and retain their informativeness when propagated to deeper layers in the architecture. This leads to around 84% accuracy on CIFAR-10, a notable improvement (25%) over the original FF algorithm. These results highlight the potential of Forward-Forward as a genuine competitor to backpropagation and as a promising research avenue. △ Less

Submitted 12 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

MSC Class: 68T07

arXiv:2305.10121 [pdf, other]

FICNN: A Framework for the Interpretation of Deep Convolutional Neural Networks

Authors: Hamed Behzadi-Khormouji, José Oramas

Abstract: With the continue development of Convolutional Neural Networks (CNNs), there is a growing concern regarding representations that they encode internally. Analyzing these internal representations is referred to as model interpretation. While the task of model explanation, justifying the predictions of such models, has been studied extensively; the task of model interpretation has received less atten… ▽ More With the continue development of Convolutional Neural Networks (CNNs), there is a growing concern regarding representations that they encode internally. Analyzing these internal representations is referred to as model interpretation. While the task of model explanation, justifying the predictions of such models, has been studied extensively; the task of model interpretation has received less attention. The aim of this paper is to propose a framework for the study of interpretation methods designed for CNN models trained from visual data. More specifically, we first specify the difference between the interpretation and explanation tasks which are often considered the same in the literature. Then, we define a set of six specific factors that can be used to characterize interpretation methods. Third, based on the previous factors, we propose a framework for the positioning of interpretation methods. Our framework highlights that just a very small amount of the suggested factors, and combinations thereof, have been actually studied. Consequently, leaving significant areas unexplored. Following the proposed framework, we discuss existing interpretation methods and give some attention to the evaluation protocols followed to validate them. Finally, the paper highlights capabilities of the methods in producing feedback for enabling interpretation and proposes possible research problems arising from the framework. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.05349 [pdf, other]

Towards the Characterization of Representations Learned via Capsule-based Network Architectures

Authors: Saja AL-Tawalbeh, José Oramas

Abstract: Capsule Networks (CapsNets) have been re-introduced as a more compact and interpretable alternative to standard deep neural networks. While recent efforts have proved their compression capabilities, to date, their interpretability properties have not been fully assessed. Here, we conduct a systematic and principled study towards assessing the interpretability of these types of networks. Moreover,… ▽ More Capsule Networks (CapsNets) have been re-introduced as a more compact and interpretable alternative to standard deep neural networks. While recent efforts have proved their compression capabilities, to date, their interpretability properties have not been fully assessed. Here, we conduct a systematic and principled study towards assessing the interpretability of these types of networks. Moreover, we pay special attention towards analyzing the level to which part-whole relationships are indeed encoded within the learned representation. Our analysis in the MNIST, SVHN, PASCAL-part and CelebA datasets suggest that the representations encoded in CapsNets might not be as disentangled nor strictly related to parts-whole relationships as is commonly stated in the literature. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: This paper consist of 7 pages including 8 figures. This paper concern about interpretation of capsule network

MSC Class: ACM-class

arXiv:2302.11244 [pdf, other]

Considering Layerwise Importance in the Lottery Ticket Hypothesis

Authors: Benjamin Vandersmissen, Jose Oramas

Abstract: The Lottery Ticket Hypothesis (LTH) showed that by iteratively training a model, removing connections with the lowest global weight magnitude and rewinding the remaining connections, sparse networks can be extracted. This global comparison removes context information between connections within a layer. Here we study means for recovering some of this layer distributional context and generalise th… ▽ More The Lottery Ticket Hypothesis (LTH) showed that by iteratively training a model, removing connections with the lowest global weight magnitude and rewinding the remaining connections, sparse networks can be extracted. This global comparison removes context information between connections within a layer. Here we study means for recovering some of this layer distributional context and generalise the LTH to consider weight importance values rather than global weight magnitudes. We find that given a repeatable training procedure, applying different importance metrics leads to distinct performant lottery tickets with little overlap** connections. This strongly suggests that lottery tickets are not unique △ Less

Submitted 22 February, 2023; originally announced February 2023.

arXiv:2302.10764 [pdf, other]

doi 10.1016/j.cviu.2024.103934

On The Coherence of Quantitative Evaluation of Visual Explanations

Authors: Benjamin Vandersmissen, Jose Oramas

Abstract: Recent years have shown an increased development of methods for justifying the predictions of neural networks through visual explanations. These explanations usually take the form of heatmaps which assign a saliency (or relevance) value to each pixel of the input image that expresses how relevant the pixel is for the prediction of a label. Complementing this development, evaluation methods have… ▽ More Recent years have shown an increased development of methods for justifying the predictions of neural networks through visual explanations. These explanations usually take the form of heatmaps which assign a saliency (or relevance) value to each pixel of the input image that expresses how relevant the pixel is for the prediction of a label. Complementing this development, evaluation methods have been proposed to assess the "goodness" of such explanations. On the one hand, some of these methods rely on synthetic datasets. However, this introduces the weakness of having limited guarantees regarding their applicability on more realistic settings. On the other hand, some methods rely on metrics for objective evaluation. However the level to which some of these evaluation methods perform with respect to each other is uncertain. Taking this into account, we conduct a comprehensive study on a subset of the ImageNet-1k validation set where we evaluate a number of different commonly-used explanation methods following a set of evaluation methods. We complement our study with sanity checks on the studied evaluation methods as a means to investigate their reliability and the impact of characteristics of the explanations on the evaluation methods. Results of our study suggest that there is a lack of coherency on the grading provided by some of the considered evaluation methods. Moreover, we have identified some characteristics of the explanations, e.g. sparsity, which can have a significant effect on the performance. △ Less

Submitted 19 February, 2024; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: Accepted at CVIU

arXiv:2301.06874 [pdf, ps, other]

Training Methods of Multi-label Prediction Classifiers for Hyperspectral Remote Sensing Images

Authors: Salma Haidar, José Oramas

Abstract: With their combined spectral depth and geometric resolution, hyperspectral remote sensing images embed a wealth of complex, non-linear information that challenges traditional computer vision techniques. Yet, deep learning methods known for their representation learning capabilities prove more suitable for handling such complexities. Unlike applications that focus on single-label, pixel-level class… ▽ More With their combined spectral depth and geometric resolution, hyperspectral remote sensing images embed a wealth of complex, non-linear information that challenges traditional computer vision techniques. Yet, deep learning methods known for their representation learning capabilities prove more suitable for handling such complexities. Unlike applications that focus on single-label, pixel-level classification methods for hyperspectral remote sensing images, we propose a multi-label, patch-level classification method based on a two-component deep-learning network. We use patches of reduced spatial dimension and a complete spectral depth extracted from the remote sensing images. Additionally, we investigate three training schemes for our network: Iterative, Joint, and Cascade. Experiments suggest that the Joint scheme is the best-performing scheme; however, its application requires an expensive search for the best weight combination of the loss constituents. The Iterative scheme enables the sharing of features between the two parts of the network at the early stages of training. It performs better on complex data with multi-labels. Further experiments showed that methods designed with different architectures performed well when trained on patches extracted and labeled according to our sampling method. △ Less

Submitted 26 October, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: 1- Added references. 2- updated methodology figure and added new figures to visualise the different training schemes and 3- Correcting typos 4- Revised introduction, no change in results or discussion

arXiv:2212.11030 [pdf, other]

doi 10.5220/0010838400003124

Deep set conditioned latent representations for action recognition

Authors: Akash Singh, Tom De Schepper, Kevin Mets, Peter Hellinckx, Jose Oramas, Steven Latre

Abstract: In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognisin… ▽ More In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset. △ Less

Submitted 21 December, 2022; originally announced December 2022.

Comments: Conference VISAPP 2022, 11 pages,5 figures, 2 Tables, 6 plots

Journal ref: In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, ISBN 978-989-758-555-5; ISSN 2184-4321, year 2022, pages 456-466

arXiv:2104.14375 [pdf, other]

MinMaxCAM: Improving object coverage for CAM-basedWeakly Supervised Object Localization

Authors: Kaili Wang, Jose Oramas, Tinne Tuytelaars

Abstract: One of the most common problems of weakly supervised object localization is that of inaccurate object coverage. In the context of state-of-the-art methods based on Class Activation Map**, this is caused either by localization maps which focus, exclusively, on the most discriminative region of the objects of interest or by activations occurring in background regions. To address these two problems… ▽ More One of the most common problems of weakly supervised object localization is that of inaccurate object coverage. In the context of state-of-the-art methods based on Class Activation Map**, this is caused either by localization maps which focus, exclusively, on the most discriminative region of the objects of interest or by activations occurring in background regions. To address these two problems, we propose two representation regularization mechanisms: Full Region Regularizationwhich tries to maximize the coverage of the localization map inside the object region, and Common Region Regularization which minimizes the activations occurring in background regions. We evaluate the two regularizations on the ImageNet, CUB-200-2011 and OpenImages-segmentation datasets, and show that the proposed regularizations tackle both problems, outperforming the state-of-the-art by a significant margin. △ Less

Submitted 29 April, 2021; originally announced April 2021.

arXiv:2104.07954 [pdf, other]

Towards Human-Understandable Visual Explanations:Imperceptible High-frequency Cues Can Better Be Removed

Authors: Kaili Wang, Jose Oramas, Tinne Tuytelaars

Abstract: Explainable AI (XAI) methods focus on explaining what a neural network has learned - in other words, identifying the features that are the most influential to the prediction. In this paper, we call them "distinguishing features". However, whether a human can make sense of the generated explanation also depends on the perceptibility of these features to humans. To make sure an explanation is human-… ▽ More Explainable AI (XAI) methods focus on explaining what a neural network has learned - in other words, identifying the features that are the most influential to the prediction. In this paper, we call them "distinguishing features". However, whether a human can make sense of the generated explanation also depends on the perceptibility of these features to humans. To make sure an explanation is human-understandable, we argue that the capabilities of humans, constrained by the Human Visual System (HVS) and psychophysics, need to be taken into account. We propose the {\em human perceptibility principle for XAI}, stating that, to generate human-understandable explanations, neural networks should be steered towards focusing on human-understandable cues during training. We conduct a case study regarding the classification of real vs. fake face images, where many of the distinguishing features picked up by standard neural networks turn out not to be perceptible to humans. By applying the proposed principle, a neural network with human-understandable explanations is trained which, in a user study, is shown to better align with human intuition. This is likely to make the AI more trustworthy and opens the door to humans learning from machines. In the case study, we specifically investigate and analyze the behaviour of the human-imperceptible high spatial frequency features in neural networks and XAI methods. △ Less

Submitted 16 April, 2021; originally announced April 2021.

arXiv:2010.15974 [pdf, other]

Can the state of relevant neurons in a deep neural networks serve as indicators for detecting adversarial attacks?

Authors: Roger Granda, Tinne Tuytelaars, Jose Oramas

Abstract: We present a method for adversarial attack detection based on the inspection of a sparse set of neurons. We follow the hypothesis that adversarial attacks introduce imperceptible perturbations in the input and that these perturbations change the state of neurons relevant for the concepts modelled by the attacked model. Therefore, monitoring the status of these neurons would enable the detection of… ▽ More We present a method for adversarial attack detection based on the inspection of a sparse set of neurons. We follow the hypothesis that adversarial attacks introduce imperceptible perturbations in the input and that these perturbations change the state of neurons relevant for the concepts modelled by the attacked model. Therefore, monitoring the status of these neurons would enable the detection of adversarial attacks. Focusing on the image classification task, our method identifies neurons that are relevant for the classes predicted by the model. A deeper qualitative inspection of these sparse set of neurons indicates that their state changes in the presence of adversarial samples. Moreover, quantitative results from our empirical evaluation indicate that our method is capable of recognizing adversarial samples, produced by state-of-the-art attack methods, with comparable accuracy to that of state-of-the-art detectors. △ Less

Submitted 29 October, 2020; originally announced October 2020.

arXiv:2009.07827 [pdf, other]

Multiple Exemplars-based Hallucinationfor Face Super-resolution and Editing

Authors: Kaili Wang, Jose Oramas, Tinne Tuytelaars

Abstract: Given a really low-resolution input image of a face (say 16x16 or 8x8 pixels), the goal of this paper is to reconstruct a high-resolution version thereof. This, by itself, is an ill-posed problem, as the high-frequency information is missing in the low-resolution input and needs to be hallucinated, based on prior knowledge about the image content. Rather than relying on a generic face prior, in th… ▽ More Given a really low-resolution input image of a face (say 16x16 or 8x8 pixels), the goal of this paper is to reconstruct a high-resolution version thereof. This, by itself, is an ill-posed problem, as the high-frequency information is missing in the low-resolution input and needs to be hallucinated, based on prior knowledge about the image content. Rather than relying on a generic face prior, in this paper, we explore the use of a set of exemplars, i.e. other high-resolution images of the same person. These guide the neural network as we condition the output on them. Multiple exemplars work better than a single one. To combine the information from multiple exemplars effectively, we introduce a pixel-wise weight generation module. Besides standard face super-resolution, our method allows to perform subtle face editing simply by replacing the exemplars with another set with different facial features. A user study is conducted and shows the super-resolved images can hardly be distinguished from real images on the CelebA dataset. A qualitative comparison indicates our model outperforms methods proposed in the literature on the CelebA and WebFace dataset. △ Less

Submitted 14 January, 2021; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: accepted in ACCV 2020

arXiv:2001.08559

Information Compensation for Deep Conditional Generative Networks

Authors: Zehao Wang, Kaili Wang, Tinne Tuytelaars, Jose Oramas

Abstract: In recent years, unsupervised/weakly-supervised conditional generative adversarial networks (GANs) have achieved many successes on the task of modeling and generating data. However, one of their weaknesses lies in their poor ability to separate, or disentangle, the different factors that characterize the representation encoded in their latent space. To address this issue, we propose a novel struct… ▽ More In recent years, unsupervised/weakly-supervised conditional generative adversarial networks (GANs) have achieved many successes on the task of modeling and generating data. However, one of their weaknesses lies in their poor ability to separate, or disentangle, the different factors that characterize the representation encoded in their latent space. To address this issue, we propose a novel structure for unsupervised conditional GANs powered by a novel Information Compensation Connection (IC-Connection). The proposed IC-Connection enables GANs to compensate for information loss incurred during deconvolution operations. In addition, to quantify the degree of disentanglement on both discrete and continuous latent variables, we design a novel evaluation procedure. Our empirical results suggest that our method achieves better disentanglement compared to the state-of-the-art GANs in a conditional generation setting. △ Less

Submitted 6 March, 2022; v1 submitted 23 January, 2020; originally announced January 2020.

Comments: I think my previous work during master study is too naive

arXiv:1909.05690 [pdf, other]

In Defense of LSTMs for Addressing Multiple Instance Learning Problems

Authors: Kaili Wang, Jose Oramas, Tinne Tuytelaars

Abstract: LSTMs have a proven track record in analyzing sequential data. But what about unordered instance bags, as found under a Multiple Instance Learning (MIL) setting? While not often used for this, we show LSTMs excell under this setting too. In addition, we show thatLSTMs are capable of indirectly capturing instance-level information us-ing only bag-level annotations. Thus, they can be used to learn i… ▽ More LSTMs have a proven track record in analyzing sequential data. But what about unordered instance bags, as found under a Multiple Instance Learning (MIL) setting? While not often used for this, we show LSTMs excell under this setting too. In addition, we show thatLSTMs are capable of indirectly capturing instance-level information us-ing only bag-level annotations. Thus, they can be used to learn instance-level models in a weakly supervised manner. Our empirical evaluation on both simplified (MNIST) and realistic (Lookbook and Histopathology) datasets shows that LSTMs are competitive with or even surpass state-of-the-art methods specially designed for handling specific MIL problems. Moreover, we show that their performance on instance-level prediction is close to that of fully-supervised methods. △ Less

Submitted 14 January, 2021; v1 submitted 11 September, 2019; originally announced September 2019.

Comments: accepted in ACCV 2020 (oral)

arXiv:1812.02134 [pdf, other]

An Unpaired Shape Transforming Method for Image Translation and Cross-Domain Retrieval

Authors: Kaili Wang, Liqian Ma, Jose Oramas, Luc Van Gool, Tinne Tuytelaars

Abstract: We address the problem of unpaired geometric image-to-image translation. Rather than transferring the style of an image as a whole, our goal is to translate the geometry of an object as depicted in different domains while preserving its appearance characteristics. Our model is trained in an unpaired fashion, i.e. without the need of paired images during training. It performs all steps of the shape… ▽ More We address the problem of unpaired geometric image-to-image translation. Rather than transferring the style of an image as a whole, our goal is to translate the geometry of an object as depicted in different domains while preserving its appearance characteristics. Our model is trained in an unpaired fashion, i.e. without the need of paired images during training. It performs all steps of the shape transfer within a single model and without additional post-processing stages. Extensive experiments on the VITON, CMU-Multi-PIE and our own FashionStyle datasets show the effectiveness of the method. In addition, we show that despite their low-dimensionality, the features learned by our model are useful to the item retrieval task. △ Less

Submitted 18 August, 2021; v1 submitted 5 December, 2018; originally announced December 2018.

Comments: This manuscript is a pre-print currently under review at the Elsevier Journal Computer Vision and Image Under-standing (CVIU)

arXiv:1712.06302 [pdf, other]

Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks

Authors: Jose Oramas, Kaili Wang, Tinne Tuytelaars

Abstract: Interpretation and explanation of deep models is critical towards wide adoption of systems that rely on them. In this paper, we propose a novel scheme for both interpretation as well as explanation in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without relying on additional annotations. We interpret the model… ▽ More Interpretation and explanation of deep models is critical towards wide adoption of systems that rely on them. In this paper, we propose a novel scheme for both interpretation as well as explanation in which, given a pretrained model, we automatically identify internal features relevant for the set of classes considered by the model, without relying on additional annotations. We interpret the model through average visualizations of this reduced set of features. Then, at test time, we explain the network prediction by accompanying the predicted class label with supporting visualizations derived from the identified features. In addition, we propose a method to address the artifacts introduced by stridded operations in deconvNet-based visualizations. Moreover, we introduce an8Flower, a dataset specifically designed for objective quantitative evaluation of methods for visual explanation.Experiments on the MNIST,ILSVRC12,Fashion144k and an8Flower datasets show that our method produces detailed explanations with good coverage of relevant features of the classes of interest △ Less

Submitted 8 March, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: Accepted at International Conference on Learning Representations (ICLR) 2019. Project website: http://homes.esat.kuleuven.be/~joramas/projects/visualExplanationByInterpretation

arXiv:1707.02905 [pdf, other]

An Analysis of Human-centered Geolocation

Authors: Kaili Wang, Yu-Hui Huang, Jose Oramas, Luc Van Gool, Tinne Tuytelaars

Abstract: Online social networks contain a constantly increasing amount of images - most of them focusing on people. Due to cultural and climate factors, fashion trends and physical appearance of individuals differ from city to city. In this paper we investigate to what extent such cues can be exploited in order to infer the geographic location, i.e. the city, where a picture was taken. We conduct a user st… ▽ More Online social networks contain a constantly increasing amount of images - most of them focusing on people. Due to cultural and climate factors, fashion trends and physical appearance of individuals differ from city to city. In this paper we investigate to what extent such cues can be exploited in order to infer the geographic location, i.e. the city, where a picture was taken. We conduct a user study, as well as an evaluation of automatic methods based on convolutional neural networks. Experiments on the Fashion 144k and a Pinterest-based dataset show that the automatic methods succeed at this task to a reasonable extent. As a matter of fact, our empirical results suggest that automatic methods can surpass human performance by a large margin. Further inspection of the trained models shows that human-centered characteristics, like clothing style, physical features, and accessories, are informative for the task at hand. Moreover, it reveals that also contextual features, e.g. wall type, natural environment, etc., are taken into account by the automatic methods. △ Less

Submitted 31 January, 2018; v1 submitted 10 July, 2017; originally announced July 2017.

Comments: WACV'18

arXiv:1704.06610 [pdf, other]

doi 10.1016/j.cviu.2017.04.005

Context-based Object Viewpoint Estimation: A 2D Relational Approach

Authors: Jose Oramas, Luc De Raedt, Tinne Tuytelaars

Abstract: The task of object viewpoint estimation has been a challenge since the early days of computer vision. To estimate the viewpoint (or pose) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, extrinsic elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been… ▽ More The task of object viewpoint estimation has been a challenge since the early days of computer vision. To estimate the viewpoint (or pose) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, extrinsic elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been proven to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for viewpoint estimation. In particular, we look at object configurations by following a relational neighbor-based approach for reasoning about object relations. We show that, starting from noisy object detections and viewpoint estimates, exploiting the estimated viewpoint and location of other objects in the scene can lead to improved object viewpoint predictions. Experiments on the KITTI dataset demonstrate that object configurations can indeed be used as a complementary cue to appearance-based viewpoint estimation. Our analysis reveals that the proposed context-based method can improve object viewpoint estimation by reducing specific types of viewpoint estimation errors commonly made by methods that only consider local information. Moreover, considering contextual information produces superior performance in scenes where a high number of object instances occur. Finally, our results suggest that, following a cautious relational neighbor formulation brings improvements over its aggressive counterpart for the task of object viewpoint estimation. △ Less

Submitted 21 April, 2017; originally announced April 2017.

Comments: Computer Vision and Image Understanding (CVIU)

arXiv:1607.06356 [pdf, other]

Reasoning about Body-Parts Relations for Sign Language Recognition

Authors: Marc Martínez-Camarena, Jose Oramas, Mario Montagud-Climent, Tinne Tuytelaars

Abstract: Over the years, hand gesture recognition has been mostly addressed considering hand trajectories in isolation. However, in most sign languages, hand gestures are defined on a particular context (body region). We propose a pipeline to perform sign language recognition which models hand movements in the context of other parts of the body captured in the 3D space using the MS Kinect sensor. In additi… ▽ More Over the years, hand gesture recognition has been mostly addressed considering hand trajectories in isolation. However, in most sign languages, hand gestures are defined on a particular context (body region). We propose a pipeline to perform sign language recognition which models hand movements in the context of other parts of the body captured in the 3D space using the MS Kinect sensor. In addition, we perform sign recognition based on the different hand postures that occur during a sign. Our experiments show that considering different body parts brings improved performance when compared to other methods which only consider global hand trajectories. Finally, we demonstrate that the combination of hand postures features with hand gestures features helps to improve the prediction of a given sign. △ Less

Submitted 21 July, 2016; originally announced July 2016.

Comments: Under Review ( 15 Pages: 13 Figures, 6 Tables )

arXiv:1604.00036 [pdf, other]

Modeling Visual Compatibility through Hierarchical Mid-level Elements

Authors: Jose Oramas, Tinne Tuytelaars

Abstract: In this paper we present a hierarchical method to discover mid-level elements with the objective of modeling visual compatibility between objects. At the base-level, our method identifies patterns of CNN activations with the aim of modeling different variations/styles in which objects of the classes of interest may occur. At the top-level, the proposed method discovers patterns of co-occurring act… ▽ More In this paper we present a hierarchical method to discover mid-level elements with the objective of modeling visual compatibility between objects. At the base-level, our method identifies patterns of CNN activations with the aim of modeling different variations/styles in which objects of the classes of interest may occur. At the top-level, the proposed method discovers patterns of co-occurring activations of base-level elements that define visual compatibility between pairs of object classes. Experiments on the massive Amazon dataset show the strength of our method at describing object classes and the characteristics that drive the compatibility between them. △ Less

Submitted 31 March, 2016; originally announced April 2016.

Comments: 29 pages, 19 Figures

arXiv:1512.01848 [pdf, other]

doi 10.1109/TPAMI.2016.2558148

Rank Pooling for Action Recognition

Authors: Basura Fernando, Efstratios Gavves, Jose Oramas, Amir Ghodrati, Tinne Tuytelaars

Abstract: We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the fra… ▽ More We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features. △ Less

Submitted 15 May, 2016; v1 submitted 6 December, 2015; originally announced December 2015.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence

Showing 1–20 of 20 results for author: Oramas, J