Skip to main content

Showing 1–28 of 28 results for author: Lanz, O

.
  1. arXiv:2405.06994  [pdf, other

    cs.CV cs.LG

    GRASP-GCN: Graph-Shape Prioritization for Neural Architecture Search under Distribution Shifts

    Authors: Sofia Casarin, Oswald Lanz, Sergio Escalera

    Abstract: Neural Architecture Search (NAS) methods have shown to output networks that largely outperform human-designed networks. However, conventional NAS methods have mostly tackled the single dataset scenario, incuring in a large computational cost as the procedure has to be run from scratch for every new dataset. In this work, we focus on predictor-based algorithms and propose a simple and efficient way… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  2. arXiv:2405.06980  [pdf, other

    cs.CV

    Fractals as Pre-training Datasets for Anomaly Detection and Localization

    Authors: C. I. Ugwu, S. Casarin, O. Lanz

    Abstract: Anomaly detection is crucial in large-scale industrial manufacturing as it helps detect and localise defective parts. Pre-training feature extractors on large-scale datasets is a popular approach for this task. Stringent data security and privacy regulations and high costs and acquisition time hinder the availability and creation of such large datasets. While recent work in anomaly detection prima… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  3. arXiv:2403.15194  [pdf, other

    cs.CV cs.LG

    Your Image is My Video: Resha** the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

    Authors: Sofia Casarin, Cynthia I. Ugwu, Sergio Escalera, Oswald Lanz

    Abstract: The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. amount and variability) is still a ma… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

  4. arXiv:2307.12853  [pdf, other

    eess.IV cs.CV

    Spatiotemporal Modeling Encounters 3D Medical Image Analysis: Slice-Shift UNet with Multi-View Fusion

    Authors: C. I. Ugwu, S. Casarin, O. Lanz

    Abstract: As a fundamental part of computational healthcare, Computer Tomography (CT) and Magnetic Resonance Imaging (MRI) provide volumetric data, making the development of algorithms for 3D image analysis a necessity. Despite being computationally cheap, 2D Convolutional Neural Networks can only extract spatial information. In contrast, 3D CNNs can extract three-dimensional features, but they have higher… ▽ More

    Submitted 25 July, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

  5. arXiv:2212.08830  [pdf, other

    cs.CV

    Inductive Attention for Video Action Anticipation

    Authors: Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, Oswald Lanz

    Abstract: Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time -- a model ca… ▽ More

    Submitted 18 March, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

  6. A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

    Authors: Alex Falcon, Giuseppe Serra, Oswald Lanz

    Abstract: Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the ap… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted for presentation at 30th ACM International Conference on Multimedia (ACM MM)

  7. arXiv:2206.10903  [pdf, ps, other

    cs.CV

    UniUD-FBK-UB-UniBZ Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2022

    Authors: Alex Falcon, Giuseppe Serra, Sergio Escalera, Oswald Lanz

    Abstract: This report presents the technical details of our submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2022. To participate in the challenge, we designed an ensemble consisting of different models trained with two recently developed relevance-augmented versions of the widely used triplet loss. Our submission, visible on the public leaderboard, obtains an average score of 61.02% n… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: Ranked joint 1st place in the Multi-Instance Action Retrieval Challenge organized at EPIC@CVPR2022

  8. arXiv:2206.10869  [pdf, other

    cs.CV

    NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022

    Authors: Tsung-Ming Tai, Oswald Lanz, Giuseppe Fiameni, Yi-Kwan Wong, Sze-Sen Poon, Cheng-Kuang Lee, Ka-Chun Cheung, Simon See

    Abstract: In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the pre… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

  9. arXiv:2206.01009  [pdf, other

    cs.CV

    Unified Recurrence Modeling for Video Action Anticipation

    Authors: Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, Oswald Lanz

    Abstract: Forecasting future events based on evidence of current conditions is an innate skill of human beings, and key for predicting the outcome of any decision making. In artificial vision for example, we would like to predict the next human action before it happens, without observing the future video frames associated to it. Computer vision models for action anticipation are expected to collect the subt… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  10. Relevance-based Margin for Contrastively-trained Video Retrieval Models

    Authors: Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, Oswald Lanz

    Abstract: Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space b… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: Accepted for presentation at International Conference on Multimedia Retrieval (ICMR '22)

  11. arXiv:2203.08897  [pdf, other

    cs.CV

    Gate-Shift-Fuse for Video Action Recognition

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in… ▽ More

    Submitted 15 April, 2023; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: Accepted to TPAMI. arXiv admin note: text overlap with arXiv:1912.00381

  12. arXiv:2203.08688  [pdf, other

    cs.CV

    Learning video retrieval models with relevance-aware online mining

    Authors: Alex Falcon, Giuseppe Serra, Oswald Lanz

    Abstract: Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called nega… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Accepted at 21st International Conference on Image Analysis and Processing (ICIAP 2021)

  13. arXiv:2110.02902  [pdf, ps, other

    cs.CV

    SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021

    Authors: Swathikiran Sudhakaran, Adrian Bulat, Juan-Manuel Perez-Rua, Alex Falcon, Sergio Escalera, Oswald Lanz, Brais Martinez, Georgios Tzimiropoulos

    Abstract: This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition. XViT is a… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Comments: Ranked third in the EPIC-Kitchens-100 Action Recognition Challenge @ CVPR 2021

  14. arXiv:2104.08665  [pdf, other

    cs.CV

    Higher Order Recurrent Space-Time Transformer for Video Action Prediction

    Authors: Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Oswald Lanz

    Abstract: Endowing visual agents with predictive capability is a key step towards video intelligence at scale. The predominant modeling paradigm for this is sequence learning, mostly implemented through LSTMs. Feed-forward Transformer architectures have replaced recurrent model designs in ML applications of language processing and also partly in computer vision. In this paper we investigate on the competiti… ▽ More

    Submitted 21 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

  15. Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained re… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted to TPAMI

  16. arXiv:2008.09849  [pdf, other

    cs.CV

    Data augmentation techniques for the Video Question Answering task

    Authors: Alex Falcon, Oswald Lanz, Giuseppe Serra

    Abstract: Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can ha… ▽ More

    Submitted 22 August, 2020; originally announced August 2020.

    Comments: 16 pages, 5 figures; to be published in Egocentric Perception, Interaction and Computing (EPIC) Workshop Proceedings, at ECCV 2020

  17. arXiv:2007.02808  [pdf, other

    cs.CV

    Novel-View Human Action Synthesis

    Authors: Mohamed Ilyes Lakhal, Davide Boscaini, Fabio Poiesi, Oswald Lanz, Andrea Cavallaro

    Abstract: Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolutio… ▽ More

    Submitted 8 October, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: Asian Conference on Computer Vision (ACCV) 2020

  18. arXiv:2006.13725  [pdf, other

    cs.CV

    FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2020 Challenge

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: In this report we describe the technical details of our submission to the EPIC-Kitchens Action Recognition 2020 Challenge. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: Gate-Shift Module (GSM) [1] and EgoACO, an extension of Long Short-Term Attention (LSTA) [2]. We design an ensemble of GSM and EgoACO model familie… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: Ranked 3rd in the EPIC-Kitchens action recognition challenge @ CVPR 2020

  19. arXiv:1912.00381  [pdf, other

    cs.CV

    Gate-Shift Networks for Video Action Recognition

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space. In practice however, because of the large number of parameters and computations involved, they may under-perform in the lack of sufficiently large datasets for training them at scale. In this paper we introduce spatial gating in spatial-temporal decomposition of 3D k… ▽ More

    Submitted 21 March, 2020; v1 submitted 1 December, 2019; originally announced December 2019.

    Comments: CVPR20 camera ready version. Code and models available at https://github.com/swathikirans/GSM

  20. arXiv:1907.01273  [pdf, other

    cs.CV

    An Analysis of Deep Neural Networks with Attention for Action Recognition from a Neurophysiological Perspective

    Authors: Swathikiran Sudhakaran, Oswald Lanz

    Abstract: We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophyisiological point of view. We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.

    Submitted 2 July, 2019; originally announced July 2019.

    Comments: Presented as an extended abstract in the Mutual benefits of cognitive and computer vision (MBCCV) workshop, CVPR 2019

  21. arXiv:1906.08960  [pdf, other

    cs.CV

    FBK-HUPBA Submission to the EPIC-Kitchens 2019 Action Recognition Challenge

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: In this report we describe the technical details of our submission to the EPIC-Kitchens 2019 action recognition challenge. To participate in the challenge we have developed a number of CNN-LSTA [3] and HF-TSN [2] variants, and submitted predictions from an ensemble compiled out of these two model families. Our submission, visible on the public leaderboard with team name FBK-HUPBA, achieved a top-1… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: Ranked 3rd in the EPIC-Kitchens 2019 action recognition challenge, held as part of CVPR 2019

  22. arXiv:1905.12462  [pdf, other

    cs.CV

    Hierarchical Feature Aggregation Networks for Video Action Recognition

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped fr… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

  23. arXiv:1811.10698  [pdf, other

    cs.CV

    LSTA: Long Short-Term Attention for Egocentric Action Recognition

    Authors: Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

    Abstract: Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features… ▽ More

    Submitted 12 April, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

    Comments: Accepted to CVPR 2019

  24. arXiv:1808.09892  [pdf, other

    cs.CV

    Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos

    Authors: Swathikiran Sudhakaran, Oswald Lanz

    Abstract: Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the ob… ▽ More

    Submitted 29 August, 2018; originally announced August 2018.

    Comments: Accepted to the 17th International Conference of the Italian Association for Artificial Intelligence

  25. arXiv:1807.11794  [pdf, ps, other

    cs.CV

    Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition

    Authors: Swathikiran Sudhakaran, Oswald Lanz

    Abstract: In this paper we propose an end-to-end trainable deep neural network model for egocentric activity recognition. Our model is built on the observation that egocentric activities are highly characterized by the objects and their locations in the video. Based on this, we develop a spatial attention mechanism that enables the network to attend to regions containing objects that are correlated with the… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

    Comments: Accepted to BMVC 2018

  26. arXiv:1709.06531  [pdf, ps, other

    cs.CV

    Learning to Detect Violent Videos using Convolutional Long Short-Term Memory

    Authors: Swathikiran Sudhakaran, Oswald Lanz

    Abstract: Develo** a technique for the automatic analysis of surveillance videos in order to identify the presence of violence is of broad interest. In this work, we propose a deep neural network for the purpose of recognizing violent videos. A convolutional neural network is used to extract frame level features from a video. The frame level features are then aggregated using a variant of the long short t… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

    Comments: Accepted in International Conference on Advanced Video and Signal based Surveillance(AVSS 2017)

  27. arXiv:1709.06495  [pdf, ps, other

    cs.CV

    Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions

    Authors: Swathikiran Sudhakaran, Oswald Lanz

    Abstract: In this paper, we present a novel deep learning based approach for addressing the problem of interaction recognition from a first person perspective. The proposed approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-te… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

    Comments: Accepted on the second International Workshop on Egocentric Perception, Interaction and Computing(EPIC) at International Conference on Computer Vision(ICCV-17)

  28. arXiv:1506.06882  [pdf, other

    cs.CV

    SALSA: A Novel Dataset for Multimodal Group Behavior Analysis

    Authors: Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, Nicu Sebe

    Abstract: Studying free-standing conversational groups (FCGs) in unstructured social settings (e.g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels. However, analyzing social scenes involving FCGs is also highly challenging due to the difficulty in extracting behavior… ▽ More

    Submitted 23 June, 2015; originally announced June 2015.

    Comments: 14 pages, 11 figures