Search | arXiv e-print repository

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Authors: Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and renderin… ▽ More The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/ △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/

arXiv:2403.15336 [pdf, other]

Dialogue Understandability: Why are we streaming movies with subtitles?

Authors: Helard Becerra Martinez, Alessandro Ragano, Diptasree Debnath, Asad Ullah, Crisron Rudolf Lucas, Martin Walsh, Andrew Hines

Abstract: Watching movies and TV shows with subtitles enabled is not simply down to audibility or speech intelligibility. A variety of evolving factors related to technological advances, cinema production and social behaviour challenge our perception and understanding. This study seeks to formalise and give context to these influential factors under a wider and novel term referred to as Dialogue Understanda… ▽ More Watching movies and TV shows with subtitles enabled is not simply down to audibility or speech intelligibility. A variety of evolving factors related to technological advances, cinema production and social behaviour challenge our perception and understanding. This study seeks to formalise and give context to these influential factors under a wider and novel term referred to as Dialogue Understandability. We propose a working definition for Dialogue Understandability being a listener's capacity to follow the story without undue cognitive effort or concentration being required that impacts their Quality of Experience (QoE). The paper identifies, describes and categorises the factors that influence Dialogue Understandability map** them over the QoE framework, a media streaming lifecycle, and the stakeholders involved. We then explore available measurement tools in the literature and link them to the factors they could potentially be used for. The maturity and suitability of these tools is evaluated over a set of pilot experiments. Finally, we reflect on the gaps that still need to be filled, what we can measure and what not, future subjective experiments, and new research trends that could help us to fully characterise Dialogue Understandability. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.09227 [pdf, other]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Authors: Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews , et al. (10 additional authors not shown)

Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with… ▽ More We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)

arXiv:2401.17258 [pdf, other]

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

Authors: Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

Abstract: In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby mak… ▽ More In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.13594 [pdf, other]

Graph Guided Question Answer Generation for Procedural Question-Answering

Authors: Hai X. Pham, Isma Hadji, Xinnuo Xu, Ziedune Degutyte, Jay Rainey, Evangelos Kazakos, Afsaneh Fazly, Georgios Tzimiropoulos, Brais Martinez

Abstract: In this paper, we focus on task-specific question answering (QA). To this end, we introduce a method for generating exhaustive and high-quality training data, which allows us to train compact (e.g., run on a mobile device), task-specific QA models that are competitive against GPT variants. The key technological enabler is a novel mechanism for automatic question-answer generation from procedural t… ▽ More In this paper, we focus on task-specific question answering (QA). To this end, we introduce a method for generating exhaustive and high-quality training data, which allows us to train compact (e.g., run on a mobile device), task-specific QA models that are competitive against GPT variants. The key technological enabler is a novel mechanism for automatic question-answer generation from procedural text which can ingest large amounts of textual instructions and produce exhaustive in-domain QA training data. While current QA data generation methods can produce well-formed and varied data, their non-exhaustive nature is sub-optimal for training a QA model. In contrast, we leverage the highly structured aspect of procedural text and represent each step and the overall flow of the procedure as graphs. We then condition on graph nodes to automatically generate QA pairs in an exhaustive and controllable manner. Comprehensive evaluations of our method show that: 1) small models trained with our data achieve excellent performance on the target QA task, even exceeding that of GPT3 and ChatGPT despite being several orders of magnitude smaller. 2) semantic coverage is the key indicator for downstream QA performance. Crucially, while large language models excel at syntactic diversity, this does not necessarily result in improvements on the end QA model. In contrast, the higher semantic coverage provided by our method is critical for QA performance. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to EACL 2024 as long paper. 25 pages including appendix

MSC Class: I.2.7

arXiv:2307.15697 [pdf, other]

SimDETR: Simplifying self-supervised pretraining for DETR

Authors: Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

Abstract: DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like kee** the detector's backbone frozen… ▽ More DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like kee** the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR. △ Less

Submitted 28 July, 2023; originally announced July 2023.

arXiv:2304.01752 [pdf, other]

Black Box Few-Shot Adaptation for Vision-Language models

Authors: Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Abstract: Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaptation aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the… ▽ More Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaptation aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets. △ Less

Submitted 17 August, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: Published at ICCV 2023

arXiv:2303.06455 [pdf, other]

doi 10.1016/j.neunet.2024.106180

Graph Neural Network contextual embedding for Deep Learning on Tabular Data

Authors: Mario Villaizán-Vallelado, Matteo Salvatori, Belén Carro Martinez, Antonio Javier Sanchez Esguevillas

Abstract: All industries are trying to leverage Artificial Intelligence (AI) based on their existing big data which is available in so called tabular form, where each record is composed of a number of heterogeneous continuous and categorical columns also known as features. Deep Learning (DL) has constituted a major breakthrough for AI in fields related to human skills like natural language processing, but i… ▽ More All industries are trying to leverage Artificial Intelligence (AI) based on their existing big data which is available in so called tabular form, where each record is composed of a number of heterogeneous continuous and categorical columns also known as features. Deep Learning (DL) has constituted a major breakthrough for AI in fields related to human skills like natural language processing, but its applicability to tabular data has been more challenging. More classical Machine Learning (ML) models like tree-based ensemble ones usually perform better. This paper presents a novel DL model using Graph Neural Network (GNN) more specifically Interaction Network (IN), for contextual embedding and modelling interactions among tabular features. Its results outperform those of a recently published survey with DL benchmark based on five public datasets, also achieving competitive results when compared to boosted-tree solutions. △ Less

Submitted 4 July, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

arXiv:2210.04996 [pdf, other]

Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization

Authors: Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, Allan D. Jepson

Abstract: In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works r… ▽ More In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works require the actual order of procedure steps in the video, to be provided by human annotators at both training and test times. Instead, here, we only rely on generic procedural text that is not tied to a specific video. We represent the various ways to complete the procedure by transforming the list of instructions into a procedure flow graph which captures the partial order of steps. Using the flow graphs reduces both training and test time annotation requirements. To this end, we introduce the new problem of flow graph to video grounding. In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video. To solve this problem, we propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them. To show the advantage of our proposed formulation, we extend the CrossTask dataset with procedure flow graph information. Our experiments show that Graph2Vid is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation. △ Less

Submitted 31 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: ECCV'22, oral

Journal ref: ECCV 2022

arXiv:2210.04845 [pdf, other]

FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

Authors: Adrian Bulat, Ricardo Guerrero, Brais Martinez, Georgios Tzimiropoulos

Abstract: This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process… ▽ More This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) based on visual prompting that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2) ``stamp'' these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c). Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and outperforms the current state-of-the-art fine-tuning based methods on the most well-established benchmarks (PASCAL VOC & MSCOCO). △ Less

Submitted 20 August, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: Accepted at ICCV 2023

arXiv:2210.02808 [pdf, other]

Effective Self-supervised Pre-training on Low-compute Networks without Distillation

Authors: Fuwen Tan, Fatemeh Saleh, Brais Martinez

Abstract: Despite the impressive progress of self-supervised learning (SSL), its applicability to low-compute networks has received limited attention. Reported performance has trailed behind standard supervised pre-training by a large margin, barring self-supervised learning from making an impact on models that are deployed on device. Most prior works attribute this poor performance to the capacity bottlene… ▽ More Despite the impressive progress of self-supervised learning (SSL), its applicability to low-compute networks has received limited attention. Reported performance has trailed behind standard supervised pre-training by a large margin, barring self-supervised learning from making an impact on models that are deployed on device. Most prior works attribute this poor performance to the capacity bottleneck of the low-compute networks and opt to bypass the problem through the use of knowledge distillation (KD). In this work, we revisit SSL for efficient neural networks, taking a closer at what are the detrimental factors causing the practical limitations, and whether they are intrinsic to the self-supervised low-compute setting. We find that, contrary to accepted knowledge, there is no intrinsic architectural bottleneck, we diagnose that the performance bottleneck is related to the model complexity vs regularization strength trade-off. In particular, we start by empirically observing that the use of local views can have a dramatic impact on the effectiveness of the SSL methods. This hints at view sampling being one of the performance bottlenecks for SSL on low-capacity networks. We hypothesize that the view sampling strategy for large neural networks, which requires matching views in very diverse spatial scales and contexts, is too demanding for low-capacity architectures. We systematize the design of the view sampling mechanism, leading to a new training methodology that consistently improves the performance across different SSL methods (e.g. MoCo-v2, SwAV, DINO), different low-size networks (e.g. MobileNetV2, ResNet18, ResNet34, ViT-Ti), and different tasks (linear probe, object detection, instance segmentation and semi-supervised learning). Our best models establish a new state-of-the-art for SSL methods on low-compute networks despite not using a KD loss term. △ Less

Submitted 2 October, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: ICLR 2023 Camera Ready. Code is publicly available at https://github.com/saic-fi/SSLight

arXiv:2210.02390 [pdf, other]

Bayesian Prompt Learning for Image-Language Model Generalization

Authors: Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, Brais Martinez

Abstract: Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generaliza… ▽ More Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. Code available at: https://github.com/saic-fi/Bayesian-Prompt-Learning △ Less

Submitted 20 August, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Accepted at ICCV 2023

arXiv:2209.15000 [pdf, other]

REST: REtrieve & Self-Train for generative action recognition

Authors: Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos

Abstract: This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L)… ▽ More This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available. △ Less

Submitted 29 September, 2022; originally announced September 2022.

arXiv:2208.11108 [pdf, other]

Efficient Attention-free Video Shift Transformers

Authors: Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Abstract: This paper tackles the problem of efficient video recognition. In this area, video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. At the same time, there have been some attempts in the image domain which challenge the necessity of the self-attention operation within the transformer architecture, advocating the use of simpler approaches for token mixing. How… ▽ More This paper tackles the problem of efficient video recognition. In this area, video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. At the same time, there have been some attempts in the image domain which challenge the necessity of the self-attention operation within the transformer architecture, advocating the use of simpler approaches for token mixing. However, there are no results yet for the case of video recognition, where the self-attention operator has a significantly higher impact (compared to the case of images) on efficiency. To address this gap, in this paper, we make the following contributions: (a) we construct a highly efficient \& accurate attention-free block based on the shift operator, coined Affine-Shift block, specifically designed to approximate as closely as possible the operations in the MHSA block of a Transformer layer. Based on our Affine-Shift block, we construct our Affine-Shift Transformer and show that it already outperforms all existing shift/MLP--based architectures for ImageNet classification. (b) We extend our formulation in the video domain to construct Video Affine-Shift Transformer (VAST), the very first purely attention-free shift-based video transformer. (c) We show that VAST significantly outperforms recent state-of-the-art transformers on the most popular action recognition benchmarks for the case of models with low computational and memory footprint. Code will be made available. △ Less

Submitted 23 August, 2022; originally announced August 2022.

arXiv:2208.04247 [pdf, other]

Challenges and Opportunities for Simultaneous Multi-functional Networks in the UHF Bands

Authors: Xavier Vilajosana, Guillem Boquet, Joan Melià, Pere Tuset-Peiró, Borja Martinez, Ferran Adelantado

Abstract: Multi-functional wireless networks are rapidly evolving and aspire to become a promising attribute of the upcoming 6G networks. Enabling multiple simultaneous networking functions with a single radio fosters the development of more integrated and simpler equipment, overcoming design and technology barriers inherited from radio systems of the past. We are seeing numerous trends exploiting these fea… ▽ More Multi-functional wireless networks are rapidly evolving and aspire to become a promising attribute of the upcoming 6G networks. Enabling multiple simultaneous networking functions with a single radio fosters the development of more integrated and simpler equipment, overcoming design and technology barriers inherited from radio systems of the past. We are seeing numerous trends exploiting these features in newly designed radios, such as those operating on the mmWave band. In this article, however, we carefully analyze the challenges and opportunities for multi-functional wireless networks in UHF bands, advocating the reuse of existing infrastructures and technologies, and exploring the possibilities of expanding their functionality without requiring architectural changes. We believe that both modern and legacy technologies can be turned into multi-functional systems if the right scientific and technological challenges are properly addressed. This transformation can foster the development of new applications and extend the useful life of these systems, contributing to a more sustainable digitization by delaying equipment obsolescence. △ Less

Submitted 8 August, 2022; originally announced August 2022.

arXiv:2206.08339 [pdf, other]

iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

Authors: Fatemeh Saleh, Fuwen Tan, Adrian Bulat, Georgios Tzimiropoulos, Brais Martinez

Abstract: Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video da… ▽ More Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2205.06701 [pdf, other]

Knowledge Distillation Meets Open-Set Semi-Supervised Learning

Authors: **g Yang, Xiatian Zhu, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Abstract: Existing knowledge distillation methods mostly focus on distillation of teacher's prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel {\em \modelname{}} ({\bf\em \shortname{})} method dedicated for distilling representational knowledge semantica… ▽ More Existing knowledge distillation methods mostly focus on distillation of teacher's prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel {\em \modelname{}} ({\bf\em \shortname{})} method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. The key idea is that we leverage the teacher's classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions. This is accomplished by introducing a notion of cross-network logit computed through passing student's representation into teacher's classifier. Further, considering the set of seen classes as a basis for the semantic space in a combinatorial perspective, we scale \shortname{} to unseen classes for enabling effective exploitation of largely available, arbitrary unlabeled training data. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL). Extensive experiments show that our \shortname{} outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks, as well as less studied yet practically crucial binary network distillation. Under more realistic open-set SSL settings we introduce, we reveal that knowledge distillation is generally more effective than existing Out-Of-Distribution (OOD) sample detection, and our proposed \shortname{} is superior over both previous distillation and SSL competitors. The source code is available at \url{https://github.com/**gyang2017/SRD\_ossl}. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 13 pages

arXiv:2205.03436 [pdf, other]

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Authors: Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, Brais Martinez

Abstract: Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several… ▽ More Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy trade-offs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs. Code is available at https://github.com/saic-fi/edgevit. △ Less

Submitted 21 July, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: Accepted in ECCV 2022

arXiv:2204.04796 [pdf, other]

SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

Authors: Victor Escorcia, Ricardo Guerrero, Xiatian Zhu, Brais Martinez

Abstract: Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotation… ▽ More Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object representation. Besides, previous methods deeply couple the existing action models and need to retrain them jointly with object representation, leading to costly and inflexible integration. To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector. Instead of augmenting object regions individually as in conventional self-supervised learning, we view the action process as a means of natural data transformations with unique spatio-temporal continuity and exploit the inherent relationships among per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100 and EGTEA, show that our OIC significantly boosts the performance of multiple state-of-the-art video classification models. △ Less

Submitted 2 May, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

arXiv:2201.00954 [pdf, other]

Dynamics of polynomial maps over finite fields

Authors: José Alves Oliveira, Fabio Enrique Brochero Martínez

Abstract: Let $\mathbb{F}_q$ be a finite field with $q$ elements and let $n$ be a positive integer. In this paper, we study the digraph associated to the map $x\mapsto x^n h(x^{\frac{q-1}{m}})$, where $h(x)\in\mathbb{F}_q[x].$ We completely determine the associated functional graph of maps that satisfy a certain condition of regularity. In particular, we provide the functional graphs associated to monomial… ▽ More Let $\mathbb{F}_q$ be a finite field with $q$ elements and let $n$ be a positive integer. In this paper, we study the digraph associated to the map $x\mapsto x^n h(x^{\frac{q-1}{m}})$, where $h(x)\in\mathbb{F}_q[x].$ We completely determine the associated functional graph of maps that satisfy a certain condition of regularity. In particular, we provide the functional graphs associated to monomial maps. As a consequence of our results, the number of connected components, length of the cycles and number of fixed points of these class of maps are provided. △ Less

Submitted 3 January, 2022; originally announced January 2022.

Comments: Comments are welcome!

arXiv:2111.11132 [pdf, other]

On the functional graph of $f(X)=c(X^{q+1}+aX^2)$ over quadratic extensions of finite fields

Authors: F. E. Brochero Martínez, H. R. Teixeira

Abstract: Let $\mathbb{F}_q$ be the finite field with $q$ elements and $char(\mathbb{F}_q)$ odd. In this article we will describe completely the dynamics of the map $f(X)=c(X^{q+1}+aX^2)$, for $a=\{\pm1\}$ and $c\in\mathbb{F}_q^*$, over the finite field $\mathbb{F}_{q^2}$, and give some partial results for $a\in\mathbb{F}_q^*\setminus\{\pm1\}$. Let $\mathbb{F}_q$ be the finite field with $q$ elements and $char(\mathbb{F}_q)$ odd. In this article we will describe completely the dynamics of the map $f(X)=c(X^{q+1}+aX^2)$, for $a=\{\pm1\}$ and $c\in\mathbb{F}_q^*$, over the finite field $\mathbb{F}_{q^2}$, and give some partial results for $a\in\mathbb{F}_q^*\setminus\{\pm1\}$. △ Less

Submitted 22 November, 2021; originally announced November 2021.

MSC Class: 12E20; 05C20; 37P25

arXiv:2110.05812 [pdf, other]

Satellite Image Semantic Segmentation

Authors: Eric Guérin, Killian Oechslin, Christian Wolf, Benoît Martinez

Abstract: In this paper, we propose a method for the automatic semantic segmentation of satellite images into six classes (sparse forest, dense forest, moor, herbaceous formation, building, and road). We rely on Swin Transformer architecture and build the dataset from IGN open data. We report quantitative and qualitative segmentation results on this dataset and discuss strengths and limitations. The dataset… ▽ More In this paper, we propose a method for the automatic semantic segmentation of satellite images into six classes (sparse forest, dense forest, moor, herbaceous formation, building, and road). We rely on Swin Transformer architecture and build the dataset from IGN open data. We report quantitative and qualitative segmentation results on this dataset and discuss strengths and limitations. The dataset and the trained model are made publicly available. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 8 pages, 3 figures

ACM Class: I.4.6

arXiv:2110.02902 [pdf, ps, other]

SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021

Authors: Swathikiran Sudhakaran, Adrian Bulat, Juan-Manuel Perez-Rua, Alex Falcon, Sergio Escalera, Oswald Lanz, Brais Martinez, Georgios Tzimiropoulos

Abstract: This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition. XViT is a… ▽ More This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition. XViT is a convolution free video feature extractor based on transformer architecture. We design an ensemble of GSF and XViT model families with different backbones and pretraining to generate the prediction scores. Our submission, visible on the public leaderboard, achieved a top-1 action recognition accuracy of 44.82%, using only RGB. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: Ranked third in the EPIC-Kitchens-100 Action Recognition Challenge @ CVPR 2021

arXiv:2106.06505 [pdf, other]

Efficient Deep Learning Architectures for Fast Identification of Bacterial Strains in Resource-Constrained Devices

Authors: R. Gallardo García, S. Jarquín Rodríguez, B. Beltrán Martínez, C. Hernández Gracidas, R. Martínez Torres

Abstract: This work presents twelve fine-tuned deep learning architectures to solve the bacterial classification problem over the Digital Image of Bacterial Species Dataset. The base architectures were mainly published as mobile or efficient solutions to the ImageNet challenge, and all experiments presented in this work consisted of making several modifications to the original designs, in order to make them… ▽ More This work presents twelve fine-tuned deep learning architectures to solve the bacterial classification problem over the Digital Image of Bacterial Species Dataset. The base architectures were mainly published as mobile or efficient solutions to the ImageNet challenge, and all experiments presented in this work consisted of making several modifications to the original designs, in order to make them able to solve the bacterial classification problem by using fine-tuning and transfer learning techniques. This work also proposes a novel data augmentation technique for this dataset, which is based on the idea of artificial zooming, strongly increasing the performance of every tested architecture, even doubling it in some cases. In order to get robust and complete evaluations, all experiments were performed with 10-fold cross-validation and evaluated with five different metrics: top-1 and top-5 accuracy, precision, recall, and F1 score. This paper presents a complete comparison of the twelve different architectures, cross-validated with the original and the augmented version of the dataset, the results are also compared with several literature methods. Overall, eight of the eleven architectures surpassed the 0.95 scores in top-1 accuracy with our data augmentation method, being 0.9738 the highest top-1 accuracy. The impact of the data augmentation technique is reported with relative improvement scores. △ Less

Submitted 11 June, 2021; originally announced June 2021.

Comments: 22 pages, 2 figures, 5 tables. Submitted to Multimedia Tools and Applications, issue 1218 - Engineering Tools and Applications in Medical Imaging (currently in reviewing process)

MSC Class: 68T07 (Primary); 68U10 (Secondary) ACM Class: I.4; J.3

arXiv:2106.05968 [pdf, other]

Space-time Mixing Attention for Video Transformer

Authors: Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

Abstract: This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linear… ▽ More This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code will be made available. △ Less

Submitted 11 June, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Updated results on SSv2

arXiv:2103.15233 [pdf, other]

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez

Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is n… ▽ More Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin. △ Less

Submitted 29 October, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

Comments: To appear at NeurIPS 2021. 15 pages, 1 figure

arXiv:2101.08085 [pdf, other]

Few-shot Action Recognition with Prototype-centered Attentive Learning

Authors: Xiatian Zhu, Antoine Toisoul, Juan-Manuel Perez-Rua, Li Zhang, Brais Martinez, Tao Xiang

Abstract: Few-shot action recognition aims to recognize action classes with few training samples. Most existing methods adopt a meta-learning approach with episodic training. In each episode, the few samples in a meta-training task are split into support and query sets. The former is used to build a classifier, which is then evaluated on the latter using a query-centered loss for model updating. There are h… ▽ More Few-shot action recognition aims to recognize action classes with few training samples. Most existing methods adopt a meta-learning approach with episodic training. In each episode, the few samples in a meta-training task are split into support and query sets. The former is used to build a classifier, which is then evaluated on the latter using a query-centered loss for model updating. There are however two major limitations: lack of data efficiency due to the query-centered only loss design and inability to deal with the support set outlying samples and inter-class distribution overlap** problems. In this paper, we overcome both limitations by proposing a new Prototype-centered Attentive Learning (PAL) model composed of two novel components. First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective, in order to make full use of the limited training samples in each episode. Second, PAL further integrates a hybrid attentive learning mechanism that can minimize the negative impacts of outliers and promote class separation. Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (10+\%) on the most challenging fine-grained action recognition benchmark. △ Less

Submitted 28 March, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

Comments: 10 pages, 4 figures

Journal ref: BMVC 2021

arXiv:2012.03854 [pdf, other]

doi 10.1016/j.ijforecast.2021.11.001

Forecasting: theory and practice

Authors: Fotios Petropoulos, Daniele Apiletti, Vassilios Assimakopoulos, Mohamed Zied Babai, Devon K. Barrow, Souhaib Ben Taieb, Christoph Bergmeir, Ricardo J. Bessa, Jakub Bijak, John E. Boylan, Jethro Browell, Claudio Carnevale, Jennifer L. Castle, Pasquale Cirillo, Michael P. Clements, Clara Cordeiro, Fernando Luiz Cyrino Oliveira, Shari De Baets, Alexander Dokumentov, Joanne Ellison, Piotr Fiszeder, Philip Hans Franses, David T. Frazier, Michael Gilliland, M. Sinan Gönül , et al. (55 additional authors not shown)

Abstract: Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systemati… ▽ More Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases. △ Less

Submitted 5 January, 2022; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2012.01534 [pdf, ps, other]

Artin-Schreier curves given by $\mathbb F_q$-linearized polynomials

Authors: Daniela Oliveira, F. E. Brochero Martínez

Abstract: Let $\mathbb F_q$ be a finite field with $q$ elements, where $q$ is a power of an odd prime $p$. In this paper we associate circulant matrices and quadratic forms with the Artin-Schreier curve $y^q - y= x \cdot F(x) - λ,$ where $F(x)$ is a $\mathbb F_q$-linearized polynomial and $λ\in \mathbb F_q$. Our results provide a characterization of the number of affine rational points of this curve in the… ▽ More Let $\mathbb F_q$ be a finite field with $q$ elements, where $q$ is a power of an odd prime $p$. In this paper we associate circulant matrices and quadratic forms with the Artin-Schreier curve $y^q - y= x \cdot F(x) - λ,$ where $F(x)$ is a $\mathbb F_q$-linearized polynomial and $λ\in \mathbb F_q$. Our results provide a characterization of the number of affine rational points of this curve in the extension $\mathbb F_{q^r}$ of $\mathbb F_q$, for $\gcd(q,r)=1$. In the case $F(x) = x^{q^i}-x$ we give a complete description of the number of affine rational points in terms of Legendre symbols and quadratic characters. △ Less

Submitted 8 September, 2022; v1 submitted 2 December, 2020; originally announced December 2020.

MSC Class: 12E20; 11T06

arXiv:2011.10830 [pdf, other]

Boundary-sensitive Pre-training for Temporal Localization in Videos

Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Victor Escorcia, Brais Martinez, Xiatian Zhu, Li Zhang, Bernard Ghanem, Tao Xiang

Abstract: Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this pape… ▽ More Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks. △ Less

Submitted 26 March, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: 11 pages, 4 figures

arXiv:2010.03558 [pdf, other]

High-Capacity Expert Binary Networks

Authors: Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Abstract: Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution,… ▽ More Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network growth mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost, by $\sim6 \%$, reaching a groundbreaking $\sim 71\%$ on ImageNet classification. Code will be made available $\href{https://www.adrianbulat.com/binary-networks}{here}$. △ Less

Submitted 30 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Accepted at ICLR 2021

arXiv:2008.01427 [pdf, other]

Debunking Wireless Sensor Networks Myths

Authors: Borja Martinez, Cristina Cano, Xavier Vilajosana

Abstract: In this article we revisit Wireless Sensor Networks from a contemporary perspective, after the surge of the Internet of Things. First, we analyze the evolution of distributed monitoring applications, which we consider inherited from the early idea of collaborative sensor networks. Second, we evaluate, within the current context of networked objects, the level of adoption of low-power multi-hop wir… ▽ More In this article we revisit Wireless Sensor Networks from a contemporary perspective, after the surge of the Internet of Things. First, we analyze the evolution of distributed monitoring applications, which we consider inherited from the early idea of collaborative sensor networks. Second, we evaluate, within the current context of networked objects, the level of adoption of low-power multi-hop wireless, a technology pivotal to the Wireless Sensor Network paradigm. This article assesses the transformation of this technology in its integration into the Internet of Things, identifying outdated requirements and providing a critical view on future research directions. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2007.06504 [pdf, other]

Towards Practical Lipreading with Distilled and Efficient Models

Authors: **chuan Ma, Brais Martinez, Stavros Petridis, Maja Pantic

Abstract: Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work… ▽ More Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation. Secondly, we propose a series of architectural changes, including a novel Depthwise Separable Temporal Convolutional Network (DS-TCN) head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8.2x and 3.9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications. △ Less

Submitted 2 June, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: Accepted to ICASSP 2021

arXiv:2007.01883 [pdf, other]

Egocentric Action Recognition by Video Attention and Temporal Context

Authors: Juan-Manuel Perez-Rua, Antoine Toisoul, Brais Martinez, Victor Escorcia, Li Zhang, Xiatian Zhu, Tao Xiang

Abstract: We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-li… ▽ More We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-life action recognition task include small fast moving objects, complex hand-object interactions, and occlusions. At the core of our submission is a recently-proposed spatial-temporal video attention model, called `W3' (`What-Where-When') attention~\cite{perez2020knowing}. We further introduce a simple yet effective contextual learning mechanism to model `action' class scores directly from long-term temporal behaviour based on the `verb' and `noun' prediction scores. Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data. In particular, our best solution with multimodal ensemble achieves the 2$^{nd}$ best position for `verb', and 3$^{rd}$ best for `noun' and `action' on the Seen Kitchens test set. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: EPIC-Kitchens challenges@CVPR 2020

arXiv:2006.05703 [pdf, other]

doi 10.1109/TSUSC.2021.3058588

Exploiting the Solar Energy Surplus for Edge Computing

Authors: Borja Martinez, Xavier Vilajosana

Abstract: In the context of the global energy ecosystem transformation, we introduce a new approach to reduce the carbon emissions of the cloud-computing sector and, at the same time, foster the deployment of small-scale private photovoltaic plants. We consider the opportunity cost of moving some cloud services to private, distributed, solar-powered computing facilities. To this end, we compare the potentia… ▽ More In the context of the global energy ecosystem transformation, we introduce a new approach to reduce the carbon emissions of the cloud-computing sector and, at the same time, foster the deployment of small-scale private photovoltaic plants. We consider the opportunity cost of moving some cloud services to private, distributed, solar-powered computing facilities. To this end, we compare the potential revenue of leasing computing resources to a cloud pool with the revenue obtained by selling the surplus energy to the grid. We first estimate the consumption of virtualized cloud computing instances, establishing a metric of computational efficiency per nominal photovoltaic power installed. Based on this metric and characterizing the site's annual solar production, we estimate the total return and payback. The results show that the model is economically viable and technically feasible. We finally depict the still many questions open, such as security, and the fundamental barriers to address, mainly related with a cloud model ruled by a few big players. △ Less

Submitted 10 June, 2020; originally announced June 2020.

arXiv:2004.01278 [pdf, other]

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

Authors: Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu, Antoine Toisoul, Victor Escorcia, Tao Xiang

Abstract: Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally… ▽ More Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks. △ Less

Submitted 2 April, 2020; originally announced April 2020.

arXiv:2003.11535 [pdf, other]

Training Binary Neural Networks with Real-to-Binary Convolutions

Authors: Brais Martinez, **g Yang, Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper shows how to train binary networks to within a few percent points ($\sim 3-5 \%$) of the full precision counterpart. We first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances and carefully adjusting the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output… ▽ More This paper shows how to train binary networks to within a few percent points ($\sim 3-5 \%$) of the full precision counterpart. We first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances and carefully adjusting the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output of the binary and the corresponding real-valued convolution, additional significant accuracy gains can be obtained. We materialize this idea in two complementary ways: (1) with a loss function, during training, by matching the spatial attention maps computed at the output of the binary and real-valued convolutions, and (2) in a data-driven manner, by using the real-valued activations, available during inference prior to the binarization process, for re-scaling the activations right after the binary convolution. Finally, we show that, when putting all of our improvements together, the proposed model beats the current state of the art by more than 5% top-1 accuracy on ImageNet and reduces the gap to its real-valued counterpart to less than 3% and 5% top-1 accuracy on CIFAR-100 and ImageNet respectively when using a ResNet-18 architecture. Code available at https://github.com/brais-martinez/real2binary. △ Less

Submitted 25 March, 2020; originally announced March 2020.

Comments: ICLR 2020

arXiv:2003.04289 [pdf, other]

Knowledge distillation via adaptive instance normalization

Authors: **g Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

Abstract: This paper addresses the problem of model compression via knowledge distillation. To this end, we propose a new knowledge distillation method based on transferring feature statistics, specifically the channel-wise mean and variance, from the teacher to the student. Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher throug… ▽ More This paper addresses the problem of model compression via knowledge distillation. To this end, we propose a new knowledge distillation method based on transferring feature statistics, specifically the channel-wise mean and variance, from the teacher to the student. Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher through an $L_2$ loss, which we found it to be of limited effectiveness. Specifically, we propose a new loss based on adaptive instance normalization to effectively transfer the feature statistics. The main idea is to transfer the learned statistics back to the teacher via adaptive instance normalization (conditioned on the student) and let the teacher network "evaluate" via a loss whether the statistics learned by the student are reliably transferred. We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2003.01711 [pdf, other]

BATS: Binary ArchitecTure Search

Authors: Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Abstract: This paper proposes Binary ArchitecTure Search (BATS), a framework that drastically reduces the accuracy gap between binary neural networks and their real-valued counterparts by means of Neural Architecture Search (NAS). We show that directly applying NAS to the binary domain provides very poor results. To alleviate this, we describe, to our knowledge, for the first time, the 3 key ingredients for… ▽ More This paper proposes Binary ArchitecTure Search (BATS), a framework that drastically reduces the accuracy gap between binary neural networks and their real-valued counterparts by means of Neural Architecture Search (NAS). We show that directly applying NAS to the binary domain provides very poor results. To alleviate this, we describe, to our knowledge, for the first time, the 3 key ingredients for successfully applying NAS to the binary domain. Specifically, we (1) introduce and design a novel binary-oriented search space, (2) propose a new mechanism for controlling and stabilising the resulting searched topologies, (3) propose and validate a series of new search strategies for binary networks that lead to faster convergence and lower search times. Experimental results demonstrate the effectiveness of the proposed approach and the necessity of searching in the binary space directly. Moreover, (4) we set a new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and ImageNet datasets. Code will be made available https://github.com/1adrianb/binary-nas △ Less

Submitted 23 July, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

Comments: accepted to ECCV 2020

arXiv:2001.08702 [pdf, other]

Lipreading using Temporal Convolutional Networks

Authors: Brais Martinez, **chuan Ma, Stavros Petridis, Maja Pantic

Abstract: Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU l… ▽ More Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU layers are replaced with Temporal Convolutional Networks (TCN). Secondly, we greatly simplify the training procedure, which allows us to train the model in one single stage. Thirdly, we show that the current state-of-the-art methodology produces models that do not generalize well to variations on the sequence length, and we addresses this issue by proposing a variable-length augmentation. We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively. Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets which is the new state-of-the-art performance. △ Less

Submitted 23 January, 2020; originally announced January 2020.

arXiv:1908.07625 [pdf, other]

Action recognition with spatial-temporal discriminative filter banks

Authors: Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe

Abstract: Action recognition has seen a dramatic performance improvement in the last few years. Most of the current state-of-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same… ▽ More Action recognition has seen a dramatic performance improvement in the last few years. Most of the current state-of-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same last layers of the network, which simply consist of a global average pooling followed by a fully connected layer. In this work we focus on how to improve the representation capacity of the network, but rather than altering the backbone, we focus on improving the last layers of the network, where changes have low impact in terms of computational cost. In particular, we show that current architectures have poor sensitivity to finer details and we exploit recent advances in the fine-grained recognition literature to improve our model in this aspect. With the proposed approach, we obtain state-of-the-art performance on Kinetics-400 and Something-Something-V1, the two major large-scale action recognition benchmarks. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Comments: ICCV 2019 Accepted Paper

arXiv:1810.00847 [pdf, other]

doi 10.1109/JIOT.2019.2904799

Exploring the Performance Boundaries of NB-IoT

Authors: Borja Martinez, Ferran Adelantado, Andrea Bartoli, Xavier Vilajosana

Abstract: NarrowBand-IoT has just joined the LPWAN community. Unlike most of its competitors, NB-IoT did not emerge from a blank slate. Indeed, it is closely linked to LTE, from which it inherits many of the features that undoubtedly determine its behavior. In this paper, we empirically explore the boundaries of this technology, analyzing from a user's point of view critical characteristics such as energy c… ▽ More NarrowBand-IoT has just joined the LPWAN community. Unlike most of its competitors, NB-IoT did not emerge from a blank slate. Indeed, it is closely linked to LTE, from which it inherits many of the features that undoubtedly determine its behavior. In this paper, we empirically explore the boundaries of this technology, analyzing from a user's point of view critical characteristics such as energy consumption, reliability and delays. The results show that its performance in terms of energy is comparable and even outperforms, in some cases, an LPWAN reference technology like LoRa, with the added benefit of guaranteeing delivery. However, the high variability observed in both energy expenditure and network delays call into question its suitability for some applications, especially those subject to service-level agreements. △ Less

Submitted 18 February, 2019; v1 submitted 1 October, 2018; originally announced October 2018.

arXiv:1808.03065 [pdf, other]

doi 10.1109/MCOM.2019.1800570

A Square Peg in a Round Hole: The Complex Path for Wireless in the Manufacturing Industry

Authors: Borja Martinez, Cristina Cano, Xavier Vilajosana

Abstract: The manufacturing industry is at the edge of the 4th industrial revolution, a paradigm of integrated architectures in which the entire production chain (composed of machines, workers and products) is intrinsically connected. Wireless technologies can add further value in this manufacturing revolution. However, we identify some signs that indicate that wireless could be left out from the next gener… ▽ More The manufacturing industry is at the edge of the 4th industrial revolution, a paradigm of integrated architectures in which the entire production chain (composed of machines, workers and products) is intrinsically connected. Wireless technologies can add further value in this manufacturing revolution. However, we identify some signs that indicate that wireless could be left out from the next generation of smart-factory equipment. This is particularly relevant considering that the heavy machinery characteristic of this sector can last for decades. We argue that at the core of this issue there is a mismatch between industrial needs and the interests of academic and partly-academic (such as standardization bodies) sectors. We base our claims on surveys from renowned advisory firms and interviews with industrial actors, which we contrast with results from content analysis of scientific articles. Finally we propose some convergence paths that, while still retaining the degree of novelty required for academic purposes, are more aligned with industrial concerns. △ Less

Submitted 1 February, 2019; v1 submitted 9 August, 2018; originally announced August 2018.

Comments: 6 pages, 3 figures

arXiv:1801.03648 [pdf]

The Wireless Technology Landscape in the Manufacturing Industry: A Reality Check

Authors: Xavier Vilajosana, Cristina Cano, Borja Martinez, Pere Tuset, Joan Melià, Ferran Adelantado

Abstract: An upcoming industrial IoT revolution, supposedly led by the introduction of embedded sensing and computing, seamless communication and massive data analytics within industrial processes [1], seems unquestionable today. Multiple technologies are being developed, and huge marketing efforts are being made to position solutions in this industrial landscape. However, we have observed that industrial w… ▽ More An upcoming industrial IoT revolution, supposedly led by the introduction of embedded sensing and computing, seamless communication and massive data analytics within industrial processes [1], seems unquestionable today. Multiple technologies are being developed, and huge marketing efforts are being made to position solutions in this industrial landscape. However, we have observed that industrial wireless technologies are hardly being adopted by the manufacturing industry. In this article, we try to understand the reasons behind this current lack of wireless technologies adoption by means of conducting visits to the manufacturing industry and interviews with the maintenance and engineering teams in these industries. The manufacturing industry is very diverse and specialized, so we have tried to cover some of the most representative cases: the automotive sector, the pharmaceutical sector (blistering), machine-tool industries (both consumer and aerospace sectors) and robotics. We have analyzed the technology of their machinery, their application requirements and restrictions, and identified a list of obstacles for wireless technology adoption. The most immediate obstacles we have found are the need to strictly follow standards and certifications processes, as well as their prudence. But the less obvious and perhaps even more limiting obstacles are their apparent lack of concern regarding low energy consumption or cost which, in contrast, are believed to be of utmost importance by wireless researchers and practitioners. In this reality-check article, we analyze the causes of this different perception, we identify these obstacles and devise complementary paths to make wireless adoption by the industrial manufacturing sector a reality in the coming years. △ Less

Submitted 11 January, 2018; originally announced January 2018.

Comments: 5 pages

Report number: 01-A

Journal ref: MMTC Communications - Frontiers, SPECIAL ISSUE ON Multiple Wireless Technologies and IoT in Industry: Applications and Challenges, Vol. 12, No. 6, November 2017

arXiv:1701.04540 [pdf, other]

Fusing Deep Learned and Hand-Crafted Features of Appearance, Shape, and Dynamics for Automatic Pain Estimation

Authors: Joy Egede, Michel Valstar, Brais Martinez

Abstract: Automatic continuous time, continuous value assessment of a patient's pain from face video is highly sought after by the medical profession. Despite the recent advances in deep learning that attain impressive results in many domains, pain estimation risks not being able to benefit from this due to the difficulty in obtaining data sets of considerable size. In this work we propose a combination of… ▽ More Automatic continuous time, continuous value assessment of a patient's pain from face video is highly sought after by the medical profession. Despite the recent advances in deep learning that attain impressive results in many domains, pain estimation risks not being able to benefit from this due to the difficulty in obtaining data sets of considerable size. In this work we propose a combination of hand-crafted and deep-learned features that makes the most of deep learning techniques in small sample settings. Encoding shape, appearance, and dynamics, our method significantly outperforms the current state of the art, attaining a RMSE error of less than 1 point on a 16-level pain scale, whilst simultaneously scoring a 67.3% Pearson correlation coefficient between our predicted pain level time series and the ground truth. △ Less

Submitted 17 January, 2017; originally announced January 2017.

Comments: 8 pages, 5 figures

arXiv:1612.02203 [pdf, other]

doi 10.1109/TPAMI.2017.2745568

A Functional Regression approach to Facial Landmark Tracking

Authors: Enrique Sánchez-Lozano, Georgios Tzimiropoulos, Brais Martinez, Fernando De la Torre, Michel Valstar

Abstract: Linear regression is a fundamental building block in many face detection and tracking algorithms, typically used to predict shape displacements from image features through a linear map**. This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker. Contrary to prior work in Funct… ▽ More Linear regression is a fundamental building block in many face detection and tracking algorithms, typically used to predict shape displacements from image features through a linear map**. This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker. Contrary to prior work in Functional Regression, in which B-splines or Fourier series were used, we propose to approximate the input space by its first-order Taylor expansion, yielding a closed-form solution for the continuous domain of displacements. We then extend the continuous least squares problem to correlated variables, and demonstrate the generalisation of our approach. We incorporate Continuous Regression into the cascaded regression framework, and show its computational benefits for both training and testing. We then present a fast approach for incremental learning within Cascaded Continuous Regression, coined iCCR, and show that its complexity allows real-time face tracking, being 20 times faster than the state of the art. To the best of our knowledge, this is the first incremental face tracker that is shown to operate in real-time. We show that iCCR achieves state-of-the-art performance on the 300-VW dataset, the most recent, large-scale benchmark for face tracking. △ Less

Submitted 20 September, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

Comments: Accepted at IEEE TPAMI. This is authors' version. 0162-8828 ©2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

arXiv:1608.01137 [pdf, other]

Cascaded Continuous Regression for Real-time Incremental Face Tracking

Authors: Enrique Sánchez-Lozano, Brais Martinez, Georgios Tzimiropoulos, Michel Valstar

Abstract: This paper introduces a novel real-time algorithm for facial landmark tracking. Compared to detection, tracking has both additional challenges and opportunities. Arguably the most important aspect in this domain is updating a tracker's models as tracking progresses, also known as incremental (face) tracking. While this should result in more accurate localisation, how to do this online and in real… ▽ More This paper introduces a novel real-time algorithm for facial landmark tracking. Compared to detection, tracking has both additional challenges and opportunities. Arguably the most important aspect in this domain is updating a tracker's models as tracking progresses, also known as incremental (face) tracking. While this should result in more accurate localisation, how to do this online and in real time without causing a tracker to drift is still an important open research question. We address this question in the cascaded regression framework, the state-of-the-art approach for facial landmark localisation. Because incremental learning for cascaded regression is costly, we propose a much more efficient yet equally accurate alternative using continuous regression. More specifically, we first propose cascaded continuous regression (CCR) and show its accuracy is equivalent to the Supervised Descent Method. We then derive the incremental learning updates for CCR (iCCR) and show that it is an order of magnitude faster than standard incremental learning for cascaded regression, bringing the time required for the update from seconds down to a fraction of a second, thus enabling real-time tracking. Finally, we evaluate iCCR and show the importance of incremental learning in achieving state-of-the-art performance. Code for our iCCR is available from http://www.cs.nott.ac.uk/~psxes1 △ Less

Submitted 6 August, 2016; v1 submitted 3 August, 2016; originally announced August 2016.

Comments: ECCV 2016 accepted paper, with supplementary material included as appendices. References to Equations fixed

arXiv:1607.08011 [pdf, other]

doi 10.1109/MCOM.2017.1600613

Understanding the limits of LoRaWAN

Authors: Ferran Adelantado, Xavier Vilajosana, Pere Tuset-Peiro, Borja Martinez, Joan Melia, Thomas Watteyne

Abstract: The quick proliferation of LPWAN networks, being LoRaWAN one of the most adopted, raised the interest of the industry, network operators and facilitated the development of novel services based on large scale and simple network structures. LoRaWAN brings the desired ubiquitous connectivity to enable most of the outdoor IoT applications and its growth and quick adoption are real proofs of that. Yet… ▽ More The quick proliferation of LPWAN networks, being LoRaWAN one of the most adopted, raised the interest of the industry, network operators and facilitated the development of novel services based on large scale and simple network structures. LoRaWAN brings the desired ubiquitous connectivity to enable most of the outdoor IoT applications and its growth and quick adoption are real proofs of that. Yet the technology has some limitations that need to be understood in order to avoid over-use of the technology. In this article we aim to provide an impartial overview of what are the limitations of such technology, and in a comprehensive manner bring use case examples to show where the limits are. △ Less

Submitted 13 February, 2017; v1 submitted 27 July, 2016; originally announced July 2016.

arXiv:1406.4212 [pdf, ps, other]

Number of minimal cyclic codes with given length and dimension

Authors: F. E. Brochero Martínez

Abstract: In this article, we count the quantity of minimal cyclic codes of length $n$ and dimension $k$ over a finite field $\mathbb F_q$, in the case when the prime factors of $n$ satisfy a special condition. This problem is equivalent to count the quantity of irreducible factors of $x^n-1\in \mathbb F_q[x]$ of degree $k$. In this article, we count the quantity of minimal cyclic codes of length $n$ and dimension $k$ over a finite field $\mathbb F_q$, in the case when the prime factors of $n$ satisfy a special condition. This problem is equivalent to count the quantity of irreducible factors of $x^n-1\in \mathbb F_q[x]$ of degree $k$. △ Less

Submitted 16 June, 2014; originally announced June 2014.

MSC Class: 20C05 (primary) and 16S34(secondary)

arXiv:1404.6851 [pdf, ps, other]

Weight enumerator of some irreducible cyclic codes

Authors: F. E. Brochero Martínez, C. R. Giraldo Vergara

Abstract: In this article, we show explicitly all possible weight enumerators for every irreducible cyclic code of length $n$ over a finite field $\mathbb F_q$, in the case which each prime divisor of $n$ is also a divisor of $q-1$. In this article, we show explicitly all possible weight enumerators for every irreducible cyclic code of length $n$ over a finite field $\mathbb F_q$, in the case which each prime divisor of $n$ is also a divisor of $q-1$. △ Less

Submitted 8 May, 2014; v1 submitted 27 April, 2014; originally announced April 2014.

Comments: Submitted to Designs, Codes and Cryptography, 8 pages

MSC Class: 12E05(primary) and 94B05(secondary)

Showing 1–50 of 52 results for author: Martinez, B