Search | arXiv e-print repository

Refining activation downsampling with SoftPool

Authors: Alexandros Stergiou, Ronald Poppe, Grigorios Kalliatakis

Abstract: Convolutional Neural Networks (CNNs) use pooling to decrease the size of activation maps. This process is crucial to increase the receptive fields and to reduce computational requirements of subsequent convolutions. An important feature of the pooling operation is the minimization of information loss, with respect to the initial activation maps, without a significant impact on the computation and… ▽ More Convolutional Neural Networks (CNNs) use pooling to decrease the size of activation maps. This process is crucial to increase the receptive fields and to reduce computational requirements of subsequent convolutions. An important feature of the pooling operation is the minimization of information loss, with respect to the initial activation maps, without a significant impact on the computation and memory overhead. To meet these requirements, we propose SoftPool: a fast and efficient method for exponentially weighted activation downsampling. Through experiments across a range of architectures and pooling methods, we demonstrate that SoftPool can retain more information in the reduced activation maps. This refined downsampling leads to improvements in a CNN's classification accuracy. Experiments with pooling layer substitutions on ImageNet1K show an increase in accuracy over both original architectures and other pooling methods. We also test SoftPool on video datasets for action recognition. Again, through the direct replacement of pooling layers, we observe consistent performance improvements while computational loads and memory requirements remain limited. △ Less

Submitted 18 March, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

arXiv:1909.08611 [pdf, other]

doi 10.1109/ICCVW.2019.00524

Class Feature Pyramids for Video Explanation

Authors: Alexandros Stergiou, Georgios Kapidis, Grigorios Kalliatakis, Christos Chrysoulas, Ronald Poppe, Remco Veltkamp

Abstract: Deep convolutional networks are widely used in video action recognition. 3D convolutions are one prominent approach to deal with the additional time dimension. While 3D convolutions typically lead to higher accuracies, the inner workings of the trained models are more difficult to interpret. We focus on creating human-understandable visual explanations that represent the hierarchical parts of spat… ▽ More Deep convolutional networks are widely used in video action recognition. 3D convolutions are one prominent approach to deal with the additional time dimension. While 3D convolutions typically lead to higher accuracies, the inner workings of the trained models are more difficult to interpret. We focus on creating human-understandable visual explanations that represent the hierarchical parts of spatio-temporal networks. We introduce Class Feature Pyramids, a method that traverses the entire network structure and incrementally discovers kernels at different network depths that are informative for a specific class. Our method does not depend on the network's architecture or the type of 3D convolutions, supporting grouped and depth-wise convolutions, convolutions in fibers, and convolutions in branches. We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs) on three action recognition (Kinetics-400, UCF-101, and HMDB-51) and two egocentric action recognition datasets (EPIC-Kitchens and EGTEA Gaze+). △ Less

Submitted 18 September, 2019; originally announced September 2019.

arXiv:1905.02025 [pdf, other]

DisplaceNet: Recognising Displaced People from Images by Exploiting Dominance Level

Authors: Grigorios Kalliatakis, Shoaib Ehsan, Maria Fasli, Klaus McDonald-Maier

Abstract: Every year millions of men, women and children are forced to leave their homes and seek refuge from wars, human rights violations, persecution, and natural disasters. The number of forcibly displaced people came at a record rate of 44,400 every day throughout 2017, raising the cumulative total to 68.5 million at the years end, overtaken the total population of the United Kingdom. Up to 85% of the… ▽ More Every year millions of men, women and children are forced to leave their homes and seek refuge from wars, human rights violations, persecution, and natural disasters. The number of forcibly displaced people came at a record rate of 44,400 every day throughout 2017, raising the cumulative total to 68.5 million at the years end, overtaken the total population of the United Kingdom. Up to 85% of the forcibly displaced find refuge in low- and middle-income countries, calling for increased humanitarian assistance worldwide. To reduce the amount of manual labour required for human-rights-related image analysis, we introduce DisplaceNet, a novel model which infers potential displaced people from images by integrating the control level of the situation and conventional convolutional neural network (CNN) classifier into one framework for image classification. Experimental results show that DisplaceNet achieves up to 4% coverage-the proportion of a data set for which a classifier is able to produce a prediction-gain over the sole use of a CNN classifier. Our dataset, codes and trained models will be available online at https://github.com/GKalliatakis/DisplaceNet. △ Less

Submitted 3 May, 2019; originally announced May 2019.

Comments: To be published in CVPR Workshop on Computer Vision for Global Challenges (CV4GC). arXiv admin note: substantial text overlap with arXiv:1902.03817

arXiv:1902.03817 [pdf, other]

GET-AID: Visual Recognition of Human Rights Abuses via Global Emotional Traits

Authors: Grigorios Kalliatakis, Shoaib Ehsan, Maria Fasli, Klaus D. McDonald-Maier

Abstract: In the era of social media and big data, the use of visual evidence to document conflict and human rights abuse has become an important element for human rights organizations and advocates. In this paper, we address the task of detecting two types of human rights abuses in challenging, everyday photos: (1) child labour, and (2) displaced populations. We propose a novel model that is driven by a hu… ▽ More In the era of social media and big data, the use of visual evidence to document conflict and human rights abuse has become an important element for human rights organizations and advocates. In this paper, we address the task of detecting two types of human rights abuses in challenging, everyday photos: (1) child labour, and (2) displaced populations. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the emotional state of a person -- how positive or pleasant an emotion is, and the control level of the situation by the person -- are powerful cues for perceiving potential human rights violations. To exploit these cues, our model learns to predict global emotional traits over a given image based on the joint analysis of every detected person and the whole scene. By integrating these predictions with a data-driven convolutional neural network (CNN) classifier, our system efficiently infers potential human rights abuses in a clean, end-to-end system we call GET-AID (from Global Emotional Traits for Abuse IDentification). Extensive experiments are performed to verify our method on the recently introduced subset of Human Rights Archive (HRA) dataset (2 violation categories with the same number of positive and negative samples), where we show quantitatively compelling results. Compared with previous works and the sole use of a CNN classifier, this paper improves the coverage up to 23.73% for child labour and 57.21% for displaced populations. Our dataset, codes and trained models are available online at https://github.com/GKalliatakis/GET-AID. △ Less

Submitted 11 February, 2019; originally announced February 2019.

Comments: 10 pages, 6 figures

arXiv:1902.01078 [pdf, other]

doi 10.1109/ICIP.2019.8803153

Saliency Tubes: Visual Explanations for Spatio-Temporal Convolutions

Authors: Alexandros Stergiou, Georgios Kapidis, Grigorios Kalliatakis, Christos Chrysoulas, Remco Veltkamp, Ronald Poppe

Abstract: Deep learning approaches have been established as the main methodology for video classification and recognition. Recently, 3-dimensional convolutions have been used to achieve state-of-the-art performance in many challenging video datasets. Because of the high level of complexity of these methods, as the convolution operations are also extended to additional dimension in order to extract features… ▽ More Deep learning approaches have been established as the main methodology for video classification and recognition. Recently, 3-dimensional convolutions have been used to achieve state-of-the-art performance in many challenging video datasets. Because of the high level of complexity of these methods, as the convolution operations are also extended to additional dimension in order to extract features from them as well, providing a visualization for the signals that the network interpret as informative, is a challenging task. An effective notion of understanding the network's inner-workings would be to isolate the spatio-temporal regions on the video that the network finds most informative. We propose a method called Saliency Tubes which demonstrate the foremost points and regions in both frame level and over time that are found to be the main focus points of the network. We demonstrate our findings on widely used datasets for third-person and egocentric action classification and enhance the set of methods and visualizations that improve 3D Convolutional Neural Networks (CNNs) intelligibility. △ Less

Submitted 12 May, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

Journal ref: IEEE International Conference on Image Processing (ICIP 2019)

arXiv:1807.02098 [pdf, other]

MAT-CNN-SOPC: Motionless Analysis of Traffic Using Convolutional Neural Networks on System-On-a-Programmable-Chip

Authors: Somdip Dey, Grigorios Kalliatakis, Sangeet Saha, Amit Kumar Singh, Shoaib Ehsan, Klaus McDonald-Maier

Abstract: Intelligent Transportation Systems (ITS) have become an important pillar in modern "smart city" framework which demands intelligent involvement of machines. Traffic load recognition can be categorized as an important and challenging issue for such systems. Recently, Convolutional Neural Network (CNN) models have drawn considerable amount of interest in many areas such as weather classification, hu… ▽ More Intelligent Transportation Systems (ITS) have become an important pillar in modern "smart city" framework which demands intelligent involvement of machines. Traffic load recognition can be categorized as an important and challenging issue for such systems. Recently, Convolutional Neural Network (CNN) models have drawn considerable amount of interest in many areas such as weather classification, human rights violation detection through images, due to its accurate prediction capabilities. This work tackles real-life traffic load recognition problem on System-On-a-Programmable-Chip (SOPC) platform and coin it as MAT-CNN- SOPC, which uses an intelligent re-training mechanism of the CNN with known environments. The proposed methodology is capable of enhancing the efficacy of the approach by 2.44x in comparison to the state-of-art and proven through experimental analysis. We have also introduced a mathematical equation, which is capable of quantifying the suitability of using different CNN models over the other for a particular application based implementation. △ Less

Submitted 14 August, 2018; v1 submitted 5 July, 2018; originally announced July 2018.

Comments: 6 pages, 3 figures, 2 tables

ACM Class: I.4; I.2.1; C.1.4

Journal ref: 2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2018)

arXiv:1805.04714 [pdf, other]

Exploring object-centric and scene-centric CNN features and their complementarity for human rights violations recognition in images

Authors: Grigorios Kalliatakis, Shoaib Ehsan, Ales Leonardis, Klaus McDonald-Maier

Abstract: Identifying potential abuses of human rights through imagery is a novel and challenging task in the field of computer vision, that will enable to expose human rights violations over large-scale data that may otherwise be impossible. While standard databases for object and scene categorisation contain hundreds of different classes, the largest available dataset of human rights violations contains o… ▽ More Identifying potential abuses of human rights through imagery is a novel and challenging task in the field of computer vision, that will enable to expose human rights violations over large-scale data that may otherwise be impossible. While standard databases for object and scene categorisation contain hundreds of different classes, the largest available dataset of human rights violations contains only 4 classes. Here, we introduce the `Human Rights Archive Database' (HRA), a verified-by-experts repository of 3050 human rights violations photographs, labelled with human rights semantic categories, comprising a list of the types of human rights abuses encountered at present. With the HRA dataset and a two-phase transfer learning scheme, we fine-tuned the state-of-the-art deep convolutional neural networks (CNNs) to provide human rights violations classification CNNs (HRA-CNNs). We also present extensive experiments refined to evaluate how well object-centric and scene-centric CNN features can be combined for the task of recognising human rights abuses. With this, we show that HRA database poses a challenge at a higher level for the well studied representation learning methods, and provide a benchmark in the task of human rights violations recognition in visual context. We expect this dataset can help to open up new horizons on creating systems able of recognising rich information about human rights violations. Our dataset, codes and trained models are available online at https://github.com/GKalliatakis/Human-Rights-Archive-CNNs. △ Less

Submitted 12 May, 2018; originally announced May 2018.

Comments: 19 pages, 13 figures; Submitted to PLOS ONE

arXiv:1711.03874 [pdf]

Material Classification in the Wild: Do Synthesized Training Data Generalise Better than Real-World Training Data?

Authors: Grigorios Kalliatakis, Anca Sticlaru, George Stamatiadis, Shoaib Ehsan, Ales Leonardis, Juergen Gall, Klaus D. McDonald-Maier

Abstract: We question the dominant role of real-world training images in the field of material classification by investigating whether synthesized data can generalise more effectively than real-world data. Experimental results on three challenging real-world material databases show that the best performing pre-trained convolutional neural network (CNN) architectures can achieve up to 91.03% mean average pre… ▽ More We question the dominant role of real-world training images in the field of material classification by investigating whether synthesized data can generalise more effectively than real-world data. Experimental results on three challenging real-world material databases show that the best performing pre-trained convolutional neural network (CNN) architectures can achieve up to 91.03% mean average precision when classifying materials in cross-dataset scenarios. We demonstrate that synthesized data achieve an improvement on mean average precision when used as training data and in conjunction with pre-trained CNN architectures, which spans from ~ 5% to ~ 19% across three widely used material databases of real-world images. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: accepted for publication in VISAPP 2018. arXiv admin note: text overlap with arXiv:1703.04101

arXiv:1703.10501 [pdf, other]

A Paradigm Shift: Detecting Human Rights Violations Through Web Images

Authors: Grigorios Kalliatakis, Shoaib Ehsan, Klaus D. McDonald-Maier

Abstract: The growing presence of devices carrying digital cameras, such as mobile phones and tablets, combined with ever improving internet networks have enabled ordinary citizens, victims of human rights abuse, and participants in armed conflicts, protests, and disaster situations to capture and share via social media networks images and videos of specific events. This paper discusses the potential of ima… ▽ More The growing presence of devices carrying digital cameras, such as mobile phones and tablets, combined with ever improving internet networks have enabled ordinary citizens, victims of human rights abuse, and participants in armed conflicts, protests, and disaster situations to capture and share via social media networks images and videos of specific events. This paper discusses the potential of images in human rights context including the opportunities and challenges they present. This study demonstrates that real-world images have the capacity to contribute complementary data to operational human rights monitoring efforts when combined with novel computer vision approaches. The analysis is concluded by arguing that if images are to be used effectively to detect and identify human rights violations by rights advocates, greater attention to gathering task-specific visual concepts from large-scale web images is required. △ Less

Submitted 30 March, 2017; originally announced March 2017.

Comments: Position paper, 8 pages, 3 figures

arXiv:1703.04103 [pdf, other]

Detection of Human Rights Violations in Images: Can Convolutional Neural Networks help?

Authors: Grigorios Kalliatakis, Shoaib Ehsan, Maria Fasli, Ales Leonardis, Juergen Gall, Klaus D. McDonald-Maier

Abstract: After setting the performance benchmarks for image, video, speech and audio processing, deep convolutional networks have been core to the greatest advances in image recognition tasks in recent times. This raises the question of whether there are any benefit in targeting these remarkable deep architectures with the unattempted task of recognising human rights violations through digital images. Unde… ▽ More After setting the performance benchmarks for image, video, speech and audio processing, deep convolutional networks have been core to the greatest advances in image recognition tasks in recent times. This raises the question of whether there are any benefit in targeting these remarkable deep architectures with the unattempted task of recognising human rights violations through digital images. Under this perspective, we introduce a new, well-sampled human rights-centric dataset called Human Rights Understanding (HRUN). We conduct a rigorous evaluation on a common ground by combining this dataset with different state-of-the-art deep convolutional architectures in order to achieve recognition of human rights violations. Experimental results on the HRUN dataset have shown that the best performing CNN architectures can achieve up to 88.10\% mean average precision. Additionally, our experiments demonstrate that increasing the size of the training samples is crucial for achieving an improvement on mean average precision principally when utilising very deep networks. △ Less

Submitted 16 March, 2017; v1 submitted 12 March, 2017; originally announced March 2017.

Comments: In Proceedings of the 12th International Conference on Computer Vision Theory and Applications (VISAPP 2017), 8 pages

arXiv:1703.04101 [pdf, other]

Evaluating Deep Convolutional Neural Networks for Material Classification

Authors: Grigorios Kalliatakis, Georgios Stamatiadis, Shoaib Ehsan, Ales Leonardis, Juergen Gall, Anca Sticlaru, Klaus D. McDonald-Maier

Abstract: Determining the material category of a surface from an image is a demanding task in perception that is drawing increasing attention. Following the recent remarkable results achieved for image classification and object detection utilising Convolutional Neural Networks (CNNs), we empirically study material classification of everyday objects employing these techniques. More specifically, we conduct a… ▽ More Determining the material category of a surface from an image is a demanding task in perception that is drawing increasing attention. Following the recent remarkable results achieved for image classification and object detection utilising Convolutional Neural Networks (CNNs), we empirically study material classification of everyday objects employing these techniques. More specifically, we conduct a rigorous evaluation of how state-of-the art CNN architectures compare on a common ground over widely used material databases. Experimental results on three challenging material databases show that the best performing CNN architectures can achieve up to 94.99\% mean average precision when classifying materials. △ Less

Submitted 16 March, 2017; v1 submitted 12 March, 2017; originally announced March 2017.

Comments: In Proceedings of the 12th International Conference on Computer Vision Theory and Applications (VISAPP 2017), 7 pages

arXiv:1703.03949 [pdf, other]

doi 10.1109/CEEC.2016.7835887

Web-based visualisation of head pose and facial expressions changes: monitoring human activity using depth data

Authors: Grigorios Kalliatakis, Nikolaos Vidakis, Georgios Triantafyllidis

Abstract: Despite significant recent advances in the field of head pose estimation and facial expression recognition, raising the cognitive level when analysing human activity presents serious challenges to current concepts. Motivated by the need of generating comprehensible visual representations from different sets of data, we introduce a system capable of monitoring human activity through head pose and f… ▽ More Despite significant recent advances in the field of head pose estimation and facial expression recognition, raising the cognitive level when analysing human activity presents serious challenges to current concepts. Motivated by the need of generating comprehensible visual representations from different sets of data, we introduce a system capable of monitoring human activity through head pose and facial expression changes, utilising an affordable 3D sensing technology (Microsoft Kinect sensor). An approach build on discriminative random regression forests was selected in order to rapidly and accurately estimate head pose changes in unconstrained environment. In order to complete the secondary process of recognising four universal dominant facial expressions (happiness, anger, sadness and surprise), emotion recognition via facial expressions (ERFE) was adopted. After that, a lightweight data exchange format (JavaScript Object Notation-JSON) is employed, in order to manipulate the data extracted from the two aforementioned settings. Such mechanism can yield a platform for objective and effortless assessment of human activity within the context of serious gaming and human-computer interaction. △ Less

Submitted 16 March, 2017; v1 submitted 11 March, 2017; originally announced March 2017.

Comments: 8th Computer Science and Electronic Engineering, (CEEC 2016), University of Essex, UK, 6 pages

Showing 1–12 of 12 results for author: Kalliatakis, G