Search | arXiv e-print repository

doi 10.1109/RBME.2024.3408456

Automated Radiology Report Generation: A Review of Recent Advances

Authors: Phillip Sloan, Philip Clatworthy, Edwin Simpson, Majid Mirmehdi

Abstract: Increasing demands on medical imaging departments are taking a toll on the radiologist's ability to deliver timely and accurate reports. Recent technological advances in artificial intelligence have demonstrated great potential for automatic radiology report generation (ARRG), sparking an explosion of research. This survey paper conducts a methodological review of contemporary ARRG approaches by w… ▽ More Increasing demands on medical imaging departments are taking a toll on the radiologist's ability to deliver timely and accurate reports. Recent technological advances in artificial intelligence have demonstrated great potential for automatic radiology report generation (ARRG), sparking an explosion of research. This survey paper conducts a methodological review of contemporary ARRG approaches by way of (i) assessing datasets based on characteristics, such as availability, size, and adoption rate, (ii) examining deep learning training methods, such as contrastive learning and reinforcement learning, (iii) exploring state-of-the-art model architectures, including variations of CNN and transformer models, (iv) outlining techniques integrating clinical knowledge through multimodal inputs and knowledge graphs, and (v) scrutinising current model evaluation techniques, including commonly applied NLP metrics and qualitative clinical reviews. Furthermore, the quantitative results of the reviewed models are analysed, where the top performing models are examined to seek further insights. Finally, potential new directions are highlighted, with the adoption of additional datasets from other radiological modalities and improved evaluation methods predicted as important areas of future development. △ Less

Submitted 29 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

Comments: 24 pages, 8 figures, 6 tables. Accepted by IEEE Reviews in Biomedical Engineering

MSC Class: 68T99 ACM Class: I.2; I.4; J.3

arXiv:2404.08937 [pdf, other]

ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

Authors: Otto Brookes, Majid Mirmehdi, Hjalmar Kuhl, Tilo Burghardt

Abstract: We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and outp… ▽ More We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based initialisations. In addition, the effect of initialising query tokens using a masked language model fine-tuned on a text corpus of known behavioural patterns is explored. We evaluate our system on the PanAf500 and PanAf20K datasets and demonstrate the performance benefits of our multi-modal decoding approach and query initialisation strategy on multi-class and multi-label recognition tasks, respectively. Results and ablations corroborate performance improvements. We achieve state-of-the-art performance over vision and vision-language models in top-1 accuracy (+6.34%) on PanAf500 and overall (+1.1%) and tail-class (+2.26%) mean average precision on PanAf20K. We share complete source code and network weights for full reproducibility of results and easy utilisation. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2401.13554 [pdf, other]

PanAf20K: A Large Video Dataset for Wild Ape Detection and Behaviour Recognition

Authors: Otto Brookes, Majid Mirmehdi, Colleen Stephens, Samuel Angedakin, Katherine Corogenes, Dervla Dowd, Paula Dieguez, Thurston C. Hicks, Sorrel Jones, Kevin Lee, Vera Leinert, Juan Lapuente, Maureen S. McCarthy, Amelia Meier, Mizuki Murai, Emmanuelle Normand, Virginie Vergnes, Erin G. Wessling, Roman M. Wittig, Kevin Langergraber, Nuria Maldonado, Xinyu Yang, Klaus Zuberbuhler, Christophe Boesch, Mimi Arandjelovic , et al. (2 additional authors not shown)

Abstract: We present the PanAf20K dataset, the largest and most diverse open-access annotated video dataset of great apes in their natural environment. It comprises more than 7 million frames across ~20,000 camera trap videos of chimpanzees and gorillas collected at 14 field sites in tropical Africa as part of the Pan African Programme: The Cultured Chimpanzee. The footage is accompanied by a rich set of an… ▽ More We present the PanAf20K dataset, the largest and most diverse open-access annotated video dataset of great apes in their natural environment. It comprises more than 7 million frames across ~20,000 camera trap videos of chimpanzees and gorillas collected at 14 field sites in tropical Africa as part of the Pan African Programme: The Cultured Chimpanzee. The footage is accompanied by a rich set of annotations and benchmarks making it suitable for training and testing a variety of challenging and ecologically important computer vision tasks including ape detection and behaviour recognition. Furthering AI analysis of camera trap information is critical given the International Union for Conservation of Nature now lists all species in the great ape family as either Endangered or Critically Endangered. We hope the dataset can form a solid basis for engagement of the AI community to improve performance, efficiency, and result interpretation in order to support assessments of great ape presence, abundance, distribution, and behaviour and thereby aid conservation efforts. △ Less

Submitted 31 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted at IJCV

arXiv:2312.00856 [pdf, other]

QAFE-Net: Quality Assessment of Facial Expressions with Landmark Heatmaps

Authors: Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi

Abstract: Facial expression recognition (FER) methods have made great inroads in categorising moods and feelings in humans. Beyond FER, pain estimation methods assess levels of intensity in pain expressions, however assessing the quality of all facial expressions is of critical value in health-related applications. In this work, we address the quality of five different facial expressions in patients affecte… ▽ More Facial expression recognition (FER) methods have made great inroads in categorising moods and feelings in humans. Beyond FER, pain estimation methods assess levels of intensity in pain expressions, however assessing the quality of all facial expressions is of critical value in health-related applications. In this work, we address the quality of five different facial expressions in patients affected by Parkinson's disease. We propose a novel landmark-guided approach, QAFE-Net, that combines temporal landmark heatmaps with RGB data to capture small facial muscle movements that are encoded and mapped to severity scores. The proposed approach is evaluated on a new Parkinson's Disease Facial Expression dataset (PFED5), as well as on the pain estimation benchmark, the UNBC-McMaster Shoulder Pain Expression Archive Database. Our comparative experiments demonstrate that the proposed method outperforms SOTA action quality assessment works on PFED5 and achieves lower mean absolute error than the SOTA pain estimation methods on UNBC-McMaster. Our code and the new PFED5 dataset are available at https://github.com/shuchaoduan/QAFE-Net. △ Less

Submitted 12 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Accepted to ELFA workshop at WACV 2024

arXiv:2311.16446 [pdf, other]

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett

Abstract: Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries.… ▽ More Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: Accepted to VUA workshop at BMVC 2023

arXiv:2311.07603 [pdf, other]

PECoP: Parameter Efficient Continual Pretraining for Action Quality Assessment

Authors: Amirhossein Dadashzadeh, Shuchao Duan, Alan Whone, Majid Mirmehdi

Abstract: The limited availability of labelled data in Action Quality Assessment (AQA), has forced previous works to fine-tune their models pretrained on large-scale domain-general datasets. This common approach results in weak generalisation, particularly when there is a significant domain shift. We propose a novel, parameter efficient, continual pretraining framework, PECoP, to reduce such domain shift vi… ▽ More The limited availability of labelled data in Action Quality Assessment (AQA), has forced previous works to fine-tune their models pretrained on large-scale domain-general datasets. This common approach results in weak generalisation, particularly when there is a significant domain shift. We propose a novel, parameter efficient, continual pretraining framework, PECoP, to reduce such domain shift via an additional pretraining stage. In PECoP, we introduce 3D-Adapters, inserted into the pretrained model, to learn spatiotemporal, in-domain information via self-supervised learning where only the adapter modules' parameters are updated. We demonstrate PECoP's ability to enhance the performance of recent state-of-the-art methods (MUSDL, CoRe, and TSA) applied to AQA, leading to considerable improvements on benchmark datasets, JIGSAWS ($\uparrow6.0\%$), MTL-AQA ($\uparrow0.99\%$), and FineDiving ($\uparrow2.54\%$). We also present a new Parkinson's Disease dataset, PD4T, of real patients performing four various actions, where we surpass ($\uparrow3.56\%$) the state-of-the-art in comparison. Our code, pretrained models, and the PD4T dataset are available at https://github.com/Plrbear/PECoP. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: Accepted to WACV 2024 (preprint)

arXiv:2304.01143 [pdf, other]

Use Your Head: Improving Long-Tail Video Recognition

Authors: Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima Damen

Abstract: This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sam… ▽ More This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction, which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: CVPR 2023

arXiv:2302.11325 [pdf, other]

doi 10.1109/ICIP49359.2023

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

Authors: Chengxi Zeng, Xinyu Yang, David Smithard, Majid Mirmehdi, Alberto M Gambaruto, Tilo Burghardt

Abstract: This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal… ▽ More This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet. △ Less

Submitted 4 July, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

arXiv:2301.10829 [pdf, other]

TranSOP: Transformer-based Multimodal Classification for Stroke Treatment Outcome Prediction

Authors: Zeynel A. Samak, Philip Clatworthy, Majid Mirmehdi

Abstract: Acute ischaemic stroke, caused by an interruption in blood flow to brain tissue, is a leading cause of disability and mortality worldwide. The selection of patients for the most optimal ischaemic stroke treatment is a crucial step for a successful outcome, as the effect of treatment highly depends on the time to treatment. We propose a transformer-based multimodal network (TranSOP) for a classific… ▽ More Acute ischaemic stroke, caused by an interruption in blood flow to brain tissue, is a leading cause of disability and mortality worldwide. The selection of patients for the most optimal ischaemic stroke treatment is a crucial step for a successful outcome, as the effect of treatment highly depends on the time to treatment. We propose a transformer-based multimodal network (TranSOP) for a classification approach that employs clinical metadata and imaging information, acquired on hospital admission, to predict the functional outcome of stroke treatment based on the modified Rankin Scale (mRS). This includes a fusion module to efficiently combine 3D non-contrast computed tomography (NCCT) features and clinical information. In comparative experiments using unimodal and multimodal data on the MRCLEAN dataset, we achieve a state-of-the-art AUC score of 0.85. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted at IEEE ISBI 2023, 5 pages

arXiv:2301.02642 [pdf, other]

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

Authors: Otto Brookes, Majid Mirmehdi, Hjalmar Kühl, Tilo Burghardt

Abstract: We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate sy… ▽ More We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate system variants with different feature fusion techniques and long-tail recognition approaches. Results and ablations show performance improvements of ~12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing 180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative analysis of our findings and augment the metric learning system with long-tail recognition techniques showing that average per class accuracy -- critical in the domain -- can be improved by ~23% compared to the literature on that dataset. Finally, since our embedding spaces are constructed as metric, we provide first data-driven visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope that the work sparks further interest in this vital application area of computer vision for the benefit of endangered great apes. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2210.14284 [pdf, other]

Refining Action Boundaries for One-stage Detection

Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett

Abstract: Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined bounda… ▽ More Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined boundaries with higher confidence. We obtain state-of-the-art performance on the challenging EPIC-KITCHENS-100 action detection as well as the standard THUMOS14 action detection benchmarks, and achieve improvement on the ActivityNet-1.3 benchmark. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: Accepted to AVSS 2022. Our code is available at https://github.com/hanielwang/Refining_Boundary_Head.git

arXiv:2208.08315 [pdf, other]

Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

Authors: Chengxi Zeng, Xinyu Yang, Majid Mirmehdi, Alberto M Gambaruto, Tilo Burghardt

Abstract: We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, a… ▽ More We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction. △ Less

Submitted 22 August, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: Accepted by International Conference on Machine Vision 2022

arXiv:2207.08064 [pdf, other]

Detecting Humans in RGB-D Data with CNNs

Authors: Kaiyang Zhou, Adeline Paiement, Majid Mirmehdi

Abstract: We address the problem of people detection in RGB-D data where we leverage depth information to develop a region-of-interest (ROI) selection method that provides proposals to two color and depth CNNs. To combine the detections produced by the two CNNs, we propose a novel fusion approach based on the characteristics of depth images. We also present a new depth-encoding scheme, which not only encode… ▽ More We address the problem of people detection in RGB-D data where we leverage depth information to develop a region-of-interest (ROI) selection method that provides proposals to two color and depth CNNs. To combine the detections produced by the two CNNs, we propose a novel fusion approach based on the characteristics of depth images. We also present a new depth-encoding scheme, which not only encodes depth images into three channels but also enhances the information for classification. We conduct experiments on a publicly available RGB-D people dataset and show that our approach outperforms the baseline models that only use RGB data. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: An (outdated) MSc project (2016), which studied how to use CNNs to detect humans in RGBD data

arXiv:2207.06789 [pdf, other]

Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things

Authors: Alessandro Masullo, Toby Perrett, Tilo Burghardt, Ian Craddock, Dima Damen, Majid Mirmehdi

Abstract: We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalit… ▽ More We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time. We evaluate the proposed model on inertial data from a wearable accelerometer device, using RGB videos and skeletons as privileged modalities, and show an improvement of accuracy of an average 6.6% on the UTD-MHAD dataset and an average 5.5% on the Berkeley MHAD dataset, reaching a new state-of-the-art for inertial-only classification accuracy on these datasets. We validate our framework through several ablation studies. △ Less

Submitted 14 July, 2022; originally announced July 2022.

arXiv:2205.00275 [pdf, other]

Dynamic Curriculum Learning for Great Ape Detection in the Wild

Authors: Xinyu Yang, Tilo Burghardt, Majid Mirmehdi

Abstract: We propose a novel end-to-end curriculum learning approach for sparsely labelled animal datasets leveraging large volumes of unlabelled data to improve supervised species detectors. We exemplify the method in detail on the task of finding great apes in camera trap footage taken in challenging real-world jungle environments. In contrast to previous semi-supervised methods, our approach adjusts lear… ▽ More We propose a novel end-to-end curriculum learning approach for sparsely labelled animal datasets leveraging large volumes of unlabelled data to improve supervised species detectors. We exemplify the method in detail on the task of finding great apes in camera trap footage taken in challenging real-world jungle environments. In contrast to previous semi-supervised methods, our approach adjusts learning parameters dynamically over time and gradually improves detection quality by steering training towards virtuous self-reinforcement. To achieve this, we propose integrating pseudo-labelling with curriculum learning policies and show how learning collapse can be avoided. We discuss theoretical arguments, ablations, and significant performance improvements against various state-of-the-art systems when evaluating on the Extended PanAfrican Dataset holding approx. 1.8M frames. We also demonstrate our method can outperform supervised baselines with significant margins on sparse label versions of other animal datasets such as Bees and Snapshot Serengeti. We note that performance advantages are strongest for smaller labelled ratios common in ecological applications. Finally, we show that our approach achieves competitive benchmarks for generic object detection in MS-COCO and PASCAL-VOC indicating wider applicability of the dynamic learning concepts introduced. We publish all relevant source code, network weights, and data access details for full reproducibility. The code is available at https://github.com/youshyee/DCL-Detection. △ Less

Submitted 2 January, 2023; v1 submitted 30 April, 2022; originally announced May 2022.

Comments: Accepted at IJCV

arXiv:2201.00434 [pdf, other]

TVNet: Temporal Voting Network for Action Localization

Authors: Hanyuan Wang, Dima Damen, Majid Mirmehdi, Toby Perrett

Abstract: We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate conf… ▽ More We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate confidence scores and action classes. We achieve an average mAP of 34.6% on ActivityNet-1.3, particularly outperforming previous methods with the highest IoU of 0.95. TVNet also achieves mAP of 56.0% when combined with PGCN and 59.1% with MUSES at 0.5 IoU on THUMOS14 and outperforms prior work at all thresholds. Our code is available at https://github.com/hanielwang/TVNet. △ Less

Submitted 2 January, 2022; originally announced January 2022.

Comments: 9 pages, 7 figures, 11 tables

arXiv:2112.04011 [pdf, other]

Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation

Authors: Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi

Abstract: Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervi… ▽ More Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetics-400. Our method deploys a teacher network that iteratively distills its knowledge to the student model by capturing the similarity information between segments of unlabelled video data. The student model meanwhile solves a pretext task by exploiting this prior knowledge. We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations. Our experimental results show superior results to the state of the art on both UCF101 and HMDB51 datasets when pretraining on K100 in apple-to-apple comparisons. Additionally, we show that our auxiliary pretraining, auxSKD, when added as an extra pretraining phase to recent state of the art self-supervised methods (i.e. VCOP, VideoPace, and RSPNet), improves their results on UCF101 and HMDB51. Our code is available at https://github.com/Plrbear/auxSKD. △ Less

Submitted 25 April, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

arXiv:2111.06830 [pdf, other]

Small or Far Away? Exploiting Deep Super-Resolution and Altitude Data for Aerial Animal Surveillance

Authors: Mowen Xue, Theo Greenslade, Majid Mirmehdi, Tilo Burghardt

Abstract: Visuals captured by high-flying aerial drones are increasingly used to assess biodiversity and animal population dynamics around the globe. Yet, challenging acquisition scenarios and tiny animal depictions in airborne imagery, despite ultra-high resolution cameras, have so far been limiting factors for applying computer vision detectors successfully with high confidence. In this paper, we address… ▽ More Visuals captured by high-flying aerial drones are increasingly used to assess biodiversity and animal population dynamics around the globe. Yet, challenging acquisition scenarios and tiny animal depictions in airborne imagery, despite ultra-high resolution cameras, have so far been limiting factors for applying computer vision detectors successfully with high confidence. In this paper, we address the problem for the first time by combining deep object detectors with super-resolution techniques and altitude data. In particular, we show that the integration of a holistic attention network based super-resolution approach and a custom-built altitude data exploitation network into standard recognition pipelines can considerably increase the detection efficacy in real-world settings. We evaluate the system on two public, large aerial-capture animal datasets, SAVMAP and AED. We find that the proposed approach can consistently improve over ablated baselines and the state-of-the-art performance for both datasets. In addition, we provide a systematic analysis of the relationship between animal resolution and detection performance. We conclude that super-resolution and altitude knowledge exploitation techniques can significantly increase benchmarks across settings and, thus, should be used routinely when detecting minutely resolved animals in aerial imagery. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: 11 pages, 7 figures, 2 tables

MSC Class: 65D19

arXiv:2109.08730 [pdf, ps, other]

Unsupervised View-Invariant Human Posture Representation

Authors: Faegheh Sardari, Björn Ommer, Majid Mirmehdi

Abstract: Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose represe… ▽ More Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image without using 3D joint data. Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames from different viewpoints and their equivariant properties between augmented frames from the same viewpoint. We evaluate the learned view-invariant pose representations for two downstream tasks. We perform comparative experiments that show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on NTU RGB+D by a significant margin, on both RGB and depth images. We also show the efficiency of transferring the learned representations from NTU RGB+D to obtain the first ever unsupervised cross-view and cross-subject rank correlation results on the multi-view human movement quality dataset, QMAR, and marginally improve on the-state-of-the-art supervised results for this dataset. We also carry out ablation studies to examine the contributions of the different components of our proposed network. △ Less

Submitted 17 September, 2021; originally announced September 2021.

arXiv:2101.06184 [pdf, other]

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Authors: Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen

Abstract: We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video represent… ▽ More We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers. △ Less

Submitted 28 March, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: Accepted in CVPR 2021

arXiv:2012.09890 [pdf, other]

Exploring Motion Boundaries in an End-to-End Network for Vision-based Parkinson's Severity Assessment

Authors: Amirhossein Dadashzadeh, Alan Whone, Michal Rolinski, Majid Mirmehdi

Abstract: Evaluating neurological disorders such as Parkinson's disease (PD) is a challenging task that requires the assessment of several motor and non-motor functions. In this paper, we present an end-to-end deep learning framework to measure PD severity in two important components, hand movement and gait, of the Unified Parkinson's Disease Rating Scale (UPDRS). Our method leverages on an Inflated 3D CNN… ▽ More Evaluating neurological disorders such as Parkinson's disease (PD) is a challenging task that requires the assessment of several motor and non-motor functions. In this paper, we present an end-to-end deep learning framework to measure PD severity in two important components, hand movement and gait, of the Unified Parkinson's Disease Rating Scale (UPDRS). Our method leverages on an Inflated 3D CNN trained by a temporal segment framework to learn spatial and long temporal structure in video data. We also deploy a temporal attention mechanism to boost the performance of our model. Further, motion boundaries are explored as an extra input modality to assist in obfuscating the effects of camera motion for better movement assessment. We ablate the effects of different data modalities on the accuracy of the proposed network and compare with other popular architectures. We evaluate our proposed method on a dataset of 25 PD patients, obtaining 72.3% and 77.1% top-1 accuracy on hand movement and gait tasks respectively. △ Less

Submitted 24 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

arXiv:2010.07217 [pdf, other]

Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

Authors: Xinyu Yang, Majid Mirmehdi, Tilo Burghardt

Abstract: In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction (CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-b… ▽ More In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction (CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-backward as well as backward-forward temporal loops is approximately preserved. As a self-supervision signal, CEP leverages the bi-directional temporal coherence of the video stream and applies loss functions that encourage both temporal cycle closure as well as contrastive feature separation. Architecturally, the underpinning network structure utilises a single feature encoder for all video snippets, adding two predictive modules that learn temporal forward and backward transitions. We apply our framework for pretext training of networks for action recognition tasks. We report significantly improved results for the standard datasets UCF101 and HMDB51. Detailed ablation studies support the effectiveness of the proposed components. We publish source code for the CEP components in full with this paper. △ Less

Submitted 24 October, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

Comments: accepted at BMVC

arXiv:2008.04999 [pdf, ps, other]

VI-Net: View-Invariant Quality of Human Movement Assessment

Authors: Faegheh Sardari, Adeline Paiement, Sion Hannuna, Majid Mirmehdi

Abstract: We propose a view-invariant method towards the assessment of the quality of human movements which does not rely on skeleton data. Our end-to-end convolutional neural network consists of two stages, where at first a view-invariant trajectory descriptor for each body joint is generated from RGB images, and then the collection of trajectories for all joints are processed by an adapted, pre-trained 2D… ▽ More We propose a view-invariant method towards the assessment of the quality of human movements which does not rely on skeleton data. Our end-to-end convolutional neural network consists of two stages, where at first a view-invariant trajectory descriptor for each body joint is generated from RGB images, and then the collection of trajectories for all joints are processed by an adapted, pre-trained 2D CNN (e.g. VGG-19 or ResNeXt-50) to learn the relationship amongst the different body parts and deliver a score for the movement quality. We release the only publicly-available, multi-view, non-skeleton, non-mocap, rehabilitation movement dataset (QMAR), and provide results for both cross-subject and cross-view scenarios on this dataset. We show that VI-Net achieves average rank correlation of 0.66 on cross-subject and 0.65 on unseen views when trained on only two views. We also evaluate the proposed method on the single-view rehabilitation dataset KIMORE and obtain 0.66 rank correlation against a baseline of 0.62. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: 13 pages, 6 figures, 7 tables

arXiv:2007.14658 [pdf, other]

Meta-Learning with Context-Agnostic Initialisations

Authors: Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen

Abstract: Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We addre… ▽ More Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and two problems. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report on Omniglot few-shot character classification, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%. △ Less

Submitted 22 October, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

Comments: Accepted at ACCV 2020

arXiv:2005.13061 [pdf, other]

Prediction of Thrombectomy Functional Outcomes using Multimodal Data

Authors: Zeynel A. Samak, Philip Clatworthy, Majid Mirmehdi

Abstract: Recent randomised clinical trials have shown that patients with ischaemic stroke {due to occlusion of a large intracranial blood vessel} benefit from endovascular thrombectomy. However, predicting outcome of treatment in an individual patient remains a challenge. We propose a novel deep learning approach to directly exploit multimodal data (clinical metadata information, imaging data, and imaging… ▽ More Recent randomised clinical trials have shown that patients with ischaemic stroke {due to occlusion of a large intracranial blood vessel} benefit from endovascular thrombectomy. However, predicting outcome of treatment in an individual patient remains a challenge. We propose a novel deep learning approach to directly exploit multimodal data (clinical metadata information, imaging data, and imaging biomarkers extracted from images) to estimate the success of endovascular treatment. We incorporate an attention mechanism in our architecture to model global feature inter-dependencies, both channel-wise and spatially. We perform comparative experiments using unimodal and multimodal data, to predict functional outcome (modified Rankin Scale score, mRS) and achieve 0.75 AUC for dichotomised mRS scores and 0.35 classification accuracy for individual mRS scores. △ Less

Submitted 28 May, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

Comments: Accepted at Medical Image Understanding and Analysis (MIUA) 2020

arXiv:1910.09920 [pdf, other]

Weakly-Supervised Completion Moment Detection using Temporal Attention

Authors: Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen

Abstract: Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort required human annotati… ▽ More Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort required human annotations of the completion moment for training (i.e. full supervision). In this work, we present an approach for moment detection from weak video-level labels. Given both complete and incomplete sequences, of the same action, we learn temporal attention, along with accumulated completion prediction from all frames in the sequence. We also demonstrate how the approach can be used when completion moment supervision is available. We evaluate and compare our approach on actions from three datasets, namely HMDB, UCF101 and RGBD-AC, and show that temporal attention improves detection in both weakly-supervised and fully-supervised settings. △ Less

Submitted 22 October, 2019; originally announced October 2019.

arXiv:1910.01370 [pdf, other]

doi 10.1007/978-3-030-27272-2_15

Sit-to-Stand Analysis in the Wild using Silhouettes for Longitudinal Health Monitoring

Authors: Alessandro Masullo, Tilo Burghardt, Toby Perrett, Dima Damen, Majid Mirmehdi

Abstract: We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D… ▽ More We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D bounding boxes. We tested our method on data from real homes of participants and monitored patients undergoing total hip or knee replacement. Our results show 94.4% overall accuracy in the coarse localisation and an error of 0.026 m/s in the speed of ascent measurement, highlighting important trends in the recuperation of patients who underwent surgery. △ Less

Submitted 3 October, 2019; originally announced October 2019.

arXiv:1908.11240 [pdf, other]

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

Authors: Xinyu Yang, Majid Mirmehdi, Tilo Burghardt

Abstract: We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species… ▽ More We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species appearance and motion signatures despite significant partial occlusion. We evaluate the framework using 500 camera trap videos of great apes from the Pan African Programme containing 180K frames, which we manually annotated with accurate per-frame animal bounding boxes. These clips contain significant partial occlusions, challenging lighting, dynamic backgrounds, and natural camouflage effects. We show that our approach performs highly robustly and significantly outperforms frame-based detectors. We also perform detailed ablation studies and validation on the full ILSVRC 2015 VID data corpus to demonstrate wider applicability at adequate performance levels. We conclude that the framework is ready to assist human camera trap inspection efforts. We publish code, weights, and ground truth annotations with this paper. △ Less

Submitted 29 August, 2019; originally announced August 2019.

Comments: Accepted by ICCV workshop 2019

arXiv:1806.08152 [pdf, other]

CaloriNet: From silhouettes to calorie estimation in private environments

Authors: Alessandro Masullo, Tilo Burghardt, Dima Damen, Sion Hannuna, Victor Ponce-López, Majid Mirmehdi

Abstract: We propose a novel deep fusion architecture, CaloriNet, for the online estimation of energy expenditure for free living monitoring in private environments, where RGB data is discarded and replaced by silhouettes. Our fused convolutional neural network architecture is trainable end-to-end, to estimate calorie expenditure, using temporal foreground silhouettes alongside accelerometer data. The netwo… ▽ More We propose a novel deep fusion architecture, CaloriNet, for the online estimation of energy expenditure for free living monitoring in private environments, where RGB data is discarded and replaced by silhouettes. Our fused convolutional neural network architecture is trainable end-to-end, to estimate calorie expenditure, using temporal foreground silhouettes alongside accelerometer data. The network is trained and cross-validated on a publicly available dataset, SPHERE_RGBD + Inertial_calorie. Results show state-of-the-art minimum error on the estimation of energy expenditure (calories per minute), outperforming alternative, standard and single-modal techniques. △ Less

Submitted 21 June, 2018; originally announced June 2018.

Comments: 11 pages, 7 figures

arXiv:1806.05653 [pdf, other]

HGR-Net: A Fusion Network for Hand Gesture Segmentation and Recognition

Authors: Amirhossein Dadashzadeh, Alireza Tavakoli Targhi, Maryam Tahmasbi, Majid Mirmehdi

Abstract: We propose a two-stage convolutional neural network (CNN) architecture for robust recognition of hand gestures, called HGR-Net, where the first stage performs accurate semantic segmentation to determine hand regions, and the second stage identifies the gesture. The segmentation stage architecture is based on the combination of fully convolutional residual network and atrous spatial pyramid pooling… ▽ More We propose a two-stage convolutional neural network (CNN) architecture for robust recognition of hand gestures, called HGR-Net, where the first stage performs accurate semantic segmentation to determine hand regions, and the second stage identifies the gesture. The segmentation stage architecture is based on the combination of fully convolutional residual network and atrous spatial pyramid pooling. Although the segmentation sub-network is trained without depth information, it is particularly robust against challenges such as illumination variations and complex backgrounds. The recognition stage deploys a two-stream CNN, which fuses the information from the red-green-blue and segmented images by combining their deep representations in a fully connected layer before classification. Extensive experiments on public datasets show that our architecture achieves almost as good as state-of-the-art performance in segmentation and recognition of static hand gestures, at a fraction of training time, run time, and model size. Our method can operate at an average of 23 ms per frame. △ Less

Submitted 28 December, 2019; v1 submitted 14 June, 2018; originally announced June 2018.

arXiv:1806.04074 [pdf, other]

Semantically Selective Augmentation for Deep Compact Person Re-Identification

Authors: Víctor Ponce-López, Tilo Burghardt, Sion Hannunna, Dima Damen, Alessandro Masullo, Majid Mirmehdi

Abstract: We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks. In particular, we propose to augment limited training data via sampling from a deep convolutional generative adversarial network (DCGAN), whose discriminator is constrained by a se… ▽ More We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks. In particular, we propose to augment limited training data via sampling from a deep convolutional generative adversarial network (DCGAN), whose discriminator is constrained by a semantic classifier to explicitly control the domain specificity of the generation process. Thereby, we encode information in the classifier network which can be utilized to steer adversarial synthesis, and which fuels our CondenseNet ID-network training. We provide a quantitative and qualitative analysis of the approach and its variants on a number of datasets, obtaining results that outperform the state-of-the-art on the LIMA dataset for long-term monitoring in indoor living spaces. △ Less

Submitted 18 June, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

arXiv:1805.11907 [pdf, other]

A Guide to the SPHERE 100 Homes Study Dataset

Authors: Atis Elsts, Tilo Burghardt, Dallan Byrne, Massimo Camplani, Dima Damen, Xenofon Fafoutis, Sion Hannuna, William Harwin, Michael Holmes, Balazs Janko, Victor Ponce Lopez, Alessandro Masullo, Majid Mirmehdi, George Oikonomou, Robert Piechocki, R. Simon Sherratt, Emma Tonkin, Niall Twomey, Antonis Vafeas, Przemyslaw Woznowski, Ian Craddock

Abstract: The SPHERE project has developed a multi-modal sensor platform for health and behavior monitoring in residential environments. So far, the SPHERE platform has been deployed for data collection in approximately 50 homes for duration up to one year. This technical document describes the format and the expected content of the SPHERE dataset(s) under preparation. It includes a list of some data qualit… ▽ More The SPHERE project has developed a multi-modal sensor platform for health and behavior monitoring in residential environments. So far, the SPHERE platform has been deployed for data collection in approximately 50 homes for duration up to one year. This technical document describes the format and the expected content of the SPHERE dataset(s) under preparation. It includes a list of some data quality problems (both known to exist in the dataset(s) and potential ones), their workarounds, and other information important to people working with the SPHERE data, software, and hardware. This document does not aim to be an exhaustive descriptor of the SPHERE dataset(s); it also does not aim to discuss or validate the potential scientific uses of the SPHERE data. △ Less

Submitted 30 October, 2018; v1 submitted 30 May, 2018; originally announced May 2018.

arXiv:1805.06749 [pdf, ps, other]

Action Completion: A Temporal Model for Moment Detection

Authors: Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen

Abstract: We introduce completion moment detection for actions - the problem of locating the moment of completion, when the action's goal is confidently considered achieved. The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment. We introduce a recurrent voting… ▽ More We introduce completion moment detection for actions - the problem of locating the moment of completion, when the action's goal is confidently considered achieved. The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment. We introduce a recurrent voting node that predicts the frame's relative position of the completion moment by either classification or regression. The method is also capable of detecting incompletion. For example, the method is capable of detecting a missed ball-catch, as well as the moment at which the ball is safely caught. We test the method on 16 actions from three public datasets, covering sports as well as daily actions. Results show that when combining contributions from frames prior to the completion moment as well as frames post completion, the completion moment is detected within one second in 89% of all tested sequences. △ Less

Submitted 23 July, 2018; v1 submitted 17 May, 2018; originally announced May 2018.

arXiv:1710.02310 [pdf, ps, other]

Detecting the Moment of Completion: Temporal Models for Localising Action Completion

Authors: Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen

Abstract: Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved. In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, pull, pick an… ▽ More Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved. In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, pull, pick and drink. We use a supervised approach, where annotations of pre-completion and post-completion frames are available per action, and fine-tuned CNN features are used to train temporal models. Tested on the Action-Completion-2016 dataset, we detect completion within 10 frames of annotations for ~75% of completed action sequences using both temporal models. Results show that fine-tuned CNN features outperform hand-crafted features for localisation, and that observing incomplete instances is necessary when incomplete sequences are also present in the test set. △ Less

Submitted 6 October, 2017; originally announced October 2017.

arXiv:1607.08196 [pdf, other]

Calorie Counter: RGB-Depth Visual Estimation of Energy Expenditure at Home

Authors: Lili Tao, Tilo Burghardt, Majid Mirmehdi, Dima Damen, Ashley Cooper, Sion Hannuna, Massimo Camplani, Adeline Paiement, Ian Craddock

Abstract: We present a new framework for vision-based estimation of calorific expenditure from RGB-D data - the first that is validated on physical gas exchange measurements and applied to daily living scenarios. Deriving a person's energy expenditure from sensors is an important tool in tracking physical activity levels for health and lifestyle monitoring. Most existing methods use metabolic lookup tables… ▽ More We present a new framework for vision-based estimation of calorific expenditure from RGB-D data - the first that is validated on physical gas exchange measurements and applied to daily living scenarios. Deriving a person's energy expenditure from sensors is an important tool in tracking physical activity levels for health and lifestyle monitoring. Most existing methods use metabolic lookup tables (METs) for a manual estimate or systems with inertial sensors which ultimately require users to wear devices. In contrast, the proposed pose-invariant and individual-independent vision framework allows for a remote estimation of calorific expenditure. We introduce, and evaluate our approach on, a new dataset called SPHERE-calorie, for which visual estimates can be compared against simultaneously obtained, indirect calorimetry measures based on gas exchange. % based on per breath gas exchange. We conclude from our experiments that the proposed vision pipeline is suitable for home monitoring in a controlled environment, with calorific expenditure estimates above accuracy levels of commonly used manual estimations via METs. With the dataset released, our work establishes a baseline for future research for this little-explored area of computer vision. △ Less

Submitted 27 July, 2016; originally announced July 2016.

arXiv:1606.04450 [pdf, other]

Multiple Human Tracking in RGB-D Data: A Survey

Authors: Massimo Camplani, Adeline Paiement, Majid Mirmehdi, Dima Damen, Sion Hannuna, Tilo Burghardt, Lili Tao

Abstract: Multiple human tracking (MHT) is a fundamental task in many computer vision applications. Appearance-based approaches, primarily formulated on RGB data, are constrained and affected by problems arising from occlusions and/or illumination variations. In recent years, the arrival of cheap RGB-Depth (RGB-D) devices has {led} to many new approaches to MHT, and many of these integrate color and depth c… ▽ More Multiple human tracking (MHT) is a fundamental task in many computer vision applications. Appearance-based approaches, primarily formulated on RGB data, are constrained and affected by problems arising from occlusions and/or illumination variations. In recent years, the arrival of cheap RGB-Depth (RGB-D) devices has {led} to many new approaches to MHT, and many of these integrate color and depth cues to improve each and every stage of the process. In this survey, we present the common processing pipeline of these methods and review their methodology based (a) on how they implement this pipeline and (b) on what role depth plays within each stage of it. We identify and introduce existing, publicly available, benchmark datasets and software resources that fuse color and depth data for MHT. Finally, we present a brief comparative evaluation of the performance of those works that have applied their methods to these datasets. △ Less

Submitted 14 June, 2016; originally announced June 2016.

arXiv:1512.07080 [pdf, other]

Cost-based Feature Transfer for Vehicle Occupant Classification

Authors: Toby Perrett, Majid Mirmehdi, Eduardo Dias

Abstract: Knowledge of human presence and interaction in a vehicle is of growing interest to vehicle manufacturers for design and safety purposes. We present a framework to perform the tasks of occupant detection and occupant classification for automatic child locks and airbag suppression. It operates for all passenger seats, using a single overhead camera. A transfer learning technique is introduced to mak… ▽ More Knowledge of human presence and interaction in a vehicle is of growing interest to vehicle manufacturers for design and safety purposes. We present a framework to perform the tasks of occupant detection and occupant classification for automatic child locks and airbag suppression. It operates for all passenger seats, using a single overhead camera. A transfer learning technique is introduced to make full use of training data from all seats whilst still maintaining some control over the bias, necessary for a system designed to penalize certain misclassifications more than others. An evaluation is performed on a challenging dataset with both weighted and unweighted classifiers, demonstrating the effectiveness of the transfer process. △ Less

Submitted 22 December, 2015; originally announced December 2015.

Comments: 9 pages, 4 figures, 5 tables

ACM Class: I.4.9

Showing 1–37 of 37 results for author: Mirmehdi, M