-
A Review of Multi-Modal Large Language and Vision Models
Authors:
Kilian Carolan,
Laura Fennelly,
Alan F. Smeaton
Abstract:
Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens u…
▽ More
Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes coverage of the major and most important of the LLMs and MM-LLMs and also covers the techniques of model tuning, including fine-tuning and prompt engineering, which tailor pre-trained models to specific tasks or domains. Ethical considerations and challenges, such as data bias and model misuse, are also analysed to underscore the importance of responsible AI development and deployment. Finally, we discuss the implications of open-source versus proprietary models in AI research. Through this review, we provide insights into the transformative potential of MM-LLMs in various applications.
△ Less
Submitted 28 March, 2024;
originally announced April 2024.
-
A Systematic Review of Available Datasets in Additive Manufacturing
Authors:
Xiao Liu,
Alessandra Mileo,
Alan F. Smeaton
Abstract:
In-situ monitoring incorporating data from visual and other sensor technologies, allows the collection of extensive datasets during the Additive Manufacturing (AM) process. These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning during the manufacturing process. Open and annotated datasets derived from AM…
▽ More
In-situ monitoring incorporating data from visual and other sensor technologies, allows the collection of extensive datasets during the Additive Manufacturing (AM) process. These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning during the manufacturing process. Open and annotated datasets derived from AM processes are necessary for the machine learning community to address this opportunity, which creates difficulties in the application of computer vision-related machine learning in AM. This systematic review investigates the availability of open image-based datasets originating from AM processes that align with a number of pre-defined selection criteria. The review identifies existing gaps among the current image-based datasets in the domain of AM, and points to the need for greater availability of open datasets in order to allow quality assessment and defect detection during additive manufacturing, to develop.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Lifelogging As An Extreme Form of Personal Information Management -- What Lessons To Learn
Authors:
Ly-Duyen Tran,
Cathal Gurrin,
Alan F. Smeaton
Abstract:
Personal data includes the digital footprints that we leave behind as part of our everyday activities, both online and offline in the real world. It includes data we collect ourselves, such as from wearables, as well as the data collected by others about our online behaviour and activities. Sometimes we are able to use the personal data we ourselves collect, in order to examine some parts of our l…
▽ More
Personal data includes the digital footprints that we leave behind as part of our everyday activities, both online and offline in the real world. It includes data we collect ourselves, such as from wearables, as well as the data collected by others about our online behaviour and activities. Sometimes we are able to use the personal data we ourselves collect, in order to examine some parts of our lives but for the most part, our personal data is leveraged by third parties including internet companies, for services like targeted advertising and recommendations. Lifelogging is a form of extreme personal data gathering and in this article we present an overview of the tools used to manage access to lifelogs as demonstrated at the most recent of the annual Lifelog Search Challenge benchmarking workshops. Here, experimental systems are showcased in live, real time information seeking tasks by real users. This overview of these systems' capabilities show the range of possibilities for accessing our own personal data which may, in time, become more easily available as consumer-level services.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach
Authors:
Ayush K. Rai,
Tarun Krishna,
Feiyan Hu,
Alexandru Drimbarean,
Kevin McGuinness,
Alan F. Smeaton,
Noel E. O'Connor
Abstract:
Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real…
▽ More
Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.
△ Less
Submitted 7 April, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There Outlier Words?
Authors:
Siddhant Jaydeep Mahajani,
Shashank Srivastava,
Alan F. Smeaton
Abstract:
Lexicon-based approaches to sentiment analysis of text are based on each word or lexical entry having a pre-defined weight indicating its sentiment polarity. These are usually manually assigned but the accuracy of these when compared against machine leaning based approaches to computing sentiment, are not known. It may be that there are lexical entries whose sentiment values cause a lexicon-based…
▽ More
Lexicon-based approaches to sentiment analysis of text are based on each word or lexical entry having a pre-defined weight indicating its sentiment polarity. These are usually manually assigned but the accuracy of these when compared against machine leaning based approaches to computing sentiment, are not known. It may be that there are lexical entries whose sentiment values cause a lexicon-based approach to give results which are very different to a machine learning approach. In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains using the Hedonometer, a lexicon-based technique and Azure, a contemporary machine-learning based approach which is part of the Azure Cognitive Services family of APIs which is easy to use. We model differences in sentiment scores between approaches for documents in each domain using a regression and analyse the independent variables (Hedonometer lexical entries) as indicators of each word's importance and contribution to the score differences. Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Memories in the Making: Predicting Video Memorability with Encoding Phase EEG
Authors:
Lorin Sweeney,
Graham Healy,
Alan F. Smeaton
Abstract:
In a world of ephemeral moments, our brain diligently sieves through a cascade of experiences, like a skilled gold prospector searching for precious nuggets amidst the river's relentless flow. This study delves into the elusive "moment of memorability" -- a fleeting, yet vital instant where experiences are prioritised for consolidation in our memory. By transforming subjects' encoding phase electr…
▽ More
In a world of ephemeral moments, our brain diligently sieves through a cascade of experiences, like a skilled gold prospector searching for precious nuggets amidst the river's relentless flow. This study delves into the elusive "moment of memorability" -- a fleeting, yet vital instant where experiences are prioritised for consolidation in our memory. By transforming subjects' encoding phase electroencephalography (EEG) signals into the visual domain using scaleograms and leveraging deep learning techniques, we investigate the neural signatures that underpin this moment, with the aim of predicting subject-specific recognition of video. Our findings not only support the involvement of theta band (4-8Hz) oscillations over the right temporal lobe in the encoding of declarative memory, but also support the existence of a distinct moment of memorability, akin to the gold nuggets that define our personal river of experiences.
△ Less
Submitted 16 August, 2023;
originally announced September 2023.
-
Heart Rate Detection Using an Event Camera
Authors:
Aniket Jagtap,
RamaKrishna Venkatesh Saripalli,
Joe Lemley,
Waseem Shariff,
Alan F. Smeaton
Abstract:
Event cameras, also known as neuromorphic cameras, are an emerging technology that offer advantages over traditional shutter and frame-based cameras, including high temporal resolution, low power consumption, and selective data acquisition. In this study, we propose to harnesses the capabilities of event-based cameras to capture subtle changes in the surface of the skin caused by the pulsatile flo…
▽ More
Event cameras, also known as neuromorphic cameras, are an emerging technology that offer advantages over traditional shutter and frame-based cameras, including high temporal resolution, low power consumption, and selective data acquisition. In this study, we propose to harnesses the capabilities of event-based cameras to capture subtle changes in the surface of the skin caused by the pulsatile flow of blood in the wrist region. We investigate whether an event camera could be used for continuous noninvasive monitoring of heart rate (HR). Event camera video data from 25 participants, comprising varying age groups and skin colours, was collected and analysed. Ground-truth HR measurements obtained using conventional methods were used to evaluate of the accuracy of automatic detection of HR from event camera data. Our experimental results and comparison to the performance of other non-contact HR measurement methods demonstrate the feasibility of using event cameras for pulse detection. We also acknowledge the challenges and limitations of our method, such as light-induced flickering and the sub-conscious but naturally-occurring tremors of an individual during data capture.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Using Saliency and Crop** to Improve Video Memorability
Authors:
Vaibhav Mudgal,
Qingyang Wang,
Lorin Sweeney,
Alan F. Smeaton
Abstract:
Video memorability is a measure of how likely a particular video is to be remembered by a viewer when that viewer has no emotional connection with the video content. It is an important characteristic as videos that are more memorable are more likely to be shared, viewed, and discussed. This paper presents results of a series of experiments where we improved the memorability of a video by selective…
▽ More
Video memorability is a measure of how likely a particular video is to be remembered by a viewer when that viewer has no emotional connection with the video content. It is an important characteristic as videos that are more memorable are more likely to be shared, viewed, and discussed. This paper presents results of a series of experiments where we improved the memorability of a video by selectively crop** frames based on image saliency. We present results of a basic fixed crop** as well as the results from dynamic crop** where both the size of the crop and the position of the crop within the frame, move as the video is played and saliency is tracked. Our results indicate that especially for videos of low initial memorability, the memorability score can be improved.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset
Authors:
Iya Chivileva,
Philip Lynch,
Tomas E. Ward,
Alan F. Smeaton
Abstract:
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality met…
▽ More
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Handwriting Analysis on the Diaries of Rosamond Jacob
Authors:
Sharmistha S. Sawant,
Saloni D. Thakare,
Derek Greene,
Gerardine Meaney,
Alan F. Smeaton
Abstract:
Handwriting is an art form that most people learn at an early age. Each person's writing style is unique with small changes as we grow older and as our mood changes. Here we analyse handwritten text in a culturally significant personal diary. We compare changes in handwriting and relate this to the sentiment of the written material and to the topic of diary entries. We identify handwritten text fr…
▽ More
Handwriting is an art form that most people learn at an early age. Each person's writing style is unique with small changes as we grow older and as our mood changes. Here we analyse handwritten text in a culturally significant personal diary. We compare changes in handwriting and relate this to the sentiment of the written material and to the topic of diary entries. We identify handwritten text from digitised images and generate a canonical form for words using shape matching to compare how the same handwritten word appears over a period of time. For determining the sentiment of diary entries, we use the Hedonometer, a dictionary-based approach to scoring sentiment. We apply these techniques to the historical diary entries of Rosamond Jacob (1888-1960), an Irish writer and political activist whose daily diary entries report on the major events in Ireland during the first half of the last century.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Domain Generalisation with Bidirectional Encoder Representations from Vision Transformers
Authors:
Hamza Riaz,
Alan F. Smeaton
Abstract:
Domain generalisation involves pooling knowledge from source domain(s) into a single model that can generalise to unseen target domain(s). Recent research in domain generalisation has faced challenges when using deep learning models as they interact with data distributions which differ from those they are trained on. Here we perform domain generalisation on out-of-distribution (OOD) vision benchma…
▽ More
Domain generalisation involves pooling knowledge from source domain(s) into a single model that can generalise to unseen target domain(s). Recent research in domain generalisation has faced challenges when using deep learning models as they interact with data distributions which differ from those they are trained on. Here we perform domain generalisation on out-of-distribution (OOD) vision benchmarks using vision transformers. Initially we examine four vision transformer architectures namely ViT, LeViT, DeiT, and BEIT on out-of-distribution data. As the bidirectional encoder representation from image transformers (BEIT) architecture performs best, we use it in further experiments on three benchmarks PACS, Home-Office and DomainNet. Our results show significant improvements in validation and test accuracy and our implementation significantly overcomes gaps between within-distribution and OOD data.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Defect Classification in Additive Manufacturing Using CNN-Based Vision Processing
Authors:
Xiao Liu,
Alessandra Mileo,
Alan F. Smeaton
Abstract:
The development of computer vision and in-situ monitoring using visual sensors allows the collection of large datasets from the additive manufacturing (AM) process. Such datasets could be used with machine learning techniques to improve the quality of AM. This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM…
▽ More
The development of computer vision and in-situ monitoring using visual sensors allows the collection of large datasets from the additive manufacturing (AM) process. Such datasets could be used with machine learning techniques to improve the quality of AM. This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM and second, applying active learning techniques to the developed classification model. This allows the construction of a human-in-the-loop mechanism to reduce the size of the data required to train and generate training data.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Automatically detecting activities of daily living from in-home sensors as indicators of routine behaviour in an older population
Authors:
Claire M. Timon,
Pamela Hussey,
Hyowon Lee,
Catriona Murphy,
Harsh Vardan Rai,
and Alan F. Smeaton
Abstract:
Objective: The NEX project has developed an integrated Internet of Things (IoT) system coupled with data analytics to offer unobtrusive health and wellness monitoring supporting older adults living independently at home. Monitoring {currently} involves visualising a set of automatically detected activities of daily living (ADLs) for each participant. The detection of ADLs is achieved {} to allow t…
▽ More
Objective: The NEX project has developed an integrated Internet of Things (IoT) system coupled with data analytics to offer unobtrusive health and wellness monitoring supporting older adults living independently at home. Monitoring {currently} involves visualising a set of automatically detected activities of daily living (ADLs) for each participant. The detection of ADLs is achieved {} to allow the incorporation of additional participants whose ADLs are detected without re-training the system.
Methods: Following an extensive User Needs and Requirements study involving 426 participants, a pilot trial and a friendly trial of the deployment, an Action Research Cycle (ARC) trial was completed. This involved 23 participants over a 10-week period each with c.20 IoT sensors in their homes. During the ARC trial, participants each took part in two data-informed briefings which presented visualisations of their own in-home activities. The briefings also gathered training data on the accuracy of detected activities. Association rule mining was then used on the combination of data from sensors and participant feedback to improve the automatic detection of ADLs.
Results: Association rule mining was used to detect a range of ADLs for each participant independently of others and was then used to detect ADLs across participants using a single set of rules {for each ADL}. This allows additional participants to be added without the necessity of them providing training data.
Conclusions: Additional participants can be added to the NEX system without the necessity to re-train the system for automatic detection of the set of their activities of daily living.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Calculating the matrix profile from noisy data
Authors:
Colin Hehir,
Alan F. Smeaton
Abstract:
The matrix profile (MP) is a data structure computed from a time series which encodes the data required to locate motifs and discords, corresponding to recurring patterns and outliers respectively. When the time series contains noisy data then the conventional approach is to pre-filter it in order to remove noise but this cannot apply in unsupervised settings where patterns and outliers are not an…
▽ More
The matrix profile (MP) is a data structure computed from a time series which encodes the data required to locate motifs and discords, corresponding to recurring patterns and outliers respectively. When the time series contains noisy data then the conventional approach is to pre-filter it in order to remove noise but this cannot apply in unsupervised settings where patterns and outliers are not annotated. The resilience of the algorithm used to generate the MP when faced with noisy data remains unknown. We measure the similarities between the MP from original time series data with MPs generated from the same data with noisy data added under a range of parameter settings including adding duplicates and adding irrelevant data. We use three real world data sets drawn from diverse domains for these experiments Based on dissimilarities between the MPs, our results suggest that MP generation is resilient to a small amount of noise being introduced into the data but as the amount of noise increases this resilience disappears
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Enhancing Gappy Speech Audio Signals with Generative Adversarial Networks
Authors:
Deniss Strods,
Alan F. Smeaton
Abstract:
Gaps, dropouts and short clips of corrupted audio are a common problem and particularly annoying when they occur in speech. This paper uses machine learning to regenerate gaps of up to 320ms in an audio speech signal. Audio regeneration is translated into image regeneration by transforming audio into a Mel-spectrogram and using image in-painting to regenerate the gaps. The full Mel-spectrogram is…
▽ More
Gaps, dropouts and short clips of corrupted audio are a common problem and particularly annoying when they occur in speech. This paper uses machine learning to regenerate gaps of up to 320ms in an audio speech signal. Audio regeneration is translated into image regeneration by transforming audio into a Mel-spectrogram and using image in-painting to regenerate the gaps. The full Mel-spectrogram is then transferred back to audio using the Parallel-WaveGAN vocoder and integrated into the audio stream. Using a sample of 1300 spoken audio clips of between 1 and 10 seconds taken from the publicly-available LJSpeech dataset our results show regeneration of audio gaps in close to real time using GANs with a GPU equipped system. As expected, the smaller the gap in the audio, the better the quality of the filled gaps. On a gap of 240ms the average mean opinion score (MOS) for the best performing models was 3.737, on a scale of 1 (worst) to 5 (best) which is sufficient for a human to perceive as close to uninterrupted human speech.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Automatic Detection of Signalling Behaviour from Assistance Dogs as they Forecast the Onset of Epileptic Seizures in Humans
Authors:
Hitesh Raju,
Ankit Sharma,
Aoife Smeaton,
Alan F. Smeaton
Abstract:
Epilepsy or the occurrence of epileptic seizures, is one of the world's most well-known neurological disorders affecting millions of people. Seizures mostly occur due to non-coordinated electrical discharges in the human brain and may cause damage, including collapse and loss of consciousness. If the onset of a seizure can be forecast then the subject can be placed into a safe environment or posit…
▽ More
Epilepsy or the occurrence of epileptic seizures, is one of the world's most well-known neurological disorders affecting millions of people. Seizures mostly occur due to non-coordinated electrical discharges in the human brain and may cause damage, including collapse and loss of consciousness. If the onset of a seizure can be forecast then the subject can be placed into a safe environment or position so that self-injury as a result of a collapse can be minimised. However there are no definitive methods to predict seizures in an everyday, uncontrolled environment. Previous studies have shown that pet dogs have the ability to detect the onset of an epileptic seizure by scenting the characteristic volatile organic compounds exuded through the skin by a subject prior a seizure occurring and there are cases where assistance dogs, trained to scent the onset of a seizure, can signal this to their owner/trainer. In this work we identify how we can automatically detect the signalling behaviours of trained assistance dogs and use this to alert their owner. Using data from an accelerometer worn on the collar of a dog we describe how we gathered movement data from 11 trained dogs for a total of 107 days as they exhibited signalling behaviour on command. We present the machine learning techniques used to accurately detect signalling from routine dog behaviour. This work is a step towards automatic alerting of the likely onset of an epileptic seizure from the signalling behaviour of a trained assistance dog.
△ Less
Submitted 11 March, 2023;
originally announced March 2023.
-
Periodicity Intensity Reveals Insights into Time Series Data: Three Use Cases
Authors:
Alan F. Smeaton,
Feiyan Hu
Abstract:
Periodic phenomena are oscillating signals found in many naturally-occurring time series. A periodogram can be used to measure the intensities of oscillations at different frequencies over an entire time series but sometimes we are interested in measuring how periodicity intensity at a specific frequency varies throughout the time series. This can be done by calculating periodicity intensity withi…
▽ More
Periodic phenomena are oscillating signals found in many naturally-occurring time series. A periodogram can be used to measure the intensities of oscillations at different frequencies over an entire time series but sometimes we are interested in measuring how periodicity intensity at a specific frequency varies throughout the time series. This can be done by calculating periodicity intensity within a window then sliding and recalculating the intensity for the window, giving an indication of how periodicity intensity at a specific frequency changes throughout the series. We illustrate three applications of this the first of which is movements of a herd of new-born calves where we show how intensity of the 24h periodicity increases and decreases synchronously across the herd. We also show how changes in 24h periodicity intensity of activities detected from in-home sensors can be indicative of overall wellness. We illustrate this on several weeks of sensor data gathered from each of the homes of 23 older adults. Our third application is the intensity of 7-day periodicity of hundreds of University students accessing online resources from a virtual learning environment (VLE) and how the regularity of their weekly learning behaviours changes throughout a teaching semester. The paper demonstrates how periodicity intensity reveals insights into time series data not visible using other forms of analysis
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Unifying Synergies between Self-supervised Learning and Dynamic Computation
Authors:
Tarun Krishna,
Ayush K Rai,
Alexandru Drimbarean,
Eric Arazo,
Paul Albert,
Alan F Smeaton,
Kevin McGuinness,
Noel E O'Connor
Abstract:
Computationally expensive training strategies make self-supervised learning (SSL) impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweightmodel, which usually involves multiple epochs of fine-tuning (or distilling steps) of a large pre-trained model, making it more computation…
▽ More
Computationally expensive training strategies make self-supervised learning (SSL) impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweightmodel, which usually involves multiple epochs of fine-tuning (or distilling steps) of a large pre-trained model, making it more computationally challenging. In this work we present a novel perspective on the interplay between SSL and DC paradigms. In particular, we show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting without any additional fine-tuning or pruning steps. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off and therefore yields a generic and multi-purpose architecture for application specific industrial settings. Extensive experiments on several image classification benchmarks including CIFAR-10/100, STL-10 and ImageNet-100, demonstrate that the proposed training strategy provides a dense and corresponding gated sub-network that achieves on-par performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs, under a range of target budgets (td ).
△ Less
Submitted 9 September, 2023; v1 submitted 22 January, 2023;
originally announced January 2023.
-
Vision Based Machine Learning Algorithms for Out-of-Distribution Generalisation
Authors:
Hamza Riaz,
Alan F. Smeaton
Abstract:
There are many computer vision applications including object segmentation, classification, object detection, and reconstruction for which machine learning (ML) shows state-of-the-art performance. Nowadays, we can build ML tools for such applications with real-world accuracy. However, each tool works well within the domain in which it has been trained and developed. Often, when we train a model on…
▽ More
There are many computer vision applications including object segmentation, classification, object detection, and reconstruction for which machine learning (ML) shows state-of-the-art performance. Nowadays, we can build ML tools for such applications with real-world accuracy. However, each tool works well within the domain in which it has been trained and developed. Often, when we train a model on a dataset in one specific domain and test on another unseen domain known as an out of distribution (OOD) dataset, models or ML tools show a decrease in performance. For instance, when we train a simple classifier on real-world images and apply that model on the same classes but with a different domain like cartoons, paintings or sketches then the performance of ML tools disappoints. This presents serious challenges of domain generalisation (DG), domain adaptation (DA), and domain shifting. To enhance the power of ML tools, we can rebuild and retrain models from scratch or we can perform transfer learning. In this paper, we present a comparison study between vision-based technologies for domain-specific and domain-generalised methods. In this research we highlight that simple convolutional neural network (CNN) based deep learning methods perform poorly when they have to tackle domain shifting. Experiments are conducted on two popular vision-based benchmarks, PACS and Office-Home. We introduce an implementation pipeline for domain generalisation methods and conventional deep learning models. The outcome confirms that CNN-based deep learning models show poor generalisation compare to other extensive methods.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
Managing Large Dataset Gaps in Urban Air Quality Prediction: DCU-Insight-AQ at MediaEval 2022
Authors:
Dinh Viet Cuong,
Phuc H. Le-Khac,
Adam Stapleton,
Elke Eichlemann,
Mark Roantree,
Alan F. Smeaton
Abstract:
Calculating an Air Quality Index (AQI) typically uses data streams from air quality sensors deployed at fixed locations and the calculation is a real time process. If one or a number of sensors are broken or offline, then the real time AQI value cannot be computed. Estimating AQI values for some point in the future is a predictive process and uses historical AQI values to train and build models. I…
▽ More
Calculating an Air Quality Index (AQI) typically uses data streams from air quality sensors deployed at fixed locations and the calculation is a real time process. If one or a number of sensors are broken or offline, then the real time AQI value cannot be computed. Estimating AQI values for some point in the future is a predictive process and uses historical AQI values to train and build models. In this work we focus on gap filling in air quality data where the task is to predict the AQI at 1, 5 and 7 days into the future. The scenario is where one or a number of air, weather and traffic sensors are offline and explores prediction accuracy under such situations. The work is part of the MediaEval'2022 Urban Air: Urban Life and Air Pollution task submitted by the DCU-Insight-AQ team and uses multimodal and crossmodal data consisting of AQI, weather and CCTV traffic images for air pollution prediction.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability
Authors:
Lorin Sweeney,
Graham Healy,
Alan F. Smeaton
Abstract:
As part of the MediaEval 2022 Predicting Video Memorability task we explore the relationship between visual memorability, the visual representation that characterises it, and the underlying concept portrayed by that visual representation. We achieve state-of-the-art memorability prediction performance with a model trained and tested exclusively on surrogate dream images, elevating concepts to the…
▽ More
As part of the MediaEval 2022 Predicting Video Memorability task we explore the relationship between visual memorability, the visual representation that characterises it, and the underlying concept portrayed by that visual representation. We achieve state-of-the-art memorability prediction performance with a model trained and tested exclusively on surrogate dream images, elevating concepts to the status of a cornerstone memorability feature, and finding strong evidence to suggest that the intrinsic memorability of visual content can be distilled to its underlying concept or meaning irrespective of its specific visual representational.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Overview of The MediaEval 2022 Predicting Video Memorability Task
Authors:
Lorin Sweeney,
Mihai Gabriel Constantin,
Claire-Hélène Demarty,
Camilo Fosco,
Alba G. Seco de Herrera,
Sebastian Halder,
Graham Healy,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Mushfika Sultana
Abstract:
This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in o…
▽ More
This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in order to remedy underlying data quality issues, and to prioritise short-term memorability prediction by elevating the Memento10k dataset as the primary dataset. Additionally, a fully fledged electroencephalography (EEG)-based prediction sub-task is introduced. In this paper, we outline the core facets of the task and its constituent sub-tasks; describing the datasets, evaluation metrics, and requirements for participant submissions.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
An adaptive human-in-the-loop approach to emission detection of Additive Manufacturing processes and active learning with computer vision
Authors:
Xiao Liu,
Alan F. Smeaton,
Alessandra Mileo
Abstract:
Recent developments in in-situ monitoring and process control in Additive Manufacturing (AM), also known as 3D-printing, allows the collection of large amounts of emission data during the build process of the parts being manufactured. This data can be used as input into 3D and 2D representations of the 3D-printed parts. However the analysis and use, as well as the characterization of this data sti…
▽ More
Recent developments in in-situ monitoring and process control in Additive Manufacturing (AM), also known as 3D-printing, allows the collection of large amounts of emission data during the build process of the parts being manufactured. This data can be used as input into 3D and 2D representations of the 3D-printed parts. However the analysis and use, as well as the characterization of this data still remains a manual process. The aim of this paper is to propose an adaptive human-in-the-loop approach using Machine Learning techniques that automatically inspect and annotate the emissions data generated during the AM process. More specifically, this paper will look at two scenarios: firstly, using convolutional neural networks (CNNs) to automatically inspect and classify emission data collected by in-situ monitoring and secondly, applying Active Learning techniques to the developed classification model to construct a human-in-the-loop mechanism in order to accelerate the labeling process of the emission data. The CNN-based approach relies on transfer learning and fine-tuning, which makes the approach applicable to other industrial image patterns. The adaptive nature of the approach is enabled by uncertainty sampling strategy to automatic selection of samples to be presented to human experts for annotation.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
Experiences from the MediaEval Predicting Media Memorability Task
Authors:
Alba García Deco de Herrera,
Mihai Gabriel Constantin,
Chaire-Hélène Demarty,
Camilo Fosco,
Sebastian Halder,
Graham Healy,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Mushfika Sultana,
Lorin Sweeney
Abstract:
The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute med…
▽ More
The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute media memorability are now being used by researchers well beyond the actual evaluation campaign. In this paper we present a summary of the task, including the collective lessons we have learned for the research community.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Motion Aware Self-Supervision for Generic Event Boundary Detection
Authors:
Ayush K. Rai,
Tarun Krishna,
Julia Dietlmeier,
Kevin McGuinness,
Alan F. Smeaton,
Noel E. O'Connor
Abstract:
The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hen…
▽ More
The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task.
△ Less
Submitted 12 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Analysing the Memorability of a Procedural Crime-Drama TV Series, CSI
Authors:
Sean Cummins,
Lorin Sweeney,
Alan F. Smeaton
Abstract:
We investigate the memorability of a 5-season span of a popular crime-drama TV series, CSI, through the application of a vision transformer fine-tuned on the task of predicting video memorability. By investigating the popular genre of crime-drama TV through the use of a detailed annotated corpus combined with video memorability scores, we show how to extrapolate meaning from the memorability score…
▽ More
We investigate the memorability of a 5-season span of a popular crime-drama TV series, CSI, through the application of a vision transformer fine-tuned on the task of predicting video memorability. By investigating the popular genre of crime-drama TV through the use of a detailed annotated corpus combined with video memorability scores, we show how to extrapolate meaning from the memorability scores generated on video shots. We perform a quantitative analysis to relate video shot memorability to a variety of aspects of the show. The insights we present in this paper illustrate the importance of video memorability in applications which use multimedia in areas like education, marketing, indexing, as well as in the case here namely TV and film production.
△ Less
Submitted 6 August, 2022;
originally announced August 2022.
-
Playback-centric visualisations of video usage using weighted interactions to guide where to watch in an educational context
Authors:
Hyowon Lee,
Mingming Liu,
Michael Scriney,
Alan F. Smeaton
Abstract:
The increase in use of online educational tools has led to a large amount of educational video materials made available for students. Finding the right video content is usually supported by the overarching learning management system and its interface that organises video items by course, categories and weeks, and makes them searchable. However, once a video is found, students are left without furt…
▽ More
The increase in use of online educational tools has led to a large amount of educational video materials made available for students. Finding the right video content is usually supported by the overarching learning management system and its interface that organises video items by course, categories and weeks, and makes them searchable. However, once a video is found, students are left without further guidance as to what parts in that video they should focus on. In this article, an additional timeline visualisation to augment the conventional playback timeline is introduced which employs a novel playback weighting strategy in which the history of different video interactions generate scores based on the context of each playback. The resultant scores are presented on the additional timeline, making it in effect a playback-centric usage graph nuanced by how each playback was executed. Students can selectively watch those portions which the contour of the usage visualisation suggests. The visualisation was implemented and deployed in an undergraduate course at a university for two full semesters. 270 students used the system throughout both semesters watching 52 videos, guided by visualisations on what to watch. Analysis of playback logs revealed students selectively watched corresponding to the most important portions of the videos as assessed by the instructor who created the videos. The characteristics of this as a way of guiding students as to where to watch as well as a complementary tool for playback analysis, are discussed. Further insights into the potential values of this visualisation and its underlying playback weighting strategy are also discussed.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Dynamic Channel Selection in Self-Supervised Learning
Authors:
Tarun Krishna,
Ayush K. Rai,
Yasser A. D. Djilali,
Alan F. Smeaton,
Kevin McGuinness,
Noel E. O'Connor
Abstract:
Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance…
▽ More
Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance on downstream tasks in comparison to their supervised counterparts in computer vision. However, there are drawbacks to self-supervised models including their large numbers of parameters, computationally expensive training strategies and a clear need for faster inference on downstream tasks. In this work, our goal is to address the latter by studying how a standard channel selection method developed for supervised learning can be applied to networks trained with self-supervision. We validate our findings on a range of target budgets $t_{d}$ for channel computation on image classification task across different datasets, specifically CIFAR-10, CIFAR-100, and ImageNet-100, obtaining comparable performance to that of the original network when selecting all channels but at a significant reduction in computation reported in terms of FLOPs.
△ Less
Submitted 16 December, 2022; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Analysis of Individual Conversational Volatility in Tandem Telecollaboration for Second Language Learning
Authors:
Alan F. Smeaton,
Aparajita Dey-Plissonneau,
Hyowon Lee,
Mingming Liu,
Michael Scriney
Abstract:
Second language learning can be enabled by tandem collaboration where students are grouped into video conference calls while learning the native language of other student(s) on the calls. This places students in an online environment where the more outgoing can actively contribute and engage in dialogue while those more shy and unsure of their second language skills can sit back and coast through…
▽ More
Second language learning can be enabled by tandem collaboration where students are grouped into video conference calls while learning the native language of other student(s) on the calls. This places students in an online environment where the more outgoing can actively contribute and engage in dialogue while those more shy and unsure of their second language skills can sit back and coast through the calls. We have built and deployed the L2L system which records timings of conversational utterances from all participants in a call. We generate visualisations including participation rates and timelines for each student in each call and present these on a dashboard. We have recently developed a measure called personal conversational volatility for how dynamic has been each student's contribution to the dialogue in each call. We present an analysis of conversational volatility measures for a sample of 19 individual English-speaking students from our University who are learning Frenchm, in each of 86 tandem telecollaboration calls over one teaching semester. Our analysis shows there is a need to look into the nature of the interactions and see if the choices of discussion topics assigned to them were too difficult for some students and that may have influenced their engagement in some way.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
An Analysis of Conversational Volatility During Telecollaboration Sessions for Second Language Learning
Authors:
Aparajita Dey-Plissonneau,
Hyowon Lee,
Mingming Liu,
Vyoma Patel,
Michael Scriney,
Alan F. Smeaton
Abstract:
Tandem telecollaboration is a pedagogy used in second language learning where mixed groups of students meet online in videoconferencing sessions to practice their conversational skills in their target language. We have built and deployed a system called L2 Learning to support post-session review and self-reflection on students participation in such meetings. We automatically compute a metric calle…
▽ More
Tandem telecollaboration is a pedagogy used in second language learning where mixed groups of students meet online in videoconferencing sessions to practice their conversational skills in their target language. We have built and deployed a system called L2 Learning to support post-session review and self-reflection on students participation in such meetings. We automatically compute a metric called Conversational Volatility which quantifies the amount of interaction among participants, indicating how dynamic or flat the conversations were. Our analysis on more than 100 hours of video recordings involving 28 of our students indicates that conversations do not get more dynamic as meetings progress, that there is a wide variety of levels of interaction across students and student groups, and the speaking in French appears to have more animated conversations than speaking in English, though the reasons for that are not clear.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Overview of the EEG Pilot Subtask at MediaEval 2021: Predicting Media Memorability
Authors:
Lorin Sweeney,
Ana Matran-Fernandez,
Sebastian Halder,
Alba G. Seco de Herrera,
Alan Smeaton,
Graham Healy
Abstract:
The aim of the Memorability-EEG pilot subtask at MediaEval'2021 is to promote interest in the use of neural signals -- either alone or in combination with other data sources -- in the context of predicting video memorability by highlighting the utility of EEG data. The dataset created consists of pre-extracted features from EEG recordings of subjects while watching a subset of videos from Predicti…
▽ More
The aim of the Memorability-EEG pilot subtask at MediaEval'2021 is to promote interest in the use of neural signals -- either alone or in combination with other data sources -- in the context of predicting video memorability by highlighting the utility of EEG data. The dataset created consists of pre-extracted features from EEG recordings of subjects while watching a subset of videos from Predicting Media Memorability subtask 1. This demonstration pilot gives interested researchers a sense of how neural signals can be used without any prior domain knowledge, and enables them to do so in a future memorability task. The dataset can be used to support the exploration of novel machine learning and processing strategies for predicting video memorability, while potentially increasing interdisciplinary interest in the subject of memorability, and opening the door to new combined EEG-computer vision approaches.
△ Less
Submitted 15 December, 2021;
originally announced January 2022.
-
Predicting Media Memorability: Comparing Visual, Textual and Auditory Features
Authors:
Lorin Sweeney,
Graham Healy,
Alan F. Smeaton
Abstract:
This paper describes our approach to the Predicting Media Memorability task in MediaEval 2021, which aims to address the question of media memorability by setting the task of automatically predicting video memorability. This year we tackle the task from a comparative standpoint, looking to gain deeper insights into each of three explored modalities, and using our results from last year's submissio…
▽ More
This paper describes our approach to the Predicting Media Memorability task in MediaEval 2021, which aims to address the question of media memorability by setting the task of automatically predicting video memorability. This year we tackle the task from a comparative standpoint, looking to gain deeper insights into each of three explored modalities, and using our results from last year's submission (2020) as a point of reference. Our best performing short-term memorability model (0.132) tested on the TRECVid2019 dataset -- just like last year -- was a frame based CNN that was not trained on any TRECVid data, and our best short-term memorability model (0.524) tested on the Memento10k dataset, was a Bayesian Ride Regressor fit with DenseNet121 visual features.
△ Less
Submitted 15 December, 2021;
originally announced December 2021.
-
Overview of The MediaEval 2021 Predicting Media Memorability Task
Authors:
Rukiye Savran Kiziltepe,
Mihai Gabriel Constantin,
Claire-Helene Demarty,
Graham Healy,
Camilo Fosco,
Alba Garcia Seco de Herrera,
Sebastian Halder,
Bogdan Ionescu,
Ana Matran-Fernandez,
Alan F. Smeaton,
Lorin Sweeney
Abstract:
This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset g…
▽ More
This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants' submissions.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
An Annotated Video Dataset for Computing Video Memorability
Authors:
Rukiye Savran Kiziltepe,
Lorin Sweeney,
Mihai Gabriel Constantin,
Faiyaz Doctor,
Alba Garcia Seco de Herrera,
Claire-Helene Demarty,
Graham Healy,
Bogdan Ionescu,
Alan F. Smeaton
Abstract:
Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a co…
▽ More
Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Using a GAN to Generate Adversarial Examples to Facial Image Recognition
Authors:
Andrew Merrigan,
Alan F. Smeaton
Abstract:
Images posted online present a privacy concern in that they may be used as reference examples for a facial recognition system. Such abuse of images is in violation of privacy rights but is difficult to counter. It is well established that adversarial example images can be created for recognition systems which are based on deep neural networks. These adversarial examples can be used to disrupt the…
▽ More
Images posted online present a privacy concern in that they may be used as reference examples for a facial recognition system. Such abuse of images is in violation of privacy rights but is difficult to counter. It is well established that adversarial example images can be created for recognition systems which are based on deep neural networks. These adversarial examples can be used to disrupt the utility of the images as reference examples or training data. In this work we use a Generative Adversarial Network (GAN) to create adversarial examples to deceive facial recognition and we achieve an acceptable success rate in fooling the face recognition. Our results reduce the training time for the GAN by removing the discriminator component. Furthermore, our results show knowledge distillation can be employed to drastically reduce the size of the resulting model without impacting performance indicating that our contribution could run comfortably on a smartphone
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
Image Segmentation to Identify Safe Landing Zones for Unmanned Aerial Vehicles
Authors:
Joe Kinahan,
Alan F. Smeaton
Abstract:
There is a marked increase in delivery services in urban areas, and with Jeff Bezos claiming that 86% of the orders that Amazon ships weigh less than 5 lbs, the time is ripe for investigation into economical methods of automating the final stage of the delivery process. With the advent of semi-autonomous drone delivery services, such as Irish startup `Manna', and Malta's `Skymax', the final step o…
▽ More
There is a marked increase in delivery services in urban areas, and with Jeff Bezos claiming that 86% of the orders that Amazon ships weigh less than 5 lbs, the time is ripe for investigation into economical methods of automating the final stage of the delivery process. With the advent of semi-autonomous drone delivery services, such as Irish startup `Manna', and Malta's `Skymax', the final step of the delivery journey remains the most difficult to automate. This paper investigates the use of simple images captured by a single RGB camera on a UAV to distinguish between safe and unsafe landing zones. We investigate semantic image segmentation frameworks as a way to identify safe landing zones and demonstrate the accuracy of lightweight models that minimise the number of sensors needed. By working with images rather than video we reduce the amount of energy needed to identify safe landing zones for a drone, without the need for human intervention.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
An Investigation into Keystroke Dynamics and Heart Rate Variability as Indicators of Stress
Authors:
Srijith Unni,
Sushma Suryanarayana Gowda,
Alan F. Smeaton
Abstract:
Lifelogging has become a prominent research topic in recent years. Wearable sensors like Fitbits and smart watches are now increasingly popular for recording ones activities. Some researchers are also exploring keystroke dynamics for lifelogging. Keystroke dynamics refers to the process of measuring and assessing a persons ty** rhythm on digital devices. A digital footprint is created when a use…
▽ More
Lifelogging has become a prominent research topic in recent years. Wearable sensors like Fitbits and smart watches are now increasingly popular for recording ones activities. Some researchers are also exploring keystroke dynamics for lifelogging. Keystroke dynamics refers to the process of measuring and assessing a persons ty** rhythm on digital devices. A digital footprint is created when a user interacts with devices like keyboards, mobile phones or touch screen panels and the timing of the keystrokes is unique to each individual though likely to be affected by factors such as fatigue, distraction or emotional stress. In this work we explore the relationship between keystroke dynamics as measured by the timing for the top-10 most frequently occurring bi-grams in English, and the emotional state and stress of an individual as measured by heart rate variability (HRV). We collected keystroke data using the Loggerman application while HRV was simultaneously gathered. With this data we performed an analysis to determine the relationship between variations in keystroke dynamics and variations in HRV. Our conclusion is that we need to use a more detailed representation of keystroke timing than the top-10 bigrams, probably personalised to each user.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Facilitating reflection in teletandem through automatically generated conversation metrics and playback video
Authors:
Aparajita Dey-Plissonneau,
Hyowon Lee,
Michael Scriney,
Alan F. Smeaton,
Vincent Pradier,
Hamza Riaz
Abstract:
This pilot study focuses on a tool called L2L that allows second language (L2) learners to visualise and analyse their Zoom interactions with native speakers. L2L uses the Zoom transcript to automatically generate conversation metrics and its playback feature with timestamps allows students to replay any chosen portion of the conversation for post-session reflection and self-review. This explorato…
▽ More
This pilot study focuses on a tool called L2L that allows second language (L2) learners to visualise and analyse their Zoom interactions with native speakers. L2L uses the Zoom transcript to automatically generate conversation metrics and its playback feature with timestamps allows students to replay any chosen portion of the conversation for post-session reflection and self-review. This exploratory study investigates a seven-week teletandem project, where undergraduate students from an Irish University learning French (B2) interacted with their peers from a French University learning English (B2+) via Zoom. The data collected from a survey (N=43) and semi-structured interviews (N=35) show that the quantitative conversation metrics and qualitative review of the synchronous content helped raise students' confidence levels while engaging with native speakers. Furthermore, it allowed them to set tangible goals to improve their participation, and be more aware of what, why and how they are learning.
△ Less
Submitted 18 November, 2021; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Computer Vision for Supporting Image Search
Authors:
Alan F. Smeaton
Abstract:
Computer vision and multimedia information processing have made extreme progress within the last decade and many tasks can be done with a level of accuracy as if done by humans, or better. This is because we leverage the benefits of huge amounts of data available for training, we have enormous computer processing available and we have seen the evolution of machine learning as a suite of techniques…
▽ More
Computer vision and multimedia information processing have made extreme progress within the last decade and many tasks can be done with a level of accuracy as if done by humans, or better. This is because we leverage the benefits of huge amounts of data available for training, we have enormous computer processing available and we have seen the evolution of machine learning as a suite of techniques to process data and deliver accurate vision-based systems. What kind of applications do we use this processing for ? We use this in autonomous vehicle navigation or in security applications, searching CCTV for example, and in medical image analysis for healthcare diagnostics. One application which is not widespread is image or video search directly by users. In this paper we present the need for such image finding or re-finding by examining human memory and when it fails, thus motivating the need for a different approach to image search which is outlined, along with the requirements of computer vision to support it.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Visual Selective Attention System to Intervene User Attention in Sharing COVID-19 Misinformation
Authors:
Zaid Amin,
Nazlena Mohamad Ali,
Alan F. Smeaton
Abstract:
Information sharing on social media must be accompanied by attentive behavior so that in a distorted digital environment, users are not rushed and distracted in deciding to share information. The spread of misinformation, especially those related to the COVID-19, can divide and create negative effects of falsehood in society. Individuals can also cause feelings of fear, health anxiety, and confusi…
▽ More
Information sharing on social media must be accompanied by attentive behavior so that in a distorted digital environment, users are not rushed and distracted in deciding to share information. The spread of misinformation, especially those related to the COVID-19, can divide and create negative effects of falsehood in society. Individuals can also cause feelings of fear, health anxiety, and confusion in the treatment COVID-19. Although much research has focused on understanding human judgment from a psychological underline, few have addressed the essential issue in the screening phase of what technology can interfere amidst users' attention in sharing information. This research aims to intervene in the user's attention with a visual selective attention approach. This study uses a quantitative method through studies 1 and 2 with pre-and post-intervention experiments. In study 1, we intervened in user decisions and attention by stimulating ten information and misinformation using the Visual Selective Attention System (VSAS) tool. In Study 2, we identified associations of user tendencies in evaluating information using the Implicit Association Test (IAT). The significant results showed that the user's attention and decision behavior improved after using the VSAS. The IAT results show a change in the association of user exposure, where after the intervention using VSAS, users tend not to share misinformation about COVID-19. The results are expected to be the basis for develo** social media applications to combat the negative impact of the infodemic COVID-19 misinformation.
△ Less
Submitted 9 November, 2021; v1 submitted 26 October, 2021;
originally announced October 2021.
-
The L2L System for Second Language Learning Using Visualised Zoom Calls Among Students
Authors:
Aparajita Dey-Plissonneau,
Hyowon Lee,
Vincent Pradier,
Michael Scriney,
Alan F. Smeaton
Abstract:
An important part of second language learning is conversation which is best practised with speakers whose native language is the language being learned. We facilitate this by pairing students from different countries learning each others' native language. Mixed groups of students have Zoom calls, half in one language and half in the other, in order to practice and improve their conversation skills…
▽ More
An important part of second language learning is conversation which is best practised with speakers whose native language is the language being learned. We facilitate this by pairing students from different countries learning each others' native language. Mixed groups of students have Zoom calls, half in one language and half in the other, in order to practice and improve their conversation skills. We use Zoom video recordings with audio transcripts enabled which generates recognised speech from which we extract timestamped utterances and calculate and visualise conversation metrics on a dashboard. A timeline highlights each utterance, colour coded per student, with links to the video in a playback window. L2L was deployed for a semester and recorded almost 250 hours of zoom meetings. The conversation metrics visualised on the dashboard are a beneficial asset for both students and lecturers.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
Usage-based Summaries of Learning Videos
Authors:
Hyowon Lee,
Mingming Liu,
Michael Scriney,
Alan F. Smeaton
Abstract:
Much of the delivery of University education is now by synchronous or asynchronous video. For students, one of the challenges is managing the sheer volume of such video material as video presentations of taught material are difficult to abbreviate and summarise because they do not have highlights which stand out. Apart from video bookmarks there are no tools available to determine which parts of v…
▽ More
Much of the delivery of University education is now by synchronous or asynchronous video. For students, one of the challenges is managing the sheer volume of such video material as video presentations of taught material are difficult to abbreviate and summarise because they do not have highlights which stand out. Apart from video bookmarks there are no tools available to determine which parts of video content should be replayed at revision time or just before examinations. We have developed and deployed a digital library for managing video learning material which has many dozens of hours of short-form video content from a range of taught courses for hundreds of students at undergraduate level. Through a web browser we allow students to access and play these videos and we log their anonymised playback usage. From these logs we score to each segment of each video based on the amount of playback it receives from across all students, whether the segment has been re-wound and re-played in the same student session, whether the on-screen window is the window in focus on the student's desktop/laptop, and speed of playback. We also incorporate negative scoring if a video segment is skipped or fast-forward, and overarching all this we include a decay function based on recency of playback, so the most recent days of playback contribute more to the video segment scores. For each video in the library we present a usage-based graph which allows students to see which parts of each video attract the most playback from their peers, which helps them select material at revision time. Usage of the system is fully anonymised and GDPR-compliant.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
Discerning Generic Event Boundaries in Long-Form Wild Videos
Authors:
Ayush K Rai,
Tarun Krishna,
Julia Dietlmeier,
Kevin McGuinness,
Alan F Smeaton,
Noel E O'Connor
Abstract:
Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding. In this paper we present a technique forgeneric event boundary detection based on a two stream in-flated 3D convolutions architecture, which can learn spatio-temporal features from videos. Our work is inspired from theGeneric Event Boundary Detection Challenge (part of…
▽ More
Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding. In this paper we present a technique forgeneric event boundary detection based on a two stream in-flated 3D convolutions architecture, which can learn spatio-temporal features from videos. Our work is inspired from theGeneric Event Boundary Detection Challenge (part of CVPR2021 Long Form Video Understanding- LOVEU Workshop).Throughout the paper we provide an in-depth analysis ofthe experiments performed along with an interpretation ofthe results obtained.
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding
Authors:
Luka Murn,
Saverio Blasi,
Alan F. Smeaton,
Marta Mrak
Abstract:
The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces…
▽ More
The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces a novel explainable neural network-based inter-prediction scheme, to improve the interpolation of reference samples needed for fractional precision motion compensation. The approach requires a single neural network to be trained from which a full quarter-pixel interpolation filter set is derived, as the network is easily interpretable due to its linear structure. A novel training framework enables each network branch to resemble a specific fractional shift. This practical solution makes it very efficient to use alongside conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved on average for lower resolution sequences under the random access, low-delay B and low-delay P configurations, respectively, while the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full CNNs.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods
Authors:
Lifeng Han,
Gareth J. F. Jones,
Alan F. Smeaton
Abstract:
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement…
▽ More
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU) and natural language generation (NLG).
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Attention-based Stylisation for Exemplar Image Colourisation
Authors:
Marc Gorriz Blanch,
Issa Khalifeh,
Alan Smeaton,
Noel O'Connor,
Marta Mrak
Abstract:
Exemplar-based colourisation aims to add plausible colours to a grayscale image using the guidance of a colour reference image. Most of the existing methods tackle the task as a style transfer problem, using a convolutional neural network (CNN) to obtain deep representations of the content of both inputs. Stylised outputs are then obtained by computing similarities between both feature representat…
▽ More
Exemplar-based colourisation aims to add plausible colours to a grayscale image using the guidance of a colour reference image. Most of the existing methods tackle the task as a style transfer problem, using a convolutional neural network (CNN) to obtain deep representations of the content of both inputs. Stylised outputs are then obtained by computing similarities between both feature representations in order to transfer the style of the reference to the content of the target input. However, in order to gain robustness towards dissimilar references, the stylised outputs need to be refined with a second colourisation network, which significantly increases the overall system complexity. This work reformulates the existing methodology introducing a novel end-to-end colourisation network that unifies the feature matching with the colourisation process. The proposed architecture integrates attention modules at different resolutions that learn how to perform the style transfer task in an unsupervised way towards decoding realistic colour predictions. Moreover, axial attention is proposed to simplify the attention operations and to obtain a fast but robust cost-effective architecture. Experimental validations demonstrate efficiency of the proposed methodology which generates high quality and visual appealing colourisation. Furthermore, the complexity of the proposed methodology is reduced compared to the state-of-the-art methods.
△ Less
Submitted 4 May, 2021;
originally announced May 2021.
-
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains
Authors:
George Awad,
Asad A. Butt,
Keith Curtis,
Jonathan Fiscus,
Afzal Godil,
Yooyoung Lee,
Andrew Delgado,
Jesse Zhang,
Eliot Godard,
Baptiste Chocot,
Lukas Diduch,
Jeffrey Liu,
Alan F. Smeaton,
Yvette Graham,
Gareth J. F. Jones,
Wessel Kraaij,
Georges Quenot
Abstract:
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, metrics-based evaluation. Over the last twenty years this effort has yielded a better understanding of how systems can effectively accomplish such…
▽ More
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, metrics-based evaluation. Over the last twenty years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID has been funded by NIST (National Institute of Standards and Technology) and other US government agencies. In addition, many organizations and individuals worldwide contribute significant time and effort. TRECVID 2020 represented a continuation of four tasks and the addition of two new tasks. In total, 29 teams from various research organizations worldwide completed one or more of the following six tasks: 1. Ad-hoc Video Search (AVS), 2. Instance Search (INS), 3. Disaster Scene Description and Indexing (DSDI), 4. Video to Text Description (VTT), 5. Activities in Extended Video (ActEV), 6. Video Summarization (VSUM). This paper is an introduction to the evaluation framework, tasks, data, and measures used in the evaluation campaign.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
The Influence of Audio on Video Memorability with an Audio Gestalt Regulated Video Memorability System
Authors:
Lorin Sweeney,
Graham Healy,
Alan F. Smeaton
Abstract:
Memories are the tethering threads that tie us to the world, and memorability is the measure of their tensile strength. The threads of memory are spun from fibres of many modalities, obscuring the contribution of a single fibre to a thread's overall tensile strength. Unfurling these fibres is the key to understanding the nature of their interaction, and how we can ultimately create more meaningful…
▽ More
Memories are the tethering threads that tie us to the world, and memorability is the measure of their tensile strength. The threads of memory are spun from fibres of many modalities, obscuring the contribution of a single fibre to a thread's overall tensile strength. Unfurling these fibres is the key to understanding the nature of their interaction, and how we can ultimately create more meaningful media content. In this paper, we examine the influence of audio on video recognition memorability, finding evidence to suggest that it can facilitate overall video recognition memorability rich in high-level (gestalt) audio features. We introduce a novel multimodal deep learning-based late-fusion system that uses audio gestalt to estimate the influence of a given video's audio on its overall short-term recognition memorability, and selectively leverages audio features to make a prediction accordingly. We benchmark our audio gestalt based system on the Memento10k short-term video memorability dataset, achieving top-2 state-of-the-art results.
△ Less
Submitted 23 April, 2021;
originally announced April 2021.
-
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Authors:
Lifeng Han,
Gareth J. F. Jones,
Alan F. Smeaton,
Paolo Bolzoni
Abstract:
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chi…
▽ More
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decompositions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Multiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Attention-Based Neural Networks for Chroma Intra Prediction in Video Coding
Authors:
Marc Górriz,
Saverio Blasi,
Alan F. Smeaton,
Noel E. O'Connor,
Marta Mrak
Abstract:
Neural networks can be successfully used to improve several modules of advanced video coding schemes. In particular, compression of colour components was shown to greatly benefit from usage of machine learning models, thanks to the design of appropriate attention-based architectures that allow the prediction to exploit specific samples in the reference region. However, such architectures tend to b…
▽ More
Neural networks can be successfully used to improve several modules of advanced video coding schemes. In particular, compression of colour components was shown to greatly benefit from usage of machine learning models, thanks to the design of appropriate attention-based architectures that allow the prediction to exploit specific samples in the reference region. However, such architectures tend to be complex and computationally intense, and may be difficult to deploy in a practical video coding pipeline. This work focuses on reducing the complexity of such methodologies, to design a set of simplified and cost-effective attention-based architectures for chroma intra-prediction. A novel size-agnostic multi-model approach is proposed to reduce the complexity of the inference process. The resulting simplified architecture is still capable of outperforming state-of-the-art methods. Moreover, a collection of simplifications is presented in this paper, to further reduce the complexity overhead of the proposed prediction architecture. Thanks to these simplifications, a reduction in the number of parameters of around 90% is achieved with respect to the original attention-based methodologies. Simplifications include a framework for reducing the overhead of the convolutional operations, a simplified cross-component processing model integrated into the original architecture, and a methodology to perform integer-precision approximations with the aim to obtain fast and hardware-aware implementations. The proposed schemes are integrated into the Versatile Video Coding (VVC) prediction pipeline, retaining compression efficiency of state-of-the-art chroma intra-prediction methods based on neural networks, while offering different directions for significantly reducing coding complexity.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.