Search | arXiv e-print repository

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

Abstract: Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarc… ▽ More Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 14 pages. arXiv admin note: substantial text overlap with arXiv:2308.05051

arXiv:2406.00495 [pdf, other]

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

Authors: Davide Berghi, Philip J. B. Jackson

Abstract: Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating t… ▽ More Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.10690 [pdf, other]

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

Abstract: Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts… ▽ More Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively. △ Less

Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2401.00128 [pdf]

Quantifying intra-tumoral genetic heterogeneity of glioblastoma toward precision medicine using MRI and a data-inclusive machine learning algorithm

Authors: Lujia Wang, Hairong Wang, Fulvio D'Angelo, Lee Curtin, Christopher P. Sereduk, Gustavo De Leon, Kyle W. Singleton, Javier Urcuyo, Andrea Hawkins-Daarud, Pamela R. Jackson, Chandan Krishna, Richard S. Zimmerman, Devi P. Patra, Bernard R. Bendok, Kris A. Smith, Peter Nakaji, Kliment Donev, Leslie C. Baxter, Maciej M. Mrugała, Michele Ceccarelli, Antonio Iavarone, Kristin R. Swanson, Nhan L. Tran, Leland S. Hu, **g Li

Abstract: Glioblastoma (GBM) is one of the most aggressive and lethal human cancers. Intra-tumoral genetic heterogeneity poses a significant challenge for treatment. Biopsy is invasive, which motivates the development of non-invasive, MRI-based machine learning (ML) models to quantify intra-tumoral genetic heterogeneity for each patient. This capability holds great promise for enabling better therapeutic se… ▽ More Glioblastoma (GBM) is one of the most aggressive and lethal human cancers. Intra-tumoral genetic heterogeneity poses a significant challenge for treatment. Biopsy is invasive, which motivates the development of non-invasive, MRI-based machine learning (ML) models to quantify intra-tumoral genetic heterogeneity for each patient. This capability holds great promise for enabling better therapeutic selection to improve patient outcomes. We proposed a novel Weakly Supervised Ordinal Support Vector Machine (WSO-SVM) to predict regional genetic alteration status within each GBM tumor using MRI. WSO-SVM was applied to a unique dataset of 318 image-localized biopsies with spatially matched multiparametric MRI from 74 GBM patients. The model was trained to predict the regional genetic alteration of three GBM driver genes (EGFR, PDGFRA, and PTEN) based on features extracted from the corresponding region of five MRI contrast images. For comparison, a variety of existing ML algorithms were also applied. The classification accuracy of each gene was compared between the different algorithms. The SHapley Additive exPlanations (SHAP) method was further applied to compute contribution scores of different contrast images. Finally, the trained WSO-SVM was used to generate prediction maps within the tumoral area of each patient to help visualize the intra-tumoral genetic heterogeneity. This study demonstrated the feasibility of using MRI and WSO-SVM to enable non-invasive prediction of intra-tumoral regional genetic alteration for each GBM patient, which can inform future adaptive therapies for individualized oncology. △ Less

Submitted 29 December, 2023; originally announced January 2024.

Comments: 36 pages, 8 figures, 3 tables

arXiv:2312.14021 [pdf, other]

doi 10.1109/TASLP.2023.3346643

Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Authors: Davide Berghi, Philip J. B. Jackson

Abstract: Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features ext… ▽ More Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a ``student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as a ``teacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2312.09034 [pdf, other]

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Authors: Davide Berghi, Peipei Wu, **zheng Zhao, Wenwu Wang, Philip J. B. Jackson

Abstract: Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore th… ▽ More Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2310.14778 [pdf, other]

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Authors: **zheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we condu… ▽ More Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking. △ Less

Submitted 17 December, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2308.05051 [pdf, other]

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

Abstract: We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-att… ▽ More We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2212.01892 [pdf, other]

doi 10.1145/3565516.3565522

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

Authors: Davide Berghi, Marco Volino, Philip J. B. Jackson

Abstract: 3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio re… ▽ More 3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present "Tragic Talkers", an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed. △ Less

Submitted 4 December, 2022; originally announced December 2022.

arXiv:2203.03291 [pdf, other]

Visually Supervised Speaker Detection and Localization via Microphone Array

Authors: Davide Berghi, Adrian Hilton, Philip J. B. Jackson

Abstract: Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates.… ▽ More Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline embraces the "student-teacher" paradigm, where a trained "teacher" network is used to produce pseudo-labels visually. The "student" network is an audio network trained to generate the same results. At inference, the student network can independently localize the speaker in the visual frames directly from the audio input. Experimental results on newly collected data prove that our approach significantly outperforms a variety of other baselines as well as the teacher network itself. It results in an excellent speech activity detector too. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: Erratum: Due to a bug in the evaluation script, the correct average distance (aD) metric is here reported in yellow. The analysis remains unchanged from the original paper as the trend between the old and new measures are perfectly monotonic. The bug was caused by an incorrect normalization factor

Journal ref: IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), 2021

arXiv:2105.00641 [pdf, other]

Naturalistic audio-visual volumetric sequences dataset of sounding actions for six degree-of-freedom interaction

Authors: Hanne Stenzel, Davide Berghi, Marco Volino, Philip J. B. Jackson

Abstract: As audio-visual systems increasingly bring immersive and interactive capabilities into our work and leisure activities, so the need for naturalistic test material grows. New volumetric datasets have captured high-quality 3D video, but accompanying audio is often neglected, making it hard to test an integrated bimodal experience. Designed to cover diverse sound types and features, the presented vol… ▽ More As audio-visual systems increasingly bring immersive and interactive capabilities into our work and leisure activities, so the need for naturalistic test material grows. New volumetric datasets have captured high-quality 3D video, but accompanying audio is often neglected, making it hard to test an integrated bimodal experience. Designed to cover diverse sound types and features, the presented volumetric dataset was constructed from audio and video studio recordings of scenes to yield forty short action sequences. Potential uses in technical and scientific tests are discussed. △ Less

Submitted 3 May, 2021; originally announced May 2021.

Comments: for dataset visit cvssp.org/data/navvs; accepted as poster in IEEE VR 2021

arXiv:2010.12635 [pdf, other]

Not Half Bad: Exploring Half-Precision in Graph Convolutional Neural Networks

Authors: John Brennan, Stephen Bonner, Amir Atapour-Abarghouei, Philip T Jackson, Boguslaw Obara, Andrew Stephen McGough

Abstract: With the growing significance of graphs as an effective representation of data in numerous applications, efficient graph analysis using modern machine learning is receiving a growing level of attention. Deep learning approaches often operate over the entire adjacency matrix -- as the input and intermediate network layers are all designed in proportion to the size of the adjacency matrix -- leading… ▽ More With the growing significance of graphs as an effective representation of data in numerous applications, efficient graph analysis using modern machine learning is receiving a growing level of attention. Deep learning approaches often operate over the entire adjacency matrix -- as the input and intermediate network layers are all designed in proportion to the size of the adjacency matrix -- leading to intensive computation and large memory requirements as the graph size increases. It is therefore desirable to identify efficient measures to reduce both run-time and memory requirements allowing for the analysis of the largest graphs possible. The use of reduced precision operations within the forward and backward passes of a deep neural network along with novel specialised hardware in modern GPUs can offer promising avenues towards efficiency. In this paper, we provide an in-depth exploration of the use of reduced-precision operations, easily integrable into the highly popular PyTorch framework, and an analysis of the effects of Tensor Cores on graph convolutional neural networks. We perform an extensive experimental evaluation of three GPU architectures and two widely-used graph analysis tasks (vertex classification and link prediction) using well-known benchmark and synthetically generated datasets. Thus allowing us to make important observations on the effects of reduced-precision operations and Tensor Cores on computational and memory usage of graph convolutional neural networks -- often neglected in the literature. △ Less

Submitted 23 October, 2020; originally announced October 2020.

arXiv:2007.08574 [pdf, other]

Camera Bias in a Fine Grained Classification Task

Authors: Philip T. Jackson, Stephen Bonner, Ning Jia, Christopher Holder, Jon Stonehouse, Boguslaw Obara

Abstract: We show that correlations between the camera used to acquire an image and the class label of that image can be exploited by convolutional neural networks (CNN), resulting in a model that "cheats" at an image classification task by recognizing which camera took the image and inferring the class label from the camera. We show that models trained on a dataset with camera / label correlations do not g… ▽ More We show that correlations between the camera used to acquire an image and the class label of that image can be exploited by convolutional neural networks (CNN), resulting in a model that "cheats" at an image classification task by recognizing which camera took the image and inferring the class label from the camera. We show that models trained on a dataset with camera / label correlations do not generalize well to images in which those correlations are absent, nor to images from unencountered cameras. Furthermore, we investigate which visual features they are exploiting for camera recognition. Our experiments present evidence against the importance of global color statistics, lens deformation and chromatic aberration, and in favor of high frequency features, which may be introduced by image processing algorithms built into the cameras. △ Less

Submitted 16 July, 2020; originally announced July 2020.

arXiv:2003.06656 [pdf, other]

Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

Authors: Davide Berghi, Hanne Stenzel, Marco Volino, Adrian Hilton, Philip J. B. Jackson

Abstract: Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were… ▽ More Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes. △ Less

Submitted 14 March, 2020; originally announced March 2020.

Comments: Two-pages poster abstract

Journal ref: IEEE VR 2020

arXiv:1910.02127 [pdf, other]

doi 10.1109/TASLP.2019.2946043

Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation

Authors: Luca Remaggi, Philip J. B. Jackson, Wenwu Wang

Abstract: Typical methods for binaural source separation consider only the direct sound as the target signal in a mixture. However, in most scenarios, this assumption limits the source separation performance. It is well known that the early reflections interact with the direct sound, producing acoustic effects at the listening position, e.g. the so-called comb filter effect. In this article, we propose a no… ▽ More Typical methods for binaural source separation consider only the direct sound as the target signal in a mixture. However, in most scenarios, this assumption limits the source separation performance. It is well known that the early reflections interact with the direct sound, producing acoustic effects at the listening position, e.g. the so-called comb filter effect. In this article, we propose a novel source separation model, that utilizes both the direct sound and the first early reflection information to model the comb filter effect. This is done by observing the interaural phase difference obtained from the time-frequency representation of binaural mixtures. Furthermore, a method is proposed to model the interaural coherence of the signals. Including information related to the sound multipath propagation, the performance of the proposed separation method is improved with respect to the baselines that did not use such information, as illustrated by using binaural recordings made in four rooms, having different sizes and reverberation times. △ Less

Submitted 4 October, 2019; originally announced October 2019.

Comments: IEEE Copyright. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019

arXiv:1908.08402 [pdf, other]

Temporal Neighbourhood Aggregation: Predicting Future Links in Temporal Graphs via Recurrent Variational Graph Convolutions

Authors: Stephen Bonner, Amir Atapour-Abarghouei, Philip T Jackson, John Brennan, Ibad Kureshi, Georgios Theodoropoulos, Andrew Stephen McGough, Boguslaw Obara

Abstract: Graphs have become a crucial way to represent large, complex and often temporal datasets across a wide range of scientific disciplines. However, when graphs are used as input to machine learning models, this rich temporal information is frequently disregarded during the learning process, resulting in suboptimal performance on certain temporal infernce tasks. To combat this, we introduce Temporal N… ▽ More Graphs have become a crucial way to represent large, complex and often temporal datasets across a wide range of scientific disciplines. However, when graphs are used as input to machine learning models, this rich temporal information is frequently disregarded during the learning process, resulting in suboptimal performance on certain temporal infernce tasks. To combat this, we introduce Temporal Neighbourhood Aggregation (TNA), a novel vertex representation model architecture designed to capture both topological and temporal information to directly predict future graph states. Our model exploits hierarchical recurrence at different depths within the graph to enable exploration of changes in temporal neighbourhoods, whilst requiring no additional features or labels to be present. The final vertex representations are created using variational sampling and are optimised to directly predict the next graph in the sequence. Our claims are reinforced by extensive experimental evaluation on both real and synthetic benchmark datasets, where our approach demonstrates superior performance compared to competing methods, out-performing them at predicting new temporal edges by as much as 23% on real-world datasets, whilst also requiring fewer overall model parameters. △ Less

Submitted 21 November, 2019; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: IEEE International Conference on Big Data 2019

arXiv:1906.08826 [pdf, other]

doi 10.1016/j.icarus.2020.113749

Automated crater shape retrieval using weakly-supervised deep learning

Authors: Mohamad Ali-Dib, Kristen Menou, Alan P. Jackson, Chenchong Zhu, Noah Hammond

Abstract: Crater ellipticity determination is a complex and time consuming task that so far has evaded successful automation. We train a state of the art computer vision algorithm to identify craters in Lunar digital elevation maps and retrieve their sizes and 2D shapes. The computational backbone of the model is MaskRCNN, an "instance segmentation" general framework that detects craters in an image while s… ▽ More Crater ellipticity determination is a complex and time consuming task that so far has evaded successful automation. We train a state of the art computer vision algorithm to identify craters in Lunar digital elevation maps and retrieve their sizes and 2D shapes. The computational backbone of the model is MaskRCNN, an "instance segmentation" general framework that detects craters in an image while simultaneously producing a mask for each crater that traces its outer rim. Our post-processing pipeline then finds the closest fitting ellipse to these masks, allowing us to retrieve the crater ellipticities. Our model is able to correctly identify 87\% of known craters in the longitude range we hid from the network during training and validation (test set), while predicting thousands of additional craters not present in our training data. Manual validation of a subset of these "new" craters indicates that a majority of them are real, which we take as an indicator of the strength of our model in learning to identify craters, despite incomplete training data. The crater size, ellipticity, and depth distributions predicted by our model are consistent with human-generated results. The model allows us to perform a large scale search for differences in crater diameter and shape distributions between the lunar highlands and maria, and we exclude any such differences with a high statistical significance. The predicted test set catalogue and trained model are available here: https://github.com/malidib/Craters_MaskRCNN/. △ Less

Submitted 11 March, 2020; v1 submitted 20 June, 2019; originally announced June 2019.

Comments: 59 pages, 13 figures, Accepted for publication in Icarus

arXiv:1906.07552 [pdf, other]

doi 10.24963/ijcai.2019/381

Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks

Authors: Qiuqiang Kong, Yong Xu, Wenwu Wang, Philip J. B. Jackson, Mark D. Plumbley

Abstract: Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synt… ▽ More Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB. △ Less

Submitted 14 June, 2019; originally announced June 2019.

Comments: 7 pages. Accepted by IJCAI 2019

Journal ref: International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 2747-2753

arXiv:1903.06516 [pdf, other]

Phenotypic Profiling of High Throughput Imaging Screens with Generic Deep Convolutional Features

Authors: Philip T. Jackson, Yinhai Wang, Sinead Knight, Hongming Chen, Thierry Dorval, Martin Brown, Claus Bendtsen, Boguslaw Obara

Abstract: While deep learning has seen many recent applications to drug discovery, most have focused on predicting activity or toxicity directly from chemical structure. Phenotypic changes exhibited in cellular images are also indications of the mechanism of action (MoA) of chemical compounds. In this paper, we show how pre-trained convolutional image features can be used to assist scientists in discovering… ▽ More While deep learning has seen many recent applications to drug discovery, most have focused on predicting activity or toxicity directly from chemical structure. Phenotypic changes exhibited in cellular images are also indications of the mechanism of action (MoA) of chemical compounds. In this paper, we show how pre-trained convolutional image features can be used to assist scientists in discovering interesting chemical clusters for further investigation. Our method reduces the dimensionality of raw fluorescent stained images from a high throughput imaging (HTI) screen, producing an embedding space that groups together images with similar cellular phenotypes. Running standard unsupervised clustering on this embedding space yields a set of distinct phenotypic clusters. This allows scientists to further select and focus on interesting clusters for downstream analyses. We validate the consistency of our embedding space qualitatively with t-sne visualizations, and quantitatively by measuring embedding variance among images that are known to be similar. Results suggested the usefulness of our proposed workflow using deep learning and clustering and it can lead to robust HTI screening and compound triage. △ Less

Submitted 15 March, 2019; originally announced March 2019.

arXiv:1809.05375 [pdf, other]

Style Augmentation: Data Augmentation via Style Randomization

Authors: Philip T. Jackson, Amir Atapour-Abarghouei, Stephen Bonner, Toby Breckon, Boguslaw Obara

Abstract: We introduce style augmentation, a new form of data augmentation based on random style transfer, for improving the robustness of convolutional neural networks (CNN) over both classification and regression based tasks. During training, our style augmentation randomizes texture, contrast and color, while preserving shape and semantic content. This is accomplished by adapting an arbitrary style trans… ▽ More We introduce style augmentation, a new form of data augmentation based on random style transfer, for improving the robustness of convolutional neural networks (CNN) over both classification and regression based tasks. During training, our style augmentation randomizes texture, contrast and color, while preserving shape and semantic content. This is accomplished by adapting an arbitrary style transfer network to perform style randomization, by sampling input style embeddings from a multivariate normal distribution instead of inferring them from a style image. In addition to standard classification experiments, we investigate the effect of style augmentation (and data augmentation generally) on domain transfer tasks. We find that data augmentation significantly improves robustness to domain shift, and can be used as a simple, domain agnostic alternative to domain adaptation. Comparing style augmentation against a mix of seven traditional augmentation techniques, we find that it can be readily combined with them to improve network performance. We validate the efficacy of our technique with domain transfer experiments in classification and monocular depth estimation, illustrating consistent improvements in generalization. △ Less

Submitted 12 April, 2019; v1 submitted 14 September, 2018; originally announced September 2018.

arXiv:1708.07218 [pdf]

Object-Based Audio Rendering

Authors: Philip Jackson, Filippo Fazi, Frank Melchior, Trevor Cox, Adrian Hilton, Chris Pike, Jon Francombe, Andreas Franck, Philip Coleman, Dylan Menzies-Gow, James Woodcock, Yan Tang, Qingju Liu, Rick Hughes, Marcos Simon Galvez, Teo de Campos, Hansung Kim, Hanne Stenzel

Abstract: Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting o… ▽ More Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting one or more of the plurality of audio objects for a current reproduction scenario, the object adapting means being configured to send the adapted one or more audio objects to one or more of the plurality of renderers. △ Less

Submitted 23 August, 2017; originally announced August 2017.

Comments: This is a transcript of GB Patent Application No: GB1609316.3, filed in the UK by the University of Surrey on 23 May 2016. It describes an intelligent system for customising, personalising and perceptually monitoring the rendering of an object-based audio stream for an arbitrary connected system of loudspeakers to optimize the listening experience as the producer intended. 30 pages, 5 figures

arXiv:1705.08262 [pdf, other]

Verification of a lazy cache coherence protocol against a weak memory model

Authors: Christopher J. Banks, Marco Elver, Ruth Hoffmann, Susmit Sarkar, Paul Jackson, Vijay Nagarajan

Abstract: In this paper we verify a modern lazy cache coherence protocol, TSO-CC, against the memory consistency model it was designed for, TSO. We achieve this by first showing a weak simulation relation between TSO-CC (with a fixed number of processors) and a novel finite-state operational model which exhibits the laziness of TSO-CC and satisfies TSO. We then extend this by an existing parameterisation te… ▽ More In this paper we verify a modern lazy cache coherence protocol, TSO-CC, against the memory consistency model it was designed for, TSO. We achieve this by first showing a weak simulation relation between TSO-CC (with a fixed number of processors) and a novel finite-state operational model which exhibits the laziness of TSO-CC and satisfies TSO. We then extend this by an existing parameterisation technique, allowing verification for an unlimited number of processors. The approach is executed entirely within a model checker, no external tool is required and very little in-depth knowledge of formal verification methods is required of the verifier. △ Less

Submitted 18 May, 2017; originally announced May 2017.

Comments: 10 pages

ACM Class: B.1.4; B.3.2; B.3.3

arXiv:1610.05653 [pdf, other]

doi 10.1109/TASLP.2016.2633802

Acoustic Reflector Localization: Novel Image Source Reversion and Direct Localization Methods

Authors: Luca Remaggi, Philip J. B. Jackson, Philip Coleman, Wenwu Wang

Abstract: Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as "image-source reversion", which localizes the ima… ▽ More Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as "image-source reversion", which localizes the image source before finding the reflector position, and "direct localization", which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC. △ Less

Submitted 5 January, 2017; v1 submitted 18 October, 2016; originally announced October 2016.

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 2, pp. 296-309, February 2017

arXiv:1607.03681 [pdf, other]

doi 10.1109/TASLP.2017.2690563

Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Authors: Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, Mark D. Plumbley

Abstract: Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classific… ▽ More Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (sDAE or aDAE) to generate new data-driven features from the Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed aDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17. △ Less

Submitted 29 November, 2016; v1 submitted 13 July, 2016; originally announced July 2016.

Comments: 10 pages, dcase 2016 challenge

Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 25(6):1230-1241, Jun 2017

arXiv:1606.07695 [pdf, other]

Fully DNN-based Multi-label regression for audio tagging

Authors: Yong Xu, Qiang Huang, Wenwu Wang, Philip J. B. Jackson, Mark D. Plumbley

Abstract: Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical appli… ▽ More Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge. △ Less

Submitted 13 August, 2016; v1 submitted 24 June, 2016; originally announced June 2016.

Comments: Submitted to DCASE2016 Workshop which is as a satellite event to the 2016 European Signal Processing Conference (EUSIPCO)

arXiv:1407.0325 [pdf]

A Computational Model of Crowds for Collective Intelligence

Authors: John Prpic, Piper Jackson, Thai Nguyen

Abstract: In this work, we present a high-level computational model of IT-mediated crowds for collective intelligence. We introduce the Crowd Capital perspective as an organizational-level model of collective intelligence generation from IT-mediated crowds, and specify a computational system including agents, forms of IT, and organizational knowledge. In this work, we present a high-level computational model of IT-mediated crowds for collective intelligence. We introduce the Crowd Capital perspective as an organizational-level model of collective intelligence generation from IT-mediated crowds, and specify a computational system including agents, forms of IT, and organizational knowledge. △ Less

Submitted 30 June, 2014; originally announced July 2014.

Report number: ci-2014/94

Showing 1–26 of 26 results for author: Jackson, P