-
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data
Authors:
Moira Shooter,
Charles Malleson,
Adrian Hilton
Abstract:
We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limite…
▽ More
We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limited background diversity make the captured video less representative of real-world conditions. To address this, we created 3DDogs-Wild, a naturalised version of the dataset where the optical markers are in-painted and the subjects are placed in diverse environments, enhancing its utility for training RGB image-based pose detectors. We show that using the 3DDogs-Wild to train the models leads to improved performance when evaluating on in-the-wild data. Additionally, we provide a thorough analysis using various pose estimation models, revealing their respective strengths and weaknesses. We believe that our findings, coupled with the datasets provided, offer valuable insights for advancing 3D animal pose estimation.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Authors:
Asmar Nadeem,
Faegheh Sardari,
Robert Dawes,
Syed Sameed Husain,
Adrian Hilton,
Armin Mustafa
Abstract:
Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, w…
▽ More
Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions, evaluated automatically to ensure caption quality and relevance; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN is more accurate in articulating the causal and temporal aspects of video content than the second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTT datasets, respectively. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
An Effective-Efficient Approach for Dense Multi-Label Action Detection
Authors:
Faegheh Sardari,
Armin Mustafa,
Philip J. B. Jackson,
Adrian Hilton
Abstract:
Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarc…
▽ More
Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Gaussian Splatting with Localized Points Management
Authors:
Haosen Yang,
Chenhao Zhang,
Wenqing Wang,
Marco Volino,
Adrian Hilton,
Li Zhang,
Xiatian Zhu
Abstract:
Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity…
▽ More
Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. However, we reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) as it is unable to identify all the 3D zones that require point densification, and lacking an appropriate mechanism to handle the ill-conditioned points with negative impacts (occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, with the guidance of image rendering errors. We apply point densification in the identified zone, whilst resetting the opacity of those points residing in front of these regions so that a new opportunity is created to correct ill-conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian Splatting models. Experimental evaluation across both static 3D and dynamic 4D scenes validate the efficacy of our LPM strategy in boosting a variety of existing 3DGS models both quantitatively and qualitatively. Notably, LPM improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, outperforming on challenging datasets such as Tanks & Temples and the Neural 3D Video Dataset.
△ Less
Submitted 13 June, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Demonstration of a Mobile Optical Clock Ensemble at Sea
Authors:
E. Ahern,
J. W. Allison,
C. Billington,
N. Bourbeau Hébert,
A. P. Hilton,
E. Klantsataya,
C. Locke,
A. N. Luiten,
M. Nelligan,
R. F. Offer,
C. Perrella,
S. K. Scholten,
B. White,
B. M. Sparkes,
R. Beard,
J. D. Elgin,
K. W. Martin
Abstract:
Atomic clocks have been at the leading edge of accuracy and precision since their inception in the 1950s. However, typically the most capable of these clocks have been confined to laboratories despite the fact that there are compelling reasons to apply them in the field and/or while in motion. These applications include synchronization of distributed critical infrastructure (e.g. data servers, com…
▽ More
Atomic clocks have been at the leading edge of accuracy and precision since their inception in the 1950s. However, typically the most capable of these clocks have been confined to laboratories despite the fact that there are compelling reasons to apply them in the field and/or while in motion. These applications include synchronization of distributed critical infrastructure (e.g. data servers, communications, electricity grids), scientific applications (e.g. radio-astronomy) and to mitigate the effects of interruption to Global Navigation Satellite Systems.
Over the last 20 years, there has been a breakthrough in the performance of atomic clocks by transitioning from an atomic reference based on a microwave transition to an optical frequency transition. The $10^5$-fold increase in reference frequency confers the potential for significantly higher performance. However, this performance increase has come at the cost of size, complexity and fragility which has continued the confinement of these clocks to the laboratory.
Here we report on a recent international collaboration where three emerging optical clocks, each operating on different principles, were trialed at sea. These clocks incorporate optical frequency combs so that their stable frequency outputs can be used directly in electronic apparatus and were also automated so that they do not require expert supervision. We present the frequency stability and reliability of these three clocks over multiple weeks of unsupervised naval trials, both in harbour and on the ocean. The performance of all three devices was orders of magnitude superior to existing best-in-class commercial solutions over short and medium timescales. This demonstrates that optical clocks are ready to deliver advantages for real-world applications.
△ Less
Submitted 21 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Authors:
Faegheh Sardari,
Armin Mustafa,
Philip J. B. Jackson,
Adrian Hilton
Abstract:
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts…
▽ More
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.
△ Less
Submitted 7 July, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image
Authors:
Marco Pesavento,
Yuanlu Xu,
Nikolaos Sarafianos,
Robert Maier,
Ziyan Wang,
Chun-Han Yao,
Marco Volino,
Edmond Boyer,
Adrian Hilton,
Tony Tung
Abstract:
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries alon…
▽ More
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
△ Less
Submitted 18 March, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Authors:
Asmar Nadeem,
Adrian Hilton,
Robert Dawes,
Graham Thomas,
Armin Mustafa
Abstract:
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often n…
▽ More
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.
△ Less
Submitted 27 October, 2023; v1 submitted 25 October, 2023;
originally announced October 2023.
-
PAT: Position-Aware Transformer for Dense Multi-Label Action Detection
Authors:
Faegheh Sardari,
Armin Mustafa,
Philip J. B. Jackson,
Adrian Hilton
Abstract:
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-att…
▽ More
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
End-to-End Latency Optimization of Multi-view 3D Reconstruction for Disaster Response
Authors:
Xiaojie Zhang,
Mingjun Li,
Andrew Hilton,
Amitangshu Pal,
Soumyabrata Dey,
Saptarshi Debroy
Abstract:
In order to plan rapid response during disasters, first responder agencies often adopt `bring your own device' (BYOD) model with inexpensive mobile edge devices (e.g., drones, robots, tablets) for complex video analytics applications, e.g., 3D reconstruction of a disaster scene. Unlike simpler video applications, widely used Multi-view Stereo (MVS) based 3D reconstruction applications (e.g., openM…
▽ More
In order to plan rapid response during disasters, first responder agencies often adopt `bring your own device' (BYOD) model with inexpensive mobile edge devices (e.g., drones, robots, tablets) for complex video analytics applications, e.g., 3D reconstruction of a disaster scene. Unlike simpler video applications, widely used Multi-view Stereo (MVS) based 3D reconstruction applications (e.g., openMVG/openMVS) are exceedingly time consuming, especially when run on such computationally constrained mobile edge devices. Additionally, reducing the reconstruction latency of such inherently sequential algorithms is challenging as unintelligent, application-agnostic strategies can drastically degrade the reconstruction (i.e., application outcome) quality making them useless. In this paper, we aim to design a latency optimized MVS algorithm pipeline, with the objective to best balance the end-to-end latency and reconstruction quality by running the pipeline on a collaborative mobile edge environment. The overall optimization approach is two-pronged where: (a) application optimizations introduce data-level parallelism by splitting the pipeline into high frequency and low frequency reconstruction components and (b) system optimizations incorporate task-level parallelism to the pipelines by running them opportunistically on available resources with online quality control in order to balance both latency and quality. Our evaluation on a hardware testbed using publicly available datasets shows upto ~54% reduction in latency with negligible loss (~4-7%) in reconstruction quality.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
SEM-POS: Grammatically and Semantically Correct Video Captioning
Authors:
Asmar Nadeem,
Adrian Hilton,
Robert Dawes,
Graham Thomas,
Armin Mustafa
Abstract:
Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and f…
▽ More
Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.
△ Less
Submitted 4 April, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Wavefront Curvature in Optical Atomic Beam Clocks
Authors:
A. Strathearn,
R. F. Offer,
A. P. Hilton,
E. Klantsataya,
A. N. Luiten,
R. P. Anderson,
B. M. Sparkes,
T. M. Stace
Abstract:
Atomic clocks provide a reproducible basis for our understanding of time and frequency. Recent demonstrations of compact optical clocks, employing thermal atomic beams, have achieved short-term fractional frequency instabilities in the $10^{-16}$, competitive with the best international frequency standards available. However, a serious challenge inherent in compact clocks is the necessarily smalle…
▽ More
Atomic clocks provide a reproducible basis for our understanding of time and frequency. Recent demonstrations of compact optical clocks, employing thermal atomic beams, have achieved short-term fractional frequency instabilities in the $10^{-16}$, competitive with the best international frequency standards available. However, a serious challenge inherent in compact clocks is the necessarily smaller optical beams, which results in rapid variation in interrogating wavefronts. This can cause inhomogeneous excitation of the thermal beam leading to long term drifts in the output frequency. Here we develop a model for Ramsey-Bordé interferometery using optical fields with curved wavefronts and simulate the $^{40}$Ca beam clock experiment described in [Olson et al., Phys. Rev. Lett. 123, 073202 (2019)]. Olson et al.'s results had shown surprising and unexplained behaviour in the response of the atoms in the interrogation. Our model predicts signals consistent with experimental data and can account for the significant sensitivity to laser geometry that was reported. We find the signal-to-noise ratio is maximised when the laser is uncollimated at the interrogation zones to minimise inhomogeneity, and also identify an optimal waist size determined by both laser inhomogeneity and the velocity distribution of the atomic beam. We investigate the shifts and stability of the clock frequency, showing that the Gouy phase is the primary source of frequency variations arising from laser geometry.
△ Less
Submitted 24 January, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Super-resolution 3D Human Shape from a Single Low-Resolution Image
Authors:
Marco Pesavento,
Marco Volino,
Adrian Hilton
Abstract:
We propose a novel framework to reconstruct super-resolution human shape from a single low-resolution input image. The approach overcomes limitations of existing approaches that reconstruct 3D human shape from a single image, which require high-resolution images together with auxiliary data such as surface normal or a parametric model to reconstruct high-detail shape. The proposed framework repres…
▽ More
We propose a novel framework to reconstruct super-resolution human shape from a single low-resolution input image. The approach overcomes limitations of existing approaches that reconstruct 3D human shape from a single image, which require high-resolution images together with auxiliary data such as surface normal or a parametric model to reconstruct high-detail shape. The proposed framework represents the reconstructed shape with a high-detail implicit function. Analogous to the objective of 2D image super-resolution, the approach learns the map** from a low-resolution shape to its high-resolution counterpart and it is applied to reconstruct 3D shape detail from low-resolution images. The approach is trained end-to-end employing a novel loss function which estimates the information lost between a low and high-resolution representation of the same 3D surface shape. Evaluation for single image reconstruction of clothed people demonstrates that our method achieves high-detail surface reconstruction from low-resolution images without auxiliary data. Extensive experiments show that the proposed approach can estimate super-resolution human geometries with a significantly higher level of detail than that obtained with previous approaches when applied to low-resolution images.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
Visually Supervised Speaker Detection and Localization via Microphone Array
Authors:
Davide Berghi,
Adrian Hilton,
Philip J. B. Jackson
Abstract:
Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates.…
▽ More
Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline embraces the "student-teacher" paradigm, where a trained "teacher" network is used to produce pseudo-labels visually. The "student" network is an audio network trained to generate the same results. At inference, the student network can independently localize the speaker in the visual frames directly from the audio input. Experimental results on newly collected data prove that our approach significantly outperforms a variety of other baselines as well as the teacher network itself. It results in an excellent speech activity detector too.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Super-Resolution Appearance Transfer for 4D Human Performances
Authors:
Marco Pesavento,
Marco Volino,
Adrian Hilton
Abstract:
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance which depends on both the camera resolution and capture volume. Typically the requirement to frame cameras to capture the volume of a dynamic performance ($>50m^3$) results in the person occupying only a small proportion $<$ 10% of the field of view. Even with ultra h…
▽ More
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance which depends on both the camera resolution and capture volume. Typically the requirement to frame cameras to capture the volume of a dynamic performance ($>50m^3$) results in the person occupying only a small proportion $<$ 10% of the field of view. Even with ultra high-definition 4k video acquisition this results in sampling the person at less-than standard definition 0.5k video resolution resulting in low-quality rendering. In this paper we propose a solution to this problem through super-resolution appearance transfer from a static high-resolution appearance capture rig using digital stills cameras ($> 8k$) to capture the person in a small volume ($<8m^3$). A pipeline is proposed for super-resolution appearance transfer from high-resolution static capture to dynamic video performance capture to produce super-resolution dynamic textures. This addresses two key problems: colour map** between different camera systems; and dynamic texture map super-resolution using a learnt model. Comparative evaluation demonstrates a significant qualitative and quantitative improvement in rendering the 4D performance capture with super-resolution dynamic texture appearance. The proposed approach reproduces the high-resolution detail of the static capture whilst maintaining the appearance dynamics of the captured video.
△ Less
Submitted 31 August, 2021;
originally announced August 2021.
-
Attention-based Multi-Reference Learning for Image Super-Resolution
Authors:
Marco Pesavento,
Marco Volino,
Adrian Hilton
Abstract:
This paper proposes a novel Attention-based Multi-Reference Super-resolution network (AMRSR) that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence. The use of multiple reference images together with attention-based sampling is demonstrated to achieve significantly…
▽ More
This paper proposes a novel Attention-based Multi-Reference Super-resolution network (AMRSR) that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence. The use of multiple reference images together with attention-based sampling is demonstrated to achieve significantly improved performance over state-of-the-art reference super-resolution approaches on multiple benchmark datasets. Reference super-resolution approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution reference image. Multi-reference super-resolution extends this approach by providing a more diverse pool of image features to overcome the inherent information deficit whilst maintaining memory efficiency. A novel hierarchical attention-based sampling approach is introduced to learn the similarity between low-resolution image features and multiple reference images based on a perceptual loss. Ablation demonstrates the contribution of both multi-reference and hierarchical attention-based sampling to overall performance. Perceptual and quantitative ground-truth evaluation demonstrates significant improvement in performance even when the reference images deviate significantly from the target image. The project website can be found at https://marcopesavento.github.io/AMRSR/
△ Less
Submitted 31 August, 2021;
originally announced August 2021.
-
SyDog: A Synthetic Dog Dataset for Improved 2D Pose Estimation
Authors:
Moira Shooter,
Charles Malleson,
Adrian Hilton
Abstract:
Estimating the pose of animals can facilitate the understanding of animal motion which is fundamental in disciplines such as biomechanics, neuroscience, ethology, robotics and the entertainment industry. Human pose estimation models have achieved high performance due to the huge amount of training data available. Achieving the same results for animal pose estimation is challenging due to the lack…
▽ More
Estimating the pose of animals can facilitate the understanding of animal motion which is fundamental in disciplines such as biomechanics, neuroscience, ethology, robotics and the entertainment industry. Human pose estimation models have achieved high performance due to the huge amount of training data available. Achieving the same results for animal pose estimation is challenging due to the lack of animal pose datasets. To address this problem we introduce SyDog: a synthetic dataset of dogs containing ground truth pose and bounding box coordinates which was generated using the game engine, Unity. We demonstrate that pose estimation models trained on SyDog achieve better performance than models trained purely on real data and significantly reduce the need for the labour intensive labelling of images. We release the SyDog dataset as a training and evaluation benchmark for research in animal motion.
△ Less
Submitted 31 July, 2021;
originally announced August 2021.
-
Spalling-induced liftoff and transfer of electronic films using a van der Waals release layer
Authors:
Eric W. Blanton,
Michael J. Motala,
Timothy A. Prusnick,
Albert Hilton,
Jeff L. Brown,
Arkka Bhattacharyya,
Sriram Krishnamoorthy,
Kevin Leedy,
Nicholas R. Glavin,
Michael Snure
Abstract:
Heterogeneous integration strategies are increasingly being employed to achieve more compact and capable electronics systems for multiple applications including space, electric vehicles, and wearable and medical devices. To enable new integration strategies, the growth and transfer of thin electronic films and devices, including III-nitrides, metal oxides, and two-dimensional (2D) materials, using…
▽ More
Heterogeneous integration strategies are increasingly being employed to achieve more compact and capable electronics systems for multiple applications including space, electric vehicles, and wearable and medical devices. To enable new integration strategies, the growth and transfer of thin electronic films and devices, including III-nitrides, metal oxides, and two-dimensional (2D) materials, using 2D boron nitride (BN)-on-sapphire templates is demonstrated. The van der Waals BN layer, in this case, acts as a preferred mechanical release layer for precise separation at the substrate-film interface and leaves a smooth surface suitable for van der Waals bonding. A tensilely-stressed Ni layer sputtered on top of the film induces controlled spalling fracture which propagates at the BN/sapphire interface. By incorporating controlled spalling, the process yield and sensitivity is greatly improved, owed to the greater fracture energy provided by the stressed metal layer relative to a soft tape or rubber stamp. With stress playing a critical role in this process, the influence of residual stress on detrimental cracking and bowing is investigated. Additionally, a selected area lift-off technique is developed which allows for isolation and transfer of individual devices while maximizing wafer area use and minimizing extra alignment steps in the integration process.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Multi-person Implicit Reconstruction from a Single Image
Authors:
Armin Mustafa,
Akin Caliskan,
Lourdes Agapito,
Adrian Hilton
Abstract:
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our…
▽ More
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real-world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Temporal Consistency Loss for High Resolution Textured and Clothed 3DHuman Reconstruction from Monocular Video
Authors:
Akin Caliskan,
Armin Mustafa,
Adrian Hilton
Abstract:
We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper, we introduce an approach…
▽ More
We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper, we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People
Authors:
Akin Caliskan,
Armin Mustafa,
Evren Imre,
Adrian Hilton
Abstract:
We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape re…
▽ More
We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Spectral Analysis Network for Deep Representation Learning and Image Clustering
Authors:
**ghua Wang,
Adrian Hilton,
Jianmin Jiang
Abstract:
Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training procedure. However, it is time consuming or even impossible to obtain the label information in some tasks due to cost limitation. Thus, it is necessary to develop…
▽ More
Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training procedure. However, it is time consuming or even impossible to obtain the label information in some tasks due to cost limitation. Thus, it is necessary to develop unsupervised deep representation learning techniques. This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis, which is a popular technique with solid theory foundations. Compared with the existing spectral analysis methods, the proposed network structure has at least three advantages. Firstly, it can identify the local similarities among images in patch level and thus more robust against occlusion. Secondly, through multiple consecutive spectral analysis procedures, the proposed network can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples. Thirdly, it can elegantly integrate different spectral analysis procedures, so that each spectral analysis procedure can have their individual strengths in dealing with different data sample distributions. Extensive experimental results show the effectiveness of the proposed methods on various image clustering tasks.
△ Less
Submitted 11 September, 2020;
originally announced September 2020.
-
Bounds Related to The Edge-List Chromatic and Total Chromatic Numbers of a Simple Graph
Authors:
M. Henderson,
A. J. W. Hilton,
R. Mary Jeya Jothi
Abstract:
We show that for a simple graph $G$, $c'(G)\leqΔ(G)+2$ where $c'(G)$ is the choice index (or edge-list chromatic number) of $G$, and $Δ(G)$ is the maximum degree of $G$.
As a simple corollary of this result, we show that the total chromatic number $χ_T(G)$ of a simple graph satisfies the inequality $χ_T(G)\leq\ Δ(G)+4$ and the total choice number $c_T(G)$ also satisfies this inequality.
We als…
▽ More
We show that for a simple graph $G$, $c'(G)\leqΔ(G)+2$ where $c'(G)$ is the choice index (or edge-list chromatic number) of $G$, and $Δ(G)$ is the maximum degree of $G$.
As a simple corollary of this result, we show that the total chromatic number $χ_T(G)$ of a simple graph satisfies the inequality $χ_T(G)\leq\ Δ(G)+4$ and the total choice number $c_T(G)$ also satisfies this inequality.
We also relate these bounds to the Hall index and the Hall condition index of a simple graph, and to the total Hall number and the total Hall condition number of a simple graph.
△ Less
Submitted 8 March, 2022; v1 submitted 4 April, 2020;
originally announced April 2020.
-
Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events
Authors:
Davide Berghi,
Hanne Stenzel,
Marco Volino,
Adrian Hilton,
Philip J. B. Jackson
Abstract:
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were…
▽ More
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.
△ Less
Submitted 14 March, 2020;
originally announced March 2020.
-
Transferrable AlGaN/GaN HEMTs to Arbitrary Substrates via a Two-dimensional Boron Nitride Release Layer
Authors:
Michael J. Motala,
Eric Blanton,
Al Hilton,
Eric Heller,
Chris Muratore,
Katherine Burzynski,
Jeff Brown,
Kelson Chabak,
Michael Durstock,
Michael Snure,
Nicholas Glavin
Abstract:
Mechanical transfer of high performing thin film devices onto arbitrary substrates represents an exciting opportunity to improve device performance, explore non-traditional manufacturing approaches, and paves the way for soft, conformal, and flexible electronics. Using a two-dimensional (2D) boron nitride (BN) release layer, we demonstrate the transfer of AlGaN/GaN high-electron mobility transisto…
▽ More
Mechanical transfer of high performing thin film devices onto arbitrary substrates represents an exciting opportunity to improve device performance, explore non-traditional manufacturing approaches, and paves the way for soft, conformal, and flexible electronics. Using a two-dimensional (2D) boron nitride (BN) release layer, we demonstrate the transfer of AlGaN/GaN high-electron mobility transistors (HEMTs) to arbitrary substrates through both direct van der Waals (vdW) bonding and with a polymer adhesive interlayer. No device degradation was observed due to the transfer process, and a significant reduction in device temperature (327 °C to 132 °C at 600 mW) was observed when directly bonded to a silicon carbide (SiC) wafer relative to the starting wafer. With the use of a benzocyclobutene (BCB) adhesion interlayer, devices were easily transferred and characterized on Kapton and ceramic films, representing an exciting opportunity for integration onto arbitrary substrates. Upon reduction of this polymer adhesive layer thickness, the AlGaN/GaN HEMTs transferred onto a BCB/SiC substrate resulted in comparable peak temperatures during operation at powers as high as 600 mW to the as-grown wafer, revealing that by optimizing interlayer characteristics such as thickness and thermal conductivity, transferrable devices on polymer layers can still improve performance outputs.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.
-
Gemini: A Functional Programming Language for Hardware Description
Authors:
Aditya Srinivasan,
Andrew D. Hilton
Abstract:
This paper presents Gemini, a functional programming language for hardware description that provides features such as parametric polymorphism, recursive datatypes, higher-order functions, and type inference for higher expressivity compared to modern hardware description languages. Gemini demonstrates the theory and implementation of novel type-theoretical concepts through its unique type system co…
▽ More
This paper presents Gemini, a functional programming language for hardware description that provides features such as parametric polymorphism, recursive datatypes, higher-order functions, and type inference for higher expressivity compared to modern hardware description languages. Gemini demonstrates the theory and implementation of novel type-theoretical concepts through its unique type system consisting of multiple atomic kinds and dependent types, which allows the language to model both software and hardware constructs safely and perform type inference through multi-staged compilation. The primary technical results of this paper include formalizations of the Gemini grammar, ty** rules, and evaluation rules, a proof of safety of Gemini's type system, and a prototype implementation of the compiler's semantic analysis phase.
△ Less
Submitted 10 November, 2019;
originally announced November 2019.
-
Light-shift spectroscopy of optically trapped atomic ensembles
Authors:
Ashby P. Hilton,
Andre N. Luiten,
Philip S. Light
Abstract:
We develop a method for extracting the physical parameters of interest for a dipole trapped cold atomic ensemble. This technique uses the spatially dependent ac-Stark shift of the trap itself to project the atomic distribution onto a light-shift broadened transmission spectrum. We develop a model that connects the atomic distribution with the expected transmission spectrum. We then demonstrate the…
▽ More
We develop a method for extracting the physical parameters of interest for a dipole trapped cold atomic ensemble. This technique uses the spatially dependent ac-Stark shift of the trap itself to project the atomic distribution onto a light-shift broadened transmission spectrum. We develop a model that connects the atomic distribution with the expected transmission spectrum. We then demonstrate the utility of the technique by deriving the temperature, trap depth, lifetime, and trapped atom number from data that was taken in a single shot experimental measurement.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Learning Dense Wide Baseline Stereo Matching for People
Authors:
Akin Caliskan,
Armin Mustafa,
Evren Imre,
Adrian Hilton
Abstract:
Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide baseline views. This paper proposes a framework to learn and estimate dense stereo for people from wide baseline image pairs. A synthetic people stereo patch dataset (S2P2) is introduced to learn wide baseline dense stereo matching for people. The proposed framework not only learns human specifi…
▽ More
Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide baseline views. This paper proposes a framework to learn and estimate dense stereo for people from wide baseline image pairs. A synthetic people stereo patch dataset (S2P2) is introduced to learn wide baseline dense stereo matching for people. The proposed framework not only learns human specific features from synthetic data but also exploits pooling layer and data augmentation to adapt to real data. The network learns from the human specific stereo patches from the proposed dataset for wide-baseline stereo estimation. In addition to patch match learning, a stereo constraint is introduced in the framework to solve wide baseline stereo reconstruction of humans. Quantitative and qualitative performance evaluation against state-of-the-art methods of proposed method demonstrates improved wide baseline stereo reconstruction on challenging datasets. We show that it is possible to learn stereo matching from synthetic people dataset and improve performance on real datasets for stereo reconstruction of people from narrow and wide baseline stereo data.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Heterodyne fiber interferometer for frequency-noise reduction and rapid wide-band tunability of a conventional laser source
Authors:
Ashby P. Hilton,
Philip S. Light,
Lauris J. B. Talbot,
Andre N. Luiten
Abstract:
Self-heterodyne fiber interferometers have been shown to be capable of stabilizing lasers to ultra-narrow linewidths and present an excellent alternative to high finesse cavities for frequency stabilization. In addition to suppressing frequency noise, these devices are highly tunable, and can be manipulated to produce high speed frequency sweeps over the entire range of the laser. We present an an…
▽ More
Self-heterodyne fiber interferometers have been shown to be capable of stabilizing lasers to ultra-narrow linewidths and present an excellent alternative to high finesse cavities for frequency stabilization. In addition to suppressing frequency noise, these devices are highly tunable, and can be manipulated to produce high speed frequency sweeps over the entire range of the laser. We present an analytic approach for choosing a delay-line length for both optimal noise suppression and highest in-loop frequency sweep rate. Using this model we have implemented a fiber-based active Michelson interferometer as a frequency discriminator for a conventional diode laser and demonstrated a linewidth of 700 Hz over millisecond timescales. We also demonstrate a frequency scan rate of 1 THz/s and independently measure the maximum deviation in frequency from the linear sweep to be 100 kHz, predominantly limited by acoustic resonances in the apparatus.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras
Authors:
Andrew Gilbert,
Matthew Trumble,
Adrian Hilton,
John Collomosse
Abstract:
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the…
▽ More
We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.
△ Less
Submitted 7 September, 2020; v1 submitted 8 August, 2019;
originally announced August 2019.
-
EdgeNet: Semantic Scene Completion from a Single RGB-D Image
Authors:
Aloisio Dourado,
Teofilo Emidio de Campos,
Hansung Kim,
Adrian Hilton
Abstract:
Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. Previous works on Semantic Scene Completion from RGB-D data used either only depth or depth with colour by projecting the 2D image into the 3D volume resulting in a sparse data representation. In this work, we present a…
▽ More
Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. Previous works on Semantic Scene Completion from RGB-D data used either only depth or depth with colour by projecting the 2D image into the 3D volume resulting in a sparse data representation. In this work, we present a new strategy to encode colour information in 3D space using edge detection and flipped truncated signed distance. We also present EdgeNet, a new end-to-end neural network architecture capable of handling features generated from the fusion of depth and edge information. Experimental results show improvement of 6.9% over the state-of-the-art result on real data, for end-to-end approaches.
△ Less
Submitted 6 September, 2020; v1 submitted 7 August, 2019;
originally announced August 2019.
-
U4D: Unsupervised 4D Dynamic Scene Understanding
Authors:
Armin Mustafa,
Chris Russell,
Adrian Hilton
Abstract:
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency,…
▽ More
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (approx 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Temporally Coherent General Dynamic Scene Reconstruction
Authors:
Armin Mustafa,
Marco Volino,
Hansung Kim,
Jean-Yves Guillemaut,
Adrian Hilton
Abstract:
Existing techniques for dynamic scene reconstruction from multiple wide-baseline cameras primarily focus on reconstruction in controlled environments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cameras without prior knowledge of the scene…
▽ More
Existing techniques for dynamic scene reconstruction from multiple wide-baseline cameras primarily focus on reconstruction in controlled environments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cameras without prior knowledge of the scene structure, appearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruction to initialize joint estimation; Sparse-to-dense temporal correspondence integrated with joint multi-view segmentation and reconstruction to introduce temporal coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Comparison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates improved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsupervised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object segmentation and shape reconstruction and its application to free-viewpoint rendering and virtual reality.
△ Less
Submitted 3 August, 2020; v1 submitted 18 July, 2019;
originally announced July 2019.
-
The simple graph threshold number $σ(r,s,a,t)$
Authors:
A. J. W. Hilton,
A. Rajkumar
Abstract:
For $d \ge 1$, $s \ge 0$ a $(d, d+s)$-{\em graph} is a graph whose degrees all lie in the interval $\{d, d+1, \ldots, d + s\}$. For $r \ge 1$, $a \ge 0$, an $(r, r+a)$-{\em factor} of a graph $G$ is a spanning $(r, r+a)$-subgraph of $G$. An $(r, r+a)$-{\em factorization} of a graph $G$ is a decomposition of $G$ into edge-disjoint $(r, r+a)$-factors. A graph is $(r, r+a)$-{\em factorable} if it has…
▽ More
For $d \ge 1$, $s \ge 0$ a $(d, d+s)$-{\em graph} is a graph whose degrees all lie in the interval $\{d, d+1, \ldots, d + s\}$. For $r \ge 1$, $a \ge 0$, an $(r, r+a)$-{\em factor} of a graph $G$ is a spanning $(r, r+a)$-subgraph of $G$. An $(r, r+a)$-{\em factorization} of a graph $G$ is a decomposition of $G$ into edge-disjoint $(r, r+a)$-factors. A graph is $(r, r+a)$-{\em factorable} if it has an $(r, r+a)$-factorization.
Let $σ(r, s, a, t)$ be the least integer such that, if $d \ge σ(r, s, a, t)$, then every $(d, d+s)$-simple graph $G$ is $(r,r+a)$-factorable with $x$ factors for at least $t$ different values of $x$.
In this paper we evaluate $σ(r,s,a,t)$ for all values of $r, s, a$ and $t$. We also show that if $a \ge 2$ and $r \ge 1$, then, when $r$ is even and $a$ is odd, every $(d, d+s)$-simple graph $G$ has an $(r, r+a)$-factorization with $x$ factors if and only if $$ \frac{d+s}{r+a}\, < x \le \frac{d}{r}\,,$$ and we prove similar statements for other parities of $r$ and $a$.
△ Less
Submitted 14 February, 2019;
originally announced February 2019.
-
Dual-colour magic-wavelength trap for suppression of light shifts in atoms
Authors:
Ashby P. Hilton,
Christopher Perrella,
Andre N. Luiten,
Philip S. Light
Abstract:
We present an optical approach to compensating for spatially varying ac-Stark shifts that appear on atomic ensembles subject to strong optical control or trap** fields. The introduction of an additional weak light field produces an intentional perturbation between atomic states that is tuned to suppress the influence of the strong field. The compensation field suppresses sensitivity in one of th…
▽ More
We present an optical approach to compensating for spatially varying ac-Stark shifts that appear on atomic ensembles subject to strong optical control or trap** fields. The introduction of an additional weak light field produces an intentional perturbation between atomic states that is tuned to suppress the influence of the strong field. The compensation field suppresses sensitivity in one of the transition frequencies of the trapped atoms to both the atomic distribution and motion. We demonstrate this technique in a cold rubidium ensemble and show a reduction in inhomogeneous broadening in the trap. This two-colour approach emulates the magic trap** approach that is used in modern atomic lattice clocks but provides greater flexibility in choice of atomic species, probe transition, and trap wavelength.
△ Less
Submitted 11 November, 2018;
originally announced November 2018.
-
Volumetric performance capture from minimal camera viewpoints
Authors:
Andrew Gilbert,
Marco Volino,
John Collomosse,
Adrian Hilton
Abstract:
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by th…
▽ More
We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.
△ Less
Submitted 10 July, 2018; v1 submitted 5 July, 2018;
originally announced July 2018.
-
Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling
Authors:
Matthew Trumble,
Andrew Gilbert,
Adrian Hilton,
John Collomosse
Abstract:
We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input vol…
▽ More
We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of $4 \times$, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.
△ Less
Submitted 4 July, 2018;
originally announced July 2018.
-
4D Temporally Coherent Light-field Video
Authors:
Armin Mustafa,
Marco Volino,
Jean-yves Guillemaut,
Adrian Hilton
Abstract:
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing…
▽ More
Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare light-field camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multi-view dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.
△ Less
Submitted 30 April, 2018;
originally announced April 2018.
-
High-efficiency cold-atom transport into a waveguide trap
Authors:
Ashby P. Hilton,
Christopher Perrella,
Fetah Benabid,
Ben M. Sparkes,
Andre N. Luiten,
Philip S. Light
Abstract:
We have developed and characterized an atom-guiding technique that loads $3\times10^6$ cold rubidium atoms into hollow-core optical fibre, an order-of-magnitude larger than previously reported results. This result was possible because it was guided by a physically realistic simulation that could provide the specifications for loading efficiencies of 3% and a peak optical depth of 600. The simulati…
▽ More
We have developed and characterized an atom-guiding technique that loads $3\times10^6$ cold rubidium atoms into hollow-core optical fibre, an order-of-magnitude larger than previously reported results. This result was possible because it was guided by a physically realistic simulation that could provide the specifications for loading efficiencies of 3% and a peak optical depth of 600. The simulation further showed that the demonstrated loading efficiency is limited solely by the geometric overlap of the atom cloud and the optical guide beam, and is thus open to further improvement with experimental modification. The experimental arrangement allows observation of the real-time effects of light-assisted cold atom collisions and background gas collisions by tracking the dynamics of the cold atom cloud as it falls into the fibre. The combination of these observations, and physical understanding from the simulation, allows estimation of the limits to loading cold atoms into hollow-core fibres.
△ Less
Submitted 30 October, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Semantic Scene Completion Combining Colour and Depth: preliminary experiments
Authors:
Andre Bernardes Soares Guedes,
Teofilo Emidio de Campos,
Adrian Hilton
Abstract:
Semantic scene completion is the task of producing a complete 3D voxel representation of volumetric occupancy with semantic labels for a scene from a single-view observation. We built upon the recent work of Song et al. (CVPR 2017), who proposed SSCnet, a method that performs scene completion and semantic labelling in a single end-to-end 3D convolutional network. SSCnet uses only depth maps as inp…
▽ More
Semantic scene completion is the task of producing a complete 3D voxel representation of volumetric occupancy with semantic labels for a scene from a single-view observation. We built upon the recent work of Song et al. (CVPR 2017), who proposed SSCnet, a method that performs scene completion and semantic labelling in a single end-to-end 3D convolutional network. SSCnet uses only depth maps as input, even though depth maps are usually obtained from devices that also capture colour information, such as RGBD sensors and stereo cameras. In this work, we investigate the potential of the RGB colour channels to improve SSCnet.
△ Less
Submitted 13 February, 2018;
originally announced February 2018.
-
Object-Based Audio Rendering
Authors:
Philip Jackson,
Filippo Fazi,
Frank Melchior,
Trevor Cox,
Adrian Hilton,
Chris Pike,
Jon Francombe,
Andreas Franck,
Philip Coleman,
Dylan Menzies-Gow,
James Woodcock,
Yan Tang,
Qingju Liu,
Rick Hughes,
Marcos Simon Galvez,
Teo de Campos,
Hansung Kim,
Hanne Stenzel
Abstract:
Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting o…
▽ More
Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting one or more of the plurality of audio objects for a current reproduction scenario, the object adapting means being configured to send the adapted one or more audio objects to one or more of the plurality of renderers.
△ Less
Submitted 23 August, 2017;
originally announced August 2017.
-
Drift-compensated Low-noise Frequency Synthesis Based on a cryoCSO for the KRISS-F1
Authors:
Myoung-Sun Heo,
Sang Eon Park,
Won-Kyu Lee,
Sang-Bum Lee,
Hyun-Gue Hong,
Taeg Yong Kwon,
Chang Yong Park,
Dai-Hyuk Yu,
G. Santarelli,
Ashby Hilton,
Andre N. Luiten,
John G. Hartnett
Abstract:
In this paper we report on the implementation and stability analysis of a drift-compensated frequency synthesizer from a cryogenic sapphire oscillator (CSO) designed for a Cs/Rb atomic fountain clock. The synthesizer has two microwave outputs of 7 GHz and 9 GHz for Rb and Cs atom interrogation, respectively. The short-term stability of these microwave signals, measured using an optical frequency c…
▽ More
In this paper we report on the implementation and stability analysis of a drift-compensated frequency synthesizer from a cryogenic sapphire oscillator (CSO) designed for a Cs/Rb atomic fountain clock. The synthesizer has two microwave outputs of 7 GHz and 9 GHz for Rb and Cs atom interrogation, respectively. The short-term stability of these microwave signals, measured using an optical frequency comb locked to an ultra-stable laser, is better than $5\times10^{-15}$ at an averaging time of 1 s. We demonstrate that the short-term stability of the synthesizer is lower than the quantum projection noise limit of the Cs fountain clock, KRISS-F1(Cs) by measuring the short-term stability of the fountain with varying trapped atom number. The stability of the fountain at 1-s averaging time reaches $2.5\times10^{-14}$ at the highest atom number in the experiment when the synthesizer is used as an interrogation oscillator of the fountain. In order to compensate the frequency drift of the CSO, the output frequency of a waveform generator in the synthesis chain is ramped linearly. By doing this, the stability of the synthesizer at an average time of one hour reaches a level of $10^{-16}$ which is measured with the fountain clock.
△ Less
Submitted 6 October, 2016;
originally announced October 2016.
-
TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
Authors:
Blake A. Hechtman,
Andrew D. Hilton,
Daniel J. Sorin
Abstract:
We have developed a task-parallel runtime system, called TREES, that is designed for high performance on CPU/GPU platforms. On platforms with multiple CPUs, Cilk's "work-first" principle underlies how task-parallel applications can achieve performance, but work-first is a poor fit for GPUs. We build upon work-first to create the "work-together" principle that addresses the specific strengths and w…
▽ More
We have developed a task-parallel runtime system, called TREES, that is designed for high performance on CPU/GPU platforms. On platforms with multiple CPUs, Cilk's "work-first" principle underlies how task-parallel applications can achieve performance, but work-first is a poor fit for GPUs. We build upon work-first to create the "work-together" principle that addresses the specific strengths and weaknesses of GPUs. The work-together principle extends work-first by stating that (a) the overhead on the critical path should be paid by the entire system at once and (b) work overheads should be paid co-operatively. We have implemented the TREES runtime in OpenCL, and we experimentally evaluate TREES applications on a CPU/GPU platform.
△ Less
Submitted 1 August, 2016;
originally announced August 2016.
-
Temporally coherent 4D reconstruction of complex dynamic scenes
Authors:
Armin Mustafa,
Hansung Kim,
Jean-Yves Guillemaut,
Adrian Hilton
Abstract:
This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dy…
▽ More
This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.
△ Less
Submitted 28 March, 2016; v1 submitted 10 March, 2016;
originally announced March 2016.
-
General Dynamic Scene Reconstruction from Multiple View Video
Authors:
Armin Mustafa,
Hansung Kim,
Jean-Yves Guillemaut,
Adrian Hilton
Abstract:
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and…
▽ More
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Hall's Condition for Partial Latin Squares
Authors:
A. J. W. Hilton,
E. R. Vaughan
Abstract:
Hall's Condition is a necessary condition for a partial latin square to be completable. Hilton and Johnson showed that for a partial latin square whose filled cells form a rectangle, Hall's Condition is equivalent to Ryser's Condition, which is a necessary and sufficient condition for completability.
We give what could be regarded as an extension of Ryser's Theorem, by showing that for a partial…
▽ More
Hall's Condition is a necessary condition for a partial latin square to be completable. Hilton and Johnson showed that for a partial latin square whose filled cells form a rectangle, Hall's Condition is equivalent to Ryser's Condition, which is a necessary and sufficient condition for completability.
We give what could be regarded as an extension of Ryser's Theorem, by showing that for a partial latin square whose filled cells form a rectangle, where there is at most one empty cell in each column of the rectangle, Hall's Condition is a necessary and sufficient condition for completability.
It is well-known that the problem of deciding whether a partial latin square is completable is NP-complete. We show that the problem of deciding whether a partial latin square that is promised to satisfy Hall's Condition is completable is NP-hard.
△ Less
Submitted 13 July, 2011;
originally announced July 2011.
-
An analogue of Ryser's Theorem for partial Sudoku squares
Authors:
P. J. Cameron,
A. J. W. Hilton,
E. R. Vaughan
Abstract:
In 1956 Ryser gave a necessary and sufficient condition for a partial latin rectangle to be completable to a latin square. In 1990 Hilton and Johnson showed that Ryser's condition could be reformulated in terms of Hall's Condition for partial latin squares. Thus Ryser's Theorem can be interpreted as saying that any partial latin rectangle $R$ can be completed if and only if $R$ satisfies Hall's Co…
▽ More
In 1956 Ryser gave a necessary and sufficient condition for a partial latin rectangle to be completable to a latin square. In 1990 Hilton and Johnson showed that Ryser's condition could be reformulated in terms of Hall's Condition for partial latin squares. Thus Ryser's Theorem can be interpreted as saying that any partial latin rectangle $R$ can be completed if and only if $R$ satisfies Hall's Condition for partial latin squares.
We define Hall's Condition for partial Sudoku squares and show that Hall's Condition for partial Sudoku squares gives a criterion for the completion of partial Sudoku rectangles that is both necessary and sufficient. In the particular case where $n=pq$, $p|r$, $q|s$, the result is especially simple, as we show that any $r \times s$ partial $(p,q)$-Sudoku rectangle can be completed (no further condition being necessary).
△ Less
Submitted 14 July, 2011; v1 submitted 13 July, 2011;
originally announced July 2011.