-
LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model
Authors:
Yixuan Yang,
Junru Lu,
Zixiang Zhao,
Zhen Luo,
James J. Q. Yu,
Victor Sanchez,
Feng Zheng
Abstract:
Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in…
▽ More
Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM's spatial understanding. Furthermore, through dialogue, LLplace activates the LLM's capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions. Code and dataset will be released.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame Attention
Authors:
Shuai Shao,
Yu Guan,
Victor Sanchez
Abstract:
Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics in…
▽ More
Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting
Authors:
Olly Styles,
Sam Miller,
Patricio Cerda-Mardini,
Tanaya Guha,
Victor Sanchez,
Bertie Vidgen
Abstract:
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple…
▽ More
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification
Authors:
Yiming Ma,
Victor Sanchez,
Tanaya Guha
Abstract:
The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing…
▽ More
The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Detecting Face Synthesis Using a Concealed Fusion Model
Authors:
Roberto Leyva,
Victor Sanchez,
Gregory Epiphaniou,
Carsten Maple
Abstract:
Face image synthesis is gaining more attention in computer security due to concerns about its potential negative impacts, including those related to fake biometrics. Hence, building models that can detect the synthesized face images is an important challenge to tackle. In this paper, we propose a fusion-based strategy to detect face image synthesis while providing resiliency to several attacks. Th…
▽ More
Face image synthesis is gaining more attention in computer security due to concerns about its potential negative impacts, including those related to fake biometrics. Hence, building models that can detect the synthesized face images is an important challenge to tackle. In this paper, we propose a fusion-based strategy to detect face image synthesis while providing resiliency to several attacks. The proposed strategy uses a late fusion of the outputs computed by several undisclosed models by relying on random polynomial coefficients and exponents to conceal a new feature space. Unlike existing concealing solutions, our strategy requires no quantization, which helps to preserve the feature space. Our experiments reveal that our strategy achieves state-of-the-art performance while providing protection against poisoning, perturbation, backdoor, and reverse model attacks.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Data-Agnostic Face Image Synthesis Detection Using Bayesian CNNs
Authors:
Roberto Leyva,
Victor Sanchez,
Gregory Epiphaniou,
Carsten Maple
Abstract:
Face image synthesis detection is considerably gaining attention because of the potential negative impact on society that this type of synthetic data brings. In this paper, we propose a data-agnostic solution to detect the face image synthesis process. Specifically, our solution is based on an anomaly detection framework that requires only real data to learn the inference process. It is therefore…
▽ More
Face image synthesis detection is considerably gaining attention because of the potential negative impact on society that this type of synthetic data brings. In this paper, we propose a data-agnostic solution to detect the face image synthesis process. Specifically, our solution is based on an anomaly detection framework that requires only real data to learn the inference process. It is therefore data-agnostic in the sense that it requires no synthetic face images. The solution uses the posterior probability with respect to the reference data to determine if new samples are synthetic or not. Our evaluation results using different synthesizers show that our solution is very competitive against the state-of-the-art, which requires synthetic data for training.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Cross-Age Contrastive Learning for Age-Invariant Face Recognition
Authors:
Haoyi Wang,
Victor Sanchez,
Chang-Tsun Li
Abstract:
Cross-age facial images are typically challenging and expensive to collect, making noise-free age-oriented datasets relatively small compared to widely-used large-scale facial datasets. Additionally, in real scenarios, images of the same subject at different ages are usually hard or even impossible to obtain. Both of these factors lead to a lack of supervised data, which limits the versatility of…
▽ More
Cross-age facial images are typically challenging and expensive to collect, making noise-free age-oriented datasets relatively small compared to widely-used large-scale facial datasets. Additionally, in real scenarios, images of the same subject at different ages are usually hard or even impossible to obtain. Both of these factors lead to a lack of supervised data, which limits the versatility of supervised methods for age-invariant face recognition, a critical task in applications such as security and biometrics. To address this issue, we propose a novel semi-supervised learning approach named Cross-Age Contrastive Learning (CACon). Thanks to the identity-preserving power of recent face synthesis models, CACon introduces a new contrastive learning method that leverages an additional synthesized sample from the input image. We also propose a new loss function in association with CACon to perform contrastive learning on a triplet of samples. We demonstrate that our method not only achieves state-of-the-art performance in homogeneous-dataset experiments on several age-invariant face recognition benchmarks but also outperforms other methods by a large margin in cross-dataset experiments.
△ Less
Submitted 2 January, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
ViKi-HyCo: A Hybrid-Control approach for complex car-like maneuvers
Authors:
Edison P. Velasco Sánchez,
Miguel Ángel Muñoz-Bañón,
Francisco A. Candelas,
Santiago T. Puente,
Fernando Torres
Abstract:
While Visual Servoing is deeply studied to perform simple maneuvers, the literature does not commonly address complex cases where the target is far out of the camera's field of view (FOV) during the maneuver. For this reason, in this paper, we present ViKi-HyCo (Visual Servoing and Kinematic Hybrid-Controller). This approach generates the necessary maneuvers for the complex positioning of a non-ho…
▽ More
While Visual Servoing is deeply studied to perform simple maneuvers, the literature does not commonly address complex cases where the target is far out of the camera's field of view (FOV) during the maneuver. For this reason, in this paper, we present ViKi-HyCo (Visual Servoing and Kinematic Hybrid-Controller). This approach generates the necessary maneuvers for the complex positioning of a non-holonomic mobile robot in outdoor environments. In this method, we use \hbox{LiDAR-camera} fusion to estimate objects bounding boxes using image and metrics modalities. With the multi-modality nature of our representation, we can automatically obtain a target for a visual servoing controller. At the same time, we also have a metric target, which allows us to hybridize with a kinematic controller. Given this hybridization, we can perform complex maneuvers even when the target is far away from the camera's FOV. The proposed approach does not require an object-tracking algorithm and can be applied to any robotic positioning task where its kinematic model is known. ViKi-HyCo has an error of 0.0428 \pm 0.0467 m in the X-axis and 0.0515 \pm 0.0323 m in the Y-axis at the end of a complete positioning task.
△ Less
Submitted 16 May, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention
Authors:
Yiming Ma,
Victor Sanchez,
Soodeh Nikan,
Devesh Upadhyay,
Bhushan Atote,
Tanaya Guha
Abstract:
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data…
▽ More
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
The Multi-cluster Fluctuating Two-Ray Fading Model
Authors:
José David Vega Sánchez,
F. Javier López-Martínez,
José F. Paris,
Juan M. Romero-Jerez
Abstract:
We introduce a new class of fading channels, built as the superposition of two fluctuating specular components with random phases, plus a clustering of scattered waves: the Multi-cluster Fluctuating Two-Ray (MFTR) fading channel. The MFTR model emerges as a natural generalization of both the fluctuating two-ray (FTR) and the $κ$-$μ$ shadowed fading models through a more general yet equally mathema…
▽ More
We introduce a new class of fading channels, built as the superposition of two fluctuating specular components with random phases, plus a clustering of scattered waves: the Multi-cluster Fluctuating Two-Ray (MFTR) fading channel. The MFTR model emerges as a natural generalization of both the fluctuating two-ray (FTR) and the $κ$-$μ$ shadowed fading models through a more general yet equally mathematically tractable model. This generalization enables the presence of additional multipath clusters in the purely ray-based FTR model, and the convenience of the new underlying fading channel model is discussed in depth. Then, we derive all the chief probability functions of the MFTR model (e.g., probability density function (PDF), cumulative density function (CDF), and moment generation function) in closed-form, having {a mathematical complexity similar to} other fading models in the state-of-the-art. We also provide two additional analytical formulations for the PDF and the CDF: (i) in terms of a continuous mixture of $κ$-$μ$ shadowed distributions, and (ii) as an infinite discrete mixture of Gamma distributions. Such expressions enable to conduct performance analysis under MFTR fading by directly leveraging readily available results for the $κ$-$μ$ shadowed or Nakagami-$m$ cases, respectively. The performance of wireless communications systems undergoing MFTR fading is exemplified in terms of a classical benchmarking metric like the outage probability, both in exact and asymptotic forms, and the amount of fading.
△ Less
Submitted 15 September, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
CONVOLVE: Smart and seamless design of smart edge processors
Authors:
M. Gomony,
F. Putter,
A. Gebregiorgis,
G. Paulin,
L. Mei,
V. Jain,
S. Hamdioui,
V. Sanchez,
T. Grosser,
M. Geilen,
M. Verhelst,
F. Zenke,
F. Gurkaynak,
B. Bruin,
S. Stuijk,
S. Davidson,
S. De,
M. Ghogho,
A. Jimborean,
S. Eissa,
L. Benini,
D. Soudris,
R. Bishnoi,
S. Ainsworth,
F. Corradi
, et al. (3 additional authors not shown)
Abstract:
With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC…
▽ More
With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality.
△ Less
Submitted 2 May, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Real-Time Driver Monitoring Systems through Modality and View Analysis
Authors:
Yiming Ma,
Victor Sanchez,
Soodeh Nikan,
Devesh Upadhyay,
Bhushan Atote,
Tanaya Guha
Abstract:
Driver distractions are known to be the dominant cause of road accidents. While monitoring systems can detect non-driving-related activities and facilitate reducing the risks, they must be accurate and efficient to be applicable. Unfortunately, state-of-the-art methods prioritize accuracy while ignoring latency because they leverage cross-view and multimodal videos in which consecutive frames are…
▽ More
Driver distractions are known to be the dominant cause of road accidents. While monitoring systems can detect non-driving-related activities and facilitate reducing the risks, they must be accurate and efficient to be applicable. Unfortunately, state-of-the-art methods prioritize accuracy while ignoring latency because they leverage cross-view and multimodal videos in which consecutive frames are highly similar. Thus, in this paper, we pursue time-effective detection models by neglecting the temporal relation between video frames and investigate the importance of each sensing modality in detecting drives' activities. Experiments demonstrate that 1) our proposed algorithms are real-time and can achieve similar performances (97.5\% AUC-PR) with significantly reduced computation compared with video-based models; 2) the top view with the infrared channel is more informative than any other single modality. Furthermore, we enhance the DAD dataset by manually annotating its test set to enable multiclassification. We also thoroughly analyze the influence of visual sensor types and their placements on the prediction of each class. The code and the new labels will be released.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Look at Adjacent Frames: Video Anomaly Detection without Offline Training
Authors:
Yuqi Ouyang,
Guodong Shen,
Victor Sanchez
Abstract:
We propose a solution to detect anomalous events in videos without the need to train a model offline. Specifically, our solution is based on a randomly-initialized multilayer perceptron that is optimized online to reconstruct video frames, pixel-by-pixel, from their frequency information. Based on the information shifts between adjacent frames, an incremental learner is used to update parameters o…
▽ More
We propose a solution to detect anomalous events in videos without the need to train a model offline. Specifically, our solution is based on a randomly-initialized multilayer perceptron that is optimized online to reconstruct video frames, pixel-by-pixel, from their frequency information. Based on the information shifts between adjacent frames, an incremental learner is used to update parameters of the multilayer perceptron after observing each frame, thus allowing to detect anomalous events along the video stream. Traditional solutions that require no offline training are limited to operating on videos with only a few abnormal frames. Our solution breaks this limit and achieves strong performance on benchmark datasets.
△ Less
Submitted 22 January, 2023; v1 submitted 27 July, 2022;
originally announced July 2022.
-
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
Authors:
Amir Shirian,
Krishna Somandepalli,
Victor Sanchez,
Tanaya Guha
Abstract:
Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent…
▽ More
Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory Exchange
Authors:
Guodong Shen,
Yuqi Ouyang,
Victor Sanchez
Abstract:
Video anomaly detection is a challenging task because most anomalies are scarce and non-deterministic. Many approaches investigate the reconstruction difference between normal and abnormal patterns, but neglect that anomalies do not necessarily correspond to large reconstruction errors. To address this issue, we design a Convolutional LSTM Auto-Encoder prediction framework with enhanced spatio-tem…
▽ More
Video anomaly detection is a challenging task because most anomalies are scarce and non-deterministic. Many approaches investigate the reconstruction difference between normal and abnormal patterns, but neglect that anomalies do not necessarily correspond to large reconstruction errors. To address this issue, we design a Convolutional LSTM Auto-Encoder prediction framework with enhanced spatio-temporal memory exchange using bi-directionalilty and a higher-order mechanism. The bi-directional structure promotes learning the temporal regularity through forward and backward predictions. The unique higher-order mechanism further strengthens spatial information interaction between the encoder and the decoder. Considering the limited receptive fields in Convolutional LSTMs, we also introduce an attention module to highlight informative features for prediction. Anomalies are eventually identified by comparing the frames with their corresponding predictions. Evaluations on three popular benchmarks show that our framework outperforms most existing prediction-based anomaly detection methods.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Frequency selective extrapolation with residual filtering for image error concealment
Authors:
Ján Koloda,
Jürgen Seiler,
André Kaup,
Victoria Sánchez,
Antonio M. Peinado
Abstract:
The purpose of signal extrapolation is to estimate unknown signal parts from known samples. This task is especially important for error concealment in image and video communication. For obtaining a high quality reconstruction, assumptions have to be made about the underlying signal in order to solve this underdetermined problem. Among existent reconstruction algorithms, frequency selective extrapo…
▽ More
The purpose of signal extrapolation is to estimate unknown signal parts from known samples. This task is especially important for error concealment in image and video communication. For obtaining a high quality reconstruction, assumptions have to be made about the underlying signal in order to solve this underdetermined problem. Among existent reconstruction algorithms, frequency selective extrapolation (FSE) achieves high performance by assuming that image signals can be sparsely represented in the frequency domain. However, FSE does not take into account the low-pass behaviour of natural images. In this paper, we propose a modified FSE that takes this prior knowledge into account for the modelling, yielding significant PSNR gains.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Physical Layer Security of RIS-Assisted Communications under Electromagnetic Interference
Authors:
José David Vega Sánchez,
Georges Kaddoum,
F. Javier López-Martínez
Abstract:
This work investigates the impact of the ever-present electromagnetic interference (EMI) on the achievable secrecy performance of reconfigurable intelligent surface (RIS)-aided communication systems. We characterize the end-to-end RIS channel by considering key practical aspects such as spatial correlation, transmit beamforming vector, phase-shift noise, the coexistence of direct and indirect chan…
▽ More
This work investigates the impact of the ever-present electromagnetic interference (EMI) on the achievable secrecy performance of reconfigurable intelligent surface (RIS)-aided communication systems. We characterize the end-to-end RIS channel by considering key practical aspects such as spatial correlation, transmit beamforming vector, phase-shift noise, the coexistence of direct and indirect channels, and the presence of strong/mild EMI on the receiver sides. We show that the effect of EMI on secrecy performance strongly depends on the ability of the eavesdropper to cancel such interference; this puts forth the potential of EMI-based attacks to degrade physical layer security in RIS-aided communications.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
FusionCount: Efficient Crowd Counting via Multiscale Feature Fusion
Authors:
Yiming Ma,
Victor Sanchez,
Tanaya Guha
Abstract:
State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, to account for perspective distortion, the highest-level feature map is fed to extra components to extract multiscale features, which are the input to the decoder to generate crowd densities. However, in these methods, features extracted at earlier stages…
▽ More
State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, to account for perspective distortion, the highest-level feature map is fed to extra components to extract multiscale features, which are the input to the decoder to generate crowd densities. However, in these methods, features extracted at earlier stages during encoding are underutilised, and the multiscale modules can only capture a limited range of receptive fields, albeit with considerable computational cost. This paper proposes a novel crowd counting architecture (FusionCount), which exploits the adaptive fusion of a large majority of encoded features instead of relying on additional extraction components to obtain multiscale features. Thus, it can cover a more extensive scope of receptive field sizes and lower the computational cost. We also introduce a new channel reduction block, which can extract saliency information during decoding and further enhance the model's performance. Experiments on two benchmark databases demonstrate that our model achieves state-of-the-art results with reduced computational complexity.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
Spectral-PQ: A Novel Spectral Sensitivity-Orientated Perceptual Compression Technique for RGB 4:4:4 Video Data
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
There exists an intrinsic relationship between the spectral sensitivity of the Human Visual System (HVS) and colour perception; these intertwined phenomena are often overlooked in perceptual compression research. In general, most previously proposed visually lossless compression techniques exploit luminance (luma) masking including luma spatiotemporal masking, luma contrast masking and luma textur…
▽ More
There exists an intrinsic relationship between the spectral sensitivity of the Human Visual System (HVS) and colour perception; these intertwined phenomena are often overlooked in perceptual compression research. In general, most previously proposed visually lossless compression techniques exploit luminance (luma) masking including luma spatiotemporal masking, luma contrast masking and luma texture/edge masking. The perceptual relevance of color in a picture is often overlooked, which constitutes a gap in the literature. With regard to the spectral sensitivity phenomenon of the HVS, the color channels of raw RGB 4:4:4 data contain significant color-based psychovisual redundancies. These perceptual redundancies can be quantized via color channel-level perceptual quantization. In this paper, we propose a novel spatiotemporal visually lossless coding method named Spectral Perceptual Quantization (Spectral-PQ). With application for RGB 4:4:4 video data, Spectral-PQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; the proposed method operates at the Coding Block (CB) level and the Prediction Unit (PU) level in the HEVC standard. Spectral-PQ perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), Spectral-PQ considerably reduces bitrates with a maximum reduction of approximately 81%. The Mean Opinion Score (MOS) in the subjective evaluations show that Spectral-PQ successfully achieves perceptually lossless quality.
△ Less
Submitted 24 January, 2022;
originally announced January 2022.
-
Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection
Authors:
Manuel J. Gomez,
Mario Calderón,
Victor Sánchez,
Félix J. García Clemente,
José A. Ruipérez-Valiente
Abstract:
The recent pandemic has changed the way we see education. It is not surprising that children and college students are not the only ones using online education. Millions of adults have signed up for online classes and courses during last years, and MOOC providers, such as Coursera or edX, are reporting millions of new users signing up in their platforms. However, students do face some challenges wh…
▽ More
The recent pandemic has changed the way we see education. It is not surprising that children and college students are not the only ones using online education. Millions of adults have signed up for online classes and courses during last years, and MOOC providers, such as Coursera or edX, are reporting millions of new users signing up in their platforms. However, students do face some challenges when choosing courses. Though online review systems are standard among many verticals, no standardized or fully decentralized review systems exist in the MOOC ecosystem. In this vein, we believe that there is an opportunity to leverage available open MOOC reviews in order to build simpler and more transparent reviewing systems, allowing users to really identify the best courses out there. Specifically, in our research we analyze 2.4 million reviews (which is the largest MOOC reviews dataset used until now) from five different platforms in order to determine the following: (1) if the numeric ratings provide discriminant information to learners, (2) if NLP-driven sentiment analysis on textual reviews could provide valuable information to learners, (3) if we can leverage NLP-driven topic finding techniques to infer themes that could be important for learners, and (4) if we can use these models to effectively characterize MOOCs based on the open reviews. Results show that numeric ratings are clearly biased (63\% of them are 5-star ratings), and the topic modeling reveals some interesting topics related with course advertisements, the real applicability, or the difficulty of the different courses. We expect our study to shed some light on the area and promote a more transparent approach in online education reviews, which are becoming more and more popular as we enter the post-pandemic era.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Improving Face-Based Age Estimation with Attention-Based Dynamic Patch Fusion
Authors:
Haoyi Wang,
Victor Sanchez,
Chang-Tsun Li
Abstract:
With the increasing popularity of convolutional neural networks (CNNs), recent works on face-based age estimation employ these networks as the backbone. However, state-of-the-art CNN-based methods treat each facial region equally, thus entirely ignoring the importance of some facial patches that may contain rich age-specific information. In this paper, we propose a face-based age estimation framew…
▽ More
With the increasing popularity of convolutional neural networks (CNNs), recent works on face-based age estimation employ these networks as the backbone. However, state-of-the-art CNN-based methods treat each facial region equally, thus entirely ignoring the importance of some facial patches that may contain rich age-specific information. In this paper, we propose a face-based age estimation framework, called Attention-based Dynamic Patch Fusion (ADPF). In ADPF, two separate CNNs are implemented, namely the AttentionNet and the FusionNet. The AttentionNet dynamically locates and ranks age-specific patches by employing a novel Ranking-guided Multi-Head Hybrid Attention (RMHHA) mechanism. The FusionNet uses the discovered patches along with the facial image to predict the age of the subject. Since the proposed RMHHA mechanism ranks the discovered patches based on their importance, the length of the learning path of each patch in the FusionNet is proportional to the amount of information it carries (the longer, the more important). ADPF also introduces a novel diversity loss to guide the training of the AttentionNet and reduce the overlap among patches so that the diverse and important patches are discovered. Through extensive experiments, we show that our proposed framework outperforms state-of-the-art methods on several age estimation benchmark datasets.
△ Less
Submitted 19 December, 2021;
originally announced December 2021.
-
Joint Learning Architecture for Multiple Object Tracking and Trajectory Forecasting
Authors:
Oluwafunmilola Kesa,
Olly Styles,
Victor Sanchez
Abstract:
This paper introduces a joint learning architecture (JLA) for multiple object tracking (MOT) and trajectory forecasting in which the goal is to predict objects' current and future trajectories simultaneously. Motion prediction is widely used in several state of the art MOT methods to refine predictions in the form of bounding boxes. Typically, a Kalman Filter provides short-term estimations to hel…
▽ More
This paper introduces a joint learning architecture (JLA) for multiple object tracking (MOT) and trajectory forecasting in which the goal is to predict objects' current and future trajectories simultaneously. Motion prediction is widely used in several state of the art MOT methods to refine predictions in the form of bounding boxes. Typically, a Kalman Filter provides short-term estimations to help trackers correctly predict objects' locations in the current frame. However, the Kalman Filter-based approaches cannot predict non-linear trajectories. We propose to jointly train a tracking and trajectory forecasting model and use the predicted trajectory forecasts for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. We evaluate our JLA on the MOTChallenge benchmark. Evaluations result show that JLA performs better for short-term motion prediction and reduces ID switches by 33%, 31%, and 47% in the MOT16, MOT17, and MOT20 datasets, respectively, in comparison to FairMOT.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
Multi-Camera Trajectory Forecasting with Trajectory Tensors
Authors:
Olly Styles,
Tanaya Guha,
Victor Sanchez
Abstract:
We introduce the problem of multi-camera trajectory forecasting (MCTF), which involves predicting the trajectory of a moving object across a network of cameras. While multi-camera setups are widespread for applications such as surveillance and traffic monitoring, existing trajectory forecasting methods typically focus on single-camera trajectory forecasting (SCTF), limiting their use for such appl…
▽ More
We introduce the problem of multi-camera trajectory forecasting (MCTF), which involves predicting the trajectory of a moving object across a network of cameras. While multi-camera setups are widespread for applications such as surveillance and traffic monitoring, existing trajectory forecasting methods typically focus on single-camera trajectory forecasting (SCTF), limiting their use for such applications. Furthermore, using a single camera limits the field-of-view available, making long-term trajectory forecasting impossible. We address these shortcomings of SCTF by develo** an MCTF framework that simultaneously uses all estimated relative object locations from several viewpoints and predicts the object's future location in all possible viewpoints. Our framework follows a Which-When-Where approach that predicts in which camera(s) the objects appear and when and where within the camera views they appear. To this end, we propose the concept of trajectory tensors: a new technique to encode trajectories across multiple camera views and the associated uncertainties. We develop several encoder-decoder MCTF models for trajectory tensors and present extensive experiments on our own database (comprising 600 hours of video data from 15 camera views) created particularly for the MCTF task. Results show that our trajectory tensor models outperform coordinate trajectory-based MCTF models and existing SCTF methods adapted for MCTF. Code is available from: https://github.com/olly-styles/Trajectory-Tensors
△ Less
Submitted 24 August, 2021; v1 submitted 10 August, 2021;
originally announced August 2021.
-
On the detection-to-track association for online multi-object tracking
Authors:
Xufeng Lin,
Chang-Tsun Li,
Victor Sanchez,
Carsten Maple
Abstract:
Driven by recent advances in object detection with deep neural networks, the tracking-by-detection paradigm has gained increasing prevalence in the research community of multi-object tracking (MOT). It has long been known that appearance information plays an essential role in the detection-to-track association, which lies at the core of the tracking-by-detection paradigm. While most existing works…
▽ More
Driven by recent advances in object detection with deep neural networks, the tracking-by-detection paradigm has gained increasing prevalence in the research community of multi-object tracking (MOT). It has long been known that appearance information plays an essential role in the detection-to-track association, which lies at the core of the tracking-by-detection paradigm. While most existing works consider the appearance distances between the detections and the tracks, they ignore the statistical information implied by the historical appearance distance records in the tracks, which can be particularly useful when a detection has similar distances with two or more tracks. In this work, we propose a hybrid track association (HTA) algorithm that models the historical appearance distances of a track with an incremental Gaussian mixture model (IGMM) and incorporates the derived statistical information into the calculation of the detection-to-track association cost. Experimental results on three MOT benchmarks confirm that HTA effectively improves the target identification performance with a small compromise to the tracking speed. Additionally, compared to many state-of-the-art trackers, the DeepSORT tracker equipped with HTA achieves better or comparable performance in terms of the balance of tracking quality and speed.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Deep Learning for Predictive Analytics in Reversible Steganography
Authors:
Ching-Chun Chang,
Xu Wang,
Sisheng Chen,
Isao Echizen,
Victor Sanchez,
Chang-Tsun Li
Abstract:
Deep learning is regarded as a promising solution for reversible steganography. There is an accelerating trend of representing a reversible steo-system by monolithic neural networks, which bypass intermediate operations in traditional pipelines of reversible steganography. This end-to-end paradigm, however, suffers from imperfect reversibility. By contrast, the modular paradigm that incorporates n…
▽ More
Deep learning is regarded as a promising solution for reversible steganography. There is an accelerating trend of representing a reversible steo-system by monolithic neural networks, which bypass intermediate operations in traditional pipelines of reversible steganography. This end-to-end paradigm, however, suffers from imperfect reversibility. By contrast, the modular paradigm that incorporates neural networks into modules of traditional pipelines can stably guarantee reversibility with mathematical explainability. Prediction-error modulation is a well-established reversible steganography pipeline for digital images. It consists of a predictive analytics module and a reversible coding module. Given that reversibility is governed independently by the coding module, we narrow our focus to the incorporation of neural networks into the analytics module, which serves the purpose of predicting pixel intensities and a pivotal role in determining capacity and imperceptibility. The objective of this study is to evaluate the impacts of different training configurations upon predictive accuracy of neural networks and provide practical insights. In particular, we investigate how different initialisation strategies for input images may affect the learning process and how different training strategies for dual-layer prediction respond to the problem of distributional shift. Furthermore, we compare steganographic performance of various model architectures with different loss functions.
△ Less
Submitted 7 March, 2023; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Expectation-Maximization Learning for Wireless Channel Modeling of Reconfigurable Intelligent Surfaces
Authors:
José David Vega Sánchez,
Luis Urquiza-Aguiar,
Martha Cecilia Paredes Paredes,
F. Javier López-Martínez
Abstract:
Channel modeling is a critical issue when designing or evaluating the performance of reconfigurable intelligent surface (RIS)-assisted communications. Inspired by the promising potential of learning-based methods for characterizing the radio environment, we present a general approach to model the RIS end-to-end equivalent channel using the unsupervised expectation-maximization (EM) learning algori…
▽ More
Channel modeling is a critical issue when designing or evaluating the performance of reconfigurable intelligent surface (RIS)-assisted communications. Inspired by the promising potential of learning-based methods for characterizing the radio environment, we present a general approach to model the RIS end-to-end equivalent channel using the unsupervised expectation-maximization (EM) learning algorithm. We show that an EM-based approximation through a simple mixture of two Nakagami-$m$ distributions suffices to accurately approximating the equivalent channel, while allowing for the incorporation of crucial aspects into RIS's channel modeling as spatial channel correlation, phase-shift errors, arbitrary fading conditions, and coexistence of direct and RIS channels. Based on the proposed analytical framework, we evaluate the outage probability under different settings of RIS's channel features and confirm the superiority of this approach compared to recent results in the literature.
△ Less
Submitted 10 August, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
Video Anomaly Detection by Estimating Likelihood of Representations
Authors:
Yuqi Ouyang,
Victor Sanchez
Abstract:
Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the map** between video frames and their low-dimensional f…
▽ More
Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the map** between video frames and their low-dimensional features, while ignoring the spatial connections of those features. Recent solutions focus on analyzing these spatial connections by using hard clustering techniques, such as K-Means, or applying neural networks to map latent features to a general understanding, such as action attributes. In order to solve video anomaly in the latent feature space, we propose a deep probabilistic model to transfer this task into a density estimation problem where latent manifolds are generated by a deep denoising autoencoder and clustered by expectation maximization. Evaluations on several benchmarks datasets show the strengths of our model, achieving outstanding performance on challenging datasets.
△ Less
Submitted 2 December, 2020;
originally announced December 2020.
-
Physical Layer Security of Large Reflecting Surface Aided Communications with Phase Errors
Authors:
Jose David Vega Sanchez,
Pablo Ramirez-Espinosa,
F. Javier Lopez-Martinez
Abstract:
The physical layer security (PLS) performance of a wireless communication link through a large reflecting surface (LRS) with phase errors is analyzed. Leveraging recent results that express the \ac{LRS}-based composite channel as an equivalent scalar fading channel, we show that the eavesdropper's link is Rayleigh distributed and independent of the legitimate link. The different scaling laws of th…
▽ More
The physical layer security (PLS) performance of a wireless communication link through a large reflecting surface (LRS) with phase errors is analyzed. Leveraging recent results that express the \ac{LRS}-based composite channel as an equivalent scalar fading channel, we show that the eavesdropper's link is Rayleigh distributed and independent of the legitimate link. The different scaling laws of the legitimate and eavesdroppers signal-to-noise ratios with the number of reflecting elements, and the reasonably good performance even in the case of coarse phase quantization, show the great potential of LRS-aided communications to enhance PLS in practical wireless set-ups.
△ Less
Submitted 25 July, 2020;
originally announced July 2020.
-
Age-Oriented Face Synthesis with Conditional Discriminator Pool and Adversarial Triplet Loss
Authors:
Haoyi Wang,
Victor Sanchez,
Chang-Tsun Li
Abstract:
The vanilla Generative Adversarial Networks (GAN) are commonly used to generate realistic images depicting aged and rejuvenated faces. However, the performance of such vanilla GANs in the age-oriented face synthesis task is often compromised by the mode collapse issue, which may result in the generation of faces with minimal variations and a poor synthesis accuracy. In addition, recent age-oriente…
▽ More
The vanilla Generative Adversarial Networks (GAN) are commonly used to generate realistic images depicting aged and rejuvenated faces. However, the performance of such vanilla GANs in the age-oriented face synthesis task is often compromised by the mode collapse issue, which may result in the generation of faces with minimal variations and a poor synthesis accuracy. In addition, recent age-oriented face synthesis methods use the L1 or L2 constraint to preserve the identity information on synthesized faces, which implicitly limits the identity permanence capabilities when these constraints are associated with a trivial weighting factor. In this paper, we propose a method for the age-oriented face synthesis task that achieves a high synthesis accuracy with strong identity permanence capabilities. Specifically, to achieve a high synthesis accuracy, our method tackles the mode collapse issue with a novel Conditional Discriminator Pool (CDP), which consists of multiple discriminators, each targeting one particular age category. To achieve strong identity permanence capabilities, our method uses a novel Adversarial Triplet loss. This loss, which is based on the Triplet loss, adds a ranking operation to further pull the positive embedding towards the anchor embedding resulting in significantly reduced intra-class variances in the feature space. Through extensive experiments, we show that our proposed method outperforms state-of-the-art methods in terms of synthesis accuracy and identity permanence capabilities, qualitatively and quantitatively.
△ Less
Submitted 3 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Survey on Physical Layer Security for 5G Wireless Networks
Authors:
José David Vega Sánchez,
Luis Urquiza-Aguiar,
Martha Cecilia Paredes Paredes,
Diana Pamela Moya Osorio
Abstract:
Physical layer security is a promising approach that can benefit traditional encryption methods. The idea of physical layer security is to take advantage of the features of the propagation medium and its impairments to ensure secure communication in the physical layer. This work introduces a comprehensive review of the main information-theoretic metrics used to measure the secrecy performance in p…
▽ More
Physical layer security is a promising approach that can benefit traditional encryption methods. The idea of physical layer security is to take advantage of the features of the propagation medium and its impairments to ensure secure communication in the physical layer. This work introduces a comprehensive review of the main information-theoretic metrics used to measure the secrecy performance in physical layer security. Furthermore, a theoretical framework related to the most commonly used physical layer security techniques to improve the secrecy performance is provided. Finally, our work surveys physical layer security research over several enabling 5G technologies, such as massive multiple-input multiple-output, millimeter-wave communications, heterogeneous networks, non-orthogonal multiple access, and full-duplex. Also, we include the key concepts of each of the aforementioned technologies. Future fields of research and technical challenges of physical layer security are also identified.
△ Less
Submitted 14 June, 2020;
originally announced June 2020.
-
Ensemble Network for Ranking Images Based on Visual Appeal
Authors:
Sachin Singh,
Victor Sanchez,
Tanaya Guha
Abstract:
We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a…
▽ More
We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a network which is an ensemble of three information channels, each predicting a score corresponding to one of the three visual appeal factors. For group emotion estimation, we propose a convolutional neural network (CNN) based architecture for predicting group emotion from images. This new architecture enforces the network to put emphasis on the important regions in the images, and achieves comparable results to the state-of-the-art. Next, we develop a network for the image ranking task that combines group emotion, aesthetics and image quality scores. Owing to the unavailability of suitable databases, we created a new database of manually annotated group photos taken during various social events. We present experimental results on this database and other benchmark databases whenever available. Overall, our experiments show that the proposed framework can reliably predict the overall appeal of images with results closely corresponding to human ranking.
△ Less
Submitted 6 June, 2020;
originally announced June 2020.
-
HVS-Based Perceptual Color Compression of Image Data
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
In perceptual image coding applications, the main objective is to decrease, as much as possible, Bits Per Pixel (BPP) while avoiding noticeable distortions in the reconstructed image. In this paper, we propose a novel perceptual image coding technique, named Perceptual Color Compression (PCC). PCC is based on a novel model related to Human Visual System (HVS) spectral sensitivity and CIELAB Just N…
▽ More
In perceptual image coding applications, the main objective is to decrease, as much as possible, Bits Per Pixel (BPP) while avoiding noticeable distortions in the reconstructed image. In this paper, we propose a novel perceptual image coding technique, named Perceptual Color Compression (PCC). PCC is based on a novel model related to Human Visual System (HVS) spectral sensitivity and CIELAB Just Noticeable Color Difference (JNCD). We utilize this modeling to capitalize on the inability of the HVS to perceptually differentiate photons in very similar wavelength bands (e.g., distinguishing very similar shades of a particular color or different colors that look similar). The proposed PCC technique can be used with RGB (4:4:4) image data of various bit depths and spatial resolutions. In the evaluations, we compare the proposed PCC technique with a set of reference methods including Versatile Video Coding (VVC) and High Efficiency Video Coding (HEVC) in addition to two other recently proposed algorithms. Our PCC method attains considerable BPP reductions compared with all four reference techniques including, on average, 52.6% BPP reductions compared with VVC (VVC in All Intra still image coding mode). Regarding image perceptual reconstruction quality, PCC achieves a score of SSIM = 0.99 in all tests in addition to a score of MS-SSIM = 0.99 in all but one test. Moreover, MOS = 5 is attained in 75% of subjective evaluation assessments conducted.
△ Less
Submitted 9 February, 2021; v1 submitted 16 May, 2020;
originally announced May 2020.
-
Spatiotemporal Adaptive Quantization for the Perceptual Video Coding of RGB 4:4:4 Data
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
Due to the spectral sensitivity phenomenon of the Human Visual System (HVS), the color channels of raw RGB 4:4:4 sequences contain significant psychovisual redundancies; these redundancies can be perceptually quantized. The default quantization systems in the HEVC standard are known as Uniform Reconstruction Quantization (URQ) and Rate Distortion Optimized Quantization (RDOQ); URQ and RDOQ are not…
▽ More
Due to the spectral sensitivity phenomenon of the Human Visual System (HVS), the color channels of raw RGB 4:4:4 sequences contain significant psychovisual redundancies; these redundancies can be perceptually quantized. The default quantization systems in the HEVC standard are known as Uniform Reconstruction Quantization (URQ) and Rate Distortion Optimized Quantization (RDOQ); URQ and RDOQ are not perceptually optimized for the coding of RGB 4:4:4 video data. In this paper, we propose a novel spatiotemporal perceptual quantization technique named SPAQ. With application for RGB 4:4:4 video data, SPAQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; SPAQ operates at the Coding Block (CB) level and the Prediction Unit (PU) level. The proposed technique perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), SPAQ considerably reduces bitrates with a maximum reduction of approximately 80%. The Mean Opinion Score (MOS) in the subjective evaluations, in addition to the SSIM scores, show that SPAQ successfully achieves perceptually lossless compression compared with anchors.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Information-Theoretic Security of MIMO Networks under $κ$-$μ$ Shadowed Fading Channels
Authors:
José David Vega Sánchez,
D. P. Moya Osorio,
F. Javier López-Martínez,
Martha Cecilia Paredes Paredes,
Luis Urquiza-Aguiar
Abstract:
This paper investigates the impact of realistic propagation conditions on the achievable secrecy performance of multiple-input multiple-output systems in the presence of an eavesdropper. Specifically, we concentrate on the $κ$-$μ$ shadowed fading model because its physical underpinnings capture a wide range of propagation conditions, while, at the same time, it allows for much better tractability…
▽ More
This paper investigates the impact of realistic propagation conditions on the achievable secrecy performance of multiple-input multiple-output systems in the presence of an eavesdropper. Specifically, we concentrate on the $κ$-$μ$ shadowed fading model because its physical underpinnings capture a wide range of propagation conditions, while, at the same time, it allows for much better tractability than other state-of-the-art fading models. By considering transmit antenna selection and maximal ratio combining reception at the legitimate and eavesdropper's receiver sides, we study two relevant scenarios $(i)$ the transmitter does not know the eavesdropper's channel state information (CSI), and $(ii)$ the transmitter has knowledge of the CSI of the eavesdropper link. For this purpose, we first obtain novel and tractable expressions for the statistics of the maximum of independent and identically distributed (i.i.d.) variates related to the legitimate path. Based on these results, we derive novel closed-form expressions for the secrecy outage probability (SOP) and the average secrecy capacity (ASC) to assess the secrecy performance in passive and active eavesdrop** scenarios, respectively. Moreover, we develop analytical asymptotic expressions of the SOP and ASC at the high signal-to-noise ratio regime. In all instances, secrecy performance metrics are characterized in closed-form, without requiring the evaluation of Meijer or Fox functions. Some useful insights on how the different propagation conditions and the number of antennas impact the secrecy performance are also provided.
△ Less
Submitted 30 June, 2020; v1 submitted 5 May, 2020;
originally announced May 2020.
-
Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras
Authors:
Olly Styles,
Tanaya Guha,
Victor Sanchez,
Alex Kot
Abstract:
We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlap** camera views. This has wide applicability in tasks such as re-identificatio…
▽ More
We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlap** camera views. This has wide applicability in tasks such as re-identification and multi-target multi-camera tracking. To facilitate research in this new area, we release the Warwick-NTU Multi-camera Forecasting Database (WNMF), a unique dataset of multi-camera pedestrian trajectories from a network of 15 synchronized cameras. To accurately label this large dataset (600 hours of video footage), we also develop a semi-automated annotation method. An effective MCTF model should proactively anticipate where and when a person will re-appear in the camera network. In this paper, we consider the task of predicting the next camera a pedestrian will re-appear after leaving the view of another camera, and present several baseline approaches for this. The labeled database is available online: https://github.com/olly-styles/Multi-Camera-Trajectory-Forecasting.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments
Authors:
Olly Styles,
Tanaya Guha,
Victor Sanchez
Abstract:
This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather…
▽ More
This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather than trajectories alone. Towards solving this task, we introduce the Citywalks dataset, which consists of over 200k high-resolution video frames. Citywalks comprises of footage recorded in 21 cities from 10 European countries in a variety of weather conditions and over 3.5k unique pedestrian trajectories. For evaluation, we adapt existing trajectory forecasting methods for MOF and confirm cross-dataset generalizability on the MOT-17 dataset without fine-tuning. Finally, we present STED, a novel encoder-decoder architecture for MOF. STED combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF. Code & dataset link: https://github.com/olly-styles/Multiple-Object-Forecasting
△ Less
Submitted 7 January, 2020; v1 submitted 26 September, 2019;
originally announced September 2019.
-
Forecasting Pedestrian Trajectory with Machine-Annotated Training Data
Authors:
Olly Styles,
Arun Ross,
Victor Sanchez
Abstract:
Reliable anticipation of pedestrian trajectory is imperative for the operation of autonomous vehicles and can significantly enhance the functionality of advanced driver assistance systems. While significant progress has been made in the field of pedestrian detection, forecasting pedestrian trajectories remains a challenging problem due to the unpredictable nature of pedestrians and the huge space…
▽ More
Reliable anticipation of pedestrian trajectory is imperative for the operation of autonomous vehicles and can significantly enhance the functionality of advanced driver assistance systems. While significant progress has been made in the field of pedestrian detection, forecasting pedestrian trajectories remains a challenging problem due to the unpredictable nature of pedestrians and the huge space of potentially useful features. In this work, we present a deep learning approach for pedestrian trajectory forecasting using a single vehicle-mounted camera. Deep learning models that have revolutionized other areas in computer vision have seen limited application to trajectory forecasting, in part due to the lack of richly annotated training data. We address the lack of training data by introducing a scalable machine annotation scheme that enables our model to be trained using a large dataset without human annotation. In addition, we propose Dynamic Trajectory Predictor (DTP), a model for forecasting pedestrian trajectory up to one second into the future. DTP is trained using both human and machine-annotated data, and anticipates dynamic motion that is not captured by linear models. Experimental evaluation confirms the benefits of the proposed model.
△ Less
Submitted 9 May, 2019;
originally announced May 2019.
-
On the Statistics of the Ratio of Non-Constrained Arbitrary α-μ Random Variables: a General Framework and Applications
Authors:
J. D. Vega Sánchez,
D. P. Moya Osorio,
E. E. Benitez Olivo,
H. Alves,
M. C. P. Paredes,
L. Urquiza-Aguiar
Abstract:
In this paper, we derive closed-form exact expressions for the main statistics of the ratio of squared alpha-mu random variables, which are of interest in many scenarios for future wireless networks where generalized distributions are more suitable to fit with field data. Importantly, different from previous proposals, our expressions are general in the sense that are valid for non constrained arb…
▽ More
In this paper, we derive closed-form exact expressions for the main statistics of the ratio of squared alpha-mu random variables, which are of interest in many scenarios for future wireless networks where generalized distributions are more suitable to fit with field data. Importantly, different from previous proposals, our expressions are general in the sense that are valid for non constrained arbitrary values of the parameters of the alpha-mu distribution. Thus, the probability density function, cumulative distribution function, moment generating function, and higher order moments are given in terms of both (i) theFox H-function for which we provide a portable and efficient Wolfram Mathematica code and (ii) easily computable series expansions. Our expressions can be used straightforwardly in the performance analysis of a number of wireless communication systems, including either interference-limited scenarios, spectrum sharing, full-duplex or physical-layer security networks, for which we present the application of the proposed framework. Moreover, closed-form expressions for some classical distributions, derived as special cases from the alpha-mu distribution, are provided as byproducts. The validity of the proposed expressions is confirmed via Monte Carlo simulations.
△ Less
Submitted 20 February, 2019;
originally announced February 2019.
-
Fusion Network for Face-based Age Estimation
Authors:
Haoyi Wang,
Xingjie Wei,
Victor Sanchez,
Chang-Tsun Li
Abstract:
Convolutional Neural Networks (CNN) have been applied to age-related research as the core framework. Although faces are composed of numerous facial attributes, most works with CNNs still consider a face as a typical object and do not pay enough attention to facial regions that carry age-specific feature for this particular task. In this paper, we propose a novel CNN architecture called Fusion Netw…
▽ More
Convolutional Neural Networks (CNN) have been applied to age-related research as the core framework. Although faces are composed of numerous facial attributes, most works with CNNs still consider a face as a typical object and do not pay enough attention to facial regions that carry age-specific feature for this particular task. In this paper, we propose a novel CNN architecture called Fusion Network (FusionNet) to tackle the age estimation problem. Apart from the whole face image, the FusionNet successively takes several age-specific facial patches as part of the input to emphasize the age-specific features. Through experiments, we show that the FusionNet significantly outperforms other state-of-the-art models on the MORPH II benchmark.
△ Less
Submitted 26 July, 2018;
originally announced July 2018.
-
A Web Scra** Methodology for Bypassing Twitter API Restrictions
Authors:
A. Hernandez-Suarez,
G. Sanchez-Perez,
K. Toscano-Medina,
V. Martinez-Hernandez,
V. Sanchez,
H. Perez-Meana
Abstract:
Retrieving information from social networks is the first and primordial step many data analysis fields such as Natural Language Processing, Sentiment Analysis and Machine Learning. Important data science tasks relay on historical data gathering for further predictive results. Most of the recent works use Twitter API, a public platform for collecting public streams of information, which allows quer…
▽ More
Retrieving information from social networks is the first and primordial step many data analysis fields such as Natural Language Processing, Sentiment Analysis and Machine Learning. Important data science tasks relay on historical data gathering for further predictive results. Most of the recent works use Twitter API, a public platform for collecting public streams of information, which allows querying chronological tweets for no more than three weeks old. In this paper, we present a new methodology for collecting historical tweets within any date range using web scra** techniques bypassing for Twitter API restrictions.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
Coding Block-Level Perceptual Video Coding for 4:4:4 Data in HEVC
Authors:
Lee Prangnell,
Miguel Hernández-Cabronero,
Victor Sanchez
Abstract:
There is an increasing consumer demand for high bit-depth 4:4:4 HD video data playback due to its superior perceptual visual quality compared with standard 8-bit subsampled 4:2:0 video data. Due to vast file sizes and associated bitrates, it is desirable to compress raw high bit-depth 4:4:4 HD video sequences as much as possible without incurring a discernible decrease in visual quality. In this p…
▽ More
There is an increasing consumer demand for high bit-depth 4:4:4 HD video data playback due to its superior perceptual visual quality compared with standard 8-bit subsampled 4:2:0 video data. Due to vast file sizes and associated bitrates, it is desirable to compress raw high bit-depth 4:4:4 HD video sequences as much as possible without incurring a discernible decrease in visual quality. In this paper, we propose a Coding Block (CB)-level perceptual video coding technique for HEVC named Full Color Perceptual Quantization (FCPQ). FCPQ is designed to adjust the Quantization Parameter (QP) at the CB level (i.e., the luma CB and the chroma Cb and Cr CBs) according to the variances of pixel data in each CB. FCPQ is based on the default perceptual quantization method in HEVC called AdaptiveQP. AdaptiveQP adjusts the QP of an entire CU based only on the spatial activity of the constituent luma CB. As demonstrated in this paper, by not accounting for the spatial activity of the constituent chroma CBs, as is the case with AdaptiveQP, coding performance can be significantly affected; this is because the variance of pixel data in a luma CB is notably different from the variances of pixel data in chroma Cb and Cr CBs. FCPQ, therefore, addresses this problem. In terms of coding performance, FCPQ achieves BD-Rate improvements of up to 39.5% (Y), 16% (Cb) and 29.9% (Cr) compared with AdaptiveQP.
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
JND-Based Perceptual Video Coding for 4:4:4 Screen Content Data in HEVC
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
The JCT-VC standardized Screen Content Coding (SCC) extension in the HEVC HM RExt + SCM reference codec offers an impressive coding efficiency performance when compared with HM RExt alone; however, it is not significantly perceptually optimized. For instance, it does not include advanced HVS-based perceptual coding methods, such as JND-based spatiotemporal masking schemes. In this paper, we propos…
▽ More
The JCT-VC standardized Screen Content Coding (SCC) extension in the HEVC HM RExt + SCM reference codec offers an impressive coding efficiency performance when compared with HM RExt alone; however, it is not significantly perceptually optimized. For instance, it does not include advanced HVS-based perceptual coding methods, such as JND-based spatiotemporal masking schemes. In this paper, we propose a novel JND-based perceptual video coding technique for HM RExt + SCM. The proposed method is designed to further improve the compression performance of HM RExt + SCM when applied to YCbCr 4:4:4 SC video data. In the proposed technique, luminance masking and chrominance masking are exploited to perceptually adjust the Quantization Step Size (QStep) at the Coding Block (CB) level. Compared with HM RExt 16.10 + SCM 8.0, the proposed method considerably reduces bitrates (Kbps), with a maximum reduction of 48.3%. In addition to this, the subjective evaluations reveal that SC-PAQ achieves visually lossless coding at very low bitrates.
△ Less
Submitted 12 February, 2018; v1 submitted 26 October, 2017;
originally announced October 2017.
-
Cross-Color Channel Perceptually Adaptive Quantization for HEVC
Authors:
Lee Prangnell,
Miguel Hernández-Cabronero,
Victor Sanchez
Abstract:
HEVC includes a Coding Unit (CU) level luminance-based perceptual quantization technique known as AdaptiveQP. AdaptiveQP perceptually adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of raw input video data in a luma Coding Block (CB). In this paper, we propose a novel cross-color channel adaptive quantization scheme which perceptually adjusts the CU level QP a…
▽ More
HEVC includes a Coding Unit (CU) level luminance-based perceptual quantization technique known as AdaptiveQP. AdaptiveQP perceptually adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of raw input video data in a luma Coding Block (CB). In this paper, we propose a novel cross-color channel adaptive quantization scheme which perceptually adjusts the CU level QP according to the spatial activity of raw input video data in the constituent luma and chroma CBs; i.e., the combined spatial activity across all three color channels (the Y, Cb and Cr channels). Our technique is evaluated in HM 16 with 4:4:4, 4:2:2 and 4:2:0 YCbCr JCT-VC test sequences. Both subjective and objective visual quality evaluations are undertaken during which we compare our method with AdaptiveQP. Our technique achieves considerable coding efficiency improvements, with maximum BD-Rate reductions of 15.9% (Y), 13.1% (Cr) and 16.1% (Cb) in addition to a maximum decoding time reduction of 11.0%.
△ Less
Submitted 12 February, 2018; v1 submitted 23 December, 2016;
originally announced December 2016.
-
The Predictive Context Tree: Predicting Contexts and Interactions
Authors:
Alasdair Thomason,
Nathan Griffiths,
Victor Sanchez
Abstract:
With a large proportion of people carrying location-aware smartphones, we have an unprecedented platform from which to understand individuals and predict their future actions. This work builds upon the Context Tree data structure that summarises the historical contexts of individuals from augmented geospatial trajectories, and constructs a predictive model for their likely future contexts. The Pre…
▽ More
With a large proportion of people carrying location-aware smartphones, we have an unprecedented platform from which to understand individuals and predict their future actions. This work builds upon the Context Tree data structure that summarises the historical contexts of individuals from augmented geospatial trajectories, and constructs a predictive model for their likely future contexts. The Predictive Context Tree (PCT) is constructed as a hierarchical classifier, capable of predicting both the future locations that a user will visit and the contexts that a user will be immersed within. The PCT is evaluated over real-world geospatial trajectories, and compared against existing location extraction and prediction techniques, as well as a proposed hybrid approach that uses identified land usage elements in combination with machine learning to predict future interactions. Our results demonstrate that higher predictive accuracies can be achieved using this hybrid approach over traditional extracted location datasets, and the PCT itself matches the performance of the hybrid approach at predicting future interactions, while adding utility in the form of context predictions. Such a prediction system is capable of understanding not only where a user will visit, but also their context, in terms of what they are likely to be doing.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Minimizing Compression Artifacts for High Resolutions with Adaptive Quantization Matrices for HEVC
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
Visual Display Units (VDUs), capable of displaying video data at High Definition (HD) and Ultra HD (UHD) resolutions, are frequently employed in a variety of technological domains. Quantization-induced video compression artifacts, which are usually unnoticeable in low resolution environments, are typically conspicuous on high resolution VDUs and video data. The default quantization matrices (QMs)…
▽ More
Visual Display Units (VDUs), capable of displaying video data at High Definition (HD) and Ultra HD (UHD) resolutions, are frequently employed in a variety of technological domains. Quantization-induced video compression artifacts, which are usually unnoticeable in low resolution environments, are typically conspicuous on high resolution VDUs and video data. The default quantization matrices (QMs) in HEVC do not take into account specific display resolutions of VDUs or video data to determine the appropriate levels of quantization required to reduce unwanted compression artifacts. Therefore, we propose a novel, adaptive quantization matrix technique for the HEVC standard including Scalable HEVC (SHVC). Our technique, which is based on a refinement of the current QM technique in HEVC, takes into consideration specific display resolutions of the target VDUs in order to minimize compression artifacts. We undertake a thorough evaluation of the proposed technique by utilizing SHVC SHM 9.0 (two-layered bit-stream) and the BD-Rate and SSIM metrics. For the BD-Rate evaluation, the proposed method achieves maximum BD-Rate reductions of 56.5% in the enhancement layer. For the SSIM evaluation, our technique achieves a maximum structural improvement of 0.8660 vs. 0.8538.
△ Less
Submitted 21 September, 2016;
originally announced September 2016.
-
Color-Based Coding Unit Level Adaptive Quantization for HEVC
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
HEVC HM 16 includes a Coding Unit (CU) level perceptual quantization technique named AdaptiveQP. AdaptiveQP adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of samples in the four constituent NxN sub-blocks of the luma Coding Block (CB), which is contained within a 2Nx2N CU. In this paper, we propose C-BAQ, which, in contrast to AdaptiveQP, adjusts the CU level…
▽ More
HEVC HM 16 includes a Coding Unit (CU) level perceptual quantization technique named AdaptiveQP. AdaptiveQP adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of samples in the four constituent NxN sub-blocks of the luma Coding Block (CB), which is contained within a 2Nx2N CU. In this paper, we propose C-BAQ, which, in contrast to AdaptiveQP, adjusts the CU level QP according to the spatial activity of samples in the four constituent NxN sub-blocks of both the luma and chroma CBs. By computing the sum of luma, chroma Cb and chroma Cr spatial activity in a CU, a richer reflection of spatial activity in the CU is attained. Therefore, a more appropriate CU level QP can be selected, thus leading to important improvements in terms of coding efficiency. We evaluate the proposed technique in HEVC HM 16.7 using 4:4:4, 4:2:2 and 4:2:0 YCbCr sequences. Both subjective and objective evaluations are undertaken during which we compare C-BAQ with AdaptiveQP. The objective evaluation reveals that C-BAQ attains a maximum BD-Rate reduction of 15.9% (Y), 13.1% (Cr) and 16.1% (Cb) in addition to a maximum decoding time reduction of 11.0%.
△ Less
Submitted 6 November, 2016; v1 submitted 15 September, 2016;
originally announced September 2016.
-
Context Trees: Augmenting Geospatial Trajectories with Context
Authors:
Alasdair Thomason,
Nathan Griffiths,
Victor Sanchez
Abstract:
Exposing latent knowledge in geospatial trajectories has the potential to provide a better understanding of the movements of individuals and groups. Motivated by such a desire, this work presents the context tree, a new hierarchical data structure that summarises the context behind user actions in a single model. We propose a method for context tree construction that augments geospatial trajectori…
▽ More
Exposing latent knowledge in geospatial trajectories has the potential to provide a better understanding of the movements of individuals and groups. Motivated by such a desire, this work presents the context tree, a new hierarchical data structure that summarises the context behind user actions in a single model. We propose a method for context tree construction that augments geospatial trajectories with land usage data to identify such contexts. Through evaluation of the construction method and analysis of the properties of generated context trees, we demonstrate the foundation for understanding and modelling behaviour afforded. Summarising user contexts into a single data structure gives easy access to information that would otherwise remain latent, providing the basis for better understanding and predicting the actions and behaviours of individuals and groups. Finally, we also present a method for pruning context trees, for use in applications where it is desirable to reduce the size of the tree while retaining useful information.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
Adaptive Quantization Matrices for HD and UHD Display Resolutions in Scalable HEVC
Authors:
Lee Prangnell,
Victor Sanchez
Abstract:
HEVC contains an option to enable custom quantization matrices, which are designed based on the Human Visual System and a 2D Contrast Sensitivity Function. Visual Display Units, capable of displaying video data at High Definition and Ultra HD display resolutions, are frequently utilized on a global scale. Video compression artifacts that are present due to high levels of quantization, which are ty…
▽ More
HEVC contains an option to enable custom quantization matrices, which are designed based on the Human Visual System and a 2D Contrast Sensitivity Function. Visual Display Units, capable of displaying video data at High Definition and Ultra HD display resolutions, are frequently utilized on a global scale. Video compression artifacts that are present due to high levels of quantization, which are typically inconspicuous in low display resolution environments, are clearly visible on HD and UHD video data and VDUs. The default QM technique in HEVC does not take into account the video data resolution, nor does it take into consideration the associated display resolution of a VDU to determine the appropriate levels of quantization required to reduce unwanted video compression artifacts. Based on this fact, we propose a novel, adaptive quantization matrix technique for the HEVC standard, including Scalable HEVC. Our technique, which is based on a refinement of the current HVS-CSF QM approach in HEVC, takes into consideration the display resolution of the target VDU for the purpose of minimizing video compression artifacts. In SHVC SHM 9.0, and compared with anchors, the proposed technique yields important quality and coding improvements for the Random Access configuration, with a maximum of 56.5% luma BD-Rate reductions in the enhancement layer. Furthermore, compared with the default QMs and the Sony QMs, our method yields encoding time reductions of 0.75% and 1.19%, respectively.
△ Less
Submitted 12 June, 2016; v1 submitted 7 June, 2016;
originally announced June 2016.
-
An analysis of social network connect services
Authors:
Antonio Tapiador,
Víctor Sánchez,
Joaquín Salvachúa
Abstract:
Social network platforms are increasingly becoming identity providers and a media for showing multiple types of activity from third-party web sites. In this article, we analyze the services provided by seven of the most popular social network platforms. Results show OAuth emerging as the authentication and authorization protocol, giving support to three types of APIs, client-side or Javascript, se…
▽ More
Social network platforms are increasingly becoming identity providers and a media for showing multiple types of activity from third-party web sites. In this article, we analyze the services provided by seven of the most popular social network platforms. Results show OAuth emerging as the authentication and authorization protocol, giving support to three types of APIs, client-side or Javascript, server-side or representational state transfer (REST) and streaming. JSON is the most popular format, but there a considerable variety of resource types and a lack of representation standard, which makes harder for the third-party developer integrating with several services.
△ Less
Submitted 23 July, 2012;
originally announced July 2012.