Search | arXiv e-print repository

High-resolution open-vocabulary object 6D pose estimation

Authors: Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Abstract: The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between… ▽ More The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Technical report. Extension of CVPR paper "Open-vocabulary object 6D pose estimation". Project page: https://jcorsetti.github.io/oryon

arXiv:2405.01646 [pdf, other]

Explaining models relating objects and privacy

Authors: Alessio Xompero, Myriam Bontonou, Jean-Michel Arbona, Emmanouil Benetos, Andrea Cavallaro

Abstract: Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify whic… ▽ More Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify which objects (and which of their features) are more relevant to privacy classification with respect to a reference input (i.e., no objects localised in an image) predicted as public. We show that the presence of the person category and its cardinality is the main factor for the privacy decision. Therefore, these models mostly fail to identify private images depicting documents with sensitive data, vehicle ownership, and internet activity, or public images with people (e.g., an outdoor concert or people walking in a public space next to a famous landmark). As baselines for future benchmarks, we also devise two strategies that are based on the person presence and cardinality and achieve comparable classification performance of the privacy models. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 7 pages, 3 figures, 1 table, supplementary material included as Appendix. Paper accepted at the 3rd XAI4CV Workshop at CVPR 2024. Code: https://github.com/graphnex/ig-privacy

arXiv:2405.01353 [pdf, other]

Sparse multi-view hand-object reconstruction for unseen environments

Authors: Yik Lung Pang, Changjae Oh, Andrea Cavallaro

Abstract: Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In co… ▽ More Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while kee** the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: Camera-ready version. Paper accepted to CVPRW 2024. 8 pages, 7 figures, 1 table

arXiv:2312.00690 [pdf, other]

Open-vocabulary object 6D pose estimation

Authors: Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Abstract: We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoin… ▽ More We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon. △ Less

Submitted 25 June, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Camera ready version (CVPR 2024, poster highlight). New Oryon version: arXiv:2406.16384

arXiv:2310.19582 [pdf, other]

doi 10.1109/ICIP49359.2023.10222833

Human-interpretable and deep features for image privacy classification

Authors: Darya Baranouskaya, Andrea Cavallaro

Abstract: Privacy is a complex, subjective and contextual concept that is difficult to define. Therefore, the annotation of images to train privacy classifiers is a challenging task. In this paper, we analyse privacy classification datasets and the properties of controversial images that are annotated with contrasting privacy labels by different assessors. We discuss suitable features for image privacy clas… ▽ More Privacy is a complex, subjective and contextual concept that is difficult to define. Therefore, the annotation of images to train privacy classifiers is a challenging task. In this paper, we analyse privacy classification datasets and the properties of controversial images that are annotated with contrasting privacy labels by different assessors. We discuss suitable features for image privacy classification and propose eight privacy-specific and human-interpretable features. These features increase the performance of deep learning models and, on their own, improve the image representation for privacy classification compared with much higher dimensional deep features. △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Journal ref: 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 2023, pp. 3489-3492

arXiv:2310.14447 [pdf, other]

doi 10.13009/AO4ELT7-2023-082

SOUL at LBT: commissioning results, science and future

Authors: Enrico Pinna, Fabio Rossi, Guido Agapito, Alfio Puglisi, Cédric Plantet, Essna Ghose, Matthieu Bec, Marco Bonaglia, Runa Briguglio, Guido Brusa, Luca Carbonaro, Alessandro Cavallaro, Julian Christou, Olivier Durney, Steve Ertel, Simone Esposito, Paolo Grani, Juan Carlos Guerra, Philip Hinz, Michael Lefebvre, Tommaso Mazzoni, Brandon Mechtley, Douglas L. Miller, Manny Montoya, Jennifer Power , et al. (5 additional authors not shown)

Abstract: The SOUL systems at the Large Bincoular Telescope can be seen such as precursor for the ELT SCAO systems, combining together key technologies such as EMCCD, Pyramid WFS and adaptive telescopes. After the first light of the first upgraded system on September 2018, going through COVID and technical stops, we now have all the 4 systems working on-sky. Here, we report about some key control improvemen… ▽ More The SOUL systems at the Large Bincoular Telescope can be seen such as precursor for the ELT SCAO systems, combining together key technologies such as EMCCD, Pyramid WFS and adaptive telescopes. After the first light of the first upgraded system on September 2018, going through COVID and technical stops, we now have all the 4 systems working on-sky. Here, we report about some key control improvements and the system performance characterized during the commissioning. The upgrade allows us to correct more modes (500) in the bright end and increases the sky coverage providing SR(K)>20% with reference stars G$_{RP}$<17, opening to extragalcatic targets with NGS systems. Finally, we review the first astrophysical results, looking forward to the next generation instruments (SHARK-NIR, SHARK-Vis and iLocater), to be fed by the SOUL AO correction. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: 13 pages, 10 figures, Adaptive Optics for Extremely Large Telescopes 7th Edition, 25-30 Jun 2023 Avignon (France)

Journal ref: AO4ELT7 proceedings 2023

arXiv:2310.00503 [pdf, other]

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

Authors: Alina Elena Baia, Valentina Poggioni, Andrea Cavallaro

Abstract: Explainable AI (XAI) methods aim to describe the decision process of deep neural networks. Early XAI methods produced visual explanations, whereas more recent techniques generate multimodal explanations that include textual information and visual representations. Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks, with an attacker having full or parti… ▽ More Explainable AI (XAI) methods aim to describe the decision process of deep neural networks. Early XAI methods produced visual explanations, whereas more recent techniques generate multimodal explanations that include textual information and visual representations. Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks, with an attacker having full or partial knowledge of and access to the target system. As the vulnerabilities of multimodal XAI models have not been examined, in this paper we assess for the first time the robustness to black-box attacks of the natural language explanations generated by a self-rationalizing image-based activity recognition model. We generate unrestricted, spatially variant perturbations that disrupt the association between the predictions and the corresponding explanations to mislead the model into generating unfaithful explanations. We show that we can create adversarial images that manipulate the explanations of an activity recognition model by having access only to its final output. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Comments: Accepted at ICCV2023 AROW Workshop

arXiv:2308.11233 [pdf, other]

Affordance segmentation of hand-occluded containers from exocentric images

Authors: Tommaso Apicella, Alessio Xompero, Edoardo Ragusa, Riccardo Berta, Andrea Cavallaro, Paolo Gastaldo

Abstract: Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance s… ▽ More Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Paper accepted to Workshop on Assistive Computer Vision and Robotics (ACVR) in International Conference on Computer Vision (ICCV) 2023; 10 pages, 4 figures, 2 tables. Data, code, and trained models are available at https://apicis.github.io/projects/acanet.html

arXiv:2306.10887 [pdf]

doi 10.1021/acsami.2c01379

Ion Intercalation in Lanthanum Strontium Ferrite for Aqueous Electrochemical Energy Storage Devices

Authors: Yunqing Tang, Francesco Chiabrera, Alex Morata, Andrea Cavallaro, Maciej O. Liedke, Hemesh Avireddy, Mar Maller, Maik Butterling, Andreas Wagner, Michel Stchakovsky, Federico Baiutti, Ainara Aguadero, Albert Tarancón

Abstract: Ion intercalation of perovskite oxides in liquid electrolytes is a very promising method for controlling their functional properties while storing charge, which opens the potential application in different energy and information technologies. Although the role of defect chemistry in the oxygen intercalation in a gaseous environment is well established, the mechanism of ion intercalation in liquid… ▽ More Ion intercalation of perovskite oxides in liquid electrolytes is a very promising method for controlling their functional properties while storing charge, which opens the potential application in different energy and information technologies. Although the role of defect chemistry in the oxygen intercalation in a gaseous environment is well established, the mechanism of ion intercalation in liquid electrolytes at room temperature is poorly understood. In this study, the defect chemistry during ion intercalation of La0.5Sr0.5FeO3-δ thin films in alkaline electrolytes is studied. Oxygen and proton intercalation into the LSF perovskite structure is observed at moderate electrochemical potentials (0.5 V to -0.4 V), giving rise to a change in the oxidation state of Fe (as a charge compensation mechanism). The variation of the concentration of holes as a function of the intercalation potential was characterized by in-situ ellipsometry and the concentration of electron holes was indirectly quantified for different electrochemical potentials. Finally, a dilute defect chemistry model that describes the variation of defect species during ionic intercalation was developed. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Journal ref: ACS Appl. Mater. Interfaces 2022, 14, 18486

arXiv:2305.07396 [pdf, other]

GMP-selected dual and lensed AGNs: selection function and classification based on near-IR colors and resolved spectra from VLT/ERIS, KECK/OSIRIS, and LBT/LUCI

Authors: F. Mannucci, M. Scialpi, A. Ciurlo, S. Yeh, C. Marconcini, G. Tozzi, G. Cresci, A. Marconi, A. Amiri, F. Belfiore, S. Carniani, C. Cicone, E. Nardini, E. Pancino, K. Rubinur, P. Severgnini, L. Ulivi, G. Venturi, C. Vignali, M. Volonteri, E. Pinna, F. Rossi, A. Puglisi, G. Agapito, C. Plantet , et al. (22 additional authors not shown)

Abstract: The Gaia-Multi-Peak (GMP) technique can be used to identify large numbers of dual or lensed AGN candidates at sub-arcsec separation, allowing us to study both multiple SMBHs in the same galaxy and rare, compact lensed systems. The observed samples can be used to test the predictions of the models of SMBH merging once 1) the selection function of the GMP technique is known, and 2) each system has b… ▽ More The Gaia-Multi-Peak (GMP) technique can be used to identify large numbers of dual or lensed AGN candidates at sub-arcsec separation, allowing us to study both multiple SMBHs in the same galaxy and rare, compact lensed systems. The observed samples can be used to test the predictions of the models of SMBH merging once 1) the selection function of the GMP technique is known, and 2) each system has been classified as dual AGN, lensed AGN, or AGN/star alignment. Here we show that the GMP selection is very efficient for separations above 0.15'' when the secondary (fainter) object has magnitude G<20.5. We present the spectroscopic classification of five GMP candidates using VLT/ERIS and Keck/OSIRIS, and compare them with the classifications obtained from: a) the near-IR colors of 7 systems obtained with LBT/LUCI, and b) the analysis of the total, spatially-unresolved spectra. We conclude that colors and integrated spectra can already provide reliable classifications of many systems. Finally, we summarize the confirmed dual AGNs at z>0.5 selected by the GMP technique, and compare this sample with other such systems from the literature, concluding that GMP can provide a large number of confirmed dual AGNs at separations below 7 kpc. △ Less

Submitted 9 October, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

Comments: 14 pages,A&A, in press

arXiv:2211.10470 [pdf, other]

A mixed-reality dataset for category-level 6D pose and size estimation of hand-occluded containers

Authors: Xavier Weber, Alessio Xompero, Andrea Cavallaro

Abstract: Estimating the 6D pose and size of household containers is challenging due to large intra-class variations in the object properties, such as shape, size, appearance, and transparency. The task is made more difficult when these objects are held and manipulated by a person due to varying degrees of hand occlusions caused by the type of grasps and by the viewpoint of the camera observing the person h… ▽ More Estimating the 6D pose and size of household containers is challenging due to large intra-class variations in the object properties, such as shape, size, appearance, and transparency. The task is made more difficult when these objects are held and manipulated by a person due to varying degrees of hand occlusions caused by the type of grasps and by the viewpoint of the camera observing the person holding the object. In this paper, we present a mixed-reality dataset of hand-occluded containers for category-level 6D object pose and size estimation. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds. We re-train and test an existing model for 6D object pose estimation on our mixed-reality dataset. We discuss the impact of the use of this dataset in improving the task of 6D pose and size estimation. △ Less

Submitted 18 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures, 1 table. Submitted to IEEE ICASSP 2023. Webpage at https://corsmal.eecs.qmul.ac.uk/pose.html

arXiv:2210.11169 [pdf, other]

Content-based Graph Privacy Advisor

Authors: Dimitrios Stoidis, Andrea Cavallaro

Abstract: People may be unaware of the privacy risks of uploading an image online. In this paper, we present Graph Privacy Advisor, an image privacy classifier that uses scene information and object cardinality as cues to predict whether an image is private. Graph Privacy Advisor simplifies a state-of-the-art graph model and improves its performance by refining the relevance of the information extracted fro… ▽ More People may be unaware of the privacy risks of uploading an image online. In this paper, we present Graph Privacy Advisor, an image privacy classifier that uses scene information and object cardinality as cues to predict whether an image is private. Graph Privacy Advisor simplifies a state-of-the-art graph model and improves its performance by refining the relevance of the information extracted from the image. We determine the most informative visual features to be used for the privacy classification task and reduce the complexity of the model by replacing high-dimensional image feature vectors with lower-dimensional, more effective features. We also address the problem of biased prior information by modelling object co-occurrences instead of the frequency of object occurrences in each class. △ Less

Submitted 13 November, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: 8 pages, 3 figures, in Proceedings of IEEE BigMM 2022

arXiv:2209.05077 [pdf, other]

BON: An extended public domain dataset for human activity recognition

Authors: Girmaw Abebe Tadesse, Oliver Bent, Komminist Weldemariam, Md. Abrar Istiak, Taufiq Hasan, Andrea Cavallaro

Abstract: Body-worn first-person vision (FPV) camera enables to extract a rich source of information on the environment from the subject's viewpoint. However, the research progress in wearable camera-based egocentric office activity understanding is slow compared to other activity environments (e.g., kitchen and outdoor ambulatory), mainly due to the lack of adequate datasets to train more sophisticated (e.… ▽ More Body-worn first-person vision (FPV) camera enables to extract a rich source of information on the environment from the subject's viewpoint. However, the research progress in wearable camera-based egocentric office activity understanding is slow compared to other activity environments (e.g., kitchen and outdoor ambulatory), mainly due to the lack of adequate datasets to train more sophisticated (e.g., deep learning) models for human activity recognition in office environments. This paper provides details of a large and publicly available office activity dataset (BON) collected in different office settings across three geographical locations: Barcelona (Spain), Oxford (UK) and Nairobi (Kenya), using a chest-mounted GoPro Hero camera. The BON dataset contains eighteen common office activities that can be categorised into person-to-person interactions (e.g., Chat with colleagues), person-to-object (e.g., Writing on a whiteboard), and proprioceptive (e.g., Walking). Annotation is provided for each segment of video with 5-seconds duration. Generally, BON contains 25 subjects and 2639 total segments. In order to facilitate further research in the sub-domain, we have also provided results that could be used as baselines for future studies. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2208.11661 [pdf, other]

Cross-Camera View-Overlap Recognition

Authors: Alessio Xompero, Andrea Cavallaro

Abstract: We propose a decentralised view-overlap recognition framework that operates across freely moving cameras without the need of a reference 3D map. Each camera independently extracts, aggregates into a hierarchical structure, and shares feature-point descriptors over time. A view overlap is recognised by view-matching and geometric validation to discard wrongly matched views. The proposed framework i… ▽ More We propose a decentralised view-overlap recognition framework that operates across freely moving cameras without the need of a reference 3D map. Each camera independently extracts, aggregates into a hierarchical structure, and shares feature-point descriptors over time. A view overlap is recognised by view-matching and geometric validation to discard wrongly matched views. The proposed framework is generic and can be used with different descriptors. We conduct the experiments on publicly available sequences as well as new sequences we collected with hand-held cameras. We show that Oriented FAST and Rotated BRIEF (ORB) features with Bags of Binary Words within the proposed framework lead to higher precision and a higher or similar accuracy compared to NetVLAD, RootSIFT, and SuperGlue. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Comments: 17 pages, 5 figures, 2 tables. Accepted to International Workshop on Distributed Smart Cameras (IWDSC) at the 2022 European Conference on Computer Vision (ECCV2022)

arXiv:2207.08726 [pdf, other]

Radiative pulsed L-mode operation in ARC-class reactors

Authors: S. J. Frank, C. J. Perks, A. O. Nelson, T. Qian, S. **, A. J. Cavallaro, A. Rutkowski, A. H. Reiman, J. P. Freidberg, P. Rodriguez-Fernandez, D. G. Whyte

Abstract: A new ARC-class, highly-radiative, pulsed, L-mode, burning plasma scenario is developed and evaluated as a candidate for future tokamak reactors. Pulsed inductive operation alleviates the stringent current drive requirements of steady-state reactors, and operation in L-mode affords ELM-free access to $\sim90\%$ core radiation fractions, significantly reducing the divertor power handling requiremen… ▽ More A new ARC-class, highly-radiative, pulsed, L-mode, burning plasma scenario is developed and evaluated as a candidate for future tokamak reactors. Pulsed inductive operation alleviates the stringent current drive requirements of steady-state reactors, and operation in L-mode affords ELM-free access to $\sim90\%$ core radiation fractions, significantly reducing the divertor power handling requirements. In this configuration the fusion power density can be maximized despite L-mode confinement by utilizing high-field to increase plasma densities and current. This allows us to obtain high gain in robust scenarios in compact devices with $P_\mathrm{fus} > 1000\,$MW despite low confinement. We demonstrate the feasibility of such scenarios here; first by showing that they avoid violating 0-D tokamak limits, and then by performing self-consistent integrated simulations of flattop operation including neoclassical and turbulent transport, magnetic equilibrium, and RF current drive models. Finally we examine the potential effect of introducing negative triangularity with a 0-D model. Our results show high-field radiative pulsed L-mode scenarios are a promising alternative to the typical steady state advanced tokamak scenarios which have dominated tokamak reactor development. △ Less

Submitted 9 September, 2022; v1 submitted 18 July, 2022; originally announced July 2022.

arXiv:2207.05470 [pdf, other]

On the limits of perceptual quality measures for enhanced underwater images

Authors: Chau Yi Li, Andrea Cavallaro

Abstract: The appearance of objects in underwater images is degraded by the selective attenuation of light, which reduces contrast and causes a colour cast. This degradation depends on the water environment, and increases with depth and with the distance of the object from the camera. Despite an increasing volume of works in underwater image enhancement and restoration, the lack of a commonly accepted evalu… ▽ More The appearance of objects in underwater images is degraded by the selective attenuation of light, which reduces contrast and causes a colour cast. This degradation depends on the water environment, and increases with depth and with the distance of the object from the camera. Despite an increasing volume of works in underwater image enhancement and restoration, the lack of a commonly accepted evaluation measure is hindering the progress as it is difficult to compare methods. In this paper, we review commonly used colour accuracy measures, such as colour reproduction error and CIEDE2000, and no-reference image quality measures, such as UIQM, UCIQE and CCF, which have not yet been systematically validated. We show that none of the no-reference quality measures satisfactorily rates the quality of enhanced underwater images and discuss their main shortcomings. Images and results are available at https://puiqe.eecs.qmul.ac.uk. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: Accepted in ICIP 2022

arXiv:2207.01052 [pdf, other]

Generating gender-ambiguous voices for privacy-preserving speech recognition

Authors: Dimitrios Stoidis, Andrea Cavallaro

Abstract: Our voice encodes a uniquely identifiable pattern which can be used to infer private attributes, such as gender or identity, that an individual might wish not to reveal when using a speech recognition service. To prevent attribute inference attacks alongside speech recognition tasks, we present a generative adversarial network, GenGAN, that synthesises voices that conceal the gender or identity of… ▽ More Our voice encodes a uniquely identifiable pattern which can be used to infer private attributes, such as gender or identity, that an individual might wish not to reveal when using a speech recognition service. To prevent attribute inference attacks alongside speech recognition tasks, we present a generative adversarial network, GenGAN, that synthesises voices that conceal the gender or identity of a speaker. The proposed network includes a generator with a U-Net architecture that learns to fool a discriminator. We condition the generator only on gender information and use an adversarial loss between signal distortion and privacy preservation. We show that GenGAN improves the trade-off between privacy and utility compared to privacy-preserving representation learning methods that consider gender information as a sensitive attribute to protect. △ Less

Submitted 3 July, 2022; originally announced July 2022.

Comments: 5 pages, 4 figures, submitted to INTERSPEECH

arXiv:2206.00772 [pdf, other]

On the reversibility of adversarial attacks

Authors: Chau Yi Li, Ricardo Sánchez-Matilla, Ali Shahin Shamsabadi, Riccardo Mazzon, Andrea Cavallaro

Abstract: Adversarial attacks modify images with perturbations that change the prediction of classifiers. These modified images, known as adversarial examples, expose the vulnerabilities of deep neural network classifiers. In this paper, we investigate the predictability of the map** between the classes predicted for original images and for their corresponding adversarial examples. This predictability rel… ▽ More Adversarial attacks modify images with perturbations that change the prediction of classifiers. These modified images, known as adversarial examples, expose the vulnerabilities of deep neural network classifiers. In this paper, we investigate the predictability of the map** between the classes predicted for original images and for their corresponding adversarial examples. This predictability relates to the possibility of retrieving the original predictions and hence reversing the induced misclassification. We refer to this property as the reversibility of an adversarial attack, and quantify reversibility as the accuracy in retrieving the original class or the true class of an adversarial example. We present an approach that reverses the effect of an adversarial attack on a classifier using a prior set of classification results. We analyse the reversibility of state-of-the-art adversarial attacks on benchmark classifiers and discuss the factors that affect the reversibility. △ Less

Submitted 1 June, 2022; originally announced June 2022.

arXiv:2203.04027 [pdf, other]

Data augmentation with mixtures of max-entropy transformations for filling-level classification

Authors: Apostolos Modas, Andrea Cavallaro, Pascal Frossard

Abstract: We address the problem of distribution shifts in test-time data with a principled data augmentation scheme for the task of content-level classification. In such a task, properties such as shape or transparency of test-time containers (cup or drinking glass) may differ from those represented in the training data. Dealing with such distribution shifts using standard augmentation schemes is challengi… ▽ More We address the problem of distribution shifts in test-time data with a principled data augmentation scheme for the task of content-level classification. In such a task, properties such as shape or transparency of test-time containers (cup or drinking glass) may differ from those represented in the training data. Dealing with such distribution shifts using standard augmentation schemes is challenging and transforming the training images to cover the properties of the test-time instances requires sophisticated image manipulations. We therefore generate diverse augmentations using a family of max-entropy transformations that create samples with new shapes, colors and spectral characteristics. We show that such a principled augmentation scheme, alone, can replace current approaches that use transfer learning or can be used in combination with transfer learning to improve its performance. △ Less

Submitted 8 March, 2022; originally announced March 2022.

arXiv:2203.02635 [pdf, other]

Training privacy-preserving video analytics pipelines by suppressing features that reveal information about private attributes

Authors: Chau Yi Li, Andrea Cavallaro

Abstract: Deep neural networks are increasingly deployed for scene analytics, including to evaluate the attention and reaction of people exposed to out-of-home advertisements. However, the features extracted by a deep neural network that was trained to predict a specific, consensual attribute (e.g. emotion) may also encode and thus reveal information about private, protected attributes (e.g. age or gender).… ▽ More Deep neural networks are increasingly deployed for scene analytics, including to evaluate the attention and reaction of people exposed to out-of-home advertisements. However, the features extracted by a deep neural network that was trained to predict a specific, consensual attribute (e.g. emotion) may also encode and thus reveal information about private, protected attributes (e.g. age or gender). In this work, we focus on such leakage of private information at inference time. We consider an adversary with access to the features extracted by the layers of a deployed neural network and use these features to predict private attributes. To prevent the success of such an attack, we modify the training of the network using a confusion loss that encourages the extraction of features that make it difficult for the adversary to accurately predict private attributes. We validate this training approach on image-based tasks using a publicly available dataset. Results show that, compared to the original network, the proposed PrivateNet can reduce the leakage of private information of a state-of-the-art emotion recognition classifier by 2.88% for gender and by 13.06% for age group, with a minimal effect on task accuracy. △ Less

Submitted 1 June, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

arXiv:2203.01977 [pdf, other]

Audio-Visual Object Classification for Human-Robot Collaboration

Authors: A. Xompero, Y. L. Pang, T. Patten, A. Prabhakar, B. Calli, A. Cavallaro

Abstract: Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To faci… ▽ More Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To facilitate comparisons and stimulate progress in solving this problem, we present the CORSMAL challenge and a dataset to assess the performance of the algorithms through a set of well-defined performance scores. The tasks of the challenge are the estimation of the mass, capacity, and dimensions of the object (container), and the classification of the type and amount of its content. A novel feature of the challenge is our real-to-simulation framework for visualising and assessing the impact of estimation errors in human-to-robot handovers. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, 1 table; accepted at ICASSP 2022; Challenge webpage, see https://corsmal.eecs.qmul.ac.uk/challenge.html

arXiv:2202.09263 [pdf, other]

Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?

Authors: Vandana Rajan, Alessio Brutti, Andrea Cavallaro

Abstract: Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that… ▽ More Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that fuses modalities requires components that can effectively aggregate task-relevant information from each modality. As cross-modal attention is seen as an effective mechanism for multi-modal fusion, in this paper we quantify the gain that such a mechanism brings compared to the corresponding self-attention mechanism. To this end, we implement and compare a cross-attention and a self-attention model. In addition to attention, each model uses convolutional layers for local feature extraction and recurrent layers for global sequential modelling. We compare the models using different modality combinations for a 7-class emotion classification task using the IEMOCAP dataset. Experimental results indicate that albeit both models improve upon the state-of-the-art in terms of weighted and unweighted accuracy for tri- and bi-modal configurations, their performance is generally statistically comparable. The code to replicate the experiments is available at https://github.com/smartcameras/SelfCrossAttn △ Less

Submitted 18 February, 2022; originally announced February 2022.

Comments: Accepted at ICASSP 2022

arXiv:2112.02381 [pdf, other]

Active Sensing for Search and Tracking: A Review

Authors: Luca Varotto, Angelo Cenedese, Andrea Cavallaro

Abstract: Active Position Estimation (APE) is the task of localizing one or more targets using one or more sensing platforms. APE is a key task for search and rescue missions, wildlife monitoring, source term estimation, and collaborative mobile robotics. Success in APE depends on the level of cooperation of the sensing platforms, their number, their degrees of freedom and the quality of the information gat… ▽ More Active Position Estimation (APE) is the task of localizing one or more targets using one or more sensing platforms. APE is a key task for search and rescue missions, wildlife monitoring, source term estimation, and collaborative mobile robotics. Success in APE depends on the level of cooperation of the sensing platforms, their number, their degrees of freedom and the quality of the information gathered. APE control laws enable active sensing by satisfying either pure-exploitative or pure-explorative criteria. The former minimizes the uncertainty on position estimation; whereas the latter drives the platform closer to its task completion. In this paper, we define the main elements of APE to systematically classify and critically discuss the state of the art in this domain. We also propose a reference framework as a formalism to classify APE-related solutions. Overall, this survey explores the principal challenges and envisages the main research directions in the field of autonomous perception systems for localization tasks. It is also beneficial to promote the development of robust active sensing methods for search and tracking applications. △ Less

Submitted 4 December, 2021; originally announced December 2021.

Comments: 26 pages, 5 tables, 3 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2108.03318 [pdf, other]

OHPL: One-shot Hand-eye Policy Learner

Authors: Changjae Oh, Yik Lung Pang, Andrea Cavallaro

Abstract: The control of a robot for manipulation tasks generally relies on object detection and pose estimation. An attractive alternative is to learn control policies directly from raw input data. However, this approach is time-consuming and expensive since learning the policy requires many trials with robot actions in the physical environment. To reduce the training cost, the policy can be learned in sim… ▽ More The control of a robot for manipulation tasks generally relies on object detection and pose estimation. An attractive alternative is to learn control policies directly from raw input data. However, this approach is time-consuming and expensive since learning the policy requires many trials with robot actions in the physical environment. To reduce the training cost, the policy can be learned in simulation with a large set of synthetic images. The limit of this approach is the domain gap between the simulation and the robot workspace. In this paper, we propose to learn a policy for robot reaching movements from a single image captured directly in the robot workspace from a camera placed on the end-effector (a hand-eye camera). The idea behind the proposed policy learner is that view changes seen from the hand-eye camera produced by actions in the robot workspace are analogous to locating a region-of-interest in a single image by performing sequential object localisation. This similar view change enables training of object reaching policies using reinforcement-learning-based sequential object localisation. To facilitate the adaptation of the policy to view changes in the robot workspace, we further present a dynamic filter that learns to bias an input state to remove irrelevant information for an action decision. The proposed policy learner can be used as a powerful representation for robotic tasks, and we validate it on static and moving object reaching tasks. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: Camera-ready version. Paper accepted to IROS 2021. 7 pages, 7 figures, 2 tables

arXiv:2108.00809 [pdf, other]

Cross-Modal Knowledge Transfer via Inter-Modal Translation and Alignment for Affect Recognition

Authors: Vandana Rajan, Alessio Brutti, Andrea Cavallaro

Abstract: Multi-modal affect recognition models leverage complementary information in different modalities to outperform their uni-modal counterparts. However, due to the unavailability of modality-specific sensors or data, multi-modal models may not be always employable. For this reason, we aim to improve the performance of uni-modal affect recognition models by transferring knowledge from a better-perform… ▽ More Multi-modal affect recognition models leverage complementary information in different modalities to outperform their uni-modal counterparts. However, due to the unavailability of modality-specific sensors or data, multi-modal models may not be always employable. For this reason, we aim to improve the performance of uni-modal affect recognition models by transferring knowledge from a better-performing (or stronger) modality to a weaker modality during training. Our proposed multi-modal training framework for cross-modal knowledge transfer relies on two main steps. First, an encoder-classifier model creates task-specific representations for the stronger modality. Then, cross-modal translation generates multi-modal intermediate representations, which are also aligned in the latent space with the stronger modality representations. To exploit the contextual information in temporal sequential affect data, we use Bi-GRU and transformer encoder. We validate our approach on two multi-modal affect datasets, namely CMU-MOSI for binary sentiment classification and RECOLA for dimensional emotion regression. The results show that the proposed approach consistently improves the uni-modal test-time performance of the weaker modalities. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: Under review

arXiv:2107.12719 [pdf, other]

doi 10.1109/ACCESS.2022.3166906

The CORSMAL benchmark for the prediction of the properties of containers

Authors: Alessio Xompero, Santiago Donaher, Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola, Reina Ishikawa, Yuichi Nagao, Ryo Hachiuma, Qi Liu, Fan Feng, Chuanlin Lan, Rosa H. M. Chan, Guilherme Christmann, Jyun-Ting Song, Gonuguntla Neeharika, Chinnakotla Krishna Teja Reddy, Dinesh Jain, Bakhtawar Ur Rehman, Andrea Cavallaro

Abstract: The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmar… ▽ More The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework. △ Less

Submitted 21 April, 2022; v1 submitted 27 July, 2021; originally announced July 2021.

Comments: Authors' post-print accepted for publication in IEEE Access, see https://doi.org/10.1109/ACCESS.2022.3166906 . 14 pages, 6 tables, 7 figures

Journal ref: IEEE Access, vol. 10, 2022, 1-15

arXiv:2107.01309 [pdf, other]

Towards safe human-to-robot handovers of unknown containers

Authors: Yik Lung Pang, Alessio Xompero, Changjae Oh, Andrea Cavallaro

Abstract: Safe human-to-robot handovers of unknown objects require accurate estimation of hand poses and object properties, such as shape, trajectory, and weight. Accurately estimating these properties requires the use of scanned 3D object models or expensive equipment, such as motion capture systems and markers, or both. However, testing handover algorithms with robots may be dangerous for the human and, w… ▽ More Safe human-to-robot handovers of unknown objects require accurate estimation of hand poses and object properties, such as shape, trajectory, and weight. Accurately estimating these properties requires the use of scanned 3D object models or expensive equipment, such as motion capture systems and markers, or both. However, testing handover algorithms with robots may be dangerous for the human and, when the object is an open container with liquids, for the robot. In this paper, we propose a real-to-simulation framework to develop safe human-to-robot handovers with estimations of the physical properties of unknown cups or drinking glasses and estimations of the human hands from videos of a human manipulating the container. We complete the handover in simulation, and we estimate a region that is not occluded by the hand of the human holding the container. We also quantify the safeness of the human and object in simulation. We validate the framework using public recordings of containers manipulated before a handover and show the safeness of the handover when using noisy estimates from a range of perceptual algorithms. △ Less

Submitted 2 July, 2021; originally announced July 2021.

Comments: Camera-ready version. Paper accepted to RO-MAN 2021. 8 pages, 8 figures, 1 table

arXiv:2106.04528 [pdf, other]

doi 10.1088/1361-6587/ac2890

Modeling of Particle Transport, Neutrals and Radiation in Magnetically-Confined Plasmas with Aurora

Authors: F. Sciortino, T. Odstrčil, A. Cavallaro, S. Smith, O. Meneghini, R. Reksoatmodjo, O. Linder, J. D. Lore, N. T. Howard, E. S. Marmar, S. Mordijck

Abstract: We present Aurora, an open-source package for particle transport, neutrals and radiation modeling in magnetic confinement fusion plasmas. Aurora's modern multi-language interface enables simulations of 1.5D impurity transport within high-performance computing frameworks, particularly for the inference of particle transport coefficients. A user-friendly Python library allows simple interaction with… ▽ More We present Aurora, an open-source package for particle transport, neutrals and radiation modeling in magnetic confinement fusion plasmas. Aurora's modern multi-language interface enables simulations of 1.5D impurity transport within high-performance computing frameworks, particularly for the inference of particle transport coefficients. A user-friendly Python library allows simple interaction with atomic rates from the Atomic Data and Atomic Structure database as well as other sources. This enables a range of radiation predictions, both for power balance and spectroscopic analysis. We discuss here the superstaging approximation for complex ions, as a way to group charge states and reduce computational cost, demonstrating its wide applicability within the Aurora forward model and beyond. Aurora also facilitates neutral particle analysis, both from experimental spectroscopic data and other simulation codes. Leveraging Aurora's capabilities to interface SOLPS-ITER results, we demonstrate that charge exchange is unlikely to affect the total radiated power from the ITER core during high performance operation. Finally, we describe the ImpRad module in the OMFIT framework, developed to enable experimental analysis and transport inferences on multiple devices using Aurora. △ Less

Submitted 7 July, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: 8 pages + references, 5 figures

arXiv:2104.11051 [pdf, other]

Protecting gender and identity with disentangled speech representations

Authors: Dimitrios Stoidis, Andrea Cavallaro

Abstract: Besides its linguistic content, our speech is rich in biometric information that can be inferred by classifiers. Learning privacy-preserving representations for speech signals enables downstream tasks without sharing unnecessary, private information about an individual. In this paper, we show that protecting gender information in speech is more effective than modelling speaker-identity information… ▽ More Besides its linguistic content, our speech is rich in biometric information that can be inferred by classifiers. Learning privacy-preserving representations for speech signals enables downstream tasks without sharing unnecessary, private information about an individual. In this paper, we show that protecting gender information in speech is more effective than modelling speaker-identity information only when generating a non-sensitive representation of speech. Our method relies on reconstructing speech by decoding linguistic content along with gender information using a variational autoencoder. Specifically, we exploit disentangled representation learning to encode information about different attributes into separate subspaces that can be factorised independently. We present a novel way to encode gender information and disentangle two sensitive biometric identifiers, namely gender and identity, in a privacy-protecting setting. Experiments on the LibriSpeech dataset show that gender recognition and speaker verification can be reduced to a random guess, protecting against classification-based attacks. △ Less

Submitted 16 June, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures

arXiv:2103.15999 [pdf, other]

Audio classification of the content of food containers and drinking glasses

Authors: Santiago Donaher, Alessio Xompero, Andrea Cavallaro

Abstract: Food containers, drinking glasses and cups handled by a person generate sounds that vary with the type and amount of their content. In this paper, we propose a new model for sound-based classification of the type and amount of content in a container. The proposed model is based on the decomposition of the problem into two steps, namely action recognition and content classification. We use the scen… ▽ More Food containers, drinking glasses and cups handled by a person generate sounds that vary with the type and amount of their content. In this paper, we propose a new model for sound-based classification of the type and amount of content in a container. The proposed model is based on the decomposition of the problem into two steps, namely action recognition and content classification. We use the scenario of the recent CORSMAL Containers Manipulation dataset and consider two actions (shaking and pouring), and seven combinations of material and filling level. The first step identifies the action performed by a person with the container. The second step determines the amount and type of content using an action-specific classifier. Experiments show that the proposed model achieves 76.02, 78.24, and 41.89 weighted average F1 score on the three test sets, respectively, and outperforms baselines and existing approaches that classify the content amount and type either independently or jointly. △ Less

Submitted 9 June, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

Comments: Camera-ready version. Paper accepted to EUSIPCO21. 5 pages, 4 figures, 3 tables. Minor improvements to the paper presentation

arXiv:2102.04057 [pdf, other]

Improving filling level classification with adversarial training

Authors: Apostolos Modas, Alessio Xompero, Ricardo Sanchez-Matilla, Pascal Frossard, Andrea Cavallaro

Abstract: We investigate the problem of classifying - from a single image - the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically,… ▽ More We investigate the problem of classifying - from a single image - the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically, we use adversarial training in a generic source dataset and then refine the training with a task-specific dataset. We also discuss and experimentally evaluate several training strategies and their combination on a range of container types of the CORSMAL Containers Manipulation dataset. We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set and limits the overfitting of the classifier to specific features of the training data. △ Less

Submitted 16 June, 2021; v1 submitted 8 February, 2021; originally announced February 2021.

Comments: Accepted to the 28th IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2101.07091 [pdf, other]

Bringing SOUL on sky

Authors: Enrico Pinna, Fabio Rossi, Alfio Puglisi, Guido Agapito, Marco Bonaglia, Cedric Plantet, Tommaso Mazzoni, Runa Briguglio, Luca Carbonaro, Marco Xompero, Paolo Grani, Armando Riccardi, Simone Esposito, Phil Hinz, Amali Vaz, Steve Ertel, Oscar M. Montoya, Oliver Durney, Julian Christou, Doug L. Miller, Greg Taylor, Alessandro Cavallaro, Michael Lefebvre

Abstract: The SOUL project is upgrading the 4 SCAO systems of LBT, pushing the current guide star limits of about 2 magnitudes fainter thanks to Electron Multiplied CCD detector. This improvement will open the NGS SCAO correction to a wider number of scientific cases from high contrast imaging in the visible to extra-galactic source in the NIR. The SOUL systems are today the unique case where pyramid WFS, a… ▽ More The SOUL project is upgrading the 4 SCAO systems of LBT, pushing the current guide star limits of about 2 magnitudes fainter thanks to Electron Multiplied CCD detector. This improvement will open the NGS SCAO correction to a wider number of scientific cases from high contrast imaging in the visible to extra-galactic source in the NIR. The SOUL systems are today the unique case where pyramid WFS, adaptive secondary and EMCCD are used together. This makes SOUL a pathfinder for most of the ELT SCAO systems like the one of GMT, MICADO and HARMONI of E-ELT, where the same key technologies will be employed. Today we have 3 SOUL systems installed on the telescope in commissioning phase. The 4th system will be installed in a few months. We will present here the results achieved during daytime testing and commissioning nights up to the present date. △ Less

Submitted 18 January, 2021; originally announced January 2021.

Comments: 11 pages, 9 figures, 1 table. AO4ELT6 proceedings

Journal ref: AO4ELT6 proceedings 2019

arXiv:2101.06795 [pdf, other]

An embedded multichannel sound acquisition system for drone audition

Authors: Michael Clayton, Lin Wang, Andrew McPherson, Andrea Cavallaro

Abstract: Microphone array techniques can improve the acoustic sensing performance on drones, compared to the use of a single microphone. However, multichannel sound acquisition systems are not available in current commercial drone platforms. To encourage the research in drone audition, we present an embedded sound acquisition and recording system with eight microphones and a multichannel sound recorder mou… ▽ More Microphone array techniques can improve the acoustic sensing performance on drones, compared to the use of a single microphone. However, multichannel sound acquisition systems are not available in current commercial drone platforms. To encourage the research in drone audition, we present an embedded sound acquisition and recording system with eight microphones and a multichannel sound recorder mounted on a quadcopter. In addition to recording and storing locally the sound from multiple microphones simultaneously, the embedded system can connect wirelessly to a remote terminal to transfer audio files for further processing. This will be the first stage towards creating a fully embedded solution for drone audition. We present experimental results obtained by state-of-the-art drone audition algorithms applied to the sound recorded by the embedded system. △ Less

Submitted 17 January, 2021; originally announced January 2021.

arXiv:2012.12258 [pdf, other]

Underwater image filtering: methods, datasets and evaluation

Authors: Chau Yi Li, Riccardo Mazzon, Andrea Cavallaro

Abstract: Underwater images are degraded by the selective attenuation of light that distorts colours and reduces contrast. The degradation extent depends on the water type, the distance between an object and the camera, and the depth under the water surface the object is at. Underwater image filtering aims to restore or to enhance the appearance of objects captured in an underwater image. Restoration method… ▽ More Underwater images are degraded by the selective attenuation of light that distorts colours and reduces contrast. The degradation extent depends on the water type, the distance between an object and the camera, and the depth under the water surface the object is at. Underwater image filtering aims to restore or to enhance the appearance of objects captured in an underwater image. Restoration methods compensate for the actual degradation, whereas enhancement methods improve either the perceived image quality or the performance of computer vision algorithms. The growing interest in underwater image filtering methods--including learning-based approaches used for both restoration and enhancement--and the associated challenges call for a comprehensive review of the state of the art. In this paper, we review the design principles of filtering methods and revisit the oceanology background that is fundamental to identify the degradation causes. We discuss image formation models and the results of restoration methods in various water types. Furthermore, we present task-dependent enhancement methods and categorise datasets for training neural networks and for method evaluation. Finally, we discuss evaluation strategies, including subjective tests and quality assessment measures. We complement this survey with a platform ( https://puiqe.eecs.qmul.ac.uk/ ), which hosts state-of-the-art underwater filtering methods and facilitates comparisons. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2011.10474 [pdf, other]

Probabilistic Radio-Visual Active Sensing for Search and Tracking

Authors: L. Varotto, A. Cenedese, A. Cavallaro

Abstract: Active Search and Tracking for search and rescue missions or collaborative mobile robotics relies on the actuation of a sensing platform to detect and localize a target. In this paper we focus on visually detecting a radio-emitting target with an aerial robot equipped with a radio receiver and a camera. Visual-based tracking provides high accuracy, but the directionality of the sensing domain may… ▽ More Active Search and Tracking for search and rescue missions or collaborative mobile robotics relies on the actuation of a sensing platform to detect and localize a target. In this paper we focus on visually detecting a radio-emitting target with an aerial robot equipped with a radio receiver and a camera. Visual-based tracking provides high accuracy, but the directionality of the sensing domain may require long search times before detecting the target. Conversely, radio signals have larger coverage, but lower tracking accuracy. Thus, we design a Recursive Bayesian Estimation scheme that uses camera observations to refine radio measurements. To regulate the camera pose, we design an optimal controller whose cost function is built upon a probabilistic map. Theoretical results support the proposed algorithm, while numerical analyses show higher robustness and efficiency with respect to visual and radio-only baselines. △ Less

Submitted 11 April, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

Comments: 6 pages, 3 figures, 1 table, accepted at ECC 2021

arXiv:2011.08483 [pdf, other]

FoolHD: Fooling speaker identification by Highly imperceptible adversarial Disturbances

Authors: Ali Shahin Shamsabadi, Francisco Sepúlveda Teixeira, Alberto Abad, Bhiksha Raj, Andrea Cavallaro, Isabel Trancoso

Abstract: Speaker identification models are vulnerable to carefully designed adversarial perturbations of their input signals that induce misclassification. In this work, we propose a white-box steganography-inspired adversarial attack that generates imperceptible adversarial perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in t… ▽ More Speaker identification models are vulnerable to carefully designed adversarial perturbations of their input signals that induce misclassification. In this work, we propose a white-box steganography-inspired adversarial attack that generates imperceptible adversarial perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function, in order to generate and conceal the adversarial perturbation within the original audio files. In addition to hindering speaker identification performance, this multi-objective loss accounts for human perception through a frame-wise cosine similarity between MFCC feature vectors extracted from the original and adversarial audio files. We validate the effectiveness of FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility. Our results show that FoolHD generates highly imperceptible adversarial audio files (average PESQ scores above 4.30), while achieving a success rate of 99.6% and 99.2% in misleading the speaker identification model, for untargeted and targeted settings, respectively. △ Less

Submitted 20 February, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: https://fsepteixeira.github.io/FoolHD/

arXiv:2011.01631 [pdf, other]

Robust Latent Representations via Cross-Modal Translation and Alignment

Authors: Vandana Rajan, Alessio Brutti, Andrea Cavallaro

Abstract: Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when the signals from some modalities are unavailable or are severely degraded by noise. To address this limitati… ▽ More Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when the signals from some modalities are unavailable or are severely degraded by noise. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of the weaker modalities. The translation from the weaker to the stronger modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representations in a shared latent space. We validate the proposed approach on the AVEC 2016 dataset for continuous emotion recognition and show the effectiveness of the approach that achieves state-of-the-art (uni-modal) performance for weaker modalities. △ Less

Submitted 8 March, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

Journal ref: ICASSP 2021

arXiv:2009.14684 [pdf, other]

Benchmark for Anonymous Video Analytics

Authors: Ricardo Sanchez-Matilla, Andrea Cavallaro

Abstract: Out-of-home audience measurement aims to count and characterize the people exposed to advertising content in the physical world. While audience measurement solutions based on computer vision are of increasing interest, no commonly accepted benchmark exists to evaluate and compare their performance. In this paper, we propose the first benchmark for digital out-of-home audience measurement that eval… ▽ More Out-of-home audience measurement aims to count and characterize the people exposed to advertising content in the physical world. While audience measurement solutions based on computer vision are of increasing interest, no commonly accepted benchmark exists to evaluate and compare their performance. In this paper, we propose the first benchmark for digital out-of-home audience measurement that evaluates the vision-based tasks of audience localization and counting, and audience demographics. The benchmark is composed of a novel, dataset captured at multiple locations and a set of performance measures. Using the benchmark, we present an in-depth comparison of eight open-source algorithms on four hardware platforms with GPU and CPU-optimized inferences and of two commercial off-the-shelf solutions for localization, count, age, and gender estimation. This benchmark and related open-source codes are available at http://ava.eecs.qmul.ac.uk. △ Less

Submitted 3 October, 2021; v1 submitted 30 September, 2020; originally announced September 2020.

arXiv:2008.06069 [pdf, other]

doi 10.1109/TIP.2021.3112290

Semantically Adversarial Learnable Filters

Authors: Ali Shahin Shamsabadi, Changjae Oh, Andrea Cavallaro

Abstract: We present an adversarial framework to craft perturbations that mislead classifiers by accounting for the image content and the semantics of the labels. The proposed framework combines a structure loss and a semantic adversarial loss in a multi-task objective function to train a fully convolutional neural network. The structure loss helps generate perturbations whose type and magnitude are defined… ▽ More We present an adversarial framework to craft perturbations that mislead classifiers by accounting for the image content and the semantics of the labels. The proposed framework combines a structure loss and a semantic adversarial loss in a multi-task objective function to train a fully convolutional neural network. The structure loss helps generate perturbations whose type and magnitude are defined by a target image processing filter. The semantic adversarial loss considers groups of (semantic) labels to craft perturbations that prevent the filtered image {from} being classified with a label in the same group. We validate our framework with three different target filters, namely detail enhancement, log transformation and gamma correction filters; and evaluate the adversarially filtered images against three classifiers, ResNet50, ResNet18 and AlexNet, pre-trained on ImageNet. We show that the proposed framework generates filtered images with a high success rate, robustness, and transferability to unseen classifiers. We also discuss objective and subjective evaluations of the adversarial perturbations. △ Less

Submitted 5 April, 2022; v1 submitted 13 August, 2020; originally announced August 2020.

Comments: 13 pages

Journal ref: IEEE Transactions on Image Processing, 2021

arXiv:2008.02397 [pdf, other]

doi 10.1145/3478074

DANA: Dimension-Adaptive Neural Architecture for Multivariate Sensor Data

Authors: Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, Hamed Haddadi

Abstract: Motion sensors embedded in wearable and mobile devices allow for dynamic selection of sensor streams and sampling rates, enabling several applications, such as power management and data-sharing control. While deep neural networks (DNNs) achieve competitive accuracy in sensor data classification, DNNs generally process incoming data from a fixed set of sensors with a fixed sampling rate, and change… ▽ More Motion sensors embedded in wearable and mobile devices allow for dynamic selection of sensor streams and sampling rates, enabling several applications, such as power management and data-sharing control. While deep neural networks (DNNs) achieve competitive accuracy in sensor data classification, DNNs generally process incoming data from a fixed set of sensors with a fixed sampling rate, and changes in the dimensions of their inputs cause considerable accuracy loss, unnecessary computations, or failure in operation. We introduce a dimension-adaptive pooling (DAP) layer that makes DNNs flexible and more robust to changes in sensor availability and in sampling rate. DAP operates on convolutional filter maps of variable dimensions and produces an input of fixed dimensions suitable for feedforward and recurrent layers. We also propose a dimension-adaptive training (DAT) procedure for enabling DNNs that use DAP to better generalize over the set of feasible data dimensions at inference time. DAT comprises the random selection of dimensions during the forward passes and optimization with accumulated gradients of several backward passes. Combining DAP and DAT, we show how to transform non-adaptive DNNs into a Dimension-Adaptive Neural Architecture (DANA), while kee** the same number of parameters. Compared to existing approaches, our solution provides better classification accuracy over the range of possible data dimensions at inference time and does not require up-sampling or imputation, thus reducing unnecessary computations. Experiments on seven datasets (four benchmark real-world datasets for human activity recognition and three synthetic datasets) show that DANA prevents significant losses in classification accuracy of the state-of-the-art DNNs and, compared to baselines, it better captures correlated patterns in sensor data under dynamic sensor availability and varying sampling rates. △ Less

Submitted 12 August, 2021; v1 submitted 5 August, 2020; originally announced August 2020.

Comments: Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 5, No. 3, Article 120. Publication date: September 2021

arXiv:2007.10115 [pdf, other]

doi 10.1109/MSP.2020.2985363

Towards robust sensing for Autonomous Vehicles: An adversarial perspective

Authors: Apostolos Modas, Ricardo Sanchez-Matilla, Pascal Frossard, Andrea Cavallaro

Abstract: Autonomous Vehicles rely on accurate and robust sensor observations for safety critical decision-making in a variety of conditions. Fundamental building blocks of such systems are sensors and classifiers that process ultrasound, RADAR, GPS, LiDAR and camera signals~\cite{Khan2018}. It is of primary importance that the resulting decisions are robust to perturbations, which can take the form of diff… ▽ More Autonomous Vehicles rely on accurate and robust sensor observations for safety critical decision-making in a variety of conditions. Fundamental building blocks of such systems are sensors and classifiers that process ultrasound, RADAR, GPS, LiDAR and camera signals~\cite{Khan2018}. It is of primary importance that the resulting decisions are robust to perturbations, which can take the form of different types of nuisances and data transformations, and can even be adversarial perturbations (APs). Adversarial perturbations are purposefully crafted alterations of the environment or of the sensory measurements, with the objective of attacking and defeating the autonomous systems. A careful evaluation of the vulnerabilities of their sensing system(s) is necessary in order to build and deploy safer systems in the fast-evolving domain of AVs. To this end, we survey the emerging field of sensing in adversarial settings: after reviewing adversarial attacks on sensing modalities for autonomous systems, we discuss countermeasures and present future research directions. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Journal ref: IEEE Signal Processing Magazine, Volume 37, Issue 4, Pages 14 - 23, July 2020

arXiv:2007.09766 [pdf, other]

doi 10.1109/TMM.2020.2987694

Exploiting vulnerabilities of deep neural networks for privacy protection

Authors: Ricardo Sanchez-Matilla, Chau Yi Li, Ali Shahin Shamsabadi, Riccardo Mazzon, Andrea Cavallaro

Abstract: Adversarial perturbations can be added to images to protect their content from unwanted inferences. These perturbations may, however, be ineffective against classifiers that were not {seen} during the generation of the perturbation, or against defenses {based on re-quantization, median filtering or JPEG compression. To address these limitations, we present an adversarial attack {that is} specifica… ▽ More Adversarial perturbations can be added to images to protect their content from unwanted inferences. These perturbations may, however, be ineffective against classifiers that were not {seen} during the generation of the perturbation, or against defenses {based on re-quantization, median filtering or JPEG compression. To address these limitations, we present an adversarial attack {that is} specifically designed to protect visual content against { unseen} classifiers and known defenses. We craft perturbations using an iterative process that is based on the Fast Gradient Signed Method and {that} randomly selects a classifier and a defense, at each iteration}. This randomization prevents an undesirable overfitting to a specific classifier or defense. We validate the proposed attack in both targeted and untargeted settings on the private classes of the Places365-Standard dataset. Using ResNet18, ResNet50, AlexNet and DenseNet161 {as classifiers}, the performance of the proposed attack exceeds that of eleven state-of-the-art attacks. The implementation is available at https://github.com/smartcameras/RP-FGSM/. △ Less

Submitted 19 July, 2020; originally announced July 2020.

Journal ref: IEEE Transactions on Multimedia 2020

arXiv:2007.02808 [pdf, other]

Novel-View Human Action Synthesis

Authors: Mohamed Ilyes Lakhal, Davide Boscaini, Fabio Poiesi, Oswald Lanz, Andrea Cavallaro

Abstract: Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolutio… ▽ More Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolution or occlusions. We produce a semi-dense textured mesh by propagating the transferred textures both locally, within local geodesic neighborhoods, and globally, across symmetric semantic parts. Next, we introduce a context-based generator to learn how to correct and complete the residual appearance information. This allows the network to independently focus on learning the foreground and background synthesis tasks. We validate the proposed solution on the public NTU RGB+D dataset. The code and resources are available at https://bit.ly/36u3h4K. △ Less

Submitted 8 October, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Asian Conference on Computer Vision (ACCV) 2020

arXiv:2004.05703 [pdf, other]

doi 10.1145/3386901.3388946

DarkneTZ: Towards Model Privacy at the Edge using Trusted Execution Environments

Authors: Fan Mo, Ali Shahin Shamsabadi, Kleomenis Katevas, Soteris Demetriou, Ilias Leontiadis, Andrea Cavallaro, Hamed Haddadi

Abstract: We present DarkneTZ, a framework that uses an edge device's Trusted Execution Environment (TEE) in conjunction with model partitioning to limit the attack surface against Deep Neural Networks (DNNs). Increasingly, edge devices (smartphones and consumer IoT devices) are equipped with pre-trained DNNs for a variety of applications. This trend comes with privacy risks as models can leak information a… ▽ More We present DarkneTZ, a framework that uses an edge device's Trusted Execution Environment (TEE) in conjunction with model partitioning to limit the attack surface against Deep Neural Networks (DNNs). Increasingly, edge devices (smartphones and consumer IoT devices) are equipped with pre-trained DNNs for a variety of applications. This trend comes with privacy risks as models can leak information about their training data through effective membership inference attacks (MIAs). We evaluate the performance of DarkneTZ, including CPU execution time, memory usage, and accurate power consumption, using two small and six large image classification models. Due to the limited memory of the edge device's TEE, we partition model layers into more sensitive layers (to be executed inside the device TEE), and a set of layers to be executed in the untrusted part of the operating system. Our results show that even if a single layer is hidden, we can provide reliable model privacy and defend against state of the art MIAs, with only 3% performance overhead. When fully utilizing the TEE, DarkneTZ provides model protections with up to 10% overhead. △ Less

Submitted 12 April, 2020; originally announced April 2020.

Comments: 13 pages, 8 figures, accepted to ACM MobiSys 2020

arXiv:2004.05574 [pdf, other]

PrivEdge: From Local to Distributed Private Training and Prediction

Authors: Ali Shahin Shamsabadi, Adria Gascon, Hamed Haddadi, Andrea Cavallaro

Abstract: Machine Learning as a Service (MLaaS) operators provide model training and prediction on the cloud. MLaaS applications often rely on centralised collection and aggregation of user data, which could lead to significant privacy concerns when dealing with sensitive personal data. To address this problem, we propose PrivEdge, a technique for privacy-preserving MLaaS that safeguards the privacy of user… ▽ More Machine Learning as a Service (MLaaS) operators provide model training and prediction on the cloud. MLaaS applications often rely on centralised collection and aggregation of user data, which could lead to significant privacy concerns when dealing with sensitive personal data. To address this problem, we propose PrivEdge, a technique for privacy-preserving MLaaS that safeguards the privacy of users who provide their data for training, as well as users who use the prediction service. With PrivEdge, each user independently uses their private data to locally train a one-class reconstructive adversarial network that succinctly represents their training data. As sending the model parameters to the service provider in the clear would reveal private information, PrivEdge secret-shares the parameters among two non-colluding MLaaS providers, to then provide cryptographically private prediction services through secure multi-party computation techniques. We quantify the benefits of PrivEdge and compare its performance with state-of-the-art centralised architectures on three privacy-sensitive image-based tasks: individual identification, writer identification, and handwritten letter recognition. Experimental results show that PrivEdge has high precision and recall in preserving privacy, as well as in distinguishing between private and non-private images. Moreover, we show the robustness of PrivEdge to image compression and biased training data. The source code is available at https://github.com/smartcameras/PrivEdge. △ Less

Submitted 12 April, 2020; originally announced April 2020.

Comments: IEEE Transactions on Information Forensics and Security (TIFS)

arXiv:1911.12354 [pdf, other]

Multi-view shape estimation of transparent containers

Authors: Alessio Xompero, Ricardo Sanchez-Matilla, Apostolos Modas, Pascal Frossard, Andrea Cavallaro

Abstract: The 3D localisation of an object and the estimation of its properties, such as shape and dimensions, are challenging under varying degrees of transparency and lighting conditions. In this paper, we propose a method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras. Under the assumption of circular symmetry along the vertic… ▽ More The 3D localisation of an object and the estimation of its properties, such as shape and dimensions, are challenging under varying degrees of transparency and lighting conditions. In this paper, we propose a method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras. Under the assumption of circular symmetry along the vertical axis, we estimate the dimensions of an object with a generative 3D sampling model of sparse circumferences, iterative shape fitting and image re-projection to verify the sampling hypotheses in each camera using semantic segmentation masks. We evaluate the proposed method on a novel dataset of objects with different degrees of transparency and captured under different backgrounds and illumination conditions. Our method, which is based on RGB images only, outperforms in terms of localisation success and dimension estimation accuracy a deep-learning based approach that uses depth maps. △ Less

Submitted 9 March, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

Comments: Accepted to International Conference on Acoustic, Speech, and Signal Processing (ICASSP); 5 pages, 7 figures

arXiv:1911.10891 [pdf, other]

ColorFool: Semantic Adversarial Colorization

Authors: Ali Shahin Shamsabadi, Ricardo Sanchez-Matilla, Andrea Cavallaro

Abstract: Adversarial attacks that generate small L_p-norm perturbations to mislead classifiers have limited success in black-box settings and with unseen classifiers. These attacks are also not robust to defenses that use denoising filters and to adversarial training procedures. Instead, adversarial attacks that generate unrestricted perturbations are more robust to defenses, are generally more successful… ▽ More Adversarial attacks that generate small L_p-norm perturbations to mislead classifiers have limited success in black-box settings and with unseen classifiers. These attacks are also not robust to defenses that use denoising filters and to adversarial training procedures. Instead, adversarial attacks that generate unrestricted perturbations are more robust to defenses, are generally more successful in black-box settings and are more transferable to unseen classifiers. However, unrestricted perturbations may be noticeable to humans. In this paper, we propose a content-based black-box adversarial attack that generates unrestricted perturbations by exploiting image semantics to selectively modify colors within chosen ranges that are perceived as natural by humans. We show that the proposed approach, ColorFool, outperforms in terms of success rate, robustness to defense frameworks and transferability, five state-of-the-art adversarial attacks on two different tasks, scene and object classification, when attacking three state-of-the-art deep neural networks using three standard datasets. The source code is available at https://github.com/smartcameras/ColorFool. △ Less

Submitted 12 April, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR2020)

arXiv:1911.05996 [pdf, other]

Privacy and Utility Preserving Sensor-Data Transformations

Authors: Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, Hamed Haddadi

Abstract: Sensitive inferences and user re-identification are major threats to privacy when raw sensor data from wearable or portable devices are shared with cloud-assisted applications. To mitigate these threats, we propose mechanisms to transform sensor data before sharing them with applications running on users' devices. These transformations aim at eliminating patterns that can be used for user re-ident… ▽ More Sensitive inferences and user re-identification are major threats to privacy when raw sensor data from wearable or portable devices are shared with cloud-assisted applications. To mitigate these threats, we propose mechanisms to transform sensor data before sharing them with applications running on users' devices. These transformations aim at eliminating patterns that can be used for user re-identification or for inferring potentially sensitive activities, while introducing a minor utility loss for the target application (or task). We show that, on gesture and activity recognition tasks, we can prevent inference of potentially sensitive activities while kee** the reduction in recognition accuracy of non-sensitive activities to less than 5 percentage points. We also show that we can reduce the accuracy of user re-identification and of the potential inference of gender to the level of a random guess, while kee** the accuracy of activity recognition comparable to that obtained on the original data. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: Accepted to appear in Pervasive and Mobile computing (PMC) Journal, Elsevier

arXiv:1910.12227 [pdf, other]

EdgeFool: An Adversarial Image Enhancement Filter

Authors: Ali Shahin Shamsabadi, Changjae Oh, Andrea Cavallaro

Abstract: Adversarial examples are intentionally perturbed images that mislead classifiers. These images can, however, be easily detected using denoising algorithms, when high-frequency spatial perturbations are used, or can be noticed by humans, when perturbations are large. In this paper, we propose EdgeFool, an adversarial image enhancement filter that learns structure-aware adversarial perturbations. Ed… ▽ More Adversarial examples are intentionally perturbed images that mislead classifiers. These images can, however, be easily detected using denoising algorithms, when high-frequency spatial perturbations are used, or can be noticed by humans, when perturbations are large. In this paper, we propose EdgeFool, an adversarial image enhancement filter that learns structure-aware adversarial perturbations. EdgeFool generates adversarial images with perturbations that enhance image details via training a fully convolutional neural network end-to-end with a multi-task loss function. This loss function accounts for both image detail enhancement and class misleading objectives. We evaluate EdgeFool on three classifiers (ResNet-50, ResNet-18 and AlexNet) using two datasets (ImageNet and Private-Places365) and compare it with six adversarial methods (DeepFool, SparseFool, Carlini-Wagner, SemanticAdv, Non-targeted and Private Fast Gradient Sign Methods). Code is available at https://github.com/smartcameras/EdgeFool.git. △ Less

Submitted 5 March, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

Journal ref: Proceedings of the 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)2020

arXiv:1910.06827 [pdf, other]

Learning Generalisable Omni-Scale Representations for Person Re-Identification

Authors: Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, Tao Xiang

Abstract: An effective person re-identification (re-ID) model should learn feature representations that are both discriminative, for distinguishing similar-looking people, and generalisable, for deployment across datasets without any adaptation. In this paper, we develop novel CNN architectures to address both challenges. First, we present a re-ID CNN termed omni-scale network (OSNet) to learn features that… ▽ More An effective person re-identification (re-ID) model should learn feature representations that are both discriminative, for distinguishing similar-looking people, and generalisable, for deployment across datasets without any adaptation. In this paper, we develop novel CNN architectures to address both challenges. First, we present a re-ID CNN termed omni-scale network (OSNet) to learn features that not only capture different spatial scales but also encapsulate a synergistic combination of multiple scales, namely omni-scale features. The basic building block consists of multiple convolutional streams, each detecting features at a certain scale. For omni-scale feature learning, a unified aggregation gate is introduced to dynamically fuse multi-scale features with channel-wise weights. OSNet is lightweight as its building blocks comprise factorised convolutions. Second, to improve generalisable feature learning, we introduce instance normalisation (IN) layers into OSNet to cope with cross-dataset discrepancies. Further, to determine the optimal placements of these IN layers in the architecture, we formulate an efficient differentiable architecture search algorithm. Extensive experiments show that, in the conventional same-dataset setting, OSNet achieves state-of-the-art performance, despite being much smaller than existing re-ID models. In the more challenging yet practical cross-dataset setting, OSNet beats most recent unsupervised domain adaptation methods without using any target data. Our code and models are released at \texttt{https://github.com/KaiyangZhou/deep-person-reid}. △ Less

Submitted 29 April, 2021; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: TPAMI 2021. Journal extension of arXiv:1905.00953. Updates: added appendix. arXiv admin note: text overlap with arXiv:1905.00953

Showing 1–50 of 67 results for author: Cavallaro, A