Search | arXiv e-print repository

High-resolution open-vocabulary object 6D pose estimation

Authors: Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Abstract: The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between… ▽ More The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Technical report. Extension of CVPR paper "Open-vocabulary object 6D pose estimation". Project page: https://jcorsetti.github.io/oryon

arXiv:2405.07550 [pdf, other]

Wild Berry image dataset collected in Finnish forests and peatlands using drones

Authors: Luigi Riz, Sergio Povoli, Andrea Caraffa, Davide Boscaini, Mohamed Lamine Mekhalfi, Paul Chippendale, Marjut Turtiainen, Birgitta Partanen, Laura Smith Ballester, Francisco Blanes Noguera, Alessio Franchi, Elisa Castelli, Giacomo Piccinini, Luca Marchesotti, Micael Santos Couceiro, Fabio Poiesi

Abstract: Berry picking has long-standing traditions in Finland, yet it is challenging and can potentially be dangerous. The integration of drones equipped with advanced imaging techniques represents a transformative leap forward, optimising harvests and promising sustainable practices. We propose WildBe, the first image dataset of wild berries captured in peatlands and under the canopy of Finnish forests u… ▽ More Berry picking has long-standing traditions in Finland, yet it is challenging and can potentially be dangerous. The integration of drones equipped with advanced imaging techniques represents a transformative leap forward, optimising harvests and promising sustainable practices. We propose WildBe, the first image dataset of wild berries captured in peatlands and under the canopy of Finnish forests using drones. Unlike previous and related datasets, WildBe includes new varieties of berries, such as bilberries, cloudberries, lingonberries, and crowberries, captured under severe light variations and in cluttered environments. WildBe features 3,516 images, including a total of 18,468 annotated bounding boxes. We carry out a comprehensive analysis of WildBe using six popular object detectors, assessing their effectiveness in berry detection across different forest regions and camera types. We will release WildBe publicly. △ Less

Submitted 15 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

arXiv:2312.00947 [pdf, other]

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Authors: Andrea Caraffa, Davide Boscaini, Amir Hamza, Fabio Poiesi

Abstract: Estimating the 6D pose of objects unseen during training is highly desirable yet challenging. Zero-shot object 6D pose estimation methods address this challenge by leveraging additional task-specific supervision provided by large-scale, photo-realistic synthetic datasets. However, their performance heavily depends on the quality and diversity of rendered data and they require extensive training. I… ▽ More Estimating the 6D pose of objects unseen during training is highly desirable yet challenging. Zero-shot object 6D pose estimation methods address this challenge by leveraging additional task-specific supervision provided by large-scale, photo-realistic synthetic datasets. However, their performance heavily depends on the quality and diversity of rendered data and they require extensive training. In this work, we show how to tackle the same task but without training on specific data. We propose FreeZe, a novel solution that harnesses the capabilities of pre-trained geometric and vision foundation models. FreeZe leverages 3D geometric descriptors learned from unrelated 3D point clouds and 2D visual features learned from web-scale 2D images to generate discriminative 3D point-level descriptors. We then estimate the 6D pose of unseen objects by 3D registration based on RANSAC. We also introduce a novel algorithm to solve ambiguous cases due to geometrically symmetric objects that is based on visual features. We comprehensively evaluate FreeZe across the seven core datasets of the BOP Benchmark, which include over a hundred 3D objects and 20,000 images captured in various scenarios. FreeZe consistently outperforms all state-of-the-art approaches, including competitors extensively trained on synthetic 6D pose estimation data. Code will be publicly available at https://andreacaraffa.github.io/freeze. △ Less

Submitted 3 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00690 [pdf, other]

Open-vocabulary object 6D pose estimation

Authors: Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

Abstract: We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoin… ▽ More We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon. △ Less

Submitted 25 June, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Camera ready version (CVPR 2024, poster highlight). New Oryon version: arXiv:2406.16384

arXiv:2308.15353 [pdf, other]

Detect, Augment, Compose, and Adapt: Four Steps for Unsupervised Domain Adaptation in Object Detection

Authors: Mohamed L. Mekhalfi, Davide Boscaini, Fabio Poiesi

Abstract: Unsupervised domain adaptation (UDA) plays a crucial role in object detection when adapting a source-trained detector to a target domain without annotated data. In this paper, we propose a novel and effective four-step UDA approach that leverages self-supervision and trains source and target data concurrently. We harness self-supervised learning to mitigate the lack of ground truth in the target d… ▽ More Unsupervised domain adaptation (UDA) plays a crucial role in object detection when adapting a source-trained detector to a target domain without annotated data. In this paper, we propose a novel and effective four-step UDA approach that leverages self-supervision and trains source and target data concurrently. We harness self-supervised learning to mitigate the lack of ground truth in the target domain. Our method consists of the following steps: (1) identify the region with the highest-confidence set of detections in each target image, which serve as our pseudo-labels; (2) crop the identified region and generate a collection of its augmented versions; (3) combine these latter into a composite image; (4) adapt the network to the target domain using the composed image. Through extensive experiments under cross-camera, cross-weather, and synthetic-to-real scenarios, our approach achieves state-of-the-art performance, improving upon the nearest competitor by more than 2% in terms of mean Average Precision (mAP). The code is available at https://github.com/MohamedTEV/DACA. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2307.15692 [pdf, other]

PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding

Authors: Davide Boscaini, Fabio Poiesi

Abstract: The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue… ▽ More The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue that the ability of a model to transfer the learnt knowledge to different domains is an important feature that should be evaluated to exhaustively assess the quality of a deep network architecture. In this work we propose PatchMixer, a simple yet effective architecture that extends the ideas behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our approach are the processing of local patches instead of the whole shape to promote robustness to partial point clouds, and the aggregation of patch-wise features using an MLP as a simpler alternative to the graph convolutions or the attention mechanisms that are used in prior works. We evaluated our method on the shape classification and part segmentation tasks, achieving superior generalization performance compared to a selection of the most relevant deep architectures. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: Published in the Image and Vision Computing journal

arXiv:2307.15514 [pdf, other]

Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation

Authors: Jaime Corsetti, Davide Boscaini, Fabio Poiesi

Abstract: Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Fe… ▽ More Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Features (FCGF) and tailor it for object 6D pose estimation to achieve state-of-the-art performance. FCGF employs sparse convolutions and learns point-level features using a fully-convolutional network by optimising a hardest contrastive loss. We can outperform recent competitors on popular benchmarks by adopting key modifications to the loss and to the input data representations, by carefully tuning the training strategies, and by employing data augmentations suitable for the underlying problem. We carry out a thorough ablation to study the contribution of each modification. The code is available at https://github.com/jcorsetti/FCGF6D. △ Less

Submitted 3 October, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

Comments: Camera ready version, 18 pages and 13 figures. Published at the 8th International Workshop on Recovering 6D Object Pose

arXiv:2304.05417 [pdf, other]

The MONET dataset: Multimodal drone thermal dataset recorded in rural scenarios

Authors: Luigi Riz, Andrea Caraffa, Matteo Bortolon, Mohamed Lamine Mekhalfi, Davide Boscaini, André Moura, José Antunes, André Dias, Hugo Silva, Andreas Leonidou, Christos Constantinides, Christos Keleshis, Dante Abate, Fabio Poiesi

Abstract: We present MONET, a new multimodal dataset captured using a thermal camera mounted on a drone that flew over rural areas, and recorded human and vehicle activities. We captured MONET to study the problem of object localisation and behaviour understanding of targets undergoing large-scale variations and being recorded from different and moving viewpoints. Target activities occur in two different la… ▽ More We present MONET, a new multimodal dataset captured using a thermal camera mounted on a drone that flew over rural areas, and recorded human and vehicle activities. We captured MONET to study the problem of object localisation and behaviour understanding of targets undergoing large-scale variations and being recorded from different and moving viewpoints. Target activities occur in two different land sites, each with unique scene structures and cluttered backgrounds. MONET consists of approximately 53K images featuring 162K manually annotated bounding boxes. Each image is timestamp-aligned with drone metadata that includes information about attitudes, speed, altitude, and GPS coordinates. MONET is different from previous thermal drone datasets because it features multimodal data, including rural scenes captured with thermal cameras containing both person and vehicle targets, along with trajectory information and metadata. We assessed the difficulty of the dataset in terms of transfer learning between the two sites and evaluated nine object detection algorithms to identify the open challenges associated with this type of data. Project page: https://github.com/fabiopoiesi/monet_dataset. △ Less

Submitted 19 July, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: Published in Computer Vision and Pattern Recognition (CVPR) Workshops 2023 - 6th Multimodal Learning and Applications Workshop

arXiv:2212.03300 [pdf, other]

Supervised Tractogram Filtering using Geometric Deep Learning

Authors: Pietro Astolfi, Ruben Verhagen, Laurent Petit, Emanuele Olivetti, Silvio Sarubbo, Jonathan Masci, Davide Boscaini, Paolo Avesani

Abstract: A tractogram is a virtual representation of the brain white matter. It is composed of millions of virtual fibers, encoded as 3D polylines, which approximate the white matter axonal pathways. To date, tractograms are the most accurate white matter representation and thus are used for tasks like presurgical planning and investigations of neuroplasticity, brain disorders, or brain networks. However,… ▽ More A tractogram is a virtual representation of the brain white matter. It is composed of millions of virtual fibers, encoded as 3D polylines, which approximate the white matter axonal pathways. To date, tractograms are the most accurate white matter representation and thus are used for tasks like presurgical planning and investigations of neuroplasticity, brain disorders, or brain networks. However, it is a well-known issue that a large portion of tractogram fibers is not anatomically plausible and can be considered artifacts of the tracking procedure. With Verifyber, we tackle the problem of filtering out such non-plausible fibers using a novel fully-supervised learning approach. Differently from other approaches based on signal reconstruction and/or brain topology regularization, we guide our method with the existing anatomical knowledge of the white matter. Using tractograms annotated according to anatomical principles, we train our model, Verifyber, to classify fibers as either anatomically plausible or non-plausible. The proposed Verifyber model is an original Geometric Deep Learning method that can deal with variable size fibers, while being invariant to fiber orientation. Our model considers each fiber as a graph of points, and by learning features of the edges between consecutive points via the proposed sequence Edge Convolution, it can capture the underlying anatomical properties. The output filtering results highly accurate and robust across an extensive set of experiments, and fast; with a 12GB GPU, filtering a tractogram of 1M fibers requires less than a minute. Verifyber implementation and trained models are available at https://github.com/FBK-NILab/verifyber. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: Pre-print. Under review at journal

arXiv:2105.10382 [pdf, other]

doi 10.1109/TPAMI.2022.3175371

Learning general and distinctive 3D local deep descriptors for point cloud registration

Authors: Fabio Poiesi, Davide Boscaini

Abstract: An effective 3D descriptor should be invariant to different geometric transformations, such as scale and rotation, robust to occlusions and clutter, and capable of generalising to different application domains. We present a simple yet effective method to learn general and distinctive 3D local descriptors that can be used to register point clouds that are captured in different domains. Point cloud… ▽ More An effective 3D descriptor should be invariant to different geometric transformations, such as scale and rotation, robust to occlusions and clutter, and capable of generalising to different application domains. We present a simple yet effective method to learn general and distinctive 3D local descriptors that can be used to register point clouds that are captured in different domains. Point cloud patches are extracted, canonicalised with respect to their local reference frame, and encoded into scale and rotation-invariant compact descriptors by a deep neural network that is invariant to permutations of the input points. This design is what enables our descriptors to generalise across domains. We evaluate and compare our descriptors with alternative handcrafted and deep learning-based descriptors on several indoor and outdoor datasets that are reconstructed by using both RGBD sensors and laser scanners. Our descriptors outperform most recent descriptors by a large margin in terms of generalisation, and also become the state of the art in benchmarks where training and testing are performed in the same domain. △ Less

Submitted 12 May, 2022; v1 submitted 21 May, 2021; originally announced May 2021.

Comments: Accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:2009.00258 [pdf, other]

Distinctive 3D local deep descriptors

Authors: Fabio Poiesi, Davide Boscaini

Abstract: We present a simple but yet effective method for learning distinctive 3D local deep descriptors (DIPs) that can be used to register point clouds without requiring an initial alignment. Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded into rotation-invariant compact descriptors by a PointNet-based deep neural network. DIPs can effect… ▽ More We present a simple but yet effective method for learning distinctive 3D local deep descriptors (DIPs) that can be used to register point clouds without requiring an initial alignment. Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded into rotation-invariant compact descriptors by a PointNet-based deep neural network. DIPs can effectively generalise across different sensor modalities because they are learnt end-to-end from locally and randomly sampled points. Because DIPs encode only local geometric information, they are robust to clutter, occlusions and missing regions. We evaluate and compare DIPs against alternative hand-crafted and deep descriptors on several indoor and outdoor datasets consisting of point clouds reconstructed using different sensors. Results show that DIPs (i) achieve comparable results to the state-of-the-art on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a large margin on laser-scanner outdoor scenes (ETH dataset), and (iii) generalise to indoor scenes reconstructed with the Visual-SLAM system of Android ARCore. Source code: https://github.com/fabiopoiesi/dip. △ Less

Submitted 28 December, 2020; v1 submitted 1 September, 2020; originally announced September 2020.

Comments: IEEE International Conference on Pattern Recognition 2020

arXiv:2008.01589 [pdf, other]

Shape Consistent 2D Keypoint Estimation under Domain Shift

Authors: Levi O. Vasconcelos, Massimiliano Mancini, Davide Boscaini, Samuel Rota Bulo, Barbara Caputo, Elisa Ricci

Abstract: Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under domain shift}, i.e.… ▽ More Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under domain shift}, i.e. when the training (source) and the test (target) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2007.02808 [pdf, other]

Novel-View Human Action Synthesis

Authors: Mohamed Ilyes Lakhal, Davide Boscaini, Fabio Poiesi, Oswald Lanz, Andrea Cavallaro

Abstract: Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolutio… ▽ More Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolution or occlusions. We produce a semi-dense textured mesh by propagating the transferred textures both locally, within local geodesic neighborhoods, and globally, across symmetric semantic parts. Next, we introduce a context-based generator to learn how to correct and complete the residual appearance information. This allows the network to independently focus on learning the foreground and background synthesis tasks. We validate the proposed solution on the public NTU RGB+D dataset. The code and resources are available at https://bit.ly/36u3h4K. △ Less

Submitted 8 October, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Asian Conference on Computer Vision (ACCV) 2020

arXiv:2004.07392 [pdf, other]

Joint Supervised and Self-Supervised Learning for 3D Real-World Challenges

Authors: Antonio Alliegro, Davide Boscaini, Tatiana Tommasi

Abstract: Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potentials. Still further progresses are essential to allow artificial intelligent agents to interact with the real world, where the amount of annotated data may be limited and integrating new sources of knowledge becomes crucial to support autonomous learning. He… ▽ More Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potentials. Still further progresses are essential to allow artificial intelligent agents to interact with the real world, where the amount of annotated data may be limited and integrating new sources of knowledge becomes crucial to support autonomous learning. Here we consider several possible scenarios involving synthetic and real-world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation. An extensive analysis investigating few-shot, transfer learning and cross-domain settings shows the effectiveness of our approach with state-of-the-art results for 3D shape classification and part segmentation. △ Less

Submitted 15 April, 2020; originally announced April 2020.

arXiv:2003.11013 [pdf, other]

Tractogram filtering of anatomically non-plausible fibers with geometric deep learning

Authors: Pietro Astolfi, Ruben Verhagen, Laurent Petit, Emanuele Olivetti, Jonathan Masci, Davide Boscaini, Paolo Avesani

Abstract: Tractograms are virtual representations of the white matter fibers of the brain. They are of primary interest for tasks like presurgical planning, and investigation of neuroplasticity or brain disorders. Each tractogram is composed of millions of fibers encoded as 3D polylines. Unfortunately, a large portion of those fibers are not anatomically plausible and can be considered artifacts of the trac… ▽ More Tractograms are virtual representations of the white matter fibers of the brain. They are of primary interest for tasks like presurgical planning, and investigation of neuroplasticity or brain disorders. Each tractogram is composed of millions of fibers encoded as 3D polylines. Unfortunately, a large portion of those fibers are not anatomically plausible and can be considered artifacts of the tracking algorithms. Common methods for tractogram filtering are based on signal reconstruction, a principled approach, but unable to consider the knowledge of brain anatomy. In this work, we address the problem of tractogram filtering as a supervised learning problem by exploiting the ground truth annotations obtained with a recent heuristic method, which labels fibers as either anatomically plausible or non-plausible according to well-established anatomical properties. The intuitive idea is to model a fiber as a point cloud and the goal is to investigate whether and how a geometric deep learning model might capture its anatomical properties. Our contribution is an extension of the Dynamic Edge Convolution model that exploits the sequential relations of points in a fiber and discriminates with high accuracy plausible/non-plausible fibers. △ Less

Submitted 9 July, 2020; v1 submitted 24 March, 2020; originally announced March 2020.

Comments: Accepted at MICCAI2020

arXiv:2002.00397 [pdf, other]

doi 10.1007/978-3-030-30642-7_41

3D Shape Segmentation with Geometric Deep Learning

Authors: Davide Boscaini, Fabio Poiesi

Abstract: The semantic segmentation of 3D shapes with a high-density of vertices could be impractical due to large memory requirements. To make this problem computationally tractable, we propose a neural-network based approach that produces 3D augmented views of the 3D shape to solve the whole segmentation as sub-segmentation problems. 3D augmented views are obtained by projecting vertices and normals of a… ▽ More The semantic segmentation of 3D shapes with a high-density of vertices could be impractical due to large memory requirements. To make this problem computationally tractable, we propose a neural-network based approach that produces 3D augmented views of the 3D shape to solve the whole segmentation as sub-segmentation problems. 3D augmented views are obtained by projecting vertices and normals of a 3D shape onto 2D regular grids taken from different viewpoints around the shape. These 3D views are then processed by a Convolutional Neural Network to produce a probability distribution function (pdf) over the set of the semantic classes for each vertex. These pdfs are then re-projected on the original 3D shape and postprocessed using contextual information through Conditional Random Fields. We validate our approach using 3D shapes of publicly available datasets and of real objects that are reconstructed using photogrammetry techniques. We compare our approach against state-of-the-art alternatives. △ Less

Submitted 2 February, 2020; originally announced February 2020.

arXiv:1611.08402 [pdf, other]

Geometric deep learning on graphs and manifolds using mixture model CNNs

Authors: Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, Michael M. Bronstein

Abstract: Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce state-of-the-art performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so fa… ▽ More Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce state-of-the-art performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclidean-structured data such as acoustic signals, images, or videos. Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to non-Euclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. In this paper, we propose a unified framework allowing to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and learn local, stationary, and compositional task-specific features. We show that various non-Euclidean CNN methods previously proposed in the literature can be considered as particular instances of our framework. We test the proposed method on standard tasks from the realms of image-, graph- and 3D shape analysis and show that it consistently outperforms previous approaches. △ Less

Submitted 6 December, 2016; v1 submitted 25 November, 2016; originally announced November 2016.

arXiv:1605.06437 [pdf, other]

Learning shape correspondence with anisotropic convolutional neural networks

Authors: Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Michael M. Bronstein

Abstract: Establishing correspondence between shapes is a fundamental problem in geometry processing, arising in a wide variety of applications. The problem is especially difficult in the setting of non-isometric deformations, as well as in the presence of topological noise and missing parts, mainly due to the limited capability to model such deformations axiomatically. Several recent works showed that inva… ▽ More Establishing correspondence between shapes is a fundamental problem in geometry processing, arising in a wide variety of applications. The problem is especially difficult in the setting of non-isometric deformations, as well as in the presence of topological noise and missing parts, mainly due to the limited capability to model such deformations axiomatically. Several recent works showed that invariance to complex shape transformations can be learned from examples. In this paper, we introduce an intrinsic convolutional neural network architecture based on anisotropic diffusion kernels, which we term Anisotropic Convolutional Neural Network (ACNN). In our construction, we generalize convolutions to non-Euclidean domains by constructing a set of oriented anisotropic diffusion kernels, creating in this way a local intrinsic polar representation of the data (`patch'), which is then correlated with a filter. Several cascades of such filters, linear, and non-linear operators are stacked to form a deep neural network whose parameters are learned by minimizing a task-specific cost. We use ACNNs to effectively learn intrinsic dense correspondences between deformable shapes in very challenging settings, achieving state-of-the-art results on some of the most difficult recent correspondence benchmarks. △ Less

Submitted 20 May, 2016; originally announced May 2016.

arXiv:1501.06297 [pdf, other]

Geodesic convolutional neural networks on Riemannian manifolds

Authors: Jonathan Masci, Davide Boscaini, Michael M. Bronstein, Pierre Vandergheynst

Abstract: Feature descriptors play a crucial role in a wide range of geometry analysis and processing applications, including shape correspondence, retrieval, and segmentation. In this paper, we introduce Geodesic Convolutional Neural Networks (GCNN), a generalization of the convolutional networks (CNN) paradigm to non-Euclidean manifolds. Our construction is based on a local geodesic system of polar coordi… ▽ More Feature descriptors play a crucial role in a wide range of geometry analysis and processing applications, including shape correspondence, retrieval, and segmentation. In this paper, we introduce Geodesic Convolutional Neural Networks (GCNN), a generalization of the convolutional networks (CNN) paradigm to non-Euclidean manifolds. Our construction is based on a local geodesic system of polar coordinates to extract "patches", which are then passed through a cascade of filters and linear and non-linear operators. The coefficients of the filters and linear combination weights are optimization variables that are learned to minimize a task-specific cost function. We use GCNN to learn invariant shape features, allowing to achieve state-of-the-art performance in problems such as shape description, retrieval, and correspondence. △ Less

Submitted 8 June, 2018; v1 submitted 26 January, 2015; originally announced January 2015.

arXiv:1406.1925 [pdf, other]

Shape-from-intrinsic operator

Authors: Davide Boscaini, Davide Eynard, Michael M. Bronstein

Abstract: Shape-from-X is an important class of problems in the fields of geometry processing, computer graphics, and vision, attempting to recover the structure of a shape from some observations. In this paper, we formulate the problem of shape-from-operator (SfO), recovering an embedding of a mesh from intrinsic differential operators defined on the mesh. Particularly interesting instances of our SfO prob… ▽ More Shape-from-X is an important class of problems in the fields of geometry processing, computer graphics, and vision, attempting to recover the structure of a shape from some observations. In this paper, we formulate the problem of shape-from-operator (SfO), recovering an embedding of a mesh from intrinsic differential operators defined on the mesh. Particularly interesting instances of our SfO problem include synthesis of shape analogies, shape-from-Laplacian reconstruction, and shape exaggeration. Numerically, we approach the SfO problem by splitting it into two optimization sub-problems that are applied in an alternating scheme: metric-from-operator (reconstruction of the discrete metric from the intrinsic operator) and embedding-from-metric (finding a shape embedding that would realize a given metric, a setting of the multidimensional scaling problem). △ Less

Submitted 7 June, 2014; originally announced June 2014.

Showing 1–20 of 20 results for author: Boscaini, D