Search | arXiv e-print repository

Paved2Paradise: Cost-Effective and Scalable LiDAR Simulation by Factoring the Real World

Authors: Michael A. Alcorn, Noah Schwartz

Abstract: To achieve strong real world performance, neural networks must be trained on large, diverse datasets; however, obtaining and annotating such datasets is costly and time-consuming, particularly for 3D point clouds. In this paper, we describe Paved2Paradise, a simple, cost-effective approach for generating fully labeled, diverse, and realistic lidar datasets from scratch, all while requiring minimal… ▽ More To achieve strong real world performance, neural networks must be trained on large, diverse datasets; however, obtaining and annotating such datasets is costly and time-consuming, particularly for 3D point clouds. In this paper, we describe Paved2Paradise, a simple, cost-effective approach for generating fully labeled, diverse, and realistic lidar datasets from scratch, all while requiring minimal human annotation. Our key insight is that, by deliberately collecting separate "background" and "object" datasets (i.e., "factoring the real world"), we can intelligently combine them to produce a combinatorially large and diverse training set. The Paved2Paradise pipeline thus consists of four steps: (1) collecting copious background data, (2) recording individuals from the desired object class(es) performing different behaviors in an isolated environment (like a parking lot), (3) bootstrap** labels for the object dataset, and (4) generating samples by placing objects at arbitrary locations in backgrounds. To demonstrate the utility of Paved2Paradise, we generated synthetic datasets for two tasks: (1) human detection in orchards (a task for which no public data exists) and (2) pedestrian detection in urban environments. Qualitatively, we find that a model trained exclusively on Paved2Paradise synthetic data is highly effective at detecting humans in orchards, including when individuals are heavily occluded by tree branches. Quantitatively, a model trained on Paved2Paradise data that sources backgrounds from KITTI performs comparably to a model trained on the actual dataset. These results suggest the Paved2Paradise synthetic data pipeline can help accelerate point cloud model development in sectors where acquiring lidar datasets has previously been cost-prohibitive. △ Less

Submitted 22 April, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

Comments: Accepted to the Synthetic Data for Computer Vision workshop at CVPR 2024

arXiv:2301.08838 [pdf, other]

AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

Authors: Michael A. Alcorn

Abstract: Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inf… ▽ More Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training. △ Less

Submitted 20 January, 2023; originally announced January 2023.

arXiv:2106.06989 [pdf, other]

The DEformer: An Order-Agnostic Distribution Estimating Transformer

Authors: Michael A. Alcorn, Anh Nguyen

Abstract: Order-agnostic autoregressive distribution (density) estimation (OADE), i.e., autoregressive distribution estimation where the features can occur in an arbitrary order, is a challenging problem in generative machine learning. Prior work on OADE has encoded feature identity by assigning each feature to a distinct fixed position in an input vector. As a result, architectures built for these inputs m… ▽ More Order-agnostic autoregressive distribution (density) estimation (OADE), i.e., autoregressive distribution estimation where the features can occur in an arbitrary order, is a challenging problem in generative machine learning. Prior work on OADE has encoded feature identity by assigning each feature to a distinct fixed position in an input vector. As a result, architectures built for these inputs must strategically mask either the input or model weights to learn the various conditional distributions necessary for inferring the full joint distribution of the dataset in an order-agnostic way. In this paper, we propose an alternative approach for encoding feature identities, where each feature's identity is included alongside its value in the input. This feature identity encoding strategy allows neural architectures designed for sequential data to be applied to the OADE task without modification. As a proof of concept, we show that a Transformer trained on this input (which we refer to as "the DEformer", i.e., the distribution estimating Transformer) can effectively model binarized-MNIST, approaching the performance of fixed-order autoregressive distribution estimating algorithms while still being entirely order-agnostic. Additionally, we find that the DEformer surpasses the performance of recent flow-based architectures when modeling a tabular dataset. △ Less

Submitted 11 July, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

Comments: Poster at the Third Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2021)

arXiv:2104.11980 [pdf, other]

baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents

Authors: Michael A. Alcorn, Anh Nguyen

Abstract: In many multi-agent spatiotemporal systems, agents operate under the influence of shared, unobserved variables (e.g., the play a team is executing in a game of basketball). As a result, the trajectories of the agents are often statistically dependent at any given time step; however, almost universally, multi-agent models implicitly assume the agents' trajectories are statistically independent at e… ▽ More In many multi-agent spatiotemporal systems, agents operate under the influence of shared, unobserved variables (e.g., the play a team is executing in a game of basketball). As a result, the trajectories of the agents are often statistically dependent at any given time step; however, almost universally, multi-agent models implicitly assume the agents' trajectories are statistically independent at each time step. In this paper, we introduce baller2vec++, a multi-entity Transformer that can effectively model coordinated agents. Specifically, baller2vec++ applies a specially designed self-attention mask to a mixture of location and "look-ahead" trajectory sequences to learn the distributions of statistically dependent agent trajectories. We show that, unlike baller2vec (baller2vec++'s predecessor), baller2vec++ can learn to emulate the behavior of perfectly coordinated agents in a simulated toy dataset. Additionally, when modeling the trajectories of professional basketball players, baller2vec++ outperforms baller2vec by a wide margin. △ Less

Submitted 28 September, 2021; v1 submitted 24 April, 2021; originally announced April 2021.

arXiv:2102.03291 [pdf, other]

baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling

Authors: Michael A. Alcorn, Anh Nguyen

Abstract: Multi-agent spatiotemporal modeling is a challenging task from both an algorithmic design and computational complexity perspective. Recent work has explored the efficacy of traditional deep sequential models in this domain, but these architectures are slow and cumbersome to train, particularly as model size increases. Further, prior attempts to model interactions between agents across time have li… ▽ More Multi-agent spatiotemporal modeling is a challenging task from both an algorithmic design and computational complexity perspective. Recent work has explored the efficacy of traditional deep sequential models in this domain, but these architectures are slow and cumbersome to train, particularly as model size increases. Further, prior attempts to model interactions between agents across time have limitations, such as imposing an order on the agents, or making assumptions about their relationships. In this paper, we introduce baller2vec, a multi-entity generalization of the standard Transformer that can, with minimal assumptions, simultaneously and efficiently integrate information across entities and time. We test the effectiveness of baller2vec for multi-agent spatiotemporal modeling by training it to perform two different basketball-related tasks: (1) simultaneously modeling the trajectories of all players on the court and (2) modeling the trajectory of the ball. Not only does baller2vec learn to perform these tasks well (outperforming a graph recurrent neural network with a similar number of parameters by a wide margin), it also appears to "understand" the game of basketball, encoding idiosyncratic qualities of players in its embeddings, and performing basketball-relevant functions with its attention heads. △ Less

Submitted 28 September, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

arXiv:1910.04760 [pdf, other]

A cost-effective method for improving and re-purposing large, pre-trained GANs by fine-tuning their class-embeddings

Authors: Qi Li, Long Mai, Michael A. Alcorn, Anh Nguyen

Abstract: Large, pre-trained generative models have been increasingly popular and useful to both the research and wider communities. Specifically, BigGANs a class-conditional Generative Adversarial Networks trained on ImageNet---achieved excellent, state-of-the-art capability in generating realistic photos. However, fine-tuning or training BigGANs from scratch is practically impossible for most researchers… ▽ More Large, pre-trained generative models have been increasingly popular and useful to both the research and wider communities. Specifically, BigGANs a class-conditional Generative Adversarial Networks trained on ImageNet---achieved excellent, state-of-the-art capability in generating realistic photos. However, fine-tuning or training BigGANs from scratch is practically impossible for most researchers and engineers because (1) GAN training is often unstable and suffering from mode-collapse; and (2) the training requires a significant amount of computation, 256 Google TPUs for 2 days or 8xV100 GPUs for 15 days. Importantly, many pre-trained generative models both in NLP and image domains were found to contain biases that are harmful to society. Thus, we need computationally-feasible methods for modifying and re-purposing these huge, pre-trained models for downstream tasks. In this paper, we propose a cost-effective optimization method for improving and re-purposing BigGANs by fine-tuning only the class-embedding layer. We show the effectiveness of our model-editing approach in three tasks: (1) significantly improving the realism and diversity of samples of complete mode-collapse classes; (2) re-purposing ImageNet BigGANs for generating images for Places365; and (3) de-biasing or improving the sample diversity for selected ImageNet classes. △ Less

Submitted 8 October, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

Comments: Preprint

arXiv:1811.11553 [pdf, other]

Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects

Authors: Michael A. Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, Anh Nguyen

Abstract: Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including natural, non-adversarial ones, which are common in real-world settings. In this paper, we present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. That is, we estimate the parameters of a 3D renderer that cause a… ▽ More Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including natural, non-adversarial ones, which are common in real-world settings. In this paper, we present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. That is, we estimate the parameters of a 3D renderer that cause a target DNN to misbehave in response to the rendered image. Using our framework and a self-assembled dataset of 3D objects, we investigate the vulnerability of DNNs to OoD poses of well-known objects in ImageNet. For objects that are readily recognized by DNNs in their canonical poses, DNNs incorrectly classify 97% of their pose space. In addition, DNNs are highly sensitive to slight pose perturbations. Importantly, adversarial poses transfer across models and datasets. We find that 99.9% and 99.4% of the poses misclassified by Inception-v3 also transfer to the AlexNet and ResNet-50 image classifiers trained on the same ImageNet dataset, respectively, and 75.5% transfer to the YOLOv3 object detector trained on MS COCO. △ Less

Submitted 18 April, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Comments: Poster at the 2019 Conference on Computer Vision and Pattern Recognition

Showing 1–7 of 7 results for author: Alcorn, M A