Search | arXiv e-print repository

Text-to-Drive: Diverse Driving Behavior Synthesis via Large Language Models

Authors: Phat Nguyen, Tsun-Hsuan Wang, Zhang-Wei Hong, Sertac Karaman, Daniela Rus

Abstract: Generating varied scenarios through simulation is crucial for training and evaluating safety-critical systems, such as autonomous vehicles. Yet, the task of modeling the trajectories of other vehicles to simulate diverse and meaningful close interactions remains prohibitively costly. Adopting language descriptions to generate driving behaviors emerges as a promising strategy, offering a scalable a… ▽ More Generating varied scenarios through simulation is crucial for training and evaluating safety-critical systems, such as autonomous vehicles. Yet, the task of modeling the trajectories of other vehicles to simulate diverse and meaningful close interactions remains prohibitively costly. Adopting language descriptions to generate driving behaviors emerges as a promising strategy, offering a scalable and intuitive method for human operators to simulate a wide range of driving interactions. However, the scarcity of large-scale annotated language-trajectory data makes this approach challenging. To address this gap, we propose Text-to-Drive (T2D) to synthesize diverse driving behaviors via Large Language Models (LLMs). We introduce a knowledge-driven approach that operates in two stages. In the first stage, we employ the embedded knowledge of LLMs to generate diverse language descriptions of driving behaviors for a scene. Then, we leverage LLM's reasoning capabilities to synthesize these behaviors in simulation. At its core, T2D employs an LLM to construct a state chart that maps low-level states to high-level abstractions. This strategy aids in downstream tasks such as summarizing low-level observations, assessing policy alignment with behavior description, and sha** the auxiliary reward, all without needing human supervision. With our knowledge-driven approach, we demonstrate that T2D generates more diverse trajectories compared to other baselines and offers a natural language interface that allows for interactive incorporation of human preference. Please check our website for more examples: https://text-to-drive.github.io/ △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 14 pages, 7 figures

arXiv:2405.05956 [pdf, other]

Probing Multimodal LLMs as World Models for Driving

Authors: Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus

Abstract: We provide a sober look at the application of Multimodal Large Language Models (MLLMs) within the domain of autonomous driving and challenge/verify some common assumptions, focusing on their ability to reason and interpret dynamic driving scenarios through sequences of images/frames in a closed-loop control environment. Despite the significant advancements in MLLMs like GPT-4V, their performance i… ▽ More We provide a sober look at the application of Multimodal Large Language Models (MLLMs) within the domain of autonomous driving and challenge/verify some common assumptions, focusing on their ability to reason and interpret dynamic driving scenarios through sequences of images/frames in a closed-loop control environment. Despite the significant advancements in MLLMs like GPT-4V, their performance in complex, dynamic driving environments remains largely untested and presents a wide area of exploration. We conduct a comprehensive experimental study to evaluate the capability of various MLLMs as world models for driving from the perspective of a fixed in-car camera. Our findings reveal that, while these models proficiently interpret individual images, they struggle significantly with synthesizing coherent narratives or logical sequences across frames depicting dynamic behavior. The experiments demonstrate considerable inaccuracies in predicting (i) basic vehicle dynamics (forward/backward, acceleration/deceleration, turning right or left), (ii) interactions with other road actors (e.g., identifying speeding cars or heavy traffic), (iii) trajectory planning, and (iv) open-set dynamic scene reasoning, suggesting biases in the models' training data. To enable this experimental study we introduce a specialized simulator, DriveSim, designed to generate diverse driving scenarios, providing a platform for evaluating MLLMs in the realms of driving. Additionally, we contribute the full open-source code and a new dataset, "Eval-LLM-Drive", for evaluating MLLMs in driving. Our results highlight a critical gap in the current capabilities of state-of-the-art MLLMs, underscoring the need for enhanced foundation models to improve their applicability in real-world dynamic environments. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: https://github.com/sreeramsa/DriveSim https://www.youtube.com/watch?v=Fs8jgngOJzU

arXiv:2404.01400 [pdf, other]

NVINS: Robust Visual Inertial Navigation Fused with NeRF-augmented Camera Pose Regressor and Uncertainty Quantification

Authors: Juyeop Han, Lukas Lao Beyer, Guilherme V. Cavalheiro, Sertac Karaman

Abstract: In recent years, Neural Radiance Fields (NeRF) have emerged as a powerful tool for 3D reconstruction and novel view synthesis. However, the computational cost of NeRF rendering and degradation in quality due to the presence of artifacts pose significant challenges for its application in real-time and robust robotic tasks, especially on embedded systems. This paper introduces a novel framework that… ▽ More In recent years, Neural Radiance Fields (NeRF) have emerged as a powerful tool for 3D reconstruction and novel view synthesis. However, the computational cost of NeRF rendering and degradation in quality due to the presence of artifacts pose significant challenges for its application in real-time and robust robotic tasks, especially on embedded systems. This paper introduces a novel framework that integrates NeRF-derived localization information with Visual-Inertial Odometry(VIO) to provide a robust solution for robotic navigation in a real-time. By training an absolute pose regression network with augmented image data rendered from a NeRF and quantifying its uncertainty, our approach effectively counters positional drift and enhances system reliability. We also establish a mathematically sound foundation for combining visual inertial navigation with camera localization neural networks, considering uncertainty under a Bayesian framework. Experimental validation in the photorealistic simulation environment demonstrates significant improvements in accuracy compared to a conventional VIO approach. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 8 pages, 5 figures, 2 tables

arXiv:2403.08152 [pdf, other]

Multi-Fidelity Reinforcement Learning for Time-Optimal Quadrotor Re-planning

Authors: Gilhyun Ryou, Geoffrey Wang, Sertac Karaman

Abstract: High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in… ▽ More High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in real-time applications. The proposed method involves the co-training of a planning policy and a reward estimator; the latter predicts the performance of the policy's output and is trained efficiently through multi-fidelity Bayesian optimization. This optimization approach models the correlation between different fidelity levels, thereby constructing a high-fidelity model based on a low-fidelity foundation, which enables the accurate development of the reward model with limited high-fidelity experiments. The framework is further extended to include real-world flight experiments in reinforcement learning training, allowing the reward model to precisely reflect real-world constraints and broadening the policy's applicability to real-world scenarios. We present rigorous evaluations by training and testing the planning policy in both simulated and real-world environments. The resulting trained policy not only generates faster and more reliable trajectories compared to the baseline snap minimization method, but it also achieves trajectory updates in 2 ms on average, while the baseline method takes several minutes. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2310.17642 [pdf, other]

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

Authors: Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, Daniela Rus

Abstract: As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundation… ▽ More As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the code and demos on our project webpage at https://drive-anywhere.github.io/. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Project webpage: https://drive-anywhere.github.io Explainer video: https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be

arXiv:2306.03740 [pdf, other]

doi 10.1109/TRO.2023.3348305

GMMap: Memory-Efficient Continuous Occupancy Map Using Gaussian Mixture Model

Authors: Peter Zhi Xuan Li, Sertac Karaman, Vivienne Sze

Abstract: Energy consumption of memory accesses dominates the compute energy in energy-constrained robots which require a compact 3D map of the environment to achieve autonomy. Recent map** frameworks only focused on reducing the map size while incurring significant memory usage during map construction due to multi-pass processing of each depth image. In this work, we present a memory-efficient continuous… ▽ More Energy consumption of memory accesses dominates the compute energy in energy-constrained robots which require a compact 3D map of the environment to achieve autonomy. Recent map** frameworks only focused on reducing the map size while incurring significant memory usage during map construction due to multi-pass processing of each depth image. In this work, we present a memory-efficient continuous occupancy map, named GMMap, that accurately models the 3D environment using a Gaussian Mixture Model (GMM). Memory-efficient GMMap construction is enabled by the single-pass compression of depth images into local GMMs which are directly fused together into a globally-consistent map. By extending Gaussian Mixture Regression to model unexplored regions, occupancy probability is directly computed from Gaussians. Using a low-power ARM Cortex A57 CPU, GMMap can be constructed in real-time at up to 60 images per second. Compared with prior works, GMMap maintains high accuracy while reducing the map size by at least 56%, memory overhead by at least 88%, DRAM access by at least 78%, and energy consumption by at least 69%. Thus, GMMap enables real-time 3D map** on energy-constrained robots. △ Less

Submitted 19 January, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 17 pages, 12 figures, 3 tables

Journal ref: IEEE Transactions on Robotics 40 (2024) 1339-1355

arXiv:2305.16502 [pdf, other]

Learning When to Ask for Help: Efficient Interactive Navigation via Implicit Uncertainty Estimation

Authors: Ifueko Igbinedion, Sertac Karaman

Abstract: Robots operating alongside humans often encounter unfamiliar environments that make autonomous task completion challenging. Though improving models and increasing dataset size can enhance a robot's performance in unseen environments, data collection and model refinement may be impractical in every environment. Approaches that utilize human demonstrations through manual operation can aid in refinem… ▽ More Robots operating alongside humans often encounter unfamiliar environments that make autonomous task completion challenging. Though improving models and increasing dataset size can enhance a robot's performance in unseen environments, data collection and model refinement may be impractical in every environment. Approaches that utilize human demonstrations through manual operation can aid in refinement and generalization, but often require significant data collection efforts to generate enough demonstration data to achieve satisfactory task performance. Interactive approaches allow for humans to provide correction to robot action in real time, but intervention policies are often based on explicit factors related to state and task understanding that may be difficult to generalize. Addressing these challenges, we train a lightweight interaction policy that allows robots to decide when to proceed autonomously or request expert assistance at estimated times of uncertainty. An implicit estimate of uncertainty is learned via evaluating the feature extraction capabilities of the robot's visual navigation policy. By incorporating part-time human interaction, robots recover quickly from their mistakes, significantly improving the odds of task completion. Incorporating part-time interaction yields an increase in success of 0.38 with only a 0.3 expert interaction rate within the Habitat simulation environment using a simulated human expert. We further show success transferring this approach to a new domain with a real human expert, improving success from less than 0.1 with an autonomous agent to 0.92 with a 0.23 human interaction rate. This approach provides a practical means for robots to interact and learn from humans in real-world settings. △ Less

Submitted 7 June, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

ACM Class: I.2.9

Journal ref: 2024 IEEE International Conference on Robotics and Automation (ICRA) 2024

arXiv:2305.14797 [pdf, other]

Multi-Abstractive Neural Controller: An Efficient Hierarchical Control Architecture for Interactive Driving

Authors: Xiao Li, Igor Gilitschenski, Guy Rosman, Sertac Karaman, Daniela Rus

Abstract: As learning-based methods make their way from perception systems to planning/control stacks, robot control systems have started to enjoy the benefits that data-driven methods provide. Because control systems directly affect the motion of the robot, data-driven methods, especially black box approaches, need to be used with caution considering aspects such as stability and interpretability. In this… ▽ More As learning-based methods make their way from perception systems to planning/control stacks, robot control systems have started to enjoy the benefits that data-driven methods provide. Because control systems directly affect the motion of the robot, data-driven methods, especially black box approaches, need to be used with caution considering aspects such as stability and interpretability. In this paper, we describe a differentiable and hierarchical control architecture. The proposed representation, called \textit{multi-abstractive neural controller}, uses the input image to control the transitions within a novel discrete behavior planner (referred to as the visual automaton generative network, or \textit{vAGN}). The output of a vAGN controls the parameters of a set of dynamic movement primitives which provides the system controls. We train this neural controller with real-world driving data via behavior cloning and show improved explainability, sample efficiency, and similarity to human driving. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2304.11693 [pdf, other]

doi 10.1109/IV55152.2023.10186563

Studying the Impact of Semi-Cooperative Drivers on Overall Highway Flow

Authors: Noam Buckman, Sertac Karaman, Daniela Rus

Abstract: Semi-cooperative behaviors are intrinsic properties of human drivers and should be considered for autonomous driving. In addition, new autonomous planners can consider the social value orientation (SVO) of human drivers to generate socially-compliant trajectories. Yet the overall impact on traffic flow for this new class of planners remain to be understood. In this work, we present study of implic… ▽ More Semi-cooperative behaviors are intrinsic properties of human drivers and should be considered for autonomous driving. In addition, new autonomous planners can consider the social value orientation (SVO) of human drivers to generate socially-compliant trajectories. Yet the overall impact on traffic flow for this new class of planners remain to be understood. In this work, we present study of implicit semi-cooperative driving where agents deploy a game-theoretic version of iterative best response assuming knowledge of the SVOs of other agents. We simulate nominal traffic flow and investigate whether the proportion of prosocial agents on the road impact individual or system-wide driving performance. Experiments show that the proportion of prosocial agents has a minor impact on overall traffic flow and that benefits of semi-cooperation disproportionally affect egoistic and high-speed drivers. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: 8 pages. Accepted at IEEE Intelligent Vehicle (IV) Symposium 2023

arXiv:2303.12224 [pdf, other]

doi 10.1109/ICRA48891.2023.10161536

Infrastructure-based End-to-End Learning and Prevention of Driver Failure

Authors: Noam Buckman, Shiva Sreeram, Mathias Lechner, Yutong Ban, Ramin Hasani, Sertac Karaman, Daniela Rus

Abstract: Intelligent intersection managers can improve safety by detecting dangerous drivers or failure modes in autonomous vehicles, warning oncoming vehicles as they approach an intersection. In this work, we present FailureNet, a recurrent neural network trained end-to-end on trajectories of both nominal and reckless drivers in a scaled miniature city. FailureNet observes the poses of vehicles as they a… ▽ More Intelligent intersection managers can improve safety by detecting dangerous drivers or failure modes in autonomous vehicles, warning oncoming vehicles as they approach an intersection. In this work, we present FailureNet, a recurrent neural network trained end-to-end on trajectories of both nominal and reckless drivers in a scaled miniature city. FailureNet observes the poses of vehicles as they approach an intersection and detects whether a failure is present in the autonomy stack, warning cross-traffic of potentially dangerous drivers. FailureNet can accurately identify control failures, upstream perception errors, and speeding drivers, distinguishing them from nominal driving. The network is trained and deployed with autonomous vehicles in the MiniCity. Compared to speed or frequency-based predictors, FailureNet's recurrent neural network structure provides improved predictive power, yielding upwards of 84% accuracy when deployed on hardware. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 8 pages. Accepted to ICRA 2023

arXiv:2302.00243 [pdf, ps, other]

Agility and Target Distribution in the Dynamic Stochastic Traveling Salesman Problem

Authors: Aviv Adler, Oren Gal, Sertac Karaman

Abstract: An important variant of the classic Traveling Salesman Problem (TSP) is the Dynamic TSP, in which a system with dynamic constraints is tasked with visiting a set of n target locations (in any order) in the shortest amount of time. Such tasks arise naturally in many robotic motion planning problems, particularly in exploration, surveillance and reconnaissance, and classical TSP algorithms on graphs… ▽ More An important variant of the classic Traveling Salesman Problem (TSP) is the Dynamic TSP, in which a system with dynamic constraints is tasked with visiting a set of n target locations (in any order) in the shortest amount of time. Such tasks arise naturally in many robotic motion planning problems, particularly in exploration, surveillance and reconnaissance, and classical TSP algorithms on graphs are typically inapplicable in this setting. An important question about such problems is: if the target points are random, what is the length of the tour (either in expectation or as a concentration bound) as n grows? This problem is the Dynamic Stochastic TSP (DSTSP), and has been studied both for specific important vehicle models and for general dynamic systems; however, in general only the order of growth is known. In this work, we explore the connection between the distribution from which the targets are drawn and the dynamics of the system, yielding a more precise lower bound on tour length as well as a matching upper bound for the case of symmetric (or driftless) systems. We then extend the symmetric dynamics results to the case when the points are selected by a (non-random) adversary whose goal is to maximize the length, thus showing worst-case bounds on the tour length. △ Less

Submitted 31 January, 2023; originally announced February 2023.

Comments: 106 pages

MSC Class: 60D05 (Primary)

arXiv:2212.07013 [pdf, other]

Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels

Authors: Charles Richter, Patrick R. Barragán, Sertac Karaman

Abstract: We present a unified probabilistic model that learns a representative set of discrete vehicle actions and predicts the probability of each action given a particular scenario. Our model also enables us to estimate the distribution over continuous trajectories conditioned on a scenario, representing what each discrete action would look like if executed in that scenario. While our primary objective i… ▽ More We present a unified probabilistic model that learns a representative set of discrete vehicle actions and predicts the probability of each action given a particular scenario. Our model also enables us to estimate the distribution over continuous trajectories conditioned on a scenario, representing what each discrete action would look like if executed in that scenario. While our primary objective is to learn representative action sets, these capabilities combine to produce accurate multimodal trajectory predictions as a byproduct. Although our learned action representations closely resemble semantically meaningful categories (e.g., "go straight", "turn left", etc.), our method is entirely self-supervised and does not utilize any manually generated labels or categories. Our method builds upon recent advances in variational inference and deep unsupervised clustering, resulting in full distribution estimates based on deterministic model evaluations. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: Presented at the Fresh Perspectives on the Future of Autonomous Driving workshop, ICRA 2022

arXiv:2212.03298 [pdf, other]

WiSwarm: Age-of-Information-based Wireless Networking for Collaborative Teams of UAVs

Authors: Vishrant Tripathi, Igor Kadota, Ezra Tal, Muhammad Shahir Rahman, Alexander Warren, Sertac Karaman, Eytan Modiano

Abstract: The Age-of-Information (AoI) metric has been widely studied in the theoretical communication networks and queuing systems literature. However, experimental evaluation of its applicability to complex real-world time-sensitive systems is largely lacking. In this work, we develop, implement, and evaluate an AoI-based application layer middleware that enables the customization of WiFi networks to the… ▽ More The Age-of-Information (AoI) metric has been widely studied in the theoretical communication networks and queuing systems literature. However, experimental evaluation of its applicability to complex real-world time-sensitive systems is largely lacking. In this work, we develop, implement, and evaluate an AoI-based application layer middleware that enables the customization of WiFi networks to the needs of time-sensitive applications. By controlling the storage and flow of information in the underlying WiFi network, our middleware can: (i) prevent packet collisions; (ii) discard stale packets that are no longer useful; and (iii) dynamically prioritize the transmission of the most relevant information. To demonstrate the benefits of our middleware, we implement a mobility tracking application using a swarm of UAVs communicating with a central controller via WiFi. Our experimental results show that, when compared to WiFi-UDP/WiFi-TCP, the middleware can improve information freshness by a factor of 109x/48x and tracking accuracy by a factor of 4x/6x, respectively. Most importantly, our results also show that the performance gains of our approach increase as the system scales and/or the traffic load increases. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: To be presented at IEEE INFOCOM 2023

arXiv:2210.03623 [pdf, other]

doi 10.1109/IROS51168.2021.9636603

Efficient Computation of Map-scale Continuous Mutual Information on Chip in Real Time

Authors: Keshav Gupta, Peter Zhi Xuan Li, Sertac Karaman, Vivienne Sze

Abstract: Exploration tasks are essential to many emerging robotics applications, ranging from search and rescue to space exploration. The planning problem for exploration requires determining the best locations for future measurements that will enhance the fidelity of the map, for example, by reducing its total entropy. A widely-studied technique involves computing the Mutual Information (MI) between the c… ▽ More Exploration tasks are essential to many emerging robotics applications, ranging from search and rescue to space exploration. The planning problem for exploration requires determining the best locations for future measurements that will enhance the fidelity of the map, for example, by reducing its total entropy. A widely-studied technique involves computing the Mutual Information (MI) between the current map and future measurements, and utilizing this MI metric to decide the locations for future measurements. However, computing MI for reasonably-sized maps is slow and power hungry, which has been a bottleneck towards fast and efficient robotic exploration. In this paper, we introduce a new hardware accelerator architecture for MI computation that features a low-latency, energy-efficient MI compute core and an optimized memory subsystem that provides sufficient bandwidth to keep the cores fully utilized. The core employs interleaving to counter the recursive algorithm, and workload balancing and numerical approximations to reduce latency and energy consumption. We demonstrate this optimized architecture with a Field-Programmable Gate Array (FPGA) implementation, which can compute MI for all cells in an entire 201-by-201 occupancy grid ({\em e.g.}, representing a 20.1m-by-20.1m map at 0.1m resolution) in 1.55 ms while consuming 1.7 mJ of energy, thus finally rendering MI computation for the whole map real time and at a fraction of the energy cost of traditional compute platforms. For comparison, this particular FPGA implementation running on the Xilinx Zynq-7000 platform is two orders of magnitude faster and consumes three orders of magnitude less energy per MI map compute, when compared to a baseline GPU implementation running on an NVIDIA GeForce GTX 980 platform. The improvements are more pronounced when compared to CPU implementations of equivalent algorithms. △ Less

Submitted 7 October, 2022; originally announced October 2022.

arXiv:2207.13218 [pdf, other]

Global Incremental Flight Control for Agile Maneuvering of a Tailsitter Flying Wing

Authors: Ezra Tal, Sertac Karaman

Abstract: This paper proposes a novel control law for accurate tracking of agile trajectories using a tailsitter flying wing unmanned aerial vehicle (UAV) that transitions between vertical take-off and landing (VTOL) and forward flight. The global control formulation enables maneuvering throughout the flight envelope, including uncoordinated flight with sideslip. Differential flatness of the nonlinear tails… ▽ More This paper proposes a novel control law for accurate tracking of agile trajectories using a tailsitter flying wing unmanned aerial vehicle (UAV) that transitions between vertical take-off and landing (VTOL) and forward flight. The global control formulation enables maneuvering throughout the flight envelope, including uncoordinated flight with sideslip. Differential flatness of the nonlinear tailsitter dynamics with a simplified aerodynamics model is shown. Using the flatness transform, the proposed controller incorporates tracking of the position reference along with its derivatives velocity, acceleration and jerk, as well as the yaw reference and yaw rate. The inclusion of jerk and yaw rate references through an angular velocity feedforward term improves tracking of trajectories with fast-changing accelerations. The controller does not depend on extensive aerodynamic modeling but instead uses incremental nonlinear dynamic inversion (INDI) to compute control updates based on only a local input-output relation, resulting in robustness against discrepancies in the simplified aerodynamics equations. Exact inversion of the nonlinear input-output relation is achieved through the derived flatness transform. The resulting control algorithm is extensively evaluated in flight tests, where it demonstrates accurate trajectory tracking and challenging agile maneuvers, such as sideways flight and aggressive transitions while turning. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: 24 pages, 11 figures, videos of the experiments at https://aera.mit.edu/projects/TailsitterAerobatics

arXiv:2207.03524 [pdf, other]

Aerobatic Trajectory Generation for a VTOL Fixed-Wing Aircraft Using Differential Flatness

Authors: Ezra Tal, Gilhyun Ryou, Sertac Karaman

Abstract: This paper proposes a novel algorithm for aerobatic trajectory generation for a vertical take-off and landing (VTOL) tailsitter flying wing aircraft. The algorithm differs from existing approaches for fixed-wing trajectory generation, as it considers a realistic six-degree-of-freedom (6DOF) flight dynamics model, including aerodynamics equations. Using a global dynamics model enables the generatio… ▽ More This paper proposes a novel algorithm for aerobatic trajectory generation for a vertical take-off and landing (VTOL) tailsitter flying wing aircraft. The algorithm differs from existing approaches for fixed-wing trajectory generation, as it considers a realistic six-degree-of-freedom (6DOF) flight dynamics model, including aerodynamics equations. Using a global dynamics model enables the generation of aerobatics trajectories that exploit the entire flight envelope, enabling agile maneuvering through the stall regime, sideways uncoordinated flight, inverted flight etc. The method uses the differential flatness property of the global tailsitter flying wing dynamics, which is derived in this work. By performing snap minimization in the differentially flat output space, a computationally efficient algorithm, suitable for online motion planning, is obtained. The algorithm is demonstrated in extensive flight experiments encompassing six aerobatics maneuvers, a time-optimal drone racing trajectory, and an airshow-like aerobatic sequence for three tailsitter aircraft. △ Less

Submitted 7 July, 2022; originally announced July 2022.

Comments: 14 pages, 17 figures, video of experiments available at https://aera.mit.edu/projects/TailsitterAerobatics

arXiv:2206.00726 [pdf, other]

Cooperative Multi-Agent Trajectory Generation with Modular Bayesian Optimization

Authors: Gilhyun Ryou, Ezra Tal, Sertac Karaman

Abstract: We present a modular Bayesian optimization framework that efficiently generates time-optimal trajectories for a cooperative multi-agent system, such as a team of UAVs. Existing methods for multi-agent trajectory generation often rely on overly conservative constraints to reduce the complexity of this high-dimensional planning problem, leading to suboptimal solutions. We propose a novel modular str… ▽ More We present a modular Bayesian optimization framework that efficiently generates time-optimal trajectories for a cooperative multi-agent system, such as a team of UAVs. Existing methods for multi-agent trajectory generation often rely on overly conservative constraints to reduce the complexity of this high-dimensional planning problem, leading to suboptimal solutions. We propose a novel modular structure for the Bayesian optimization model that consists of multiple Gaussian process surrogate models that represent the dynamic feasibility and collision avoidance constraints. This modular structure alleviates the stark increase in computational cost with problem dimensionality and enables the use of minimal constraints in the joint optimization of the multi-agent trajectories. The efficiency of the algorithm is further improved by introducing a scheme for simultaneous evaluation of the Bayesian optimization acquisition function and random sampling. The modular BayesOpt algorithm was applied to optimize multi-agent trajectories through six unique environments using multi-fidelity evaluations from various data sources. It was found that the resulting trajectories are faster than those obtained from two baseline methods. The optimized trajectories were validated in real-world experiments using four quadcopters that fly within centimeters of each other at speeds up to 7.4 m/s. △ Less

Submitted 1 June, 2022; originally announced June 2022.

Comments: Accepted to appear at Robotics: Science and Systems 2022. Video at https://youtu.be/rxQiNeXvLTc

arXiv:2205.09117 [pdf, other]

Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks

Authors: Ryan Sander, Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Sertac Karaman, Daniela Rus

Abstract: Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates tra… ▽ More Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates transitions with their closest neighbors in state-action space. NMER preserves a locally linear approximation of the transition manifold by only applying Mixup between transitions with vicinal state-action features. Under NMER, a given transition's set of state action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via inter-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on continuous control environments. We observe that NMER improves sample efficiency by an average 94% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: Accepted to L4DC 2022

arXiv:2202.10433 [pdf, other]

The Role of Heterogeneity in Autonomous Perimeter Defense Problems

Authors: Aviv Adler, Oscar Mickelin, Ragesh K. Ramachandran, Gaurav S. Sukhatme, Sertac Karaman

Abstract: When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on d… ▽ More When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on dynamic programming and on local interaction rules. We present a theoretical analysis of both approaches and our results are extensively validated using simulations. Interestingly, our results demonstrate that the viability of heterogeneous teams depends on the amount of information available to the defenders. Moreover, our results suggest a universality property: across a wide range of problem parameters the optimal ratio of the speeds of the defenders remains nearly constant. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Comments: 27 pages, 9 figures

arXiv:2111.12137 [pdf, other]

Learning Interactive Driving Policies via Data-driven Simulation

Authors: Tsun-Hsuan Wang, Alexander Amini, Wilko Schwarting, Igor Gilitschenski, Sertac Karaman, Daniela Rus

Abstract: Data-driven simulators promise high data-efficiency for driving policy learning. When used for modelling interactions, this data-efficiency becomes a bottleneck: Small underlying datasets often lack interesting and challenging edge cases for learning interactive driving. We address this challenge by proposing a simulation method that uses in-painted ado vehicles for learning robust driving policie… ▽ More Data-driven simulators promise high data-efficiency for driving policy learning. When used for modelling interactions, this data-efficiency becomes a bottleneck: Small underlying datasets often lack interesting and challenging edge cases for learning interactive driving. We address this challenge by proposing a simulation method that uses in-painted ado vehicles for learning robust driving policies. Thus, our approach can be used to learn policies that involve multi-agent interactions and allows for training via state-of-the-art policy learning methods. We evaluate the approach for learning standard interaction scenarios in driving. In extensive experiments, our work demonstrates that the resulting policies can be directly transferred to a full-scale autonomous vehicle without making use of any traditional sim-to-real transfer techniques such as domain randomization. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: The first two authors contributed equally to this this work. Code is available here: http://vista.csail.mit.edu/

arXiv:2111.12083 [pdf, other]

VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and Policy Learning for Autonomous Vehicles

Authors: Alexander Amini, Tsun-Hsuan Wang, Igor Gilitschenski, Wilko Schwarting, Zhijian Liu, Song Han, Sertac Karaman, Daniela Rus

Abstract: Simulation has the potential to transform the development of robust algorithms for mobile agents deployed in safety-critical scenarios. However, the poor photorealism and lack of diverse sensor modalities of existing simulation engines remain key hurdles towards realizing this potential. Here, we present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for aut… ▽ More Simulation has the potential to transform the development of robust algorithms for mobile agents deployed in safety-critical scenarios. However, the poor photorealism and lack of diverse sensor modalities of existing simulation engines remain key hurdles towards realizing this potential. Here, we present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles. Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras, enabling the rapid generation of novel viewpoints in simulation and thereby enriching the data available for policy learning with corner cases that are difficult to capture in the physical world. Using VISTA, we demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle. The policies learned in VISTA exhibit sim-to-real transfer without modification and greater robustness than those trained exclusively on real-world data. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: First two authors contributed equally. Code and project website is available here: https://vista.csail.mit.edu

arXiv:2109.02865 [pdf, other]

Journalistic Guidelines Aware News Image Captioning

Authors: Xuewen Yang, Svebor Karaman, Joel Tetreault, Alex Jaimes

Abstract: The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow journalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associate… ▽ More The task of news article image captioning aims to generate descriptive and informative captions for news article images. Unlike conventional image captions that simply describe the content of the image in general terms, news image captions follow journalistic guidelines and rely heavily on named entities to describe the image content, often drawing context from the whole article they are associated with. In this work, we propose a new approach to this task, motivated by caption guidelines that journalists follow. Our approach, Journalistic Guidelines Aware News Image Captioning (JoGANIC), leverages the structure of captions to improve the generation quality and guide our representation design. Experimental results, including detailed ablation studies, on two large-scale publicly available datasets show that JoGANIC substantially outperforms state-of-the-art methods both on caption generation and named entity related metrics. △ Less

Submitted 10 September, 2021; v1 submitted 7 September, 2021; originally announced September 2021.

Journal ref: EMNLP 2021

arXiv:2109.00642 [pdf, other]

Searching for Efficient Multi-Stage Vision Transformers

Authors: Yi-Lun Liao, Sertac Karaman, Vivienne Sze

Abstract: Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we pr… ▽ More Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple sub-networks with one forward-backward pass. After that, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate that ViT-ResNAS achieves better accuracy-MACs and accuracy-throughput trade-offs than the original DeiT and other strong baselines of ViT. Code is available at https://github.com/yilunliao/vit-search. △ Less

Submitted 1 September, 2021; originally announced September 2021.

arXiv:2107.02384 [pdf, other]

Multi-Modal Motion Planning Using Composite Pose Graph Optimization

Authors: L. Lao Beyer, N. Balabanska, E. Tal, S. Karaman

Abstract: In this paper, we present a motion planning framework for multi-modal vehicle dynamics. Our proposed algorithm employs transcription of the optimization objective function, vehicle dynamics, and state and control constraints into sparse factor graphs, which -- combined with mode transition constraints -- constitute a composite pose graph. By formulating the multi-modal motion planning problem in c… ▽ More In this paper, we present a motion planning framework for multi-modal vehicle dynamics. Our proposed algorithm employs transcription of the optimization objective function, vehicle dynamics, and state and control constraints into sparse factor graphs, which -- combined with mode transition constraints -- constitute a composite pose graph. By formulating the multi-modal motion planning problem in composite pose graph form, we enable utilization of efficient techniques for optimization on sparse graphs, such as those widely applied in dual estimation problems, e.g., simultaneous localization and map** (SLAM). The resulting motion planning algorithm optimizes the multi-modal trajectory, including the location of mode transitions, and is guided by the pose graph optimization process to eliminate unnecessary transitions, enabling efficient discovery of optimized mode sequences from rough initial guesses. We demonstrate multi-modal trajectory optimization in both simulation and real-world experiments for vehicles with various dynamics models, such as an airplane with taxi and flight modes, and a vertical take-off and landing (VTOL) fixed-wing aircraft that transitions between hover and horizontal flight modes. △ Less

Submitted 6 July, 2021; originally announced July 2021.

Comments: 7 pages, 6 figures, to be included in proceedings of IEEE International Conference on Robotics and Automation 2021

arXiv:2105.09932 [pdf, other]

Efficient and Robust LiDAR-Based End-to-End Navigation

Authors: Zhijian Liu, Alexander Amini, Sibo Zhu, Sertac Karaman, Song Han, Daniela Rus

Abstract: Deep learning has been used to demonstrate end-to-end neural network learning for autonomous vehicle control from raw sensory input. While LiDAR sensors provide reliably accurate information, existing end-to-end driving solutions are mainly based on cameras since processing 3D data requires a large memory footprint and computation cost. On the other hand, increasing the robustness of these systems… ▽ More Deep learning has been used to demonstrate end-to-end neural network learning for autonomous vehicle control from raw sensory input. While LiDAR sensors provide reliably accurate information, existing end-to-end driving solutions are mainly based on cameras since processing 3D data requires a large memory footprint and computation cost. On the other hand, increasing the robustness of these systems is also critical; however, even estimating the model's uncertainty is very challenging due to the cost of sampling-based methods. In this paper, we present an efficient and robust LiDAR-based end-to-end navigation framework. We first introduce Fast-LiDARNet that is based on sparse convolution kernel optimization and hardware-aware model design. We then propose Hybrid Evidential Fusion that directly estimates the uncertainty of the prediction from only a single forward pass and then fuses the control predictions intelligently. We evaluate our system on a full-scale vehicle and demonstrate lane-stable as well as navigation capabilities. In the presence of out-of-distribution events (e.g., sensor failures), our system significantly improves robustness and reduces the number of takeovers in the real world. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: ICRA 2021. The first two authors contributed equally to this work. Project page: https://le2ed.mit.edu/

arXiv:2103.10888 [pdf, other]

Feedback from Pixels: Output Regulation via Learning-Based Scene View Synthesis

Authors: Murad Abu-Khalaf, Sertac Karaman, Daniela Rus

Abstract: We propose a novel controller synthesis involving feedback from pixels, whereby the measurement is a high dimensional signal representing a pixelated image with Red-Green-Blue (RGB) values. The approach neither requires feature extraction, nor object detection, nor visual correspondence. The control policy does not involve the estimation of states or similar latent representations. Instead, tracki… ▽ More We propose a novel controller synthesis involving feedback from pixels, whereby the measurement is a high dimensional signal representing a pixelated image with Red-Green-Blue (RGB) values. The approach neither requires feature extraction, nor object detection, nor visual correspondence. The control policy does not involve the estimation of states or similar latent representations. Instead, tracking is achieved directly in image space, with a model of the reference signal embedded as required by the internal model principle. The reference signal is generated by a neural network with learning-based scene view synthesis capabilities. Our approach does not require an end-to-end learning of a pixel-to-action control policy. The approach is applied to a motion control problem, namely the longitudinal dynamics of a car-following problem. We show how this approach lend itself to a tractable stability analysis with associated bounds critical to establishing trustworthiness and interpretability of the closed-loop dynamics. △ Less

Submitted 23 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: Submitted to L4DC on November-20-2020; Accepted on March-5-2021

arXiv:2102.09812 [pdf, other]

Deep Latent Competition: Learning to Race Using Visual Control Policies in Latent Space

Authors: Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Lucas Liebenwein, Ryan Sander, Sertac Karaman, Daniela Rus

Abstract: Learning competitive behaviors in multi-agent settings such as racing requires long-term reasoning about potential adversarial interactions. This paper presents Deep Latent Competition (DLC), a novel reinforcement learning algorithm that learns competitive visual control policies through self-play in imagination. The DLC agent imagines multi-agent interaction sequences in the compact latent space… ▽ More Learning competitive behaviors in multi-agent settings such as racing requires long-term reasoning about potential adversarial interactions. This paper presents Deep Latent Competition (DLC), a novel reinforcement learning algorithm that learns competitive visual control policies through self-play in imagination. The DLC agent imagines multi-agent interaction sequences in the compact latent space of a learned world model that combines a joint transition function with opponent viewpoint prediction. Imagined self-play reduces costly sample generation in the real world, while the latent representation enables planning to scale gracefully with observation dimensionality. We demonstrate the effectiveness of our algorithm in learning competitive behaviors on a novel multi-agent racing benchmark that requires planning from image observations. Code and videos available at https://sites.google.com/view/deep-latent-competition. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: Wilko, Tim, and Igor contributed equally to this work; published in Conference on Robot Learning 2020

arXiv:2102.09661 [pdf, other]

Recovering orthogonal tensors under arbitrarily strong, but locally correlated, noise

Authors: Oscar Mickelin, Sertac Karaman

Abstract: We consider the problem of recovering an orthogonally decomposable tensor with a subset of elements distorted by noise with arbitrarily large magnitude. We focus on the particular case where each mode in the decomposition is corrupted by noise vectors with components that are correlated locally, i.e., with nearby components. We show that this deterministic tensor completion problem has the unusual… ▽ More We consider the problem of recovering an orthogonally decomposable tensor with a subset of elements distorted by noise with arbitrarily large magnitude. We focus on the particular case where each mode in the decomposition is corrupted by noise vectors with components that are correlated locally, i.e., with nearby components. We show that this deterministic tensor completion problem has the unusual property that it can be solved in polynomial time if the rank of the tensor is sufficiently large. This is the polar opposite of the low-rank assumptions of typical low-rank tensor and matrix completion settings. We show that our problem can be solved through a system of coupled Sylvester-like equations and show how to accelerate their solution by an alternating solver. This enables recovery even with a substantial number of missing entries, for instance for $n$-dimensional tensors of rank $n$ with up to $40\%$ missing entries. △ Less

Submitted 18 February, 2021; originally announced February 2021.

Comments: 20 pages, 6 figures

MSC Class: 65F99; 15A69

arXiv:2012.00928 [pdf]

Low Cost, Educational Internal Combustion Engine Electronic Control Unit Hardware-in-the-Loop Test Systems

Authors: Sertac Karaman, Levent Guvenc

Abstract: Different hardware platforms and their associated real time operating systems that can be used in an educational laboratory for illustrating engine electronic control unit hardware in the loop testing are presented and compared in this paper. A Matlab graphical user interface prepared for generating synthetic crank and camshaft angular position sensor signals to be fed to the engine electronic con… ▽ More Different hardware platforms and their associated real time operating systems that can be used in an educational laboratory for illustrating engine electronic control unit hardware in the loop testing are presented and compared in this paper. A Matlab graphical user interface prepared for generating synthetic crank and camshaft angular position sensor signals to be fed to the engine electronic control unit during hardware-in-the-loop testing is introduced. This graphical user interface is used to generate faulty sensor signals to check the response of the engine electronic control unit during hardware-in-the-loop simulation. Examples of faulty signals that can be generated with the graphical user interface are illustrated. △ Less

Submitted 1 December, 2020; originally announced December 2020.

Comments: 6 pages, 11 figures, 1 table

arXiv:2010.14641 [pdf, other]

Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles

Authors: Tim Seyde, Wilko Schwarting, Sertac Karaman, Daniela Rus

Abstract: Learning complex robot behaviors through interaction requires structured exploration. Planning should target interactions with the potential to optimize long-term performance, while only reducing uncertainty where conducive to this objective. This paper presents Latent Optimistic Value Exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term… ▽ More Learning complex robot behaviors through interaction requires structured exploration. Planning should target interactions with the potential to optimize long-term performance, while only reducing uncertainty where conducive to this objective. This paper presents Latent Optimistic Value Exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine latent world models with value function estimation to predict infinite-horizon returns and recover associated uncertainty via ensembling. The policy is then trained on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual robot control tasks in continuous action spaces and demonstrate on average more than 20% improved sample efficiency in comparison to state-of-the-art and other exploration objectives. In sparse and hard to explore environments we achieve an average improvement of over 30%. △ Less

Submitted 11 December, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

arXiv:2006.02513 [pdf, other]

Multi-Fidelity Black-Box Optimization for Time-Optimal Quadrotor Maneuvers

Authors: Gilhyun Ryou, Ezra Tal, Sertac Karaman

Abstract: We consider the problem of generating a time-optimal quadrotor trajectory that attains a set of prescribed waypoints. This problem is challenging since the optimal trajectory is located on the boundary of the set of dynamically feasible trajectories. This boundary is hard to model as it involves limitations of the entire system, including hardware and software, in agile high-speed flight. In this… ▽ More We consider the problem of generating a time-optimal quadrotor trajectory that attains a set of prescribed waypoints. This problem is challenging since the optimal trajectory is located on the boundary of the set of dynamically feasible trajectories. This boundary is hard to model as it involves limitations of the entire system, including hardware and software, in agile high-speed flight. In this work, we propose a multi-fidelity Bayesian optimization framework that models the feasibility constraints based on analytical approximation, numerical simulation, and real-world flight experiments. By combining evaluations at different fidelities, trajectory time is optimized while kee** the number of required costly flight experiments to a minimum. The algorithm is thoroughly evaluated in both simulation and real-world flight experiments at speeds up to 11 m/s. Resulting trajectories were found to be significantly faster than those obtained through minimum-snap trajectory planning. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted to appear at Robotics: Science and Systems 2020. Video at https://youtu.be/igwULi_H1Kg

arXiv:2005.13986 [pdf, other]

Perception-aware time optimal path parameterization for quadrotors

Authors: Igor Spasojevic, Varun Murali, Sertac Karaman

Abstract: The increasing popularity of quadrotors has given rise to a class of predominantly vision-driven vehicles. This paper addresses the problem of perception-aware time optimal path parametrization for quadrotors. Although many different choices of perceptual modalities are available, the low weight and power budgets of quadrotor systems makes a camera ideal for on-board navigation and estimation algo… ▽ More The increasing popularity of quadrotors has given rise to a class of predominantly vision-driven vehicles. This paper addresses the problem of perception-aware time optimal path parametrization for quadrotors. Although many different choices of perceptual modalities are available, the low weight and power budgets of quadrotor systems makes a camera ideal for on-board navigation and estimation algorithms. However, this does come with a set of challenges. The limited field of view of the camera can restrict the visibility of salient regions in the environment, which dictates the necessity to consider perception and planning jointly. The main contribution of this paper is an efficient time optimal path parametrization algorithm for quadrotors with limited field of view constraints. We show in a simulation study that a state-of-the-art controller can track planned trajectories, and we validate the proposed algorithm on a quadrotor platform in experiments. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: Accepted to appear at ICRA 2020

arXiv:2001.02359 [pdf, other]

Weakly Supervised Visual Semantic Parsing

Authors: Alireza Zareian, Svebor Karaman, Shih-Fu Chang

Abstract: Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pai… ▽ More Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method. △ Less

Submitted 31 March, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: To be presented at CVPR 2020 (oral paper)

arXiv:2001.02314 [pdf, other]

Bridging Knowledge Graphs to Generate Scene Graphs

Authors: Alireza Zareian, Svebor Karaman, Shih-Fu Chang

Abstract: Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of… ▽ More Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method. △ Less

Submitted 18 July, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: To be presented at ECCV 2020

arXiv:1912.06785 [pdf, other]

doi 10.1109/LRA.2020.3004800

Deep Context Maps: Agent Trajectory Prediction using Location-specific Latent Maps

Authors: Igor Gilitschenski, Guy Rosman, Arjun Gupta, Sertac Karaman, Daniela Rus

Abstract: In this paper, we propose a novel approach for agent motion prediction in cluttered environments. One of the main challenges in predicting agent motion is accounting for location and context-specific information. Our main contribution is the concept of learning context maps to improve the prediction task. Context maps are a set of location-specific latent maps that are trained alongside the predic… ▽ More In this paper, we propose a novel approach for agent motion prediction in cluttered environments. One of the main challenges in predicting agent motion is accounting for location and context-specific information. Our main contribution is the concept of learning context maps to improve the prediction task. Context maps are a set of location-specific latent maps that are trained alongside the predictor. Thus, the proposed maps are capable of capturing location context beyond visual context cues (e.g. usual average speeds and typical trajectories) or predefined map primitives (such as lanes and stop lines). We pose context map learning as a multi-task training problem and describe our map model and its incorporation into a state-of-the-art trajectory predictor. In extensive experiments, it is shown that use of learned maps can significantly improve predictor accuracy. Furthermore, the performance can be additionally boosted by providing partial knowledge of map semantics. △ Less

Submitted 19 June, 2020; v1 submitted 14 December, 2019; originally announced December 2019.

arXiv:1912.04462 [pdf, other]

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Authors: Shiyuan Huang, Xudong Lin, Svebor Karaman, Shih-Fu Chang

Abstract: Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as the high-cost of optical flow computation creates challenges for both the performance and efficiency. Recent works instead use modern compressed video m… ▽ More Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as the high-cost of optical flow computation creates challenges for both the performance and efficiency. Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes. Previous works create one stream for each modality which are combined with an additional temporal stream through late fusion. This is redundant since some modalities like motion vectors already contain temporal information. Based on this observation, we propose a compressed domain two-stream network IP TSN for compressed video recognition, where the two streams are represented by the two types of frames (I and P frames) in compressed videos, without needing a separate temporal stream. With this goal, we propose to fully exploit the motion information of P-stream through generalized distillation from optical flow, which largely improves the efficiency and accuracy. Our P-stream runs 60 times faster than using optical flow while achieving higher accuracy. Our full IP TSN, evaluated over public action recognition benchmarks (UCF101, HMDB51 and a subset of Kinetics), outperforms other compressed domain methods by large margins while improving the total inference speed by 20%. △ Less

Submitted 12 December, 2019; v1 submitted 9 December, 2019; originally announced December 2019.

arXiv:1909.10673 [pdf, other]

A Theory of Uncertainty Variables for State Estimation and Inference

Authors: Rajat Talak, Sertac Karaman, Eytan Modiano

Abstract: We develop a new framework of uncertainty variables to model uncertainty. An uncertainty variable is characterized by an uncertainty set, in which its realization is bound to lie, while the conditional uncertainty is characterized by a set map, from a given realization of a variable to a set of possible realizations of another variable. We prove Bayes' law and the law of total probability equivale… ▽ More We develop a new framework of uncertainty variables to model uncertainty. An uncertainty variable is characterized by an uncertainty set, in which its realization is bound to lie, while the conditional uncertainty is characterized by a set map, from a given realization of a variable to a set of possible realizations of another variable. We prove Bayes' law and the law of total probability equivalents for uncertainty variables. We define a notion of independence, conditional independence, and pairwise independence for a collection of uncertainty variables, and show that this new notion of independence preserves the properties of independence defined over random variables. We then develop a graphical model, namely Bayesian uncertainty network, a Bayesian network equivalent defined over a collection of uncertainty variables, and show that all the natural conditional independence properties, expected out of a Bayesian network, hold for the Bayesian uncertainty network. We also define the notion of point estimate, and show its relation with the maximum a posteriori estimate. Probability theory starts with a distribution function (equivalently a probability measure) as a primitive and builds all other useful concepts, such as law of total probability, Bayes' law, independence, graphical models, point estimate, on it. Our work shows that it is perfectly possible to start with a set, instead of a distribution function, and retain all the useful ideas needed for state estimation and inference. △ Less

Submitted 9 December, 2019; v1 submitted 23 September, 2019; originally announced September 2019.

arXiv:1909.06963 [pdf, other]

Stochastic Dynamic Games in Belief Space

Authors: Wilko Schwarting, Alyssa Pierson, Sertac Karaman, Daniela Rus

Abstract: Information gathering while interacting with other agents under sensing and motion uncertainty is critical in domains such as driving, service robots, racing, or surveillance. The interests of agents may be at odds with others, resulting in a stochastic non-cooperative dynamic game. Agents must predict others' future actions without communication, incorporate their actions into these predictions,… ▽ More Information gathering while interacting with other agents under sensing and motion uncertainty is critical in domains such as driving, service robots, racing, or surveillance. The interests of agents may be at odds with others, resulting in a stochastic non-cooperative dynamic game. Agents must predict others' future actions without communication, incorporate their actions into these predictions, account for uncertainty and noise in information gathering, and consider what information their actions reveal. Our solution uses local iterative dynamic programming in Gaussian belief space to solve a game-theoretic continuous POMDP. Solving a quadratic game in the backward pass of a game-theoretic belief-space variant of iLQG achieves a runtime polynomial in the number of agents and linear in the planning horizon. Our algorithm yields linear feedback policies for our robot, and predicted feedback policies for other agents. We present three applications: active surveillance, guiding eyes for a blind agent, and autonomous racing. Agents with game-theoretic belief-space planning win 44% more races than without game theory and 34% more than without belief-space planning. △ Less

Submitted 12 May, 2021; v1 submitted 15 September, 2019; originally announced September 2019.

Comments: Accepted in IEEE Transactions on Robotics (T-RO) 2021

Journal ref: IEEE Transactions on Robotics (T-RO) 2021

arXiv:1908.11413 [pdf, other]

doi 10.1137/19M1284579

Multi-resolution Low-rank Tensor Formats

Authors: Oscar Mickelin, Sertac Karaman

Abstract: We describe a simple, black-box compression format for tensors with a multiscale structure. By representing the tensor as a sum of compressed tensors defined on increasingly coarse grids, we capture low-rank structures on each grid-scale, and we show how this leads to an increase in compression for a fixed accuracy. We devise an alternating algorithm to represent a given tensor in the multiresolut… ▽ More We describe a simple, black-box compression format for tensors with a multiscale structure. By representing the tensor as a sum of compressed tensors defined on increasingly coarse grids, we capture low-rank structures on each grid-scale, and we show how this leads to an increase in compression for a fixed accuracy. We devise an alternating algorithm to represent a given tensor in the multiresolution format and prove local convergence guarantees. In two dimensions, we provide examples that show that this approach can beat the Eckart-Young theorem, and for dimensions higher than two, we achieve higher compression than the tensor-train format on six real-world datasets. We also provide results on the closedness and stability of the tensor format and discuss how to perform common linear algebra operations on the level of the compressed tensors. △ Less

Submitted 17 August, 2020; v1 submitted 29 August, 2019; originally announced August 2019.

Comments: 29 pages, 9 figures

MSC Class: 65F99; 15A69

Journal ref: SIAM J. Matrix Anal. Appl., 41(3), 1086-1114. (2020)

arXiv:1907.06515 [pdf, other]

Detecting and Simulating Artifacts in GAN Fake Images

Authors: Xu Zhang, Svebor Karaman, Shih-Fu Chang

Abstract: To detect GAN generated images, conventional supervised machine learning algorithms require collection of a number of real and fake images from the targeted GAN model. However, the specific model used by the attacker is often unavailable. To address this, we propose a GAN simulator, AutoGAN, which can simulate the artifacts produced by the common pipeline shared by several popular GAN models. Addi… ▽ More To detect GAN generated images, conventional supervised machine learning algorithms require collection of a number of real and fake images from the targeted GAN model. However, the specific model used by the attacker is often unavailable. To address this, we propose a GAN simulator, AutoGAN, which can simulate the artifacts produced by the common pipeline shared by several popular GAN models. Additionally, we identify a unique artifact caused by the up-sampling component included in the common GAN pipeline. We show theoretically such artifacts are manifested as replications of spectra in the frequency domain and thus propose a classifier model based on the spectrum input, rather than the pixel input. By using the simulated images to train a spectrum based classifier, even without seeing the fake images produced by the targeted GAN model during training, our approach achieves state-of-the-art performances on detecting fake images generated by popular GAN models such as CycleGAN. △ Less

Submitted 15 October, 2019; v1 submitted 15 July, 2019; originally announced July 2019.

Comments: This is an extended version of our original AutoGAN paper which will be appeared in WIFS 2019

arXiv:1906.06407 [pdf, ps, other]

Optimal orthogonal approximations to symmetric tensors cannot always be chosen symmetric

Authors: Oscar Mickelin, Sertac Karaman

Abstract: We study the problem of finding orthogonal low-rank approximations of symmetric tensors. In the case of matrices, the approximation is a truncated singular value decomposition which is then symmetric. Moreover, for rank-one approximations of tensors of any dimension, a classical result proven by Banach in 1938 shows that the optimal approximation can always be chosen to be symmetric. In contrast t… ▽ More We study the problem of finding orthogonal low-rank approximations of symmetric tensors. In the case of matrices, the approximation is a truncated singular value decomposition which is then symmetric. Moreover, for rank-one approximations of tensors of any dimension, a classical result proven by Banach in 1938 shows that the optimal approximation can always be chosen to be symmetric. In contrast to these results, this article shows that the corresponding statement is no longer true for orthogonal approximations of higher rank. Specifically, for any of the four common notions of tensor orthogonality used in the literature, we show that optimal orthogonal approximations of rank greater than one cannot always be chosen to be symmetric. △ Less

Submitted 14 June, 2019; originally announced June 2019.

Comments: 20 pages

MSC Class: 15A18; 15A69; 41A29

arXiv:1905.11524 [pdf, other]

Shared Linear Quadratic Regulation Control: A Reinforcement Learning Approach

Authors: Murad Abu-Khalaf, Sertac Karaman, Daniela Rus

Abstract: We propose controller synthesis for state regulation problems in which a human operator shares control with an autonomy system, running in parallel. The autonomy system continuously improves over human action, with minimal intervention, and can take over full-control. It additively combines user input with an adaptive optimal corrective signal. It is adaptive in that it neither estimates nor requi… ▽ More We propose controller synthesis for state regulation problems in which a human operator shares control with an autonomy system, running in parallel. The autonomy system continuously improves over human action, with minimal intervention, and can take over full-control. It additively combines user input with an adaptive optimal corrective signal. It is adaptive in that it neither estimates nor requires a model of the human's action policy, or the internal dynamics of the plant, and can adjust to changes in both. Our contribution is twofold; first, a new synthesis for shared control which we formulate as an adaptive optimal control problem for continuous-time linear systems and solve it online as a human-in-the-loop reinforcement learning. The result is an architecture that we call shared linear quadratic regulator (sLQR). Second, we provide new analysis of reinforcement learning for continuous-time linear systems in two parts. In the first analysis part, we avoid learning along a single state-space trajectory which we show leads to data collinearity under certain conditions. We make a clear separation between exploitation of learned policies and exploration of the state-space, and propose an exploration scheme that requires switching to new state-space trajectories rather than injecting noise continuously while learning. This avoidance of continuous noise injection minimizes interference with human action, and avoids bias in the convergence to the stabilizing solution of the underlying algebraic Riccati equation. We show that exploring a minimum number of pairwise distinct state-space trajectories is necessary to avoid collinearity in the learning data. In the second analysis part, we show conditions under which existence and uniqueness of solutions can be established for off-policy reinforcement learning in continuous-time linear systems; namely, prior knowledge of the input matrix. △ Less

Submitted 20 September, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Accepted by IEEE CDC 2019

arXiv:1905.11377 [pdf, other]

doi 10.1109/IROS40897.2019.8968116

FlightGoggles: A Modular Framework for Photorealistic Camera, Exteroceptive Sensor, and Dynamics Simulation

Authors: Winter Guerra, Ezra Tal, Varun Murali, Gilhyun Ryou, Sertac Karaman

Abstract: FlightGoggles is a photorealistic sensor simulator for perception-driven robotic vehicles. The key contributions of FlightGoggles are twofold. First, FlightGoggles provides photorealistic exteroceptive sensor simulation using graphics assets generated with photogrammetry. Second, it provides the ability to combine (i) synthetic exteroceptive measurements generated in silico in real time and (ii) v… ▽ More FlightGoggles is a photorealistic sensor simulator for perception-driven robotic vehicles. The key contributions of FlightGoggles are twofold. First, FlightGoggles provides photorealistic exteroceptive sensor simulation using graphics assets generated with photogrammetry. Second, it provides the ability to combine (i) synthetic exteroceptive measurements generated in silico in real time and (ii) vehicle dynamics and proprioceptive measurements generated in motio by vehicle(s) in a motion-capture facility. FlightGoggles is capable of simulating a virtual-reality environment around autonomous vehicle(s). While a vehicle is in flight in the FlightGoggles virtual reality environment, exteroceptive sensors are rendered synthetically in real time while all complex extrinsic dynamics are generated organically through the natural interactions of the vehicle. The FlightGoggles framework allows for researchers to accelerate development by circumventing the need to estimate complex and hard-to-model interactions such as aerodynamics, motor mechanics, battery electrochemistry, and behavior of other agents. The ability to perform vehicle-in-the-loop experiments with photorealistic exteroceptive sensor simulation facilitates novel research directions involving, e.g., fast and agile autonomous flight in obstacle-rich environments, safe human interaction, and flexible sensor selection. FlightGoggles has been utilized as the main test for selecting nine teams that will advance in the AlphaPilot autonomous drone racing challenge. We survey approaches and results from the top AlphaPilot teams, which may be of independent interest. △ Less

Submitted 28 May, 2021; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Initial version appeared at IROS 2019. Supplementary material can be found at https://flightgoggles.mit.edu. Revision includes description of new FlightGoggles features, such as a photogrammetric model of the MIT Stata Center, new rendering settings, and a Python API

arXiv:1905.02238 [pdf, other]

FSMI: Fast computation of Shannon Mutual Information for information-theoretic map**

Authors: Zhengdong Zhang, Trevor Henderson, Sertac Karaman, Vivienne Sze

Abstract: Exploration tasks are embedded in many robotics applications, such as search and rescue and space exploration. Information-based exploration algorithms aim to find the most informative trajectories by maximizing an information-theoretic metric, such as the mutual information between the map and potential future measurements. Unfortunately, most existing information-based exploration algorithms are… ▽ More Exploration tasks are embedded in many robotics applications, such as search and rescue and space exploration. Information-based exploration algorithms aim to find the most informative trajectories by maximizing an information-theoretic metric, such as the mutual information between the map and potential future measurements. Unfortunately, most existing information-based exploration algorithms are plagued by the computational difficulty of evaluating the Shannon mutual information metric. In this paper, we consider the fundamental problem of evaluating Shannon mutual information between the map and a range measurement. First, we consider 2D environments. We propose a novel algorithm, called the Fast Shannon Mutual Information (FSMI). The key insight behind the algorithm is that a certain integral can be computed analytically, leading to substantial computational savings. Second, we consider 3D environments, represented by efficient data structures, e.g., an OctoMap, such that the measurements are compressed by Run-Length Encoding (RLE). We propose a novel algorithm, called FSMI-RLE, that efficiently evaluates the Shannon mutual information when the measurements are compressed using RLE. For both the FSMI and the FSMI-RLE, we also propose variants that make different assumptions on the sensor noise distribution for the purpose of further computational savings. We evaluate the proposed algorithms in extensive experiments. In particular, we show that the proposed algorithms outperform existing algorithms that compute Shannon mutual information as well as other algorithms that compute the Cauchy-Schwarz Quadratic mutual information (CSQMI). In addition, we demonstrate the computation of Shannon mutual information on a 3D map for the first time. △ Less

Submitted 6 May, 2019; originally announced May 2019.

arXiv:1904.04968 [pdf, other]

doi 10.1109/LCSYS.2019.2919809

Asymptotic Optimality of a Time Optimal Path Parametrization Algorithm

Authors: Igor Spasojevic, Varun Murali, Sertac Karaman

Abstract: Time Optimal Path Parametrization is the problem of minimizing the time interval during which an actuation constrained agent can traverse a given path. Recently, an efficient linear-time algorithm for solving this problem was proposed. However, its optimality was proved for only a strict subclass of problems solved optimally by more computationally intensive approaches based on convex programming.… ▽ More Time Optimal Path Parametrization is the problem of minimizing the time interval during which an actuation constrained agent can traverse a given path. Recently, an efficient linear-time algorithm for solving this problem was proposed. However, its optimality was proved for only a strict subclass of problems solved optimally by more computationally intensive approaches based on convex programming. In this paper, we prove that the same linear-time algorithm is asymptotically optimal for all problems solved optimally by convex optimization approaches. We also characterize the optimum of the Time Optimal Path Parametrization Problem, which may be of independent interest. △ Less

Submitted 9 April, 2019; originally announced April 2019.

arXiv:1903.03273 [pdf, other]

FastDepth: Fast Monocular Depth Estimation on Embedded Systems

Authors: Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman, Vivienne Sze

Abstract: Depth sensing is a critical function for robotic tasks such as localization, map** and obstacle detection. There has been a significant and growing interest in depth estimation from a single RGB image, due to the relatively low cost and size of monocular cameras. However, state-of-the-art single-view depth estimation algorithms are based on fairly complex deep neural networks that are too slow f… ▽ More Depth sensing is a critical function for robotic tasks such as localization, map** and obstacle detection. There has been a significant and growing interest in depth estimation from a single RGB image, due to the relatively low cost and size of monocular cameras. However, state-of-the-art single-view depth estimation algorithms are based on fairly complex deep neural networks that are too slow for real-time inference on an embedded platform, for instance, mounted on a micro aerial vehicle. In this paper, we address the problem of fast depth estimation on embedded systems. We propose an efficient and lightweight encoder-decoder network architecture and apply network pruning to further reduce computational complexity and latency. In particular, we focus on the design of a low-latency decoder. Our methodology demonstrates that it is possible to achieve similar accuracy as prior work on depth estimation, but at inference speeds that are an order of magnitude faster. Our proposed network, FastDepth, runs at 178 fps on an NVIDIA Jetson TX2 GPU and at 27 fps when using only the TX2 CPU, with active power consumption under 10 W. FastDepth achieves close to state-of-the-art accuracy on the NYU Depth v2 dataset. To the best of the authors' knowledge, this paper demonstrates real-time monocular depth estimation using a deep neural network with the lowest latency and highest throughput on an embedded platform that can be carried by a micro aerial vehicle. △ Less

Submitted 7 March, 2019; originally announced March 2019.

Comments: Accepted for presentation at ICRA 2019. 8 pages, 6 figures, 7 tables

arXiv:1903.01545 [pdf, other]

Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

Authors: Svebor Karaman, Xudong Lin, Xuefeng Hu, Shih-Fu Chang

Abstract: We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-b… ▽ More We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-based model, which transforms the input representation into a binary representation. We formalize the training objective of the network in an intuitive and effective way, considering each training sample as a query and aiming to obtain the same retrieval results using the produced hash codes as those obtained with the original features. This training formulation directly optimizes the hashing model for the target usage of the hash codes it produces. We further explore the addition of a decoder trained to obtain an approximated reconstruction of the original features. At test time, we retrieved the most promising database samples with an efficient graph-based search procedure using only our hash codes and perform re-ranking using the reconstructed features, thus without needing to access the original features at all. Experiments conducted on multiple publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost. △ Less

Submitted 4 March, 2019; originally announced March 2019.

arXiv:1811.11683 [pdf, other]

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

Authors: Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang

Abstract: We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear map**s for visual features at each level, wo… ▽ More We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear map**s for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub. △ Less

Submitted 29 May, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Comments: Accepted in CVPR 2019

arXiv:1811.10119 [pdf, other]

doi 10.1109/ICRA.2019.8793579

Variational End-to-End Navigation and Localization

Authors: Alexander Amini, Guy Rosman, Sertac Karaman, Daniela Rus

Abstract: Deep learning has revolutionized the ability to learn "end-to-end" autonomous vehicle control directly from raw sensory data. While there have been recent extensions to handle forms of navigation instruction, these works are unable to capture the full distribution of possible actions that could be taken and to reason about localization of the robot within the environment. In this paper, we extend… ▽ More Deep learning has revolutionized the ability to learn "end-to-end" autonomous vehicle control directly from raw sensory data. While there have been recent extensions to handle forms of navigation instruction, these works are unable to capture the full distribution of possible actions that could be taken and to reason about localization of the robot within the environment. In this paper, we extend end-to-end driving networks with the ability to perform point-to-point navigation as well as probabilistic localization using only noisy GPS data. We define a novel variational network capable of learning from raw camera data of the environment as well as higher level roadmaps to predict (1) a full probability distribution over the possible control commands; and (2) a deterministic control command capable of navigating on the route specified within the map. Additionally, we formulate how our model can be used to localize the robot according to correspondences between the map and the observed visual road topology, inspired by the rough localization that human drivers can perform. We test our algorithms on real-world driving data that the vehicle has never driven through before, and integrate our point-to-point navigation algorithms onboard a full-scale autonomous vehicle for real-time performance. Our localization algorithm is also evaluated over a new set of roads and intersections to demonstrates rough pose localization even in situations without any GPS prior. △ Less

Submitted 11 June, 2019; v1 submitted 25 November, 2018; originally announced November 2018.

Comments: Published in IEEE International Conference on Robotics and Automation (ICRA) 2019. Best Paper Award Finalist

Journal ref: 2019 International Conference on Robotics and Automation (ICRA)

arXiv:1810.04371 [pdf, other]

Can Determinacy Minimize Age of Information?

Authors: Rajat Talak, Sertac Karaman, Eytan Modiano

Abstract: Age-of-information (AoI) is a newly proposed performance metric of information freshness. It differs from the traditional delay metric, because it is destination centric and measures the time that elapsed since the last received fresh information update was generated at the source. AoI has been analyzed for several queueing models, and the problem of optimizing AoI over arrival and service rates h… ▽ More Age-of-information (AoI) is a newly proposed performance metric of information freshness. It differs from the traditional delay metric, because it is destination centric and measures the time that elapsed since the last received fresh information update was generated at the source. AoI has been analyzed for several queueing models, and the problem of optimizing AoI over arrival and service rates has been studied in the literature. We consider the problem of minimizing AoI over the space of update generation and service time distributions. In particular, we ask whether determinacy, i.e. periodic generation of update packets and/or deterministic service, optimizes AoI. By considering several queueing systems, we show that in certain settings, deterministic service can in fact result in the worst case AoI, while a heavy-tailed distributed service can yield the minimum AoI. This leads to an interesting conclusion that, in some queueing systems, the service time distribution that minimizes expected packet delay, or variance in packet delay can, in fact, result in the worst case AoI. This exposes a fundamental difference between AoI metrics and packet delay. △ Less

Submitted 14 January, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

Showing 1–50 of 91 results for author: Karaman, S