Search | arXiv e-print repository

Graph Guided Question Answer Generation for Procedural Question-Answering

Authors: Hai X. Pham, Isma Hadji, Xinnuo Xu, Ziedune Degutyte, Jay Rainey, Evangelos Kazakos, Afsaneh Fazly, Georgios Tzimiropoulos, Brais Martinez

Abstract: In this paper, we focus on task-specific question answering (QA). To this end, we introduce a method for generating exhaustive and high-quality training data, which allows us to train compact (e.g., run on a mobile device), task-specific QA models that are competitive against GPT variants. The key technological enabler is a novel mechanism for automatic question-answer generation from procedural t… ▽ More In this paper, we focus on task-specific question answering (QA). To this end, we introduce a method for generating exhaustive and high-quality training data, which allows us to train compact (e.g., run on a mobile device), task-specific QA models that are competitive against GPT variants. The key technological enabler is a novel mechanism for automatic question-answer generation from procedural text which can ingest large amounts of textual instructions and produce exhaustive in-domain QA training data. While current QA data generation methods can produce well-formed and varied data, their non-exhaustive nature is sub-optimal for training a QA model. In contrast, we leverage the highly structured aspect of procedural text and represent each step and the overall flow of the procedure as graphs. We then condition on graph nodes to automatically generate QA pairs in an exhaustive and controllable manner. Comprehensive evaluations of our method show that: 1) small models trained with our data achieve excellent performance on the target QA task, even exceeding that of GPT3 and ChatGPT despite being several orders of magnitude smaller. 2) semantic coverage is the key indicator for downstream QA performance. Crucially, while large language models excel at syntactic diversity, this does not necessarily result in improvements on the end QA model. In contrast, the higher semantic coverage provided by our method is critical for QA performance. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to EACL 2024 as long paper. 25 pages including appendix

MSC Class: I.2.7

arXiv:2310.14030 [pdf, other]

Visual Tracking Nonlinear Model Predictive Control Method for Autonomous Wind Turbine Inspection

Authors: Abdelhakim Amer, Mohit Mehndiratta, Jonas le Fevre Sejersen, Huy Xuan Pham, Erdal Kayacan

Abstract: Automated visual inspection of on-and offshore wind turbines using aerial robots provides several benefits, namely, a safe working environment by circumventing the need for workers to be suspended high above the ground, reduced inspection time, preventive maintenance, and access to hard-to-reach areas. A novel nonlinear model predictive control (NMPC) framework alongside a global wind turbine path… ▽ More Automated visual inspection of on-and offshore wind turbines using aerial robots provides several benefits, namely, a safe working environment by circumventing the need for workers to be suspended high above the ground, reduced inspection time, preventive maintenance, and access to hard-to-reach areas. A novel nonlinear model predictive control (NMPC) framework alongside a global wind turbine path planner is proposed to achieve distance-optimal coverage for wind turbine inspection. Unlike traditional MPC formulations, visual tracking NMPC (VT-NMPC) is designed to track an inspection surface, instead of a position and heading trajectory, thereby circumventing the need to provide an accurate predefined trajectory for the drone. An additional capability of the proposed VT-NMPC method is that by incorporating inspection requirements as visual tracking costs to minimize, it naturally achieves the inspection task successfully while respecting the physical constraints of the drone. Multiple simulation runs and real-world tests demonstrate the efficiency and efficacy of the proposed automated inspection framework, which outperforms the traditional MPC designs, by providing full coverage of the target wind turbine blades as well as its robustness to changing wind conditions. The implementation codes are open-sourced. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: 8 pages, accepted for publication at ICAR conference

arXiv:2207.14131 [pdf, other]

PencilNet: Zero-Shot Sim-to-Real Transfer Learning for Robust Gate Perception in Autonomous Drone Racing

Authors: Huy Xuan Pham, Andriy Sarabakha, Mykola Odnoshyvkin, Erdal Kayacan

Abstract: In autonomous and mobile robotics, one of the main challenges is the robust on-the-fly perception of the environment, which is often unknown and dynamic, like in autonomous drone racing. In this work, we propose a novel deep neural network-based perception method for racing gate detection -- PencilNet -- which relies on a lightweight neural network backbone on top of a pencil filter. This approach… ▽ More In autonomous and mobile robotics, one of the main challenges is the robust on-the-fly perception of the environment, which is often unknown and dynamic, like in autonomous drone racing. In this work, we propose a novel deep neural network-based perception method for racing gate detection -- PencilNet -- which relies on a lightweight neural network backbone on top of a pencil filter. This approach unifies predictions of the gates' 2D position, distance, and orientation in a single pose tuple. We show that our method is effective for zero-shot sim-to-real transfer learning that does not need any real-world training samples. Moreover, our framework is highly robust to illumination changes commonly seen under rapid flight compared to state-of-art methods. A thorough set of experiments demonstrates the effectiveness of this approach in multiple challenging scenarios, where the drone completes various tracks under different lighting conditions. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: accepted for publication by IEEE RA-L/IROS 2022

arXiv:2204.02120 [pdf, other]

Event-based Navigation for Autonomous Drone Racing with Sparse Gated Recurrent Network

Authors: Kristoffer Fogh Andersen, Huy Xuan Pham, Halil Ibrahim Ugurlu, Erdal Kayacan

Abstract: Event-based vision has already revolutionized the perception task for robots by promising faster response, lower energy consumption, and lower bandwidth without introducing motion blur. In this work, a novel deep learning method based on gated recurrent units utilizing sparse convolutions for detecting gates in a race track is proposed using event-based vision for the autonomous drone racing probl… ▽ More Event-based vision has already revolutionized the perception task for robots by promising faster response, lower energy consumption, and lower bandwidth without introducing motion blur. In this work, a novel deep learning method based on gated recurrent units utilizing sparse convolutions for detecting gates in a race track is proposed using event-based vision for the autonomous drone racing problem. We demonstrate the efficiency and efficacy of the perception pipeline on a real robot platform that can safely navigate a typical autonomous drone racing track in real-time. Throughout the experiments, we show that the event-based vision with the proposed gated recurrent unit and pretrained models on simulated event data significantly improve the gate detection precision. Furthermore, an event-based drone racing dataset consisting of both simulated and real data sequences is publicly released. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted to present at the 20th European Control Conference (ECC)

arXiv:2102.02547 [pdf, other]

CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval

Authors: Hai X. Pham, Ricardo Guerrero, Jiatong Li, Vladimir Pavlovic

Abstract: Despite the abundance of multi-modal data, such as image-text pairs, there has been little effort in understanding the individual entities and their different roles in the construction of these data instances. In this work, we endeavour to discover the entities and their corresponding importance in cooking recipes automaticall} as a visual-linguistic association problem. More specifically, we intr… ▽ More Despite the abundance of multi-modal data, such as image-text pairs, there has been little effort in understanding the individual entities and their different roles in the construction of these data instances. In this work, we endeavour to discover the entities and their corresponding importance in cooking recipes automaticall} as a visual-linguistic association problem. More specifically, we introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. This model allows one to discover complex functional and hierarchical relationships between images and text, and among textual parts of a recipe including title, ingredients and cooking instructions. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are not only able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision, but we can also learn more meaningful feature representations of food recipes, appropriate for challenging cross-modal retrieval and recipe adaption tasks. △ Less

Submitted 4 February, 2021; originally announced February 2021.

Comments: 22 pages, accepted in AAAI 2021

arXiv:2012.01345 [pdf, other]

Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning

Authors: Ricardo Guerrero, Hai Xuan Pham, Vladimir Pavlovic

Abstract: Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that pre… ▽ More Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap. △ Less

Submitted 30 September, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

arXiv:1811.00690 [pdf, other]

A Multi-Robotic System for Environmental Cleaning

Authors: Chuong Le, Huy Xuan Pham, Hung Manh La

Abstract: There is a lot of waste in an industrial environment that could cause harmful effects to both the products and the workers resulting in product defects, itchy eyes or chronic obstructive pulmonary disease, etc. While automative cleaning robots could be used, the environment is often too big for one robot to clean alone in addition to the fact that it does not have adequate stored dirt capacity. We… ▽ More There is a lot of waste in an industrial environment that could cause harmful effects to both the products and the workers resulting in product defects, itchy eyes or chronic obstructive pulmonary disease, etc. While automative cleaning robots could be used, the environment is often too big for one robot to clean alone in addition to the fact that it does not have adequate stored dirt capacity. We present a multi-robotic dirt cleaning system algorithm for multiple automatic iRobot Creates teaming to efficiently clean an environment. Moreover, since some spaces in the environment are clean while others are dirty, our multi-robotic system possesses a path planning algorithm to allow the robot team to clean efficiently by spending more time on the area with higher dirt level. Overall, our multi-robotic system outperforms the single robot system in time efficiency while having almost the same total battery usage and cleaning efficiency result. △ Less

Submitted 1 November, 2018; originally announced November 2018.

arXiv:1803.07926 [pdf, other]

A Distributed Control Framework of Multiple Unmanned Aerial Vehicles for Dynamic Wildfire Tracking

Authors: Huy Xuan Pham, Hung Manh La, David Feil-Seifer, Matthew Dean

Abstract: Wild-land fire fighting is a hazardous job. A key task for firefighters is to observe the "fire front" to chart the progress of the fire and areas that will likely spread next. Lack of information of the fire front causes many accidents. Using Unmanned Aerial Vehicles (UAVs) to cover wildfire is promising because it can replace humans in hazardous fire tracking and significantly reduce operation c… ▽ More Wild-land fire fighting is a hazardous job. A key task for firefighters is to observe the "fire front" to chart the progress of the fire and areas that will likely spread next. Lack of information of the fire front causes many accidents. Using Unmanned Aerial Vehicles (UAVs) to cover wildfire is promising because it can replace humans in hazardous fire tracking and significantly reduce operation costs. In this paper we propose a distributed control framework designed for a team of UAVs that can closely monitor a wildfire in open space, and precisely track its development. The UAV team, designed for flexible deployment, can effectively avoid in-flight collisions and cooperate well with neighbors. They can maintain a certain height level to the ground for safe flight above fire. Experimental results are conducted to demonstrate the capabilities of the UAV team in covering a spreading wildfire. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1704.02630

arXiv:1803.07716 [pdf, other]

Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network

Authors: Hai X. Pham, Yuting Wang, Vladimir Pavlovic

Abstract: This paper presents Generative Adversarial Talking Head (GATH), a novel deep generative neural network that enables fully automatic facial expression synthesis of an arbitrary portrait with continuous action unit (AU) coefficients. Specifically, our model directly manipulates image pixels to make the unseen subject in the still photo express various emotions controlled by values of facial AU coeff… ▽ More This paper presents Generative Adversarial Talking Head (GATH), a novel deep generative neural network that enables fully automatic facial expression synthesis of an arbitrary portrait with continuous action unit (AU) coefficients. Specifically, our model directly manipulates image pixels to make the unseen subject in the still photo express various emotions controlled by values of facial AU coefficients, while maintaining her personal characteristics, such as facial geometry, skin color and hair style, as well as the original surrounding background. In contrast to prior work, GATH is purely data-driven and it requires neither a statistical face model nor image processing tricks to enact facial deformations. Additionally, our model is trained from unpaired data, where the input image, with its auxiliary identity label taken from abundance of still photos in the wild, and the target frame are from different persons. In order to effectively learn such model, we propose a novel weakly supervised adversarial learning framework that consists of a generator, a discriminator, a classifier and an action unit estimator. Our work gives rise to template-and-target-free expression editing, where still faces can be effortlessly animated with arbitrary AU coefficients provided by the user. △ Less

Submitted 28 March, 2018; v1 submitted 20 March, 2018; originally announced March 2018.

Comments: Fix typos, add youtube link of supplementary video

arXiv:1803.07250 [pdf, other]

Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage

Authors: Huy Xuan Pham, Hung Manh La, David Feil-Seifer, Aria Nefian

Abstract: This paper proposes a distributed Multi-Agent Reinforcement Learning (MARL) algorithm for a team of Unmanned Aerial Vehicles (UAVs). The proposed MARL algorithm allows UAVs to learn cooperatively to provide a full coverage of an unknown field of interest while minimizing the overlap** sections among their field of views. Two challenges in MARL for such a system are discussed in the paper: firstl… ▽ More This paper proposes a distributed Multi-Agent Reinforcement Learning (MARL) algorithm for a team of Unmanned Aerial Vehicles (UAVs). The proposed MARL algorithm allows UAVs to learn cooperatively to provide a full coverage of an unknown field of interest while minimizing the overlap** sections among their field of views. Two challenges in MARL for such a system are discussed in the paper: firstly, the complex dynamic of the joint-actions of the UAV team, that will be solved using game-theoretic correlated equilibrium, and secondly, the challenge in huge dimensional state space representation will be tackled with efficient function approximation techniques. We also provide our experimental results in detail with both simulation and physical implementation to show that the UAV team can successfully learn to accomplish the task. △ Less

Submitted 16 September, 2018; v1 submitted 20 March, 2018; originally announced March 2018.

arXiv:1801.05086 [pdf, other]

Autonomous UAV Navigation Using Reinforcement Learning

Authors: Huy X. Pham, Hung M. La, David Feil-Seifer, Luan V. Nguyen

Abstract: Unmanned aerial vehicles (UAV) are commonly used for missions in unknown environments, where an exact mathematical model of the environment may not be available. This paper provides a framework for using reinforcement learning to allow the UAV to navigate successfully in such environments. We conducted our simulation and real implementation to show how the UAVs can successfully learn to navigate t… ▽ More Unmanned aerial vehicles (UAV) are commonly used for missions in unknown environments, where an exact mathematical model of the environment may not be available. This paper provides a framework for using reinforcement learning to allow the UAV to navigate successfully in such environments. We conducted our simulation and real implementation to show how the UAVs can successfully learn to navigate through an unknown environment. Technical aspects regarding to applying reinforcement learning algorithm to a UAV system and UAV flight control were also addressed. This will enable continuing research using a UAV with learning capabilities in more important applications, such as wildfire monitoring, or search and rescue missions. △ Less

Submitted 15 January, 2018; originally announced January 2018.

arXiv:1710.00920 [pdf, other]

End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech

Authors: Hai X. Pham, Yuting Wang, Vladimir Pavlovic

Abstract: We present a deep learning framework for real-time speech-driven 3D facial animation from just raw waveforms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual informati… ▽ More We present a deep learning framework for real-time speech-driven 3D facial animation from just raw waveforms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or in a surprised state, both eyebrows raise higher. Experiments on a diverse audiovisual corpus of different actors across a wide range of emotional states show interesting and promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation. △ Less

Submitted 7 December, 2017; v1 submitted 2 October, 2017; originally announced October 2017.

arXiv:1704.02630 [pdf, other]

A Distributed Control Framework for a Team of Unmanned Aerial Vehicles for Dynamic Wildfire Tracking

Authors: Huy X. Pham, Hung M. La, David Feil-Seifer, Matthew Deans

Abstract: Wildland fire fighting is a very dangerous job, and the lack of information of the fire front is one of main reasons that causes many accidents. Using unmanned aerial vehicle (UAV) to cover wildfire is promising because it can replace human in hazardous fire tracking and save operation costs significantly. In this paper we propose a distributed control framework designed for a team of UAVs that ca… ▽ More Wildland fire fighting is a very dangerous job, and the lack of information of the fire front is one of main reasons that causes many accidents. Using unmanned aerial vehicle (UAV) to cover wildfire is promising because it can replace human in hazardous fire tracking and save operation costs significantly. In this paper we propose a distributed control framework designed for a team of UAVs that can closely monitor a wildfire in open space, and precisely track its development. The UAV team, designed for flexible deployment, can effectively avoid in-flight collision as well as cooperate well with other neighbors. Experimental results are conducted to demonstrate the capabilites of the UAV team in covering a spreading wildfire. △ Less

Submitted 9 April, 2017; originally announced April 2017.

arXiv:1507.02779 [pdf, other]

Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes

Authors: Hai X. Pham, Chongyu Chen, Luc N. Dao, Vladimir Pavlovic, Jianfei Cai, Tat-jen Cham

Abstract: We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. Th… ▽ More We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. This is accomplished by the combination of a flexible 3D shape regressor and the joint 2D+3D optimization on shape parameters. Our approach fits facial blendshapes to the point cloud of the human head, while being driven by an efficient and rapid 3D shape regressor trained on generic RGB datasets. As an on-line tracking system, the identity of the unknown user is adapted on-the-fly resulting in improved 3D model reconstruction and consequently better tracking performance. The result is a robust RGBD face tracker, capable of handling a wide range of target scene depths, beyond those that can be afforded by traditional depth or RGB face trackers. Lastly, since the blendshape is not able to accurately recover the real facial shape, we use the tracked 3D face model as a prior in a novel filtering process to further refine the depth map for use in other tasks, such as 3D reconstruction. △ Less

Submitted 10 July, 2015; originally announced July 2015.

Comments: 10 pages, 8 figures, 4 tables

Showing 1–14 of 14 results for author: Pham, H X