Search | arXiv e-print repository

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Authors: Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, Jimmy Lin

Abstract: Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the tradi… ▽ More Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnarök, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnarök, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnarök framework and baselines to achieve a unified standard for future RAG systems. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2403.16205 [pdf, other]

Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains

Authors: Bang-Dang Pham, Phong Tran, Anh Tran, Cuong Pham, Rang Nguyen, Minh Hoai

Abstract: This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image, which is challenging to deblur, into another blurry image that is more amenable to deblurring. The transformation process, from one blurry state to another, leverages unpaired data consisting of sharp and blurry… ▽ More This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image, which is challenging to deblur, into another blurry image that is more amenable to deblurring. The transformation process, from one blurry state to another, leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion, as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks, where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at https://zero1778.github.io/blur2blur/ △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2311.01650 [pdf, other]

doi 10.18653/v1/2023.crac-main.7

MARRS: Multimodal Reference Resolution System

Authors: Halim Cagri Ates, Shruti Bhargava, Site Li, Jiarui Lu, Siddhardha Maddula, Joel Ruben Antony Moniz, Anil Kumar Nalamalapu, Roman Hoang Nguyen, Melis Ozyildirim, Alkesh Patel, Dhivya Piraviperumal, Vincent Renkens, Ankit Samal, Thy Tran, Bo-Hsiang Tseng, Hong Yu, Yuan Zhang, Rong Zou

Abstract: Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution Sy… ▽ More Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)

arXiv:2310.00184 [pdf, other]

NASU -- Novel Actuating Screw Unit: Origami-inspired Screw-based Propulsion on Mobile Ground Robots

Authors: Calvin Joyce, Jason Lim, Roger Nguyen, Michael Owens, Sara Wickenhiser, Elizabeth Peiros, Florian Richter, Michael C. Yip

Abstract: Screw-based locomotion is a robust method of locomotion across a wide range of media including water, sand, and gravel. A challenge with screws is their significant number of impactful design parameters that affect locomotion performance. One crucial parameter is the angle of attack (also called the lead angle), which has been shown to significantly impact the performance of screw propellers in te… ▽ More Screw-based locomotion is a robust method of locomotion across a wide range of media including water, sand, and gravel. A challenge with screws is their significant number of impactful design parameters that affect locomotion performance. One crucial parameter is the angle of attack (also called the lead angle), which has been shown to significantly impact the performance of screw propellers in terms of traveling velocity, force produced, degree of slip, and sinkage. As a result, the optimal design choice may vary significantly depending on application and mission objectives. In this work, we present the Novel Actuating Screw Unit (NASU). It is the first screw-based propulsion design that enables dynamic reconfiguration of the angle of attack for optimized locomotion across multiple media and use cases. The design is inspired by the kresling unit, a mechanism from origami robotics, and the angle of attack is adjusted with a linear actuator. In contrast, the entire unit is spun on its axis to generate propulsion. NASU is integrated into a mobile test bed and experiments are conducted in various media including gravel, grass, and sand. Our experiment results indicate a trade-off between locomotive efficiency and velocity exists regarding angle of attack, and the proposed design is a promising direction for reconfigurable screws by allowing control to optimize for efficiency or velocity. △ Less

Submitted 13 May, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: 7 pages, 8 Figures, submitted to IROS 2024

arXiv:2304.01686 [pdf, other]

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering

Authors: Bang-Dang Pham, Phong Tran, Anh Tran, Cuong Pham, Rang Nguyen, Minh Hoai

Abstract: We consider the challenging task of training models for image-to-video deblurring, which aims to recover a sequence of sharp images corresponding to a given blurry image input. A critical issue disturbing the training of an image-to-video model is the ambiguity of the frame ordering since both the forward and backward sequences are plausible solutions. This paper proposes an effective self-supervi… ▽ More We consider the challenging task of training models for image-to-video deblurring, which aims to recover a sequence of sharp images corresponding to a given blurry image input. A critical issue disturbing the training of an image-to-video model is the ambiguity of the frame ordering since both the forward and backward sequences are plausible solutions. This paper proposes an effective self-supervised ordering scheme that allows training high-quality image-to-video deblurring models. Unlike previous methods that rely on order-invariant losses, we assign an explicit order for each video sequence, thus avoiding the order-ambiguity issue. Specifically, we map each video sequence to a vector in a latent high-dimensional space so that there exists a hyperplane such that for every video sequence, the vectors extracted from it and its reversed sequence are on different sides of the hyperplane. The side of the vectors will be used to define the order of the corresponding sequence. Last but not least, we propose a real-image dataset for the image-to-video deblurring problem that covers a variety of popular domains, including face, hand, and street. Extensive experimental results confirm the effectiveness of our method. Code and data are available at https://github.com/VinAIResearch/HyperCUT.git △ Less

Submitted 5 April, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: Accepted to CVPR 2023

arXiv:2210.15897 [pdf, other]

Single-Image HDR Reconstruction by Multi-Exposure Generation

Authors: Phuoc-Hieu Le, Quynh Le, Rang Nguyen, Binh-Son Hua

Abstract: High dynamic range (HDR) imaging is an indispensable technique in modern photography. Traditional methods focus on HDR reconstruction from multiple images, solving the core problems of image alignment, fusion, and tone map**, yet having a perfect solution due to ghosting and other visual artifacts in the reconstruction. Recent attempts at single-image HDR reconstruction show a promising alternat… ▽ More High dynamic range (HDR) imaging is an indispensable technique in modern photography. Traditional methods focus on HDR reconstruction from multiple images, solving the core problems of image alignment, fusion, and tone map**, yet having a perfect solution due to ghosting and other visual artifacts in the reconstruction. Recent attempts at single-image HDR reconstruction show a promising alternative: by learning to map pixel values to their irradiance using a neural network, one can bypass the align-and-merge pipeline completely yet still obtain a high-quality HDR image. In this work, we propose a weakly supervised learning method that inverts the physical image formation process for HDR reconstruction via learning to generate multiple exposures from a single image. Our neural network can invert the camera response to reconstruct pixel irradiance before synthesizing multiple exposures and hallucinating details in under- and over-exposed regions from a single input image. To train the network, we propose a representation loss, a reconstruction loss, and a perceptual loss applied on pairs of under- and over-exposure images and thus do not require HDR images for training. Our experiments show that our proposed model can effectively reconstruct HDR images. Our qualitative and quantitative results show that our method achieves state-of-the-art performance on the DrTMO dataset. Our code is available at https://github.com/VinAIResearch/single_image_hdr. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: WACV 2023 paper. 8 pages of content, 2 pages of references, 8 pages of supplementary material

arXiv:2210.00712 [pdf, other]

PSENet: Progressive Self-Enhancement Network for Unsupervised Extreme-Light Image Enhancement

Authors: Hue Nguyen, Diep Tran, Khoi Nguyen, Rang Nguyen

Abstract: The extremes of lighting (e.g. too much or too little light) usually cause many troubles for machine and human vision. Many recent works have mainly focused on under-exposure cases where images are often captured in low-light conditions (e.g. nighttime) and achieved promising results for enhancing the quality of images. However, they are inferior to handling images under over-exposure. To mitigate… ▽ More The extremes of lighting (e.g. too much or too little light) usually cause many troubles for machine and human vision. Many recent works have mainly focused on under-exposure cases where images are often captured in low-light conditions (e.g. nighttime) and achieved promising results for enhancing the quality of images. However, they are inferior to handling images under over-exposure. To mitigate this limitation, we propose a novel unsupervised enhancement framework which is robust against various lighting conditions while does not require any well-exposed images to serve as the ground-truths. Our main concept is to construct pseudo-ground-truth images synthesized from multiple source images that simulate all potential exposure scenarios to train the enhancement network. Our extensive experiments show that the proposed approach consistently outperforms the current state-of-the-art unsupervised counterparts in several public datasets in terms of both quantitative metrics and qualitative results. Our code is available at https://github.com/VinAIResearch/PSENet-Image-Enhancement. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: Accepted to WACV 2023

arXiv:2207.10785 [pdf, other]

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments

Authors: Khoi D. Nguyen, Quoc-Huy Tran, Khoi Nguyen, Binh-Son Hua, Rang Nguyen

Abstract: We present a novel method for few-shot video classification, which performs appearance and temporal alignments. In particular, given a pair of query and support videos, we conduct appearance alignment via frame-level feature matching to achieve the appearance similarity score between the videos, while utilizing temporal order-preserving priors for obtaining the temporal similarity score between th… ▽ More We present a novel method for few-shot video classification, which performs appearance and temporal alignments. In particular, given a pair of query and support videos, we conduct appearance alignment via frame-level feature matching to achieve the appearance similarity score between the videos, while utilizing temporal order-preserving priors for obtaining the temporal similarity score between the videos. Moreover, we introduce a few-shot video classification framework that leverages the above appearance and temporal similarity scores across multiple steps, namely prototype-based training and testing as well as inductive and transductive prototype refinement. To the best of our knowledge, our work is the first to explore transductive few-shot video classification. Extensive experiments on both Kinetics and Something-Something V2 datasets show that both appearance and temporal alignments are crucial for datasets with temporal order sensitivity such as Something-Something V2. Our approach achieves similar or better results than previous methods on both datasets. Our code is available at https://github.com/VinAIResearch/fsvc-ata. △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV 2022

arXiv:2206.07772 [pdf, other]

Deep Learning and Handheld Augmented Reality Based System for Optimal Data Collection in Fault Diagnostics Domain

Authors: Ryan Nguyen, Rahul Rai

Abstract: Compared to current AI or robotic systems, humans navigate their environment with ease, making tasks such as data collection trivial. However, humans find it harder to model complex relationships hidden in the data. AI systems, especially deep learning (DL) algorithms, impressively capture those complex relationships. Symbiotically coupling humans and computational machines' strengths can simultan… ▽ More Compared to current AI or robotic systems, humans navigate their environment with ease, making tasks such as data collection trivial. However, humans find it harder to model complex relationships hidden in the data. AI systems, especially deep learning (DL) algorithms, impressively capture those complex relationships. Symbiotically coupling humans and computational machines' strengths can simultaneously minimize the collected data required and build complex input-to-output map** models. This paper enables this coupling by presenting a novel human-machine interaction framework to perform fault diagnostics with minimal data. Collecting data for diagnosing faults for complex systems is difficult and time-consuming. Minimizing the required data will increase the practicability of data-driven models in diagnosing faults. The framework provides instructions to a human user to collect data that mitigates the difference between the data used to train and test the fault diagnostics model. The framework is composed of three components: (1) a reinforcement learning algorithm for data collection to develop a training dataset, (2) a deep learning algorithm for diagnosing faults, and (3) a handheld augmented reality application for data collection for testing data. The proposed framework has provided above 100\% precision and recall on a novel dataset with only one instance of each fault condition. Additionally, a usability study was conducted to gauge the user experience of the handheld augmented reality application, and all users were able to follow the provided steps. △ Less

Submitted 15 June, 2022; originally announced June 2022.

arXiv:2206.07762 [pdf, other]

doi 10.1016/j.ymssp.2022.109611

Physics-Infused Fuzzy Generative Adversarial Network for Robust Failure Prognosis

Authors: Ryan Nguyen, Shubhendu Kumar Singh, Rahul Rai

Abstract: Prognostics aid in the longevity of fielded systems or products. Quantifying the system's current health enable prognosis to enhance the operator's decision-making to preserve the system's health. Creating a prognosis for a system can be difficult due to (a) unknown physical relationships and/or (b) irregularities in data appearing well beyond the initiation of a problem. Traditionally, three diff… ▽ More Prognostics aid in the longevity of fielded systems or products. Quantifying the system's current health enable prognosis to enhance the operator's decision-making to preserve the system's health. Creating a prognosis for a system can be difficult due to (a) unknown physical relationships and/or (b) irregularities in data appearing well beyond the initiation of a problem. Traditionally, three different modeling paradigms have been used to develop a prognostics model: physics-based (PbM), data-driven (DDM), and hybrid modeling. Recently, the hybrid modeling approach that combines the strength of both PbM and DDM based approaches and alleviates their limitations is gaining traction in the prognostics domain. In this paper, a novel hybrid modeling approach for prognostics applications based on combining concepts from fuzzy logic and generative adversarial networks (GANs) is outlined. The FuzzyGAN based method embeds a physics-based model in the aggregation of the fuzzy implications. This technique constrains the output of the learning method to a realistic solution. Results on a bearing problem showcases the efficacy of adding a physics-based aggregation in a fuzzy logic model to improve GAN's ability to model health and give a more accurate system prognosis. △ Less

Submitted 15 June, 2022; originally announced June 2022.

arXiv:2206.04679 [pdf, other]

POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples

Authors: Duong H. Le, Khoi D. Nguyen, Khoi Nguyen, Quoc-Huy Tran, Rang Nguyen, Binh-Son Hua

Abstract: In this work, we propose to use out-of-distribution samples, i.e., unlabeled samples coming from outside the target classes, to improve few-shot learning. Specifically, we exploit the easily available out-of-distribution samples to drive the classifier to avoid irrelevant features by maximizing the distance from prototypes to out-of-distribution samples while minimizing that of in-distribution sam… ▽ More In this work, we propose to use out-of-distribution samples, i.e., unlabeled samples coming from outside the target classes, to improve few-shot learning. Specifically, we exploit the easily available out-of-distribution samples to drive the classifier to avoid irrelevant features by maximizing the distance from prototypes to out-of-distribution samples while minimizing that of in-distribution samples (i.e., support, query data). Our approach is simple to implement, agnostic to feature extractors, lightweight without any additional cost for pre-training, and applicable to both inductive and transductive settings. Extensive experiments on various standard benchmarks demonstrate that the proposed method consistently improves the performance of pretrained networks with different architectures. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted at NeurIPS 2021 (First two authors contribute equally)

arXiv:2202.06226 [pdf, other]

Feature Construction and Selection for PV Solar Power Modeling

Authors: Yu Yang, Jia Mao, Richard Nguyen, Annas Tohmeh, Hen-Geul Yeh

Abstract: Using solar power in the process industry can reduce greenhouse gas emissions and make the production process more sustainable. However, the intermittent nature of solar power renders its usage challenging. Building a model to predict photovoltaic (PV) power generation allows decision-makers to hedge energy shortages and further design proper operations. The solar power output is time-series data… ▽ More Using solar power in the process industry can reduce greenhouse gas emissions and make the production process more sustainable. However, the intermittent nature of solar power renders its usage challenging. Building a model to predict photovoltaic (PV) power generation allows decision-makers to hedge energy shortages and further design proper operations. The solar power output is time-series data dependent on many factors, such as irradiance and weather. A machine learning framework for 1-hour ahead solar power prediction is developed in this paper based on the historical data. Our method extends the input dataset into higher dimensional Chebyshev polynomial space. Then, a feature selection scheme is developed with constrained linear regression to construct the predictor for different weather types. Several tests show that the proposed approach yields lower mean squared error than classical machine learning methods, such as support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT). △ Less

Submitted 13 February, 2022; originally announced February 2022.

Comments: 6 pages, 8 figures

Journal ref: Adconip 2022

arXiv:2112.01398 [pdf, other]

TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation

Authors: Tan M. Dinh, Rang Nguyen, Binh-Son Hua

Abstract: In this paper, we conduct a study on the state-of-the-art methods for text-to-image synthesis and propose a framework to evaluate these methods. We consider syntheses where an image contains a single or multiple objects. Our study outlines several issues in the current evaluation pipeline: (i) for image quality assessment, a commonly used metric, e.g., Inception Score (IS), is often either miscali… ▽ More In this paper, we conduct a study on the state-of-the-art methods for text-to-image synthesis and propose a framework to evaluate these methods. We consider syntheses where an image contains a single or multiple objects. Our study outlines several issues in the current evaluation pipeline: (i) for image quality assessment, a commonly used metric, e.g., Inception Score (IS), is often either miscalibrated for the single-object case or misused for the multi-object case; (ii) for text relevance and object accuracy assessment, there is an overfitting phenomenon in the existing R-precision (RP) and Semantic Object Accuracy (SOA) metrics, respectively; (iii) for multi-object case, many vital factors for evaluation, e.g., object fidelity, positional alignment, counting alignment, are largely dismissed; (iv) the ranking of the methods based on current metrics is highly inconsistent with real images. To overcome these issues, we propose a combined set of existing and new metrics to systematically evaluate the methods. For existing metrics, we offer an improved version of IS named IS* by using temperature scaling to calibrate the confidence of the classifier used by IS; we also propose a solution to mitigate the overfitting issues of RP and SOA. For new metrics, we develop counting alignment, positional alignment, object-centric IS, and object-centric FID metrics for evaluating the multi-object case. We show that benchmarking with our bag of metrics results in a highly consistent ranking among existing methods that is well-aligned with human evaluation. As a by-product, we create AttnGAN++, a simple but strong baseline for the benchmark by stabilizing the training of AttnGAN using spectral normalization. We also release our toolbox, so-called TISE, for advocating fair and consistent evaluation of text-to-image models. △ Less

Submitted 19 July, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: Accepted to ECCV 2022; TISE toolbox is available at https://github.com/VinAIResearch/tise-toolbox

arXiv:2112.00719 [pdf, other]

HyperInverter: Improving StyleGAN Inversion via Hypernetwork

Authors: Tan M. Dinh, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua

Abstract: Real-world image manipulation has achieved fantastic progress in recent years as a result of the exploration and utilization of GAN latent spaces. GAN inversion is the first step in this pipeline, which aims to map the real image to the latent code faithfully. Unfortunately, the majority of existing GAN inversion methods fail to meet at least one of the three requirements listed below: high recons… ▽ More Real-world image manipulation has achieved fantastic progress in recent years as a result of the exploration and utilization of GAN latent spaces. GAN inversion is the first step in this pipeline, which aims to map the real image to the latent code faithfully. Unfortunately, the majority of existing GAN inversion methods fail to meet at least one of the three requirements listed below: high reconstruction quality, editability, and fast inference. We present a novel two-phase strategy in this research that fits all requirements at the same time. In the first phase, we train an encoder to map the input image to StyleGAN2 $\mathcal{W}$-space, which was proven to have excellent editability but lower reconstruction quality. In the second phase, we supplement the reconstruction ability in the initial phase by leveraging a series of hypernetworks to recover the missing information during inversion. These two steps complement each other to yield high reconstruction quality thanks to the hypernetwork branch and excellent editability due to the inversion done in the $\mathcal{W}$-space. Our method is entirely encoder-based, resulting in extremely fast inference. Extensive experiments on two challenging datasets demonstrate the superiority of our method. △ Less

Submitted 4 April, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted to CVPR 2022; Project page is located at https://di-mi-ta.github.io/HyperInverter/

arXiv:2110.14588 [pdf, other]

Fuzzy Generative Adversarial Networks

Authors: Ryan Nguyen, Shubhendu Kumar Singh, Rahul Rai

Abstract: Generative Adversarial Networks (GANs) are well-known tools for data generation and semi-supervised classification. GANs, with less labeled data, outperform Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) in classification across various tasks, this shows promise for develo** GANs capable of trespassing into the domain of semi-supervised regression. However, develo** GANs… ▽ More Generative Adversarial Networks (GANs) are well-known tools for data generation and semi-supervised classification. GANs, with less labeled data, outperform Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) in classification across various tasks, this shows promise for develo** GANs capable of trespassing into the domain of semi-supervised regression. However, develo** GANs for regression introduce two major challenges: (1) inherent instability in the GAN formulation and (2) performing regression and achieving stability simultaneously. This paper introduces techniques that show improvement in the GANs' regression capability through mean absolute error (MAE) and mean squared error (MSE). We bake a differentiable fuzzy logic system at multiple locations in a GAN because fuzzy logic systems have demonstrated high efficacy in classification and regression settings. The fuzzy logic takes the output of either or both the generator and the discriminator to either or both predict the output, $y$, and evaluate the generator's performance. We outline the results of applying the fuzzy logic system to CGAN and summarize each approach's efficacy. This paper shows that adding a fuzzy logic layer can enhance GAN's ability to perform regression; the most desirable injection location is problem-specific, and we show this through experiments over various datasets. Besides, we demonstrate empirically that the fuzzy-infused GAN is competitive with DNNs. △ Less

Submitted 27 October, 2021; originally announced October 2021.

arXiv:2110.06416 [pdf, other]

MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Authors: Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen, Nick Tzou, Hadas Kotek, Vincent Renkens

Abstract: In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions ta… ▽ More In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not readily available. The research in visual question answering (VQA) and visual question generation (VQG) is a great step forward. However, they do not capture questions that a visually-abled person would ask multimodal assistants. Moreover, many times questions do not seek information from external knowledge. In this paper, we provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided by human annotators while looking at images. We, then, use this dataset for intent classification task in multimodal digital assistant. We also experiment with various approaches for combining vision and language features including the use of multimodal transformer for classification of image-question pairs into 14 intents. We provide the benchmark results and discuss the role of visual and text features for the intent classification task on our dataset. △ Less

Submitted 30 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Extended abstract accepted for WeCNLP 2021

arXiv:2101.02637 [pdf, other]

A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset

Authors: Domenick Poster, Matthew Thielke, Robert Nguyen, Srinivasan Rajaraman, Xing Di, Cedric Nimpa Fondje, Vishal M. Patel, Nathaniel J. Short, Benjamin S. Riggan, Nasser M. Nasrabadi, Shuowen Hu

Abstract: Thermal face imagery, which captures the naturally emitted heat from the face, is limited in availability compared to face imagery in the visible spectrum. To help address this scarcity of thermal face imagery for research and algorithm development, we present the DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF). With over 500,000 images from 395 subjects, the ARL-VTF dataset… ▽ More Thermal face imagery, which captures the naturally emitted heat from the face, is limited in availability compared to face imagery in the visible spectrum. To help address this scarcity of thermal face imagery for research and algorithm development, we present the DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF). With over 500,000 images from 395 subjects, the ARL-VTF dataset represents, to the best of our knowledge, the largest collection of paired visible and thermal face images to date. The data was captured using a modern long wave infrared (LWIR) camera mounted alongside a stereo setup of three visible spectrum cameras. Variability in expressions, pose, and eyewear has been systematically recorded. The dataset has been curated with extensive annotations, metadata, and standardized protocols for evaluation. Furthermore, this paper presents extensive benchmark results and analysis on thermal face landmark detection and thermal-to-visible face verification by evaluating state-of-the-art models on the ARL-VTF dataset. △ Less

Submitted 7 January, 2021; originally announced January 2021.

arXiv:1809.05477 [pdf, other]

Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System

Authors: Lionel Heng, Benjamin Choi, Zhaopeng Cui, Marcel Geppert, Sixing Hu, Benson Kuan, Peidong Liu, Rang Nguyen, Ye Chuan Yeo, Andreas Geiger, Gim Hee Lee, Marc Pollefeys, Torsten Sattler

Abstract: Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps… ▽ More Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps the cost of this sensor suite to a minimum. In addition, the project seeks to extend the operating envelope to include GNSS-less conditions which are typical for environments with tall buildings, foliage, and tunnels. Emphasis is placed on leveraging multi-view geometry and deep learning to enable the vehicle to localize and perceive in 3D space. This paper presents an overview of the project, and describes the sensor suite and current progress in the areas of calibration, localization, and perception. △ Less

Submitted 4 March, 2019; v1 submitted 14 September, 2018; originally announced September 2018.

Journal ref: 2019 IEEE International Conference on Robotics and Automation (ICRA)

Showing 1–18 of 18 results for author: Nguyen, R