Search | arXiv e-print repository

Multimodal Learning and Cognitive Processes in Radiology: MedGaze for Chest X-ray Scanpath Prediction

Authors: Akash Awasthi, Ngan Le, Zhigang Deng, Rishi Agrawal, Carol C. Wu, Hien Van Nguyen

Abstract: Predicting human gaze behavior within computer vision is integral for develo** interactive systems that can anticipate user attention, address fundamental questions in cognitive science, and hold implications for fields like human-computer interaction (HCI) and augmented/virtual reality (AR/VR) systems. Despite methodologies introduced for modeling human eye gaze behavior, applying these models… ▽ More Predicting human gaze behavior within computer vision is integral for develo** interactive systems that can anticipate user attention, address fundamental questions in cognitive science, and hold implications for fields like human-computer interaction (HCI) and augmented/virtual reality (AR/VR) systems. Despite methodologies introduced for modeling human eye gaze behavior, applying these models to medical imaging for scanpath prediction remains unexplored. Our proposed system aims to predict eye gaze sequences from radiology reports and CXR images, potentially streamlining data collection and enhancing AI systems using larger datasets. However, predicting human scanpaths on medical images presents unique challenges due to the diverse nature of abnormal regions. Our model predicts fixation coordinates and durations critical for medical scanpath prediction, outperforming existing models in the computer vision community. Utilizing a two-stage training process and large publicly available datasets, our approach generates static heatmaps and eye gaze videos aligned with radiology reports, facilitating comprehensive analysis. We validate our approach by comparing its performance with state-of-the-art methods and assessing its generalizability among different radiologists, introducing novel strategies to model radiologists' search patterns during CXR image diagnosis. Based on the radiologist's evaluation, MedGaze can generate human-like gaze sequences with a high focus on relevant regions over the CXR images. It sometimes also outperforms humans in terms of redundancy and randomness in the scanpaths. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Submitted to the Journal

arXiv:2406.19686 [pdf]

Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction

Authors: Akash Awasthi, Ngan Le, Zhigang Deng, Carol C. Wu, Hien Van Nguyen

Abstract: Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CX… ▽ More Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX, employing a large multimodal model to analyze image embeddings, eye gaze data, and radiology reports. The system's effectiveness was evaluated based on its referral-making process, the quality of referrals, and performance in collaborative diagnostic settings. CoRaX was tested on a simulated error dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved. The Referral-Usefulness score, indicating the accuracy of predicted regions for all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score, reflecting the diagnostic accuracy of CoRaX's interactions with radiologists, showed that 84% (237 of 280) of these interactions had a score above 0.40. In conclusion, CoRaX efficiently collaborates with radiologists to address perceptual errors across various abnormalities, with potential applications in the education and training of novice radiologists. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Under Review in Journal

arXiv:2406.12367 [pdf, other]

Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines

Authors: Honglei Zhang, Jukka I. Ahonen, Nam Le, Ruiying Yang, Francesco Cricri

Abstract: This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filte… ▽ More This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filters dynamically, in a fuzzy manner, which further optimizes the winning filter on the given sample. Inspired by simulated annealing optimization techniques, we employ a softmax function with a temperature variable as the weight allocation function to mitigate the effects of random initialization. Our evaluation, conducted on a system utilizing multiple post-processing filters within a Versatile Video Coding (VVC) codec framework, demonstrates the superiority of content-specific filters trained with our proposed strategies, specifically, when images are processed in blocks. Using VVC reference software VTM 12.0 as the anchor, experiments on the OpenImages dataset show an improvement in the BD-rate reduction from -41.3% and -44.6% to -42.3% and -44.7% for object detection and instance segmentation tasks, respectively, compared to independently trained filters. The statistics of the filter usage align with our hypothesis and underscore the importance of jointly optimizing filters for both content and reconstruction quality. Our findings pave the way for further improving the performance of video/image codecs. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to be preseneted in ICIP 2024

arXiv:2406.03431 [pdf, other]

CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark

Authors: Ethan Coffman, Reagan Clark, Nhat-Tan Bui, Trong Thang Pham, Beth Kegley, Jeremy G. Powell, Jiangchao Zhao, Ngan Le

Abstract: To address this challenge, we introduce CattleFace-RGBT, a RGB-T Cattle Facial Landmark dataset consisting of 2,300 RGB-T image pairs, a total of 4,600 images. Creating a landmark dataset is time-consuming, but AI-assisted annotation can help. However, applying AI to thermal images is challenging due to suboptimal results from direct thermal training and infeasible RGB-thermal alignment due to dif… ▽ More To address this challenge, we introduce CattleFace-RGBT, a RGB-T Cattle Facial Landmark dataset consisting of 2,300 RGB-T image pairs, a total of 4,600 images. Creating a landmark dataset is time-consuming, but AI-assisted annotation can help. However, applying AI to thermal images is challenging due to suboptimal results from direct thermal training and infeasible RGB-thermal alignment due to different camera views. Therefore, we opt to transfer models trained on RGB to thermal images and refine them using our AI-assisted annotation tool following a semi-automatic annotation approach. Accurately localizing facial key points on both RGB and thermal images enables us to not only discern the cattle's respiratory signs but also measure temperatures to assess the animal's thermal state. To the best of our knowledge, this is the first dataset for the cattle facial landmark on RGB-T images. We conduct benchmarking of the CattleFace-RGBT dataset across various backbone architectures, with the objective of establishing baselines for future research, analysis, and comparison. The dataset and models are at https://github.com/UARK-AICV/CattleFace-RGBT-benchmark △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.00307 [pdf, other]

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

Abstract: Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modaliti… ▽ More Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities. In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grou** mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments. Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query. △ Less

Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

Comments: under submission

arXiv:2405.16148 [pdf, other]

Accelerating Transformers with Spectrum-Preserving Token Merging

Authors: Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert

Abstract: Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Pr… ▽ More Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: Version 1

arXiv:2405.07994 [pdf]

BubbleID: A Deep Learning Framework for Bubble Interface Dynamics Analysis

Authors: Christy Dunlap, Changgen Li, Hari Pandey, Ngan Le, Han Hu

Abstract: This paper presents BubbleID, a sophisticated deep learning architecture designed to comprehensively identify both static and dynamic attributes of bubbles within sequences of boiling images. By amalgamating segmentation powered by Mask R-CNN with SORT-based tracking techniques, the framework is capable of analyzing each bubble's location, dimensions, interface shape, and velocity over its lifetim… ▽ More This paper presents BubbleID, a sophisticated deep learning architecture designed to comprehensively identify both static and dynamic attributes of bubbles within sequences of boiling images. By amalgamating segmentation powered by Mask R-CNN with SORT-based tracking techniques, the framework is capable of analyzing each bubble's location, dimensions, interface shape, and velocity over its lifetime, and capturing dynamic events such as bubble departure. BubbleID is trained and tested on boiling images across diverse heater surfaces and operational settings. This paper also offers a comparative analysis of bubble interface dynamics prior to and post-critical heat flux (CHF) conditions. △ Less

Submitted 20 March, 2024; originally announced May 2024.

Comments: 16 pages, 4 figures

arXiv:2405.04489 [pdf, other]

S3Former: Self-supervised High-resolution Transformer for Solar PV Profiling

Authors: Minh Tran, Adrian De Luis, Haitao Liao, Ying Huang, Roy McCann, Alan Mantooth, Jack Cothren, Ngan Le

Abstract: As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate map** of PV installations is crucial for understanding the extension of its… ▽ More As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate map** of PV installations is crucial for understanding the extension of its adoption and informing energy policy. To meet this need, we introduce S3Former, designed to segment solar panels from aerial imagery and provide size and location information critical for analyzing the impact of such installations on the grid. Solar panel identification is challenging due to factors such as varying weather conditions, roof characteristics, Ground Sampling Distance variations and lack of appropriate initialization weights for optimized training. To tackle these complexities, S3Former features a Masked Attention Mask Transformer incorporating a self-supervised learning pretrained backbone. Specifically, our model leverages low-level and high-level features extracted from the backbone and incorporates an instance query mechanism incorporated on the Transformer architecture to enhance the localization of solar PV installations. We introduce a self-supervised learning phase (pretext task) to improve the initialization weights on the backbone of S3Former. We evaluated S3Former using diverse datasets, demonstrate improvement state-of-the-art models. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Preprint

arXiv:2405.02522 [pdf]

New contexts, old heuristics: How young people in India and the US trust online content in the age of generative AI

Authors: Rachel Xu, Nhu Le, Rebekah Park, Laura Murray, Vishnupriya Das, Devika Kumar, Beth Goldberg

Abstract: We conducted an in-person ethnography in India and the US to investigate how young people (18-24) trusted online content, with a focus on generative AI (GenAI). We had four key findings about how young people use GenAI and determine what to trust online. First, when online, we found participants fluidly shifted between mindsets and emotional states, which we term "information modes." Second, these… ▽ More We conducted an in-person ethnography in India and the US to investigate how young people (18-24) trusted online content, with a focus on generative AI (GenAI). We had four key findings about how young people use GenAI and determine what to trust online. First, when online, we found participants fluidly shifted between mindsets and emotional states, which we term "information modes." Second, these information modes shaped how and why participants trust GenAI and how they applied literacy skills. In the modes where they spent most of their time, they eschewed literacy skills. Third, with the advent of GenAI, participants imported existing trust heuristics from familiar online contexts into their interactions with GenAI. Fourth, although study participants had reservations about GenAI, they saw it as a requisite tool to adopt to keep up with the times. Participants valued efficiency above all else, and used GenAI to further their goals quickly at the expense of accuracy. Our findings suggest that young people spend the majority of their time online not concerned with truth because they are seeking only to pass the time. As a result, literacy interventions should be designed to intervene at the right time, to match users' distinct information modes, and to work with their existing fact-checking practices. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: 14 pages

arXiv:2404.19052 [pdf, other]

Exploring Weighted Property Approaches for RDF Graph Similarity Measure

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: Measuring similarity between RDF graphs is essential for various applications, including knowledge discovery, semantic web analysis, and recommender systems. However, traditional similarity measures often treat all properties equally, potentially overlooking the varying importance of different properties in different contexts. Consequently, exploring weighted property approaches for RDF graph simi… ▽ More Measuring similarity between RDF graphs is essential for various applications, including knowledge discovery, semantic web analysis, and recommender systems. However, traditional similarity measures often treat all properties equally, potentially overlooking the varying importance of different properties in different contexts. Consequently, exploring weighted property approaches for RDF graph similarity measure presents an intriguing avenue for investigation. Therefore, in this paper, we propose a weighted property approach for RDF graph similarity measure to address this limitation. Our approach incorporates the relative importance of properties into the similarity calculation, enabling a more nuanced and context-aware measures of similarity. We evaluate our approach through a comprehensive experimental study on an RDF graph dataset in the vehicle domain. Our results demonstrate that the proposed approach achieves promising accuracy and effectively reflects the perceived similarity between RDF graphs. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.11429 [pdf, other]

CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect

Authors: Minh Tran, Sang Truong, Arthur F. A. Fernandes, Michael T. Kidd, Ngan Le

Abstract: In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality as… ▽ More In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality assessment. To this end, an end-to-end framework called CarcassFormer is introduced. It is built upon a Transformer-based architecture designed to effectively extract visual representations while simultaneously detecting, segmenting, and classifying poultry carcass defects. Our proposed framework is capable of analyzing imperfections resulting from production and transport welfare issues, as well as processing plant stunner, scalder, picker, and other equipment malfunctions. To benchmark the framework, a dataset of 7,321 images was initially acquired, which contained both single and multiple carcasses per image. In this study, the performance of the CarcassFormer system is compared with other state-of-the-art (SOTA) approaches for both classification, detection, and segmentation tasks. Through extensive quantitative experiments, our framework consistently outperforms existing methods, demonstrating remarkable improvements across various evaluation metrics such as AP, AP@50, and AP@75. Furthermore, the qualitative results highlight the strengths of CarcassFormer in capturing fine details, including feathers, and accurately localizing and segmenting carcasses with high precision. To facilitate further research and collaboration, the pre-trained model and source code of CarcassFormer is available for research purposes at: \url{https://github.com/UARK-AICV/CarcassFormer}. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted to Poultry Science Journal

arXiv:2404.09951 [pdf, other]

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Authors: Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, Ngan Le

Abstract: Sports videos pose complex challenges, including cluttered backgrounds, camera angle changes, small action-representing objects, and imbalanced action class distribution. Existing methods for detecting actions in sports videos heavily rely on global features, utilizing a backbone network as a black box that encompasses the entire spatial frame. However, these approaches tend to overlook the nuance… ▽ More Sports videos pose complex challenges, including cluttered backgrounds, camera angle changes, small action-representing objects, and imbalanced action class distribution. Existing methods for detecting actions in sports videos heavily rely on global features, utilizing a backbone network as a black box that encompasses the entire spatial frame. However, these approaches tend to overlook the nuances of the scene and struggle with detecting actions that occupy a small portion of the frame. In particular, they face difficulties when dealing with action classes involving small objects, such as balls or yellow/red cards in soccer, which only occupy a fraction of the screen space. To address these challenges, we introduce a novel approach that analyzes and models scene entities using an adaptive attention mechanism. Particularly, our model disentangles the scene content into the global environment feature and local relevant scene entities feature. To efficiently extract environmental features while considering temporal information with less computational cost, we propose the use of a 2D backbone network with a time-shift mechanism. To accurately capture relevant scene entities, we employ a Vision-Language model in conjunction with the adaptive attention mechanism. Our model has demonstrated outstanding performance, securing the 1st place in the SoccerNet-v2 Action Spotting, FineDiving, and FineGym challenge with a substantial performance improvement of 1.6, 2.0, and 1.3 points in avg-mAP compared to the runner-up methods. Furthermore, our approach offers interpretability capabilities in contrast to other deep learning models, which are often designed as black boxes. Our code and models are released at: https://github.com/Fsoft-AIC/unifying-global-local-feature. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted to IJCNN 2024

arXiv:2403.17606 [pdf, other]

Interactive Identification of Granular Materials using Force Measurements

Authors: Samuli Hynninen, Tran Nguyen Le, Ville Kyrki

Abstract: The ability to identify granular materials facilitates the emergence of various new applications in robotics, ranging from cooking at home to truck loading at mining sites. However, granular material identification remains a challenging and underexplored area. In this work, we present a novel interactive material identification framework that enables robots to identify a wide range of granular mat… ▽ More The ability to identify granular materials facilitates the emergence of various new applications in robotics, ranging from cooking at home to truck loading at mining sites. However, granular material identification remains a challenging and underexplored area. In this work, we present a novel interactive material identification framework that enables robots to identify a wide range of granular materials using only a force-torque sensor for perception. Our framework, comprising interactive exploration, feature extraction, and classification stages, prioritizes simplicity and transparency for seamless integration into various manipulation pipelines. We evaluate the proposed approach through extensive experiments with a real-world dataset comprising 11 granular materials, which we also make publicly available. Additionally, we conducted a comprehensive qualitative analysis of the dataset to offer deeper insights into its nature, aiding future development. Our results show that the proposed method is capable of accurately identifying a wide range of granular materials solely relying on force measurements obtained from direct interaction with the materials. Code and dataset are available at: https://irobotics.aalto.fi/indentify_granular/. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

arXiv:2403.12685 [pdf, other]

Dynamic Manipulation of Deformable Objects using Imitation Learning with Adaptation to Hardware Constraints

Authors: Eric Hannus, Tran Nguyen Le, David Blanco-Mulero, Ville Kyrki

Abstract: Imitation Learning (IL) is a promising paradigm for learning dynamic manipulation of deformable objects since it does not depend on difficult-to-create accurate simulations of such objects. However, the translation of motions demonstrated by a human to a robot is a challenge for IL, due to differences in the embodiments and the robot's physical limits. These limits are especially relevant in dynam… ▽ More Imitation Learning (IL) is a promising paradigm for learning dynamic manipulation of deformable objects since it does not depend on difficult-to-create accurate simulations of such objects. However, the translation of motions demonstrated by a human to a robot is a challenge for IL, due to differences in the embodiments and the robot's physical limits. These limits are especially relevant in dynamic manipulation where high velocities and accelerations are typical. To address this problem, we propose a framework that first maps a dynamic demonstration into a motion that respects the robot's constraints using a constrained Dynamic Movement Primitive. Second, the resulting object state is further optimized by quasi-static refinement motions to optimize task performance metrics. This allows both efficiently altering the object state by dynamic motions and stable small-scale refinements. We evaluate the framework in the challenging task of bag opening, designing the system BILBO: Bimanual dynamic manipulation using Imitation Learning for Bag Opening. Our results show that BILBO can successfully open a wide range of crumpled bags, using a demonstration with a single bag. See supplementary material at https://sites.google.com/view/bilbo-bag. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024). 8 pages, 8 figures

arXiv:2403.11376 [pdf, other]

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

Authors: Minh Tran, Winston Bounsavy, Khoa Vo, Anh Nguyen, Tri Nguyen, Ngan Le

Abstract: Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utiliza… ▽ More Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: \url{https://github.com/UARK-AICV/ShapeFormer} △ Less

Submitted 17 April, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted to IJCNN2024

arXiv:2403.03435 [pdf, ps, other]

VLSP 2023 -- LTER: A Summary of the Challenge on Legal Textual Entailment Recognition

Authors: Vu Tran, Ha-Thanh Nguyen, Trung Vo, Son T. Luu, Hoang-Anh Dang, Ngoc-Cam Le, Thi-Thuy Le, Minh-Tien Nguyen, Truong-Son Nguyen, Le-Minh Nguyen

Abstract: In this new era of rapid AI development, especially in language processing, the demand for AI in the legal domain is increasingly critical. In the context where research in other languages such as English, Japanese, and Chinese has been well-established, we introduce the first fundamental research for the Vietnamese language in the legal domain: legal textual entailment recognition through the Vie… ▽ More In this new era of rapid AI development, especially in language processing, the demand for AI in the legal domain is increasingly critical. In the context where research in other languages such as English, Japanese, and Chinese has been well-established, we introduce the first fundamental research for the Vietnamese language in the legal domain: legal textual entailment recognition through the Vietnamese Language and Speech Processing workshop. In analyzing participants' results, we discuss certain linguistic aspects critical in the legal domain that pose challenges that need to be addressed. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.02974 [pdf, other]

Online Learning of Human Constraints from Feedback in Shared Autonomy

Authors: Shibei Zhu, Tran Nguyen Le, Samuel Kaski, Ville Kyrki

Abstract: Real-time collaboration with humans poses challenges due to the different behavior patterns of humans resulting from diverse physical constraints. Existing works typically focus on learning safety constraints for collaboration, or how to divide and distribute the subtasks between the participating agents to carry out the main task. In contrast, we propose to learn a human constraints model that, i… ▽ More Real-time collaboration with humans poses challenges due to the different behavior patterns of humans resulting from diverse physical constraints. Existing works typically focus on learning safety constraints for collaboration, or how to divide and distribute the subtasks between the participating agents to carry out the main task. In contrast, we propose to learn a human constraints model that, in addition, considers the diverse behaviors of different human operators. We consider a type of collaboration in a shared-autonomy fashion, where both a human operator and an assistive robot act simultaneously in the same task space that affects each other's actions. The task of the assistive agent is to augment the skill of humans to perform a shared task by supporting humans as much as possible, both in terms of reducing the workload and minimizing the discomfort for the human operator. Therefore, we propose an augmentative assistant agent capable of learning and adapting to human physical constraints, aligning its actions with the ergonomic preferences and limitations of the human operator. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Accepted to AAAI-24 Bridge Program on Collaborative AI and Modeling of Humans & AAAI-24 Workshop on Ad Hoc Teamwork

arXiv:2402.18753 [pdf]

Like-minded, like-bodied: How users (18-26) trust online eating and health information

Authors: Rachel Xu, Nhu Le, Rebekah Park, Laura Murray

Abstract: This paper investigates the relationship between social media and eating practices amongst 42 internet users aged 18-26. We conducted an ethnography in the US and India to observe how they navigated eating and health information online. We found that participants portrayed themselves online through a vocabulary we have labeled "the good life": performing holistic health by displaying a socially-id… ▽ More This paper investigates the relationship between social media and eating practices amongst 42 internet users aged 18-26. We conducted an ethnography in the US and India to observe how they navigated eating and health information online. We found that participants portrayed themselves online through a vocabulary we have labeled "the good life": performing holistic health by displaying a socially-ideal body. In doing so, participants unconsciously engaged in behaviors of disordered eating while actively eschewing them. They also valued personal testimonies, and readily tested tips from content creators who shared similar beliefs and bodies to them. In doing so, they discarded probabilistic thinking and opened themselves to harm. Our study found that their social media feeds did not unidirectionally influence participants - they also reflected participants' internalized views of health, in an intertwined, non-linear journey. Reducing the online spread of disordered eating practices requires addressing it within young people's social context. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 10 pages

arXiv:2401.10761 [pdf, other]

NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

Authors: Jukka I. Ahonen, Nam Le, Honglei Zhang, Antti Hallapuro, Francesco Cricri, Hamed Rezazadegan Tavakoli, Miska M. Hannuksela, Esa Rahtu

Abstract: The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two… ▽ More The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: ISM 2023 Best paper award winner version

arXiv:2401.10732 [pdf, other]

doi 10.1109/ICIP46576.2022.9897916

Bridging the gap between image coding for machines and humans

Authors: Nam Le, Honglei Zhang, Francesco Cricri, Ramin G. Youvalari, Hamed Rezazadegan Tavakoli, Emre Aksu, Miska M. Hannuksela, Esa Rahtu

Abstract: Image coding for machines (ICM) aims at reducing the bitrate required to represent an image while minimizing the drop in machine vision analysis accuracy. In many use cases, such as surveillance, it is also important that the visual quality is not drastically deteriorated by the compression process. Recent works on using neural network (NN) based ICM codecs have shown significant coding gains agai… ▽ More Image coding for machines (ICM) aims at reducing the bitrate required to represent an image while minimizing the drop in machine vision analysis accuracy. In many use cases, such as surveillance, it is also important that the visual quality is not drastically deteriorated by the compression process. Recent works on using neural network (NN) based ICM codecs have shown significant coding gains against traditional methods; however, the decompressed images, especially at low bitrates, often contain checkerboard artifacts. We propose an effective decoder finetuning scheme based on adversarial training to significantly enhance the visual quality of ICM codecs, while preserving the machine analysis accuracy, without adding extra bitcost or parameters at the inference phase. The results show complete removal of the checkerboard artifacts at the negligible cost of -1.6% relative change in task performance score. In the cases where some amount of artifacts is tolerable, such as when machine consumption is the primary target, this technique can enhance both pixel-fidelity and feature-fidelity scores without losing task performance. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Journal ref: IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 2022, pp. 3411-3415

arXiv:2401.04474 [pdf, other]

Combining Embedding-Based and Semantic-Based Models for Post-hoc Explanations in Recommender Systems

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: In today's data-rich environment, recommender systems play a crucial role in decision support systems. They provide to users personalized recommendations and explanations about these recommendations. Embedding-based models, despite their widespread use, often suffer from a lack of interpretability, which can undermine trust and user engagement. This paper presents an approach that combines embeddi… ▽ More In today's data-rich environment, recommender systems play a crucial role in decision support systems. They provide to users personalized recommendations and explanations about these recommendations. Embedding-based models, despite their widespread use, often suffer from a lack of interpretability, which can undermine trust and user engagement. This paper presents an approach that combines embedding-based and semantic-based models to generate post-hoc explanations in recommender systems, leveraging ontology-based knowledge graphs to improve interpretability and explainability. By organizing data within a structured framework, ontologies enable the modeling of intricate relationships between entities, which is essential for generating explanations. By combining embedding-based and semantic based models for post-hoc explanations in recommender systems, the framework we defined aims at producing meaningful and easy-to-understand explanations, enhancing user trust and satisfaction, and potentially promoting the adoption of recommender systems across the e-commerce sector. △ Less

Submitted 9 January, 2024; originally announced January 2024.

arXiv:2401.03770 [pdf, other]

Recognizing Similar Crises through the Application of Ontology-based Knowledge Mining

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Elsa Negre

Abstract: Recognizing and learning from similar crisis situations is crucial for the development of effective response strategies. This study addresses the challenge of identifying similarities within a wide range of crisis-related information. To overcome this challenge, we employed an ontology-based crisis situation knowledge base enriched with crisis-related information. Additionally, we implemented a se… ▽ More Recognizing and learning from similar crisis situations is crucial for the development of effective response strategies. This study addresses the challenge of identifying similarities within a wide range of crisis-related information. To overcome this challenge, we employed an ontology-based crisis situation knowledge base enriched with crisis-related information. Additionally, we implemented a semantic similarity measure to assess the degree of similarity between crisis situations. Our investigation specifically focuses on recognizing similar crises through the application of ontology-based knowledge mining. Through our experiments, we demonstrate the accuracy and efficiency of our approach to recognizing similar crises. These findings highlight the potential of ontology-based knowledge mining for enhancing crisis recognition processes and improving overall crisis management strategies. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.10187 [pdf, other]

TSRNet: Simple Framework for Real-time ECG Anomaly Detection with Multimodal Time and Spectrogram Restoration Network

Authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Thinh Phan, Minh-Triet Tran, Brijesh Patel, Donald Adjeroh, Ngan Le

Abstract: The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unh… ▽ More The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unhealthy conditions using solely normal ECG data for training. Furthermore, to enhance the information available and build a robust system, we suggest considering both the time series and time-frequency domain aspects of the ECG signal. As a result, we introduce a specialized network called the Multimodal Time and Spectrogram Restoration Network (TSRNet) designed specifically for detecting anomalies in ECG signals. TSRNet falls into the category of restoration-based anomaly detection and draws inspiration from both the time series and spectrogram domains. By extracting representations from both domains, TSRNet effectively captures the comprehensive characteristics of the ECG signal. This approach enables the network to learn robust representations with superior discrimination abilities, allowing it to distinguish between normal and abnormal ECG patterns more effectively. Furthermore, we introduce a novel inference method, termed Peak-based Error, that specifically focuses on ECG peaks, a critical component in detecting abnormalities. The experimental result on the large-scale dataset PTB-XL has demonstrated the effectiveness of our approach in ECG anomaly detection, while also prioritizing efficiency by minimizing the number of trainable parameters. Our code is available at https://github.com/UARK-AICV/TSRNet. △ Less

Submitted 5 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted at ISBI 2024

arXiv:2312.09507 [pdf, other]

doi 10.1109/ICASSP48485.2024.10446193

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

Authors: Huy Le, Tung Kieu, Anh Nguyen, Ngan Le

Abstract: Text-video retrieval, a prominent sub-field within the domain of multimodal information retrieval, has witnessed remarkable growth in recent years. However, existing methods assume video scenes are consistent with unbiased descriptions. These limitations fail to align with real-world scenarios since descriptions can be influenced by annotator biases, diverse writing styles, and varying textual per… ▽ More Text-video retrieval, a prominent sub-field within the domain of multimodal information retrieval, has witnessed remarkable growth in recent years. However, existing methods assume video scenes are consistent with unbiased descriptions. These limitations fail to align with real-world scenarios since descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce $\texttt{WAVER}$, a cross-domain knowledge distillation framework via vision-language models through open-vocabulary knowledge designed to tackle the challenge of handling different writing styles in video descriptions. $\texttt{WAVER}$ capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that $\texttt{WAVER}$ can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations. The code is available at: https://github.com/Fsoft-AIC/WAVER △ Less

Submitted 10 January, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP 2024

arXiv:2312.07740 [pdf, other]

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity Scene Graph Generation in Videos

Authors: Naga VS Raviteja Chappa, Pha Nguyen, Thi Hoang Ngan Le, Khoa Luu

Abstract: Group Activity Scene Graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional Video Scene Graph Generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene understanding capabilities, we introduced a GASG dataset extendin… ▽ More Group Activity Scene Graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional Video Scene Graph Generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving \textit{Appearance, Interaction, Position, Relationship, and Situation} attributes. This work also introduces an innovative approach, \textbf{H}ierarchical \textbf{Att}ention-\textbf{Flow} (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance. Flow-Attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional "values" and "keys" are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed Flow-Attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data. △ Less

Submitted 28 November, 2023; originally announced December 2023.

Comments: 11 pages, 5 figures, 6 tables

arXiv:2312.05634 [pdf, other]

PGDS: Pose-Guidance Deep Supervision for Mitigating Clothes-Changing in Person Re-Identification

Authors: Quoc-Huy Trinh, Nhat-Tan Bui, Dinh-Hieu Hoang, Phuoc-Thao Vo Thi, Hai-Dang Nguyen, Debesh Jha, Ulas Bagci, Ngan Le, Minh-Triet Tran

Abstract: Person Re-Identification (Re-ID) task seeks to enhance the tracking of multiple individuals by surveillance cameras. It supports multimodal tasks, including text-based person retrieval and human matching. One of the most significant challenges faced in Re-ID is clothes-changing, where the same person may appear in different outfits. While previous methods have made notable progress in maintaining… ▽ More Person Re-Identification (Re-ID) task seeks to enhance the tracking of multiple individuals by surveillance cameras. It supports multimodal tasks, including text-based person retrieval and human matching. One of the most significant challenges faced in Re-ID is clothes-changing, where the same person may appear in different outfits. While previous methods have made notable progress in maintaining clothing data consistency and handling clothing change data, they still rely excessively on clothing information, which can limit performance due to the dynamic nature of human appearances. To mitigate this challenge, we propose the Pose-Guidance Deep Supervision (PGDS), an effective framework for learning pose guidance within the Re-ID task. It consists of three modules: a human encoder, a pose encoder, and a Pose-to-Human Projection module (PHP). Our framework guides the human encoder, i.e., the main re-identification model, with pose information from the pose encoder through multiple layers via the knowledge transfer mechanism from the PHP module, hel** the human encoder learn body parts information without increasing computation resources in the inference stage. Through extensive experiments, our method surpasses the performance of current state-of-the-art methods, demonstrating its robustness and effectiveness for real-world applications. Our code is available at https://github.com/huyquoctrinh/PGDS. △ Less

Submitted 1 June, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: Accepted at AVSS 2024

arXiv:2311.11362 [pdf, other]

Symmetry-invariant quantum machine learning force fields

Authors: Isabel Nha Minh Le, Oriel Kiss, Julian Schuhmacher, Ivano Tavernelli, Francesco Tacchino

Abstract: Machine learning techniques are essential tools to compute efficient, yet accurate, force fields for atomistic simulations. This approach has recently been extended to incorporate quantum computational methods, making use of variational quantum learning models to predict potential energy surfaces and atomic forces from ab initio training data. However, the trainability and scalability of such mode… ▽ More Machine learning techniques are essential tools to compute efficient, yet accurate, force fields for atomistic simulations. This approach has recently been extended to incorporate quantum computational methods, making use of variational quantum learning models to predict potential energy surfaces and atomic forces from ab initio training data. However, the trainability and scalability of such models are still limited, due to both theoretical and practical barriers. Inspired by recent developments in geometric classical and quantum machine learning, here we design quantum neural networks that explicitly incorporate, as a data-inspired prior, an extensive set of physically relevant symmetries. We find that our invariant quantum learning models outperform their more generic counterparts on individual molecules of growing complexity. Furthermore, we study a water dimer as a minimal example of a system with multiple components, showcasing the versatility of our proposed approach and opening the way towards larger simulations. Our results suggest that molecular force fields generation can significantly profit from leveraging the framework of geometric quantum machine learning, and that chemical systems represent, in fact, an interesting and rich playground for the development and application of advanced quantum machine learning tools. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: 12 pages, 8 figures

arXiv:2311.11096 [pdf, other]

On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation

Authors: Duy Minh Ho Nguyen, Tan Ngoc Pham, Nghiem Tuong Diep, Nghi Quoc Phan, Quang Pham, Vinh Tong, Binh T. Nguyen, Ngan Hoang Le, Nhat Ho, Pengtao Xie, Daniel Sonntag, Mathias Niepert

Abstract: Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for… ▽ More Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on develo** better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Advances in Neural Information Processing Systems (NeurIPS) 2023, Workshop on robustness of zero/few-shot learning in foundation models

arXiv:2311.00729 [pdf, other]

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Authors: Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, Ngan Le

Abstract: Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot T… ▽ More Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot TAD methods have limitations on how to properly construct the strong relationship between two interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. The former is a Transformer-based module that detects action events while selectively collecting crucial semantic embeddings for later recognition. The latter one, CLIP-based module, generates semantic embeddings from text and frame inputs for each temporal unit. Additionally, we enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets demonstrate our approach's superior performance in zero-shot TAD and effective knowledge transfer from ViL models to unseen action categories. △ Less

Submitted 4 November, 2023; v1 submitted 31 October, 2023; originally announced November 2023.

arXiv:2310.20057 [pdf, other]

SolarFormer: Multi-scale Transformer for Solar PV Profiling

Authors: Adrian de Luis, Minh Tran, Taisei Hanyu, Anh Tran, Liao Haitao, Roy McCann, Alan Mantooth, Ying Huang, Ngan Le

Abstract: As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate map** of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar… ▽ More As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate map** of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar panels from aerial imagery, offering insights into their location and size. However, solar panel identification in Computer Vision is intricate due to various factors like weather conditions, roof conditions, and Ground Sampling Distance (GSD) variations. To tackle these complexities, we present the SolarFormer, featuring a multi-scale Transformer encoder and a masked-attention Transformer decoder. Our model leverages low-level features and incorporates an instance query mechanism to enhance the localization of solar PV installations. We rigorously evaluated our SolarFormer using diverse datasets, including GGE (France), IGN (France), and USGS (California, USA), across different GSDs. Our extensive experiments consistently demonstrate that our model either matches or surpasses state-of-the-art models, promising enhanced solar panel segmentation for global sustainable energy initiatives. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Pre-print

arXiv:2310.18986 [pdf, other]

Controllable Group Choreography using Contrastive Diffusion

Authors: Nhat Le, Tuong Do, Khoa Do, Hien Nguyen, Erman Tjiputra, Quang D. Tran, Anh Nguyen

Abstract: Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to ge… ▽ More Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible number of dancers and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography. The source code can be found at https://aioz-ai.github.io/GCD △ Less

Submitted 3 November, 2023; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.03923 [pdf, other]

Open-Fusion: Real-time Open-Vocabulary 3D Map** and Queryable Scene Representation

Authors: Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le

Abstract: Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language found… ▽ More Precise 3D environmental map** is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D map** and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2310.02143 [pdf, other]

CORec-Cri: How collaborative and social technologies can help to contextualize crises?

Authors: Ngoc Luyen Le, **feng Zhong, Elsa Negre, Marie-Hélène Abel

Abstract: Crisis situations can present complex and multifaceted challenges, often requiring the involvement of multiple organizations and stakeholders with varying areas of expertise, responsibilities, and resources. Acquiring accurate and timely information about impacted areas is crucial to effectively respond to these crises. In this paper, we investigate how collaborative and social technologies help t… ▽ More Crisis situations can present complex and multifaceted challenges, often requiring the involvement of multiple organizations and stakeholders with varying areas of expertise, responsibilities, and resources. Acquiring accurate and timely information about impacted areas is crucial to effectively respond to these crises. In this paper, we investigate how collaborative and social technologies help to contextualize crises, including identifying impacted areas and real-time needs. To this end, we define CORec-Cri (Contextulized Ontology-based Recommender system for crisis management) based on existing work. Our motivation for this approach is two-fold: first, effective collaboration among stakeholders is essential for efficient and coordinated crisis response; second, social computing facilitates interaction, information flow, and collaboration among stakeholders. We detail the key components of our system design, highlighting its potential to support decision-making, resource allocation, and communication among stakeholders. Finally, we provide examples of how our system can be applied to contextualize crises to improve crisis management. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.13550 [pdf, other]

I-AI: A Controllable & Interpretable AI System for Decoding Radiologists' Intense Focus for Accurate CXR Diagnoses

Authors: Trong Thang Pham, Jacob Brecheisen, Anh Nguyen, Hien Nguyen, Ngan Le

Abstract: In the field of chest X-ray (CXR) diagnosis, existing works often focus solely on determining where a radiologist looks, typically through tasks such as detection, segmentation, or classification. However, these approaches are often designed as black-box models, lacking interpretability. In this paper, we introduce Interpretable Artificial Intelligence (I-AI) a novel and unified controllable inter… ▽ More In the field of chest X-ray (CXR) diagnosis, existing works often focus solely on determining where a radiologist looks, typically through tasks such as detection, segmentation, or classification. However, these approaches are often designed as black-box models, lacking interpretability. In this paper, we introduce Interpretable Artificial Intelligence (I-AI) a novel and unified controllable interpretable pipeline for decoding the intense focus of radiologists in CXR diagnosis. Our I-AI addresses three key questions: where a radiologist looks, how long they focus on specific areas, and what findings they diagnose. By capturing the intensity of the radiologist's gaze, we provide a unified solution that offers insights into the cognitive process underlying radiological interpretation. Unlike current methods that rely on black-box machine learning models, which can be prone to extracting erroneous information from the entire input image during the diagnosis process, we tackle this issue by effectively masking out irrelevant information. Our proposed I-AI leverages a vision-language model, allowing for precise control over the interpretation process while ensuring the exclusion of irrelevant features. To train our I-AI model, we utilize an eye gaze dataset to extract anatomical gaze information and generate ground truth heatmaps. Through extensive experimentation, we demonstrate the efficacy of our method. We showcase that the attention heatmaps, designed to mimic radiologists' focus, encode sufficient and relevant information, enabling accurate classification tasks using only a portion of CXR. The code, checkpoints, and data are at https://github.com/UARK-AICV/IAI △ Less

Submitted 9 December, 2023; v1 submitted 24 September, 2023; originally announced September 2023.

Comments: Accepted at WACV 2024

arXiv:2309.12323 [pdf, other]

Evaluating the diversity and utility of materials proposed by generative models

Authors: Alexander New, Michael Pekala, Elizabeth A. Pogue, Nam Q. Le, Janna Domenico, Christine D. Piatko, Christopher D. Stiles

Abstract: Generative machine learning models can use data generated by scientific modeling to create large quantities of novel material structures. Here, we assess how one state-of-the-art generative model, the physics-guided crystal generation model (PGCGM), can be used as part of the inverse design process. We show that the default PGCGM's input space is not smooth with respect to parameter variation, mak… ▽ More Generative machine learning models can use data generated by scientific modeling to create large quantities of novel material structures. Here, we assess how one state-of-the-art generative model, the physics-guided crystal generation model (PGCGM), can be used as part of the inverse design process. We show that the default PGCGM's input space is not smooth with respect to parameter variation, making material optimization difficult and limited. We also demonstrate that most generated structures are predicted to be thermodynamically unstable by a separate property-prediction model, partially due to out-of-domain data challenges. Our findings suggest how generative models might be improved to enable better inverse design. △ Less

Submitted 9 August, 2023; originally announced September 2023.

Comments: 12 pages, 9 figures. Published at SynS & ML @ ICML2023: https://openreview.net/forum?id=2ZYbmYTKoR

arXiv:2309.10932 [pdf, other]

Open-Vocabulary Affordance Detection using Knowledge Distillation and Text-Point Correlation

Authors: Tuan Van Vo, Minh Nhat Vu, Baoru Huang, Toan Nguyen, Ngan Le, Thieu Vo, Anh Nguyen

Abstract: Affordance detection presents intricate challenges and has a wide range of robotic applications. Previous works have faced limitations such as the complexities of 3D object shapes, the wide range of potential affordances on real-world objects, and the lack of open-vocabulary support for affordance understanding. In this paper, we introduce a new open-vocabulary affordance detection method in 3D po… ▽ More Affordance detection presents intricate challenges and has a wide range of robotic applications. Previous works have faced limitations such as the complexities of 3D object shapes, the wide range of potential affordances on real-world objects, and the lack of open-vocabulary support for affordance understanding. In this paper, we introduce a new open-vocabulary affordance detection method in 3D point clouds, leveraging knowledge distillation and text-point correlation. Our approach employs pre-trained 3D models through knowledge distillation to enhance feature extraction and semantic understanding in 3D point clouds. We further introduce a new text-point correlation method to learn the semantic links between point cloud features and open-vocabulary labels. The intensive experiments show that our approach outperforms previous works and adapts to new affordance labels and unseen objects. Notably, our method achieves the improvement of 7.96% mIOU score compared to the baselines. Furthermore, it offers real-time inference which is well-suitable for robotic manipulation applications. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 8 pages

arXiv:2309.10911 [pdf, other]

Language-Conditioned Affordance-Pose Detection in 3D Point Clouds

Authors: Toan Nguyen, Minh Nhat Vu, Baoru Huang, Tuan Van Vo, Vy Truong, Ngan Le, Thieu Vo, Bac Le, Anh Nguyen

Abstract: Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-wor… ▽ More Affordance detection and pose estimation are of great importance in many robotic applications. Their combination helps the robot gain an enhanced manipulation capability, in which the generated pose can facilitate the corresponding affordance task. Previous methods for affodance-pose joint learning are limited to a predefined set of affordances, thus limiting the adaptability of robots in real-world environments. In this paper, we propose a new method for language-conditioned affordance-pose joint learning in 3D point clouds. Given a 3D point cloud object, our method detects the affordance region and generates appropriate 6-DoF poses for any unconstrained affordance label. Our method consists of an open-vocabulary affordance detection branch and a language-guided diffusion model that generates 6-DoF poses based on the affordance text. We also introduce a new high-quality dataset for the task of language-driven affordance-pose joint learning. Intensive experimental results demonstrate that our proposed method works effectively on a wide range of open-vocabulary affordances and outperforms other baselines by a large margin. In addition, we illustrate the usefulness of our method in real-world robotic applications. Our code and dataset are publicly available at https://3DAPNet.github.io △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Project page: https://3DAPNet.github.io

arXiv:2309.03506 [pdf, other]

Towards Robust Natural-Looking Mammography Lesion Synthesis on Ipsilateral Dual-Views Breast Cancer Analysis

Authors: Thanh-Huy Nguyen, Quang Hien Kha, Thai Ngoc Toan Truong, Ba Thinh Lam, Ba Hung Ngo, Quang Vinh Dinh, Nguyen Quoc Khanh Le

Abstract: In recent years, many mammographic image analysis methods have been introduced for improving cancer classification tasks. Two major issues of mammogram classification tasks are leveraging multi-view mammographic information and class-imbalance handling. In the first problem, many multi-view methods have been released for concatenating features of two or more views for the training and inference st… ▽ More In recent years, many mammographic image analysis methods have been introduced for improving cancer classification tasks. Two major issues of mammogram classification tasks are leveraging multi-view mammographic information and class-imbalance handling. In the first problem, many multi-view methods have been released for concatenating features of two or more views for the training and inference stage. Having said that, most multi-view existing methods are not explainable in the meaning of feature fusion, and treat many views equally for diagnosing. Our work aims to propose a simple but novel method for enhancing examined view (main view) by leveraging low-level feature information from the auxiliary view (ipsilateral view) before learning the high-level feature that contains the cancerous features. For the second issue, we also propose a simple but novel malignant mammogram synthesis framework for upsampling minor class samples. Our easy-to-implement and no-training framework has eliminated the current limitation of the CutMix algorithm which is unreliable synthesized images with random pasted patches, hard-contour problems, and domain shift problems. Our results on VinDr-Mammo and CMMD datasets show the effectiveness of our two new frameworks for both multi-view training and synthesizing mammographic images, outperforming the previous conventional methods in our experimental settings. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2309.03493 [pdf, other]

SAM3D: Segment Anything Model in Volumetric Medical Images

Authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Minh-Triet Tran, Gianfranco Doretto, Donald Adjeroh, Brijesh Patel, Arabinda Choudhary, Ngan Le

Abstract: Image segmentation remains a pivotal component in medical image analysis, aiding in the extraction of critical information for precise diagnostic practices. With the advent of deep learning, automated image segmentation methods have risen to prominence, showcasing exceptional proficiency in processing medical imagery. Motivated by the Segment Anything Model (SAM)-a foundational model renowned for… ▽ More Image segmentation remains a pivotal component in medical image analysis, aiding in the extraction of critical information for precise diagnostic practices. With the advent of deep learning, automated image segmentation methods have risen to prominence, showcasing exceptional proficiency in processing medical imagery. Motivated by the Segment Anything Model (SAM)-a foundational model renowned for its remarkable precision and robust generalization capabilities in segmenting 2D natural images-we introduce SAM3D, an innovative adaptation tailored for 3D volumetric medical image analysis. Unlike current SAM-based methods that segment volumetric data by converting the volume into separate 2D slices for individual analysis, our SAM3D model processes the entire 3D volume image in a unified approach. Extensive experiments are conducted on multiple medical image datasets to demonstrate that our network attains competitive results compared with other state-of-the-art methods in 3D medical segmentation tasks while being significantly efficient in terms of parameters. Code and checkpoints are available at https://github.com/UARK-AICV/SAM3D. △ Less

Submitted 5 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: Accepted at ISBI 2024

arXiv:2309.03329 [pdf, other]

MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation

Authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Quang-Thuc Nguyen, Minh-Triet Tran, Ngan Le

Abstract: Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tiss… ▽ More Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tissue) is difficult. To mitigate these challenges, we propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored specifically for polyp segmentation within colonoscopy images. This network draws inspiration from the fusion of a classical edge detection technique with an attention mechanism. By combining these techniques, MEGANet effectively preserves high-frequency information, notably edges and boundaries, which tend to erode as neural networks deepen. MEGANet is designed as an end-to-end framework, encompassing three key modules: an encoder, which is responsible for capturing and abstracting the features from the input image, a decoder, which focuses on salient features, and the Edge-Guided Attention module (EGA) that employs the Laplacian Operator to accentuate polyp boundaries. Extensive experiments, both qualitative and quantitative, on five benchmark datasets, demonstrate that our MEGANet outperforms other existing SOTA methods under six evaluation metrics. Our code is available at https://github.com/UARK-AICV/MEGANet. △ Less

Submitted 4 November, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024)

arXiv:2308.06018 [pdf, other]

Designing a User Contextual Profile Ontology: A Focus on the Vehicle Sales Domain

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: In the digital age, it is crucial to understand and tailor experiences for users interacting with systems and applications. This requires the creation of user contextual profiles that combine user profiles with contextual information. However, there is a lack of research on the integration of contextual information with different user profiles. This study aims to address this gap by designing a us… ▽ More In the digital age, it is crucial to understand and tailor experiences for users interacting with systems and applications. This requires the creation of user contextual profiles that combine user profiles with contextual information. However, there is a lack of research on the integration of contextual information with different user profiles. This study aims to address this gap by designing a user contextual profile ontology that considers both user profiles and contextual information on each profile. Specifically, we present a design and development of the user contextual profile ontology with a focus on the vehicle sales domain. Our designed ontology serves as a structural foundation for standardizing the representation of user profiles and contextual information, enhancing the system's ability to capture user preferences and contextual information of the user accurately. Moreover, we illustrate a case study using the User Contextual Profile Ontology in generating personalized recommendations for vehicle sales domain. △ Less

Submitted 11 August, 2023; originally announced August 2023.

arXiv:2307.10702 [pdf, other]

doi 10.1109/CSCWD57460.2023.10152701

A Constraint-based Recommender System via RDF Knowledge Graphs

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: Knowledge graphs, represented in RDF, are able to model entities and their relations by means of ontologies. The use of knowledge graphs for information modeling has attracted interest in recent years. In recommender systems, items and users can be mapped and integrated into the knowledge graph, which can represent more links and relationships between users and items. Constraint-based recommender… ▽ More Knowledge graphs, represented in RDF, are able to model entities and their relations by means of ontologies. The use of knowledge graphs for information modeling has attracted interest in recent years. In recommender systems, items and users can be mapped and integrated into the knowledge graph, which can represent more links and relationships between users and items. Constraint-based recommender systems are based on the idea of explicitly exploiting deep recommendation knowledge through constraints to identify relevant recommendations. When combined with knowledge graphs, a constraint-based recommender system gains several benefits in terms of constraint sets. In this paper, we investigate and propose the construction of a constraint-based recommender system via RDF knowledge graphs applied to the vehicle purchase/sale domain. The results of our experiments show that the proposed approach is able to efficiently identify recommendations in accordance with user preferences. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Journal ref: The 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2023 ), May 2023, Rio de Janeiro, Brazil. pp.849-854

arXiv:2307.10680 [pdf, ps, other]

doi 10.1007/978-3-031-27762-7_35

A Personalized Recommender System Based-on Knowledge Graph Embeddings

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: Knowledge graphs have proven to be effective for modeling entities and their relationships through the use of ontologies. The recent emergence in interest for using knowledge graphs as a form of information modeling has led to their increased adoption in recommender systems. By incorporating users and items into the knowledge graph, these systems can better capture the implicit connections between… ▽ More Knowledge graphs have proven to be effective for modeling entities and their relationships through the use of ontologies. The recent emergence in interest for using knowledge graphs as a form of information modeling has led to their increased adoption in recommender systems. By incorporating users and items into the knowledge graph, these systems can better capture the implicit connections between them and provide more accurate recommendations. In this paper, we investigate and propose the construction of a personalized recommender system via knowledge graphs embedding applied to the vehicle purchase/sale domain. The results of our experimentation demonstrate the efficacy of the proposed method in providing relevant recommendations that are consistent with individual users. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Journal ref: The International Conference on Artificial Intelligence and Computer Vision (AICV2023), Mar 2023, Marrakesh, Morocco. pp.368-378

arXiv:2307.10639 [pdf, other]

doi 10.1007/978-3-031-33258-6_42

Improving Semantic Similarity Measure Within a Recommender System Based-on RDF Graphs

Authors: Ngoc Luyen Le, Marie-Hélène Abel, Philippe Gouspillou

Abstract: In today's era of information explosion, more users are becoming more reliant upon recommender systems to have better advice, suggestions, or inspire them. The measure of the semantic relatedness or likeness between terms, words, or text data plays an important role in different applications dealing with textual data, as in a recommender system. Over the past few years, many ontologies have been d… ▽ More In today's era of information explosion, more users are becoming more reliant upon recommender systems to have better advice, suggestions, or inspire them. The measure of the semantic relatedness or likeness between terms, words, or text data plays an important role in different applications dealing with textual data, as in a recommender system. Over the past few years, many ontologies have been developed and used as a form of structured representation of knowledge bases for information systems. The measure of semantic similarity from ontology has developed by several methods. In this paper, we propose and carry on an approach for the improvement of semantic similarity calculations within a recommender system based-on RDF graphs. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Journal ref: International Conference on Information Technology & Systems. ICITS 2023, Apr 2023, Cusco, Peru. pp.463-474

arXiv:2307.08272 [pdf, other]

ChatGPT is Good but Bing Chat is Better for Vietnamese Students

Authors: Xuan-Quy Dao, Ngoc-Bich Le

Abstract: This study examines the efficacy of two SOTA large language models (LLMs), namely ChatGPT and Microsoft Bing Chat (BingChat), in catering to the needs of Vietnamese students. Although ChatGPT exhibits proficiency in multiple disciplines, Bing Chat emerges as the more advantageous option. We conduct a comparative analysis of their academic achievements in various disciplines, encompassing mathemati… ▽ More This study examines the efficacy of two SOTA large language models (LLMs), namely ChatGPT and Microsoft Bing Chat (BingChat), in catering to the needs of Vietnamese students. Although ChatGPT exhibits proficiency in multiple disciplines, Bing Chat emerges as the more advantageous option. We conduct a comparative analysis of their academic achievements in various disciplines, encompassing mathematics, literature, English language, physics, chemistry, biology, history, geography, and civic education. The results of our study suggest that BingChat demonstrates superior performance compared to ChatGPT across a wide range of subjects, with the exception of literature, where ChatGPT exhibits better performance. Additionally, BingChat utilizes the more advanced GPT-4 technology in contrast to ChatGPT, which is built upon GPT-3.5. This allows BingChat to improve to comprehension, reasoning and generation of creative and informative text. Moreover, the fact that BingChat is accessible in Vietnam and its integration of hyperlinks and citations within responses serve to reinforce its superiority. In our analysis, it is evident that while ChatGPT exhibits praiseworthy qualities, BingChat presents a more apdated solutions for Vietnamese students. △ Less

Submitted 29 July, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: 13 pages; 6 figures

arXiv:2307.04251

ChatGPT in the Age of Generative AI and Large Language Models: A Concise Survey

Authors: Salman Mohamadi, Ghulam Mujtaba, Ngan Le, Gianfranco Doretto, Donald A. Adjeroh

Abstract: ChatGPT is a large language model (LLM) created by OpenAI that has been carefully trained on a large amount of data. It has revolutionized the field of natural language processing (NLP) and has pushed the boundaries of LLM capabilities. ChatGPT has played a pivotal role in enabling widespread public interaction with generative artificial intelligence (GAI) on a large scale. It has also sparked res… ▽ More ChatGPT is a large language model (LLM) created by OpenAI that has been carefully trained on a large amount of data. It has revolutionized the field of natural language processing (NLP) and has pushed the boundaries of LLM capabilities. ChatGPT has played a pivotal role in enabling widespread public interaction with generative artificial intelligence (GAI) on a large scale. It has also sparked research interest in develo** similar technologies and investigating their applications and implications. In this paper, our primary goal is to provide a concise survey on the current lines of research on ChatGPT and its evolution. We considered both the glass box and black box views of ChatGPT, encompassing the components and foundational elements of the technology, as well as its applications, impacts, and implications. The glass box approach focuses on understanding the inner workings of the technology, and the black box approach embraces it as a complex system, and thus examines its inputs, outputs, and effects. This paves the way for a comprehensive exploration of the technology and provides a road map for further research and experimentation. We also lay out essential foundational literature on LLMs and GAI in general and their connection with ChatGPT. This overview sheds light on existing and missing research lines in the emerging field of LLMs, benefiting both public users and developers. Furthermore, the paper delves into the broad spectrum of applications and significant concerns in fields such as education, research, healthcare, finance, etc. △ Less

Submitted 15 July, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

Comments: The paper was uploaded in error, before it was ready for submission. The paper requires a deep revision, and significant changes and modifications before uploading to the archive

arXiv:2307.01844 [pdf, other]

Advancing Wound Filling Extraction on 3D Faces: Auto-Segmentation and Wound Face Regeneration Approach

Authors: Duong Q. Nguyen, Thinh D. Le, Phuong D. Nguyen, Nga T. K. Le, H. Nguyen-Xuan

Abstract: Facial wound segmentation plays a crucial role in preoperative planning and optimizing patient outcomes in various medical applications. In this paper, we propose an efficient approach for automating 3D facial wound segmentation using a two-stream graph convolutional network. Our method leverages the Cir3D-FaIR dataset and addresses the challenge of data imbalance through extensive experimentation… ▽ More Facial wound segmentation plays a crucial role in preoperative planning and optimizing patient outcomes in various medical applications. In this paper, we propose an efficient approach for automating 3D facial wound segmentation using a two-stream graph convolutional network. Our method leverages the Cir3D-FaIR dataset and addresses the challenge of data imbalance through extensive experimentation with different loss functions. To achieve accurate segmentation, we conducted thorough experiments and selected a high-performing model from the trained models. The selected model demonstrates exceptional segmentation performance for complex 3D facial wounds. Furthermore, based on the segmentation model, we propose an improved approach for extracting 3D facial wound fillers and compare it to the results of the previous study. Our method achieved a remarkable accuracy of 0.9999986\% on the test suite, surpassing the performance of the previous method. From this result, we use 3D printing technology to illustrate the shape of the wound filling. The outcomes of this study have significant implications for physicians involved in preoperative planning and intervention design. By automating facial wound segmentation and improving the accuracy of wound-filling extraction, our approach can assist in carefully assessing and optimizing interventions, leading to enhanced patient outcomes. Additionally, it contributes to advancing facial reconstruction techniques by utilizing machine learning and 3D bioprinting for printing skin tissue implants. Our source code is available at \url{https://github.com/SIMOGroup/WoundFilling3D}. △ Less

Submitted 12 July, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

arXiv:2307.01159 [pdf, other]

Soft Grip**: Specifying for Trustworthiness

Authors: Dhaminda B. Abeywickrama, Nguyen Hao Le, Greg Chance, Peter D. Winter, Arianna Manzini, Alix J. Partridge, Jonathan Ives, John Downer, Graham Deacon, Jonathan Rossiter, Kerstin Eder, Shane Windsor

Abstract: Soft robotics is an emerging technology in which engineers create flexible devices for use in a variety of applications. In order to advance the wide adoption of soft robots, ensuring their trustworthiness is essential; if soft robots are not trusted, they will not be used to their full potential. In order to demonstrate trustworthiness, a specification needs to be formulated to define what is tru… ▽ More Soft robotics is an emerging technology in which engineers create flexible devices for use in a variety of applications. In order to advance the wide adoption of soft robots, ensuring their trustworthiness is essential; if soft robots are not trusted, they will not be used to their full potential. In order to demonstrate trustworthiness, a specification needs to be formulated to define what is trustworthy. However, even for soft robotic grippers, which is one of the most mature areas in soft robotics, the soft robotics community has so far given very little attention to formulating specifications. In this work, we discuss the importance of develo** specifications during development of soft robotic systems, and present an extensive example specification for a soft gripper for pick-and-place tasks for grocery items. The proposed specification covers both functional and non-functional requirements, such as reliability, safety, adaptability, predictability, ethics, and regulations. We also highlight the need to promote verifiability as a first-class objective in the design of a soft gripper. △ Less

Submitted 30 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: Updated the Standards subsection of paper. 9 pages, 2 figures, 1 table, 34 references

ACM Class: D.2.1; I.2.9

arXiv:2306.09170 [pdf, ps, other]

Can ChatGPT pass the Vietnamese National High School Graduation Examination?

Authors: Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, Bac-Bien Ngo

Abstract: This research article highlights the potential of AI-powered chatbots in education and presents the results of using ChatGPT, a large language model, to complete the Vietnamese National High School Graduation Examination (VNHSGE). The study dataset included 30 essays in the literature test case and 1,700 multiple-choice questions designed for other subjects. The results showed that ChatGPT was abl… ▽ More This research article highlights the potential of AI-powered chatbots in education and presents the results of using ChatGPT, a large language model, to complete the Vietnamese National High School Graduation Examination (VNHSGE). The study dataset included 30 essays in the literature test case and 1,700 multiple-choice questions designed for other subjects. The results showed that ChatGPT was able to pass the examination with an average score of 6-7, demonstrating the technology's potential to revolutionize the educational landscape. The analysis of ChatGPT performance revealed its proficiency in a range of subjects, including mathematics, English, physics, chemistry, biology, history, geography, civic education, and literature, which suggests its potential to provide effective support for learners. However, further research is needed to assess ChatGPT performance on more complex exam questions and its potential to support learners in different contexts. As technology continues to evolve and improve, we can expect to see the use of AI tools like ChatGPT become increasingly common in educational settings, ultimately enhancing the educational experience for both students and educators. △ Less

Submitted 10 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 9 pages, 13 figures, 4 tables

arXiv:2306.06842 [pdf, other]

AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation

Authors: Kashu Yamazaki, Taisei Hanyu, Minh Tran, Adrian de Luis, Roy McCann, Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren, Ngan Le

Abstract: Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at… ▽ More Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance. △ Less

Submitted 1 October, 2023; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: under review

Showing 1–50 of 179 results for author: Le, N