Search | arXiv e-print repository

Surgical Triplet Recognition via Diffusion Model

Authors: Daochang Liu, Axel Hu, Mubarak Shah, Chang Xu

Abstract: Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iter… ▽ More Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released. △ Less

Submitted 24 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.11409 [pdf, other]

CodeGemma: Open Code Models Based on Gemma

Authors: CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, **gyue Shen, Joe Kelley, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Zhitao Gong, Jane Fine, Tris Warkentin, Ale Jakse Hartman, Bin Ni, Kathy Korevec , et al. (2 additional authors not shown)

Abstract: This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open… ▽ More This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: v1: 11 pages, 4 figures, 5 tables. v2: Update metadata

arXiv:2406.08713 [pdf, other]

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

Authors: Xinrui Yang, Zhuohan Wang, Anthony Hu

Abstract: Text-to-image models have shown remarkable progress in generating high-quality images from user-provided prompts. Despite this, the quality of these images varies due to the models' sensitivity to human language nuances. With advancements in large language models, there are new opportunities to enhance prompt design for image generation tasks. Existing research primarily focuses on optimizing prom… ▽ More Text-to-image models have shown remarkable progress in generating high-quality images from user-provided prompts. Despite this, the quality of these images varies due to the models' sensitivity to human language nuances. With advancements in large language models, there are new opportunities to enhance prompt design for image generation tasks. Existing research primarily focuses on optimizing prompts for direct interaction, while less attention is given to scenarios involving intermediary agents, like the Stable Diffusion model. This study proposes a Multi-Agent framework to optimize input prompts for text-to-image generation models. Central to this framework is a prompt generation mechanism that refines initial queries using dynamic instructions, which evolve through iterative performance feedback. High-quality prompts are then fed into a state-of-the-art text-to-image model. A professional prompts database serves as a benchmark to guide the instruction modifier towards generating high-caliber prompts. A scoring system evaluates the generated images, and an LLM generates new instructions based on calculated gradients. This iterative process is managed by the Upper Confidence Bound (UCB) algorithm and assessed using the Human Preference Score version 2 (HPS v2). Preliminary ablation studies highlight the effectiveness of various system components and suggest areas for future improvements. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.18892 [pdf, other]

EVM Analysis of Distributed Massive MIMO with 1-Bit Radio-Over-Fiber Fronthaul

Authors: Anzhong Hu, Lise Aabel, Giuseppe Durisi, Sven Jacobsson, Mikael Coldrey, Christian Fager, Christoph Studer

Abstract: We analyze the uplink performance of a distributed massive multiple-input multiple-output (MIMO) architecture in which the remotely located access points (APs) are connected to a central processing unit via a fiber-optical fronthaul carrying a dithered and 1-bit quantized version of the received radio-frequency (RF) signal. The innovative feature of the proposed architecture is that no down-conver… ▽ More We analyze the uplink performance of a distributed massive multiple-input multiple-output (MIMO) architecture in which the remotely located access points (APs) are connected to a central processing unit via a fiber-optical fronthaul carrying a dithered and 1-bit quantized version of the received radio-frequency (RF) signal. The innovative feature of the proposed architecture is that no down-conversion is performed at the APs. This eliminates the need to equip the APs with local oscillators, which may be difficult to synchronize. Under the assumption that a constraint is imposed on the amount of data that can be exchanged across the fiber-optical fronthaul, we investigate the tradeoff between spatial oversampling, defined in terms of the total number of APs, and temporal oversampling, defined in terms of the oversampling factor selected at the central processing unit, to facilitate the recovery of the transmitted signal from 1-bit samples of the RF received signal. Using the so-called error-vector magnitude (EVM) as performance metric, we shed light on the optimal design of the dither signal, and quantify, for a given number of APs, the minimum fronthaul rate required for our proposed distributed massive MIMO architecture to outperform a standard co-located massive MIMO architecture in terms of EVM. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: To appear in IEEE Transactions on Communications

arXiv:2405.00282 [pdf, ps, other]

MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games

Authors: Anran Hu, Junzi Zhang

Abstract: Reinforcement learning for multi-agent games has attracted lots of attention recently. However, given the challenge of solving Nash equilibria for large population games, existing works with guaranteed polynomial complexities either focus on variants of zero-sum and potential games, or aim at solving (coarse) correlated equilibria, or require access to simulators, or rely on certain assumptions th… ▽ More Reinforcement learning for multi-agent games has attracted lots of attention recently. However, given the challenge of solving Nash equilibria for large population games, existing works with guaranteed polynomial complexities either focus on variants of zero-sum and potential games, or aim at solving (coarse) correlated equilibria, or require access to simulators, or rely on certain assumptions that are hard to verify. This work proposes MF-OML (Mean-Field Occupation-Measure Learning), an online mean-field reinforcement learning algorithm for computing approximate Nash equilibria of large population sequential symmetric games. MF-OML is the first fully polynomial multi-agent reinforcement learning algorithm for provably solving Nash equilibria (up to mean-field approximation gaps that vanish as the number of players $N$ goes to infinity) beyond variants of zero-sum and potential games. When evaluated by the cumulative deviation from Nash equilibria, the algorithm is shown to achieve a high probability regret bound of $\tilde{O}(M^{3/4}+N^{-1/2}M)$ for games with the strong Lasry-Lions monotonicity condition, and a regret bound of $\tilde{O}(M^{11/12}+N^{- 1/6}M)$ for games with only the Lasry-Lions monotonicity condition, where $M$ is the total number of episodes and $N$ is the number of agents of the game. As a byproduct, we also obtain the first tractable globally convergent computational algorithm for computing approximate Nash equilibria of monotone mean-field games. △ Less

Submitted 30 April, 2024; originally announced May 2024.

arXiv:2404.18670 [pdf, other]

Enhancing Uncertain Demand Prediction in Hospitals Using Simple and Advanced Machine Learning

Authors: Annie Hu, Samuel Stockman, Xun Wu, Richard Wood, Bangdong Zhi, Oliver Y. Chén

Abstract: Early and timely prediction of patient care demand not only affects effective resource allocation but also influences clinical decision-making as well as patient experience. Accurately predicting patient care demand, however, is a ubiquitous challenge for hospitals across the world due, in part, to the demand's time-varying temporal variability, and, in part, to the difficulty in modelling trends… ▽ More Early and timely prediction of patient care demand not only affects effective resource allocation but also influences clinical decision-making as well as patient experience. Accurately predicting patient care demand, however, is a ubiquitous challenge for hospitals across the world due, in part, to the demand's time-varying temporal variability, and, in part, to the difficulty in modelling trends in advance. To address this issue, here, we develop two methods, a relatively simple time-vary linear model, and a more advanced neural network model. The former forecasts patient arrivals hourly over a week based on factors such as day of the week and previous 7-day arrival patterns. The latter leverages a long short-term memory (LSTM) model, capturing non-linear relationships between past data and a three-day forecasting window. We evaluate the predictive capabilities of the two proposed approaches compared to two naïve approaches - a reduced-rank vector autoregressive (VAR) model and the TBATS model. Using patient care demand data from Rambam Medical Center in Israel, our results show that both proposed models effectively capture hourly variations of patient demand. Additionally, the linear model is more explainable thanks to its simple architecture, whereas, by accurately modelling weekly seasonal trends, the LSTM model delivers lower prediction errors. Taken together, our explorations suggest the utility of machine learning in predicting time-varying patient care demand; additionally, it is possible to predict patient care demand with good accuracy (around 4 patients) three days or a week in advance using machine learning. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.16635 [pdf, other]

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Authors: Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin **, Ji Zhang, Fei Huang

Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficien… ▽ More Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 13 pages, 11 figures

arXiv:2404.14705 [pdf, other]

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Authors: Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin **

Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of levera… ▽ More This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.11844 [pdf, ps, other]

Finding A Taxi with Illegal Driver Substitution Activity via Behavior Modelings

Authors: Junbiao Pang, Muhammad Ayub Sabir, Zhuyun Wang, An**g Hu, Xue Yang, Haitao Yu, Qingming Huang

Abstract: In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limite… ▽ More In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limited number of law-enforcers and the large volume of taxis. In this paper, motivated by this problem, we propose a computational method that helps law enforcers efficiently find the taxis which tend to have the IDS activity. Firstly, our method converts the identification of the IDS activity to a supervised learning task. Secondly, two kinds of taxi driver behaviors, i.e., the Slee** Time and Location (STL) behavior and the Pick-Up (PU) behavior are proposed. Thirdly, the multiple scale pooling on self-similarity is proposed to encode the individual behaviors into the universal features for all taxis. Finally, a Multiple Component- Multiple Instance Learning (MC-MIL) method is proposed to handle the deficiency of the behavior features and to align the behavior features simultaneously. Extensive experiments on a real-world data set shows that the proposed behavior features have a good generalization ability across different classifiers, and the proposed MC-MIL method suppresses the baseline methods. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.07545 [pdf, other]

Stereo-LiDAR Depth Estimation with Deformable Propagation and Learned Disparity-Depth Conversion

Authors: Ang Li, Anning Hu, Wei Xi, Wenxian Yu, Dan** Zou

Abstract: Accurate and dense depth estimation with stereo cameras and LiDAR is an important task for automatic driving and robotic perception. While sparse hints from LiDAR points have improved cost aggregation in stereo matching, their effectiveness is limited by the low density and non-uniform distribution. To address this issue, we propose a novel stereo-LiDAR depth estimation network with Semi-Dense hin… ▽ More Accurate and dense depth estimation with stereo cameras and LiDAR is an important task for automatic driving and robotic perception. While sparse hints from LiDAR points have improved cost aggregation in stereo matching, their effectiveness is limited by the low density and non-uniform distribution. To address this issue, we propose a novel stereo-LiDAR depth estimation network with Semi-Dense hint Guidance, named SDG-Depth. Our network includes a deformable propagation module for generating a semi-dense hint map and a confidence map by propagating sparse hints using a learned deformable window. These maps then guide cost aggregation in stereo matching. To reduce the triangulation error in depth recovery from disparity, especially in distant regions, we introduce a disparity-depth conversion module. Our method is both accurate and efficient. The experimental results on benchmark tests show its superior performance. Our code is available at https://github.com/SJTU-ViSYS/SDG-Depth. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Accepted in ICRA 2024. 8 pages, 6 figures

arXiv:2403.12895 [pdf, other]

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Authors: Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin **, Fei Huang, **gren Zhou

Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure informatio… ▽ More Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 21 pages, 15 figures

arXiv:2403.12693 [pdf, other]

As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?

Authors: Anjun Hu, **dong Gu, Francesco Pinto, Konstantinos Kamnitsas, Philip Torr

Abstract: Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerab… ▽ More Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities in CLIP's downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method produces adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of public foundational models in the development of downstream systems, calling for extra caution in these scenarios. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.06284 [pdf]

Develo** an AI-Based Psychometric System for Assessing Learning Difficulties and Adaptive System to Overcome: A Qualitative and Conceptual Framework

Authors: Aaron Hu

Abstract: Learning difficulties pose significant challenges for students, impacting their academic performance and overall educational experience. These difficulties could sometimes put students into a downward spiral that lack of educational resources for personalized support consistently led to under-accommodation of students special needs, and the student lose opportunities in the longer term academic an… ▽ More Learning difficulties pose significant challenges for students, impacting their academic performance and overall educational experience. These difficulties could sometimes put students into a downward spiral that lack of educational resources for personalized support consistently led to under-accommodation of students special needs, and the student lose opportunities in the longer term academic and work development. This research aims to propose a conceptual framework for an adaptive AI-based virtual tutor system that incorporates psychometric assessment to support students with learning difficulties. This process involves the careful selection and integration of validated current mature psychometric scales that assess key dimensions of learning, such as cognitive abilities, learning styles, and academic skills. By incorporating scales that specifically assess these difficulties, the psychometric test will provide a comprehensive understanding of each students unique learning profile and inform targeted interventions within the adaptive tutoring system. The paper also proposes using autoencoders to identify the latent patterns to generate the students profile vector for collection of psychometric data, defining state space and action space representing the students desired combination of images, sound and text engagements, employing extended Bayesian knowledge tracing and hierarchical model and Metropolis-Hastings to continuously estimate and monitor the students performance in various psychometric constructs. The proposed system will leverage the capabilities of LLMs, visual generation models, and psychometric assessments to provide personalized instruction and support tailored to each students unique learning characteristics and needs. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: This is a working paper

arXiv:2401.10703 [pdf, other]

DRAT Proofs of Unsatisfiability for SAT Modulo Monotonic Theories

Authors: Nick Feng, Alan J. Hu, Sam Bayless, Syed M. Iqbal, Patrick Trentin, Mike Whalen, Lee Pike, John Backes

Abstract: Generating proofs of unsatisfiability is a valuable capability of most SAT solvers, and is an active area of research for SMT solvers. This paper introduces the first method to efficiently generate proofs of unsatisfiability specifically for an important subset of SMT: SAT Modulo Monotonic Theories (SMMT), which includes many useful finite-domain theories (e.g., bit vectors and many graph-theoreti… ▽ More Generating proofs of unsatisfiability is a valuable capability of most SAT solvers, and is an active area of research for SMT solvers. This paper introduces the first method to efficiently generate proofs of unsatisfiability specifically for an important subset of SMT: SAT Modulo Monotonic Theories (SMMT), which includes many useful finite-domain theories (e.g., bit vectors and many graph-theoretic properties) and is used in production at Amazon Web Services. Our method uses propositional definitions of the theory predicates, from which it generates compact Horn approximations of the definitions, which lead to efficient DRAT proofs, leveraging the large investment the SAT community has made in DRAT. In experiments on practical SMMT problems, our proof generation overhead is minimal (7.41% geometric mean slowdown, 28.8% worst-case), and we can generate and check proofs for many problems that were previously intractable. △ Less

Submitted 18 April, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

arXiv:2401.10314 [pdf, other]

LangProp: A code optimization framework using Large Language Models applied to driving

Authors: Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu

Abstract: We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code… ▽ More We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We show LangProp's applicability to general domains such as Sudoku and CartPole, as well as demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA. We show that LangProp can generate interpretable and transparent policies that can be verified and improved in a metric- and data-driven way. Our code is available at https://github.com/shuishida/LangProp. △ Less

Submitted 3 May, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.08433 [pdf, other]

Autonomous Multiple-Trolley Collection System with Nonholonomic Robots: Design, Control, and Implementation

Authors: Peijia Xie, Bingyi Xia, Anjun Hu, Ziqi Zhao, Lingxiao Meng, Zhirui Sun, Xuheng Gao, Jiankun Wang, Max Q. -H. Meng

Abstract: The intricate and multi-stage task in dynamic public spaces like luggage trolley collection in airports presents both a promising opportunity and an ongoing challenge for automated service robots. Previous research has primarily focused on handling a single trolley or individual functional components, creating a gap in providing cost-effective and efficient solutions for practical scenarios. In th… ▽ More The intricate and multi-stage task in dynamic public spaces like luggage trolley collection in airports presents both a promising opportunity and an ongoing challenge for automated service robots. Previous research has primarily focused on handling a single trolley or individual functional components, creating a gap in providing cost-effective and efficient solutions for practical scenarios. In this paper, we propose a mobile manipulation robot incorporated with an autonomy framework for the collection and transportation of multiple trolleys that can significantly enhance operational efficiency. We address the key challenges in the trolley collection problem through the novel design of the mechanical system and the vision-based control strategy. We design a lightweight manipulator and docking mechanism, optimized for the sequential stacking and transportation of multiple trolleys. Additionally, based on the Control Lyapunov Function and Control Barrier Function, we propose a novel vision-based control with the online Quadratic Programming which significantly improves the accuracy and efficiency of the collection process. The practical application of our system is demonstrated in real world scenarios, where it successfully executes multiple-trolley collection tasks. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.03346 [pdf, ps, other]

An Investigation of Large Language Models for Real-World Hate Speech Detection

Authors: Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, Hongxin Hu

Abstract: Hate speech has emerged as a major problem plaguing our social spaces today. While there have been significant efforts to address this problem, existing methods are still significantly limited in effectively detecting hate speech online. A major limitation of existing methods is that hate speech detection is a highly contextual problem, and these methods cannot fully capture the context of hate sp… ▽ More Hate speech has emerged as a major problem plaguing our social spaces today. While there have been significant efforts to address this problem, existing methods are still significantly limited in effectively detecting hate speech online. A major limitation of existing methods is that hate speech detection is a highly contextual problem, and these methods cannot fully capture the context of hate speech to make accurate predictions. Recently, large language models (LLMs) have demonstrated state-of-the-art performance in several natural language tasks. LLMs have undergone extensive training using vast amounts of natural language data, enabling them to grasp intricate contextual details. Hence, they could be used as knowledge bases for context-aware hate speech detection. However, a fundamental problem with using LLMs to detect hate speech is that there are no studies on effectively prompting LLMs for context-aware hate speech detection. In this study, we conduct a large-scale study of hate speech detection, employing five established hate speech datasets. We discover that LLMs not only match but often surpass the performance of current benchmark machine learning models in identifying hate speech. By proposing four diverse prompting strategies that optimize the use of LLMs in detecting hate speech. Our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech by fully utilizing the knowledge base in LLMs, significantly outperforming existing techniques. Furthermore, although LLMs can provide a rich knowledge base for the contextual detection of hate speech, suitable prompting strategies play a crucial role in effectively leveraging this knowledge base for efficient detection. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: Accepted for publication on 22nd International Conference of Machine Learning and Applications, ICMLA 2023

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2311.18248 [pdf, other]

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Authors: Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

Abstract: Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly fo… ▽ More Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl. △ Less

Submitted 9 January, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: 20 pages, 12 figures

arXiv:2311.04257 [pdf, other]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Authors: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, **gren Zhou

Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks… ▽ More Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models. △ Less

Submitted 8 November, 2023; v1 submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.17626 [pdf, ps, other]

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

Authors: **dong Gu, Xiaojun Jia, Pau de Jorge, Wenqain Yu, Xinwei Liu, Avery Ma, Yuan Xun, Anjun Hu, Ashkan Khakzar, Zhijiang Li, Xiaochun Cao, Philip Torr

Abstract: The emergence of Deep Neural Networks (DNNs) has revolutionized various domains by enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also brought to light a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models in… ▽ More The emergence of Deep Neural Networks (DNNs) has revolutionized various domains by enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also brought to light a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables black-box attacks which circumvents the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and opportunities are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape. △ Less

Submitted 1 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: Accepted to Transactions on Machine Learning Research (TMLR)

arXiv:2310.10656 [pdf, other]

VeriDIP: Verifying Ownership of Deep Neural Networks through Privacy Leakage Fingerprints

Authors: Aoting Hu, Zhigang Lu, Renjie Xie, Minhui Xue

Abstract: Deploying Machine Learning as a Service gives rise to model plagiarism, leading to copyright infringement. Ownership testing techniques are designed to identify model fingerprints for verifying plagiarism. However, previous works often rely on overfitting or robustness features as fingerprints, lacking theoretical guarantees and exhibiting under-performance on generalized models. In this paper, we… ▽ More Deploying Machine Learning as a Service gives rise to model plagiarism, leading to copyright infringement. Ownership testing techniques are designed to identify model fingerprints for verifying plagiarism. However, previous works often rely on overfitting or robustness features as fingerprints, lacking theoretical guarantees and exhibiting under-performance on generalized models. In this paper, we propose a novel ownership testing method called VeriDIP, which verifies a DNN model's intellectual property. VeriDIP makes two major contributions. (1) It utilizes membership inference attacks to estimate the lower bound of privacy leakage, which reflects the fingerprint of a given model. The privacy leakage fingerprints highlight the unique patterns through which the models memorize sensitive training datasets. (2) We introduce a novel approach using less private samples to enhance the performance of ownership testing. Extensive experimental results confirm that VeriDIP is effective and efficient in validating the ownership of deep learning models trained on both image and tabular datasets. VeriDIP achieves comparable performance to state-of-the-art methods on image datasets while significantly reducing computation and communication costs. Enhanced VeriDIP demonstrates superior verification performance on generalized deep learning models, particularly on table-trained models. Additionally, VeriDIP exhibits similar effectiveness on utility-preserving differentially private models compared to non-differentially private baselines. △ Less

Submitted 6 September, 2023; originally announced October 2023.

arXiv:2310.10219 [pdf, other]

Using Global Land Cover Product as Prompt for Cropland Map** via Visual Foundation Model

Authors: Chao Tao, Aoran Hu, Rong Xiao, Haifeng Li, Yuze Wang

Abstract: Data-driven deep learning methods have shown great potential in cropland map**. However, due to multiple factors such as attributes of cropland (topography, climate, crop type) and imaging conditions (viewing angle, illumination, scale), croplands under different scenes demonstrate a great domain gap. This makes it difficult for models trained in the specific scenes to directly generalize to oth… ▽ More Data-driven deep learning methods have shown great potential in cropland map**. However, due to multiple factors such as attributes of cropland (topography, climate, crop type) and imaging conditions (viewing angle, illumination, scale), croplands under different scenes demonstrate a great domain gap. This makes it difficult for models trained in the specific scenes to directly generalize to other scenes. A common way to handle this problem is through the "Pretrain+Fine-tuning" paradigm. Unfortunately, considering the variety of features of cropland that are affected by multiple factors, it is hardly to handle the complex domain gap between pre-trained data and target data using only sparse fine-tuned samples as general constraints. Moreover, as the number of model parameters grows, fine-tuning is no longer an easy and low-cost task. With the emergence of prompt learning via visual foundation models, the "Pretrain+Prompting" paradigm redesigns the optimization target by introducing individual prompts for each single sample. This simplifies the domain adaption from generic to specific scenes during model reasoning processes. Therefore, we introduce the "Pretrain+Prompting" paradigm to interpreting cropland scenes and design the auto-prompting (APT) method based on freely available global land cover product. It can achieve a fine-grained adaptation process from generic scenes to specialized cropland scenes without introducing additional label costs. To our best knowledge, this work pioneers the exploration of the domain adaption problems for cropland map** under prompt learning perspectives. Our experiments using two sub-meter cropland datasets from southern and northern China demonstrated that the proposed method via visual foundation models outperforms traditional supervised learning and fine-tuning approaches in the field of remote sensing. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.05126 [pdf, other]

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin **, Liang He, Xin Alex Lin, Fei Huang

Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and… ▽ More Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive crop** module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.17080 [pdf, other]

GAIA-1: A Generative World Model for Autonomous Driving

Authors: Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado

Abstract: Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA… ▽ More Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by map** the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Technical Report

arXiv:2308.10447 [pdf, other]

Explore and Tell: Embodied Visual Captioning in 3D Environments

Authors: Anwen Hu, Shizhe Chen, Liang Zhang, Qin **

Abstract: While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual capt… ▽ More While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell. △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: 12 pages; 10 figures; ICCV 2023

arXiv:2307.02499 [pdf, other]

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang

Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without… ▽ More Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: 10 pages, 8 figures

arXiv:2306.13460 [pdf, other]

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Authors: Zihao Yue, Anwen Hu, Liang Zhang, Qin **

Abstract: Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics t… ▽ More Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works. △ Less

Submitted 28 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2306.09179 [pdf, other]

Neural World Models for Computer Vision

Authors: Anthony Hu

Abstract: Humans navigate in their environment by learning a mental model of the world through passive observation and active interaction. Their world model allows them to anticipate what might happen next and act accordingly with respect to an underlying objective. Such world models hold strong promises for planning in complex environments like in autonomous driving. A human driver, or a self-driving syste… ▽ More Humans navigate in their environment by learning a mental model of the world through passive observation and active interaction. Their world model allows them to anticipate what might happen next and act accordingly with respect to an underlying objective. Such world models hold strong promises for planning in complex environments like in autonomous driving. A human driver, or a self-driving system, perceives their surroundings with their eyes or their cameras. They infer an internal representation of the world which should: (i) have spatial memory (e.g. occlusions), (ii) fill partially observable or noisy inputs (e.g. when blinded by sunlight), and (iii) be able to reason about unobservable events probabilistically (e.g. predict different possible futures). They are embodied intelligent agents that can predict, plan, and act in the physical world through their world model. In this thesis we present a general framework to train a world model and a policy, parameterised by deep neural networks, from camera observations and expert demonstrations. We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes. First, we propose a model that predicts important quantities in computer vision: depth, semantic segmentation, and optical flow. We then use 3D geometry as an inductive bias to operate in the bird's-eye view space. We present for the first time a model that can predict probabilistic future trajectories of dynamic agents in bird's-eye view from 360° surround monocular cameras only. Finally, we demonstrate the benefits of learning a world model in closed-loop driving. Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: PhD thesis

arXiv:2306.04362 [pdf, other]

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

Authors: Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

Abstract: To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chi… ▽ More To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Working in progress

arXiv:2305.12817 [pdf, other]

Conservative Physics-Informed Neural Networks for Non-Conservative Hyperbolic Conservation Laws Near Critical States

Authors: Reyna Quita, Yu-Shuo Chen, Hsin-Yi Lee Alex C. Hu, John M. Hong

Abstract: In this paper, a modified version of conservative Physics-informed Neural Networks (cPINN for short) is provided to construct the weak solutions of Riemann problem for the hyperbolic scalar conservation laws in non-conservative form. To demonstrate the results, we use the model of generalized Buckley-Leverett equation (GBL equation for short) with discontinuous porosity in porous media. By inventi… ▽ More In this paper, a modified version of conservative Physics-informed Neural Networks (cPINN for short) is provided to construct the weak solutions of Riemann problem for the hyperbolic scalar conservation laws in non-conservative form. To demonstrate the results, we use the model of generalized Buckley-Leverett equation (GBL equation for short) with discontinuous porosity in porous media. By inventing a new unknown, the GBL equation is transformed into a two-by-two resonant hyperbolic conservation laws in conservative form. The modified method of cPINN is invented to overcome the difficulties due to the discontinuity of the porosity and the appearance of the critical states (near vacuum) in the Riemann data. We experiment with our idea by using a deep learning algorithm to solve the GBL equation in both conservative and non-conservative forms, as well as the cases of critical and non-critical states. This method provides a combination of two different neural networks and corresponding loss functions, one is for the two-by-two resonant hyperbolic system, and the other is for the scalar conservation law with a discontinuous perturbation term in the non-convex flux. The technique of re-scaling to the unknowns is adopted to avoid the oscillation of the Riemann solutions in the cases of critical Riemann data. The solutions constructed by the modified cPINN match the exact solutions constructed by the theoretical analysis for hyperbolic conservation laws. In addition, the solutions are identical in both conservative and non-conservative cases. Finally, we compare the performance of the modified cPINN with numerical method called WENO5. Whereas WENO5 struggles with the highly oscillation of approximate solutions for the Riemann problems of GBL equation in non-conservative form, cPINN works admirably. △ Less

Submitted 22 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 23 pages, 26 figures

MSC Class: 35L03; 35L45; 65M99

arXiv:2305.12140 [pdf, other]

Movie101: A New Movie Understanding Benchmark

Authors: Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin **

Abstract: To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for auto… ▽ More To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101. △ Less

Submitted 27 June, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted to ACL 2023

arXiv:2305.10403 [pdf, other]

PaLM 2 Technical Report

Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report. △ Less

Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.06002 [pdf, other]

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Authors: Anwen Hu, Shizhe Chen, Liang Zhang, Qin **

Abstract: Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and the… ▽ More Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023 main conference

arXiv:2305.00664 [pdf, other]

EvoluNet: Advancing Dynamic Non-IID Transfer Learning on Graphs

Authors: Haohui Wang, Yuzhen Mao, Yujun Yan, Yaoqing Yang, Jianhui Sun, Kevin Choi, Balaji Veeramani, Alison Hu, Edward Bowen, Tyler Cody, Dawei Zhou

Abstract: Non-IID transfer learning on graphs is crucial in many high-stakes domains. The majority of existing works assume stationary distribution for both source and target domains. However, real-world graphs are intrinsically dynamic, presenting challenges in terms of domain evolution and dynamic discrepancy between source and target domains. To bridge the gap, we shift the problem to the dynamic setting… ▽ More Non-IID transfer learning on graphs is crucial in many high-stakes domains. The majority of existing works assume stationary distribution for both source and target domains. However, real-world graphs are intrinsically dynamic, presenting challenges in terms of domain evolution and dynamic discrepancy between source and target domains. To bridge the gap, we shift the problem to the dynamic setting and pose the question: given the label-rich source graphs and the label-scarce target graphs both observed in previous T timestamps, how can we effectively characterize the evolving domain discrepancy and optimize the generalization performance of the target domain at the incoming T+1 timestamp? To answer it, we propose a generalization bound for dynamic non-IID transfer learning on graphs, which implies the generalization performance is dominated by domain evolution and domain discrepancy between source and target graphs. Inspired by the theoretical results, we introduce a novel generic framework named EvoluNet. It leverages a transformer-based temporal encoding module to model temporal information of the evolving domains and then uses a dynamic domain unification module to efficiently learn domain-invariant representations across the source and target domains. Finally, EvoluNet outperforms the state-of-the-art models by up to 12.1%, demonstrating its effectiveness in transferring knowledge from dynamic source graphs to dynamic target graphs. △ Less

Submitted 31 May, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: Accepted at ICML 2024

arXiv:2304.14178 [pdf, other]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Authors: Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, **gren Zhou

Abstract: Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstrac… ▽ More Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl. △ Less

Submitted 29 March, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: Working in Process

arXiv:2304.09660 [pdf, other]

MPMQA: Multimodal Question Answering on Product Manuals

Authors: Liang Zhang, Anwen Hu, **g Zhang, Shuo Hu, Qin **

Abstract: Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the mod… ▽ More Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2304.08630 [pdf, ps, other]

MFGLib: A Library for Mean-Field Games

Authors: Xin Guo, Anran Hu, Matteo Santamaria, Mahan Tajrobehkar, Junzi Zhang

Abstract: Mean-field games (MFGs) are limiting models to approximate $N$-player games, with a number of applications. Despite the ever-growing numerical literature on computation of MFGs, there is no library that allows researchers and practitioners to easily create and solve their own MFG problems. The purpose of this document is to introduce MFGLib, an open-source Python library for solving general MFGs w… ▽ More Mean-field games (MFGs) are limiting models to approximate $N$-player games, with a number of applications. Despite the ever-growing numerical literature on computation of MFGs, there is no library that allows researchers and practitioners to easily create and solve their own MFG problems. The purpose of this document is to introduce MFGLib, an open-source Python library for solving general MFGs with a user-friendly and customizable interface. It serves as a handy tool for creating and analyzing generic MFG environments, along with embedded auto-tuners for all implemented algorithms. The package is distributed under the MIT license and the source code and documentation can be found at https://github.com/radar-research-lab/MFGLib/. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2303.12455 [pdf, ps, other]

Reconfigurable Intelligent Surface-aided Secret Key Generation in Multi-Cell Systems

Authors: Lei Hu, Chen Sun, Guyue Li, Aiqun Hu, Derrick Wing Kwan Ng

Abstract: Physical-layer key generation (PKG) exploits the reciprocity and randomness of wireless channels to generate a symmetric key between two legitimate communication ends. However, in multi-cell systems, PKG suffers from severe pilot contamination due to the reuse of pilots in different cells. In this paper, we invoke multiple reconfigurable intelligent surfaces (RISs) for adaptively sha** the envir… ▽ More Physical-layer key generation (PKG) exploits the reciprocity and randomness of wireless channels to generate a symmetric key between two legitimate communication ends. However, in multi-cell systems, PKG suffers from severe pilot contamination due to the reuse of pilots in different cells. In this paper, we invoke multiple reconfigurable intelligent surfaces (RISs) for adaptively sha** the environment and enhancing the PKG performance. To this end, we formulate an optimization problem to maximize the weighted sum key rate (WSKR) by jointly optimizing the precoding matrices at the base stations (BSs) and the phase shifts at the RISs. For addressing the non-convexity of the problem, we derive an upper bound of the WSKR and prove its tightness. To tackle the upper bound maximization problem, we apply an alternating optimization (AO)-based algorithm to divide the joint optimization into two sub-problems. We apply the Lagrangian dual approach based on the Karush-Kuhn-Tucker (KKT) conditions for the sub-problem of precoding matrices and adopt a projected gradient ascent (PGA) algorithm for the sub-problem of phase shifts. Simulation results confirm the near-optimal performance of the proposed algorithm and the effectiveness of RISs for improving the WSKR via mitigating pilot contamination. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 30 pages, 12 figures

arXiv:2303.08947 [pdf, other]

Design, Modeling, and Redundancy Resolution of Soft Robot for Effective Harvesting

Authors: Milad Azizkhani, Anthony L. Gunderman, Alex S. Qiu, Ai-** Hu, Xin Zhang, Yue Chen

Abstract: Blackberry harvesting is a labor-intensive and costly process, consuming up to 50\% of the total annual crop hours. This paper presents a solution for robotic harvesting through the design, manufacturing, integration, and control of a pneumatically actuated, kinematically redundant soft arm with a tendon-driven soft robotic gripper. The hardware design is optimized for durability and modularity fo… ▽ More Blackberry harvesting is a labor-intensive and costly process, consuming up to 50\% of the total annual crop hours. This paper presents a solution for robotic harvesting through the design, manufacturing, integration, and control of a pneumatically actuated, kinematically redundant soft arm with a tendon-driven soft robotic gripper. The hardware design is optimized for durability and modularity for practical use. The harvesting process is divided into four stages: initial placement, fine positioning, grasp, and move back to home position. For initial placement, we propose a real-time, continuous gain-scheduled redundancy resolution algorithm for simultaneous position and orientation control with joint-limit avoidance. The algorithm relies solely on visual feedback from an eye-to-hand camera and achieved a position and orientation tracking error of $0.64\pm{0.27}$ mm and $1.08\pm{1.5}^{\circ}$, respectively, in benchtop settings. Following accurate initial placement of the robotic arm, fine positioning is achieved using a combination of eye-in-hand and eye-to-hand visual feedback, reaching an accuracy of $0.75\pm{0.36}$ mm. The system's hardware, feedback framework, and control methods are thoroughly validated through benchtop and field tests, confirming feasibility for practical applications. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: 11 pages, 10 figures

arXiv:2303.07015 [pdf, ps, other]

RIS-Jamming: Breaking Key Consistency in Channel Reciprocity-based Key Generation

Authors: Guyue Li, Paul Staat, Haoyu Li, Markus Heinrichs, Christian Zenger, Rainer Kronberger, Harald Elders-Boll, Christof Paar, Aiqun Hu

Abstract: Channel Reciprocity-based Key Generation (CRKG) exploits reciprocal channel randomness to establish shared secret keys between wireless terminals. This new security technique is expected to complement existing cryptographic techniques for secret key distribution of future wireless networks. In this paper, we present a new attack, reconfigurable intelligent surface (RIS) jamming, and show that an a… ▽ More Channel Reciprocity-based Key Generation (CRKG) exploits reciprocal channel randomness to establish shared secret keys between wireless terminals. This new security technique is expected to complement existing cryptographic techniques for secret key distribution of future wireless networks. In this paper, we present a new attack, reconfigurable intelligent surface (RIS) jamming, and show that an attacker can prevent legitimate users from agreeing on the same key by deploying a malicious RIS to break channel reciprocity. Specifically, we elaborate on three examples to implement the RIS jamming attack: Using active nonreciprocal circuits, performing time-varying controls, and reducing the signal-to-noise ratio. The attack effect is then studied by formulating the secret key rate with a relationship to the deployment of RIS. To resist such RIS jamming attacks, we propose a countermeasure that exploits wideband signals for multipath separation. The malicious RIS path is distinguished from all separated channel paths, and thus the countermeasure is referred to as contaminated path removal-based CRKG(CRP-CRKG). We present simulation results, showing that legitimate users under RIS jamming are still able to generate secret keys from the remaining paths. We also experimentally demonstrate the RIS jamming attack by using commodity Wi-Fi devices in conjunction with a fabricated RIS prototype. In our experiments, we were able to increase the average bit disagreement ratio (BDR) of raw secret keys by 20%. Further, we successfully demonstrate the proposed CRP-CRKG countermeasure to tackle RIS jamming in wideband systems as long as the source of randomness and the RIS propagation paths are separable. △ Less

Submitted 10 April, 2024; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: 15 pages, 14 figures

arXiv:2303.06591 [pdf, other]

Accommodating Audio Modality in CLIP for Multimodal Processing

Authors: Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin **

Abstract: Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio mul… ▽ More Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps. △ Less

Submitted 12 March, 2023; originally announced March 2023.

Comments: Accepted by AAAI2023

arXiv:2302.11000 [pdf, other]

CHA2: CHemistry Aware Convex Hull Autoencoder Towards Inverse Molecular Design

Authors: Mohammad Sajjad Ghaemi, Hang Hu, Anguang Hu, Hsu Kiang Ooi

Abstract: Optimizing molecular design and discovering novel chemical structures to meet certain objectives, such as quantitative estimates of the drug-likeness score (QEDs), is NP-hard due to the vast combinatorial design space of discrete molecular structures, which makes it near impossible to explore the entire search space comprehensively to exploit de novo structures with properties of interest. To addr… ▽ More Optimizing molecular design and discovering novel chemical structures to meet certain objectives, such as quantitative estimates of the drug-likeness score (QEDs), is NP-hard due to the vast combinatorial design space of discrete molecular structures, which makes it near impossible to explore the entire search space comprehensively to exploit de novo structures with properties of interest. To address this challenge, reducing the intractable search space into a lower-dimensional latent volume helps examine molecular candidates more feasibly via inverse design. Autoencoders are suitable deep learning techniques, equipped with an encoder that reduces the discrete molecular structure into a latent space and a decoder that inverts the search space back to the molecular design. The continuous property of the latent space, which characterizes the discrete chemical structures, provides a flexible representation for inverse design in order to discover novel molecules. However, exploring this latent space requires certain insights to generate new structures. We propose using a convex hall surrounding the top molecules in terms of high QEDs to ensnare a tight subspace in the latent representation as an efficient way to reveal novel molecules with high QEDs. We demonstrate the effectiveness of our suggested method by using the QM9 as a training dataset along with the Self- Referencing Embedded Strings (SELFIES) representation to calibrate the autoencoder in order to carry out the Inverse molecular design that leads to unfold novel chemical structure. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.10952 [pdf]

Machine learning for the prediction of safe and biologically active organophosphorus molecules

Authors: Hang Hu, Hsu Kiang Ooi, Mohammad Sajjad Ghaemi, Anguang Hu

Abstract: Drug discovery is a complex process with a large molecular space to be considered. By constraining the search space, the fragment-based drug design is an approach that can effectively sample the chemical space of interest. Here we propose a framework of Recurrent Neural Networks (RNN) with an attention model to sample the chemical space of organophosphorus molecules using the fragment-based approa… ▽ More Drug discovery is a complex process with a large molecular space to be considered. By constraining the search space, the fragment-based drug design is an approach that can effectively sample the chemical space of interest. Here we propose a framework of Recurrent Neural Networks (RNN) with an attention model to sample the chemical space of organophosphorus molecules using the fragment-based approach. The framework is trained with a ZINC dataset that is screened for high druglikeness scores. The goal is to predict molecules with similar biological action modes as organophosphorus pesticides or chemical warfare agents yet less toxic to humans. The generated molecules contain a starting fragment of PO2F but have a bulky hydrocarbon side chain limiting its binding effectiveness to the targeted protein. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.03099 [pdf]

Tendon-Driven Soft Robotic Gripper with Integrated Ripeness Sensing for Blackberry Harvesting

Authors: Alex Qiu, Claire Young, Anthony Gunderman, Milad Azizkhani, Yue Chen, Ai-** Hu

Abstract: Growing global demand for food, coupled with continuing labor shortages, motivates the need for automated agricultural harvesting. While some specialty crops (e.g., apples, peaches, blueberries) can be harvested via existing harvesting modalities, fruits such as blackberries and raspberries require delicate handling to mitigate fruit damage that could significantly impact marketability. This motiv… ▽ More Growing global demand for food, coupled with continuing labor shortages, motivates the need for automated agricultural harvesting. While some specialty crops (e.g., apples, peaches, blueberries) can be harvested via existing harvesting modalities, fruits such as blackberries and raspberries require delicate handling to mitigate fruit damage that could significantly impact marketability. This motivates the development of soft robotic solutions that enable efficient, delicate harvesting. This paper presents the design, fabrication, and feasibility testing of a tendon-driven soft grip** system focused on blackberries, which are a fragile fruit susceptible to post-harvest damage. The gripper is both low-cost and small form factor, allowing for the integration of a micro-servo for tendon retraction, a near-infrared (NIR) based blackberry ripeness sensor utilizing the reflectance modality for identifying fully ripe blackberries, and an endoscopic camera for visual servoing with a UR-5. The gripper was used to harvest 139 berries with manual positioning in two separate field tests. Field testing found an average retention force of 2.06 N and 6.08 N for ripe and unripe blackberries, respectively. Sensor tests identified an average reflectance of 16.78 and 21.70 for ripe and unripe blackberries, respectively, indicating a clear distinction between the two ripeness levels. Finally, the soft robotic gripper was integrated onto a UR5 robot arm and successfully harvested fifteen artificial blackberries in a lab setting using visual servoing. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: 7 Pages, 9 figures, submitted to and accepted by ICRA 2023

arXiv:2301.01453 [pdf, other]

Information-Theoretic Secure Key Sharing for Wide-Area Mobile Applications

Authors: Guyue Li, Hongyi Luo, Jiabao Yu, Aiqun Hu, Jiangzhou Wang

Abstract: With the rapid growth of handheld devices in the internet of things (IoT) networks, mobile applications have become ubiquitous in everyday life. As technology is developed, so do also the risks and threats associated with it, especially in the forthcoming quantum era. Existing IoT networks, however, lack a quantum-resistant secret key sharing scheme to meet confidential message transmission demand… ▽ More With the rapid growth of handheld devices in the internet of things (IoT) networks, mobile applications have become ubiquitous in everyday life. As technology is developed, so do also the risks and threats associated with it, especially in the forthcoming quantum era. Existing IoT networks, however, lack a quantum-resistant secret key sharing scheme to meet confidential message transmission demands in wide-area mobile applications. To address this issue, this article proposes a new scheme, channel reciprocity (CR) based quantum key distribution (QKD) CR-QKD, which accomplishes the goal of secret key sharing by combining emerging techniques of QKD and CR-based key generation (CRKG). Exploiting laws of quantum physics and properties of wireless channels, the proposed scheme is able to ensure the secrecy of the key, even against computationally unbounded adversaries. The basic mechanism is elaborated for a single-user case and it is extended into a multi-user case by redesigning a multi-user edge forwarding strategy. In addition, to make CR-QKD more practical, some enhancement strategies are studied to reduce the time delay and to improve the secret key generation rate in a secure manner. A prototype of CR-QKD is demonstrated in a metropolitan area network, where secret keys are shared between two remote IoT devices that are roughly fifteen kilometers apart from each other. The experimental results have verified that CR-QKD allows a secret key rate of 424 bits per second with a retransmission rate of 2.1%. △ Less

Submitted 4 January, 2023; originally announced January 2023.

arXiv:2212.14514 [pdf, other]

The Voronoigram: Minimax Estimation of Bounded Variation Functions From Scattered Data

Authors: Addison J. Hu, Alden Green, Ryan J. Tibshirani

Abstract: We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation… ▽ More We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $θ_i,θ_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded. △ Less

Submitted 29 December, 2022; originally announced December 2022.

Comments: 75 pages, 10 figures

arXiv:2212.05588 [pdf, other]

Towards a Learner-Centered Explainable AI: Lessons from the learning sciences

Authors: Anna Kawakami, Luke Guerdan, Yang Cheng, Anita Sun, Alison Hu, Kate Glazko, Nikos Arechiga, Matthew Lee, Scott Carter, Haiyi Zhu, Kenneth Holstein

Abstract: In this short paper, we argue for a refocusing of XAI around human learning goals. Drawing upon approaches and theories from the learning sciences, we propose a framework for the learner-centered design and evaluation of XAI systems. We illustrate our framework through an ongoing case study in the context of AI-augmented social work. In this short paper, we argue for a refocusing of XAI around human learning goals. Drawing upon approaches and theories from the learning sciences, we propose a framework for the learner-centered design and evaluation of XAI systems. We illustrate our framework through an ongoing case study in the context of AI-augmented social work. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: 7 pages, 2 figures

Journal ref: Human-Centered Explainable AI Workshop at ACM CHI Conference on Human Factors in Computing Systems 2022

arXiv:2211.07820 [pdf, other]

Clinically Plausible Pathology-Anatomy Disentanglement in Patient Brain MRI with Structured Variational Priors

Authors: Anjun Hu, Jean-Pierre R. Falet, Brennan S. Nichyporuk, Changjian Shui, Douglas L. Arnold, Sotirios A. Tsaftaris, Tal Arbel

Abstract: We propose a hierarchically structured variational inference model for accurately disentangling observable evidence of disease (e.g. brain lesions or atrophy) from subject-specific anatomy in brain MRIs. With flexible, partially autoregressive priors, our model (1) addresses the subtle and fine-grained dependencies that typically exist between anatomical and pathological generating factors of an M… ▽ More We propose a hierarchically structured variational inference model for accurately disentangling observable evidence of disease (e.g. brain lesions or atrophy) from subject-specific anatomy in brain MRIs. With flexible, partially autoregressive priors, our model (1) addresses the subtle and fine-grained dependencies that typically exist between anatomical and pathological generating factors of an MRI to ensure the clinical validity of generated samples; (2) preserves and disentangles finer pathological details pertaining to a patient's disease state. Additionally, we experiment with an alternative training configuration where we provide supervision to a subset of latent units. It is shown that (1) a partially supervised latent space achieves a higher degree of disentanglement between evidence of disease and subject-specific anatomy; (2) when the prior is formulated with an autoregressive structure, knowledge from the supervision can propagate to the unsupervised latent units, resulting in more informative latent representations capable of modelling anatomy-pathology interdependencies. △ Less

Submitted 16 November, 2022; v1 submitted 14 November, 2022; originally announced November 2022.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States & Virtual, http://www.ml4h.cc, 11 pages

arXiv:2211.06313 [pdf]

BioJam Camp: toward justice through bioengineering and biodesign co-learning with youth

Authors: Callie Chappell, Henry A. -A., Elvia B. O., Emily B., Bailey B., Jacqueline C. -M., Caroline Daws, Cristian F., Emiliano G., Page Goddard, Xavier G., Anne Hu, Gabriela J., Kelley Langhans, Briana Martin-Villa, Penny M. -S., Jennifer M., Soyang N., Melissa Ortiz, Aryana P., Trisha S, Corinne Takara, Emily T., Paloma Vazquez, Rolando Perez , et al. (1 additional authors not shown)

Abstract: BioJam is a political, artistic, and educational project in which Bay Area artists, scientists, and educators collaborate with youth and communities of color to address historical exclusion of their communities in STEM fields and reframe what science can be. As an intergenerational collective, we co-learn on topics of culture (social and biological), community (cultural and ecological), and creati… ▽ More BioJam is a political, artistic, and educational project in which Bay Area artists, scientists, and educators collaborate with youth and communities of color to address historical exclusion of their communities in STEM fields and reframe what science can be. As an intergenerational collective, we co-learn on topics of culture (social and biological), community (cultural and ecological), and creativity. We reject the notion that increasing the number of scientists of color requires inculcation in the ways of the dominant culture. Instead, we center cultural practices, traditional ways of knowing, storytelling, art, experiential learning, and community engagement to break down the framing that positions these practices as distinct from science. The goal of this work is to realize a future in which the practice of science is relatable, accessible, and liberatory. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Showing 1–50 of 105 results for author: Hu, A