-
Towards Secure and Efficient Data Scheduling for Vehicular Social Networks
Authors:
Youhua Xia,
Tiehua Zhang,
Jiong **,
Ying He,
Fei Yu
Abstract:
Efficient data transmission scheduling within vehicular environments poses a significant challenge due to the high mobility of such networks. Contemporary research predominantly centers on crafting cooperative scheduling algorithms tailored for vehicular networks. Notwithstanding, the intricacies of orchestrating scheduling in vehicular social networks both effectively and efficiently remain formi…
▽ More
Efficient data transmission scheduling within vehicular environments poses a significant challenge due to the high mobility of such networks. Contemporary research predominantly centers on crafting cooperative scheduling algorithms tailored for vehicular networks. Notwithstanding, the intricacies of orchestrating scheduling in vehicular social networks both effectively and efficiently remain formidable. This paper introduces an innovative learning-based algorithm for scheduling data transmission that prioritizes efficiency and security within vehicular social networks. The algorithm first uses a specifically constructed neural network to enhance data processing capabilities. After this, it incorporates a Q-learning paradigm during the data transmission phase to optimize the information exchange, the privacy of which is safeguarded by differential privacy through the communication process. Comparative experiments demonstrate the superior performance of the proposed Q-learning enhanced scheduling algorithm relative to existing state-of-the-art scheduling algorithms in the context of vehicular social networks.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
Authors:
Rui Pan,
Jipeng Zhang,
Xingyuan Pan,
Renjie Pi,
Xiaoyu Wang,
Tong Zhang
Abstract:
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particu…
▽ More
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Mobile Robot Oriented Large-Scale Indoor Dataset for Dynamic Scene Understanding
Authors:
Yifan Tang,
Cong Tai,
Fangxing Chen,
Wanting Zhang,
Tao Zhang,
Xue** Liu,
Yong** Liu,
Long Zeng
Abstract:
Most existing robotic datasets capture static scene data and thus are limited in evaluating robots' dynamic performance. To address this, we present a mobile robot oriented large-scale indoor dataset, denoted as THUD (Tsinghua University Dynamic) robotic dataset, for training and evaluating their dynamic scene understanding algorithms. Specifically, the THUD dataset construction is first detailed,…
▽ More
Most existing robotic datasets capture static scene data and thus are limited in evaluating robots' dynamic performance. To address this, we present a mobile robot oriented large-scale indoor dataset, denoted as THUD (Tsinghua University Dynamic) robotic dataset, for training and evaluating their dynamic scene understanding algorithms. Specifically, the THUD dataset construction is first detailed, including organization, acquisition, and annotation methods. It comprises both real-world and synthetic data, collected with a real robot platform and a physical simulation platform, respectively. Our current dataset includes 13 larges-scale dynamic scenarios, 90K image frames, 20M 2D/3D bounding boxes of static and dynamic objects, camera poses, and IMU. The dataset is still continuously expanding. Then, the performance of mainstream indoor scene understanding tasks, e.g. 3D object detection, semantic segmentation, and robot relocalization, is evaluated on our THUD dataset. These experiments reveal serious challenges for some robot scene understanding tasks in dynamic scenes. By sharing this dataset, we aim to foster and iterate new mobile robot algorithms quickly for robot actual working dynamic environment, i.e. complex crowded dynamic scenes.
△ Less
Submitted 30 June, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
Authors:
Ziming Zhao,
Tiehua Zhang,
Zhishu Shen,
Hai Dong,
Xingjun Ma,
Xianhui Liu,
Yun Yang
Abstract:
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable is…
▽ More
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
A Differentiable Approach to Multi-scale Brain Modeling
Authors:
Chaoming Wang,
Muyang Lyu,
Tianqiu Zhang,
Sichao He,
Si Wu
Abstract:
We present a multi-scale differentiable brain modeling workflow utilizing BrainPy, a unique differentiable brain simulator that combines accurate brain simulation with powerful gradient-based optimization. We leverage this capability of BrainPy across different brain scales. At the single-neuron level, we implement differentiable neuron models and employ gradient methods to optimize their fit to e…
▽ More
We present a multi-scale differentiable brain modeling workflow utilizing BrainPy, a unique differentiable brain simulator that combines accurate brain simulation with powerful gradient-based optimization. We leverage this capability of BrainPy across different brain scales. At the single-neuron level, we implement differentiable neuron models and employ gradient methods to optimize their fit to electrophysiological data. On the network level, we incorporate connectomic data to construct biologically constrained network models. Finally, to replicate animal behavior, we train these models on cognitive tasks using gradient-based learning rules. Experiments demonstrate that our approach achieves superior performance and speed in fitting generalized leaky integrate-and-fire and Hodgkin-Huxley single neuron models. Additionally, training a biologically-informed network of excitatory and inhibitory spiking neurons on working memory tasks successfully replicates observed neural activity and synaptic weight distributions. Overall, our differentiable multi-scale simulation approach offers a promising tool to bridge neuroscience data across electrophysiological, anatomical, and behavioral scales.
△ Less
Submitted 1 July, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Authors:
Tao Zhang,
Xiangtai Li,
Hao Fei,
Haobo Yuan,
Shengqiong Wu,
Shun** Ji,
Chen Change Loy,
Shuicheng Yan
Abstract:
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual p…
▽ More
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model
Authors:
Haobo Yuan,
Xiangtai Li,
Lu Qi,
Tao Zhang,
Ming-Hsuan Yang,
Shuicheng Yan,
Chen Change Loy
Abstract:
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifica…
▽ More
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
Authors:
Diandian Gu,
Peng Sun,
Qinghao Hu,
Ting Huang,
Xun Chen,
Yingtong Xiong,
Guoteng Wang,
Qiaoling Chen,
Shangchun Zhao,
Jiarui Fang,
Yonggang Wen,
Tianwei Zhang,
Xin **,
Xuanzhe Liu
Abstract:
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mecha…
▽ More
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Mamba24/8D: Enhancing Global Interaction in Point Clouds via State Space Model
Authors:
Zhuoyuan Li,
Yubo Ai,
Jiahao Lu,
ChuXin Wang,
Jiacheng Deng,
Hanzhi Chang,
Yanzhe Liang,
Wenfei Yang,
Shifeng Zhang,
Tianzhu Zhang
Abstract:
Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation cost high, limiting the number of points that can be processed simultaneously and impeding the modeling of long-range dependencies. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, w…
▽ More
Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation cost high, limiting the number of points that can be processed simultaneously and impeding the modeling of long-range dependencies. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, we introduce Mamba, a SSM-based architecture, to the point cloud domain and propose Mamba24/8D, which has strong global modeling capability under linear complexity. Specifically, to make disorderness of point clouds fit in with the causal nature of Mamba, we propose a multi-path serialization strategy applicable to point clouds. Besides, we propose the ConvMamba block to compensate for the shortcomings of Mamba in modeling local geometries and in unidirectional modeling. Mamba24/8D obtains state of the art results on several 3D point cloud segmentation tasks, including ScanNet v2, ScanNet200 and nuScenes, while its effectiveness is validated by extensive experiments.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking
Authors:
Xiaohao Xu,
Tianyi Zhang,
Sibo Wang,
Xiang Li,
Yongqi Chen,
Ye Li,
Bhiksha Raj,
Matthew Johnson-Roberson,
Xiaonan Huang
Abstract:
Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Map** (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world wit…
▽ More
Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Map** (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world with diverse perturbations remains under-explored. To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations. The pipeline comprises a comprehensive taxonomy of sensor and motion perturbations for embodied multi-modal (specifically RGB-D) sensing, categorized by their sources and propagation order, allowing for procedural composition. We also provide a toolbox for synthesizing these perturbations, enabling the transformation of clean environments into challenging noisy simulations. Utilizing the pipeline, we instantiate the large-scale Noisy-Replica benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced RGB-D SLAM models. Our extensive analysis uncovers the susceptibilities of both neural (NeRF and Gaussian Splatting -based) and non-neural SLAM models to disturbances, despite their demonstrated accuracy in standard benchmarks. Our code is publicly available at https://github.com/Xiaohao-Xu/SLAM-under-Perturbation.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
A Dual-Channel Particle Swarm Optimization Algorithm Based on Adaptive Balance Search
Authors:
Zhenxing Zhang,
Tianxian Zhang,
Xiangliang Xu,
Lingjiang Kong,
Yi Han,
Zicheng Wang
Abstract:
The balance between exploration (Er) and exploitation (Ei) determines the generalization performance of the particle swarm optimization (PSO) algorithm on different problems. Although the insufficient balance caused by global best being located near a local minimum has been widely researched, few scholars have systematically paid attention to two behaviors about personal best position (P) and glob…
▽ More
The balance between exploration (Er) and exploitation (Ei) determines the generalization performance of the particle swarm optimization (PSO) algorithm on different problems. Although the insufficient balance caused by global best being located near a local minimum has been widely researched, few scholars have systematically paid attention to two behaviors about personal best position (P) and global best position (G) existing in PSO. 1) P's uncontrollable-exploitation and involuntary-exploration guidance behavior. 2) G's full-time and global guidance behavior, each of which negatively affects the balance of Er and Ei. With regards to this, we firstly discuss the two behaviors, unveiling the mechanisms by which they affect the balance, and further pinpoint three key points for better balancing Er and Ei: eliminating the coupling between P and G, empowering P with controllable-exploitation and voluntary-exploration guidance behavior, controlling G's full-time and global guidance behavior. Then, we present a dual-channel PSO algorithm based on adaptive balance search (DCPSO-ABS). This algorithm entails a dual-channel framework to mitigate the interaction of P and G, aiding in regulating the behaviors of P and G, and meanwhile an adaptive balance search strategy for empowering P with voluntary-exploration and controllable-exploitation guidance behavior as well as adaptively controlling G's full-time and global guidance behavior. Finally, three kinds of experiments on 57 benchmark functions are designed to demonstrate that our proposed algorithm has stronger generalization performance than selected state-of-the-art algorithms.
△ Less
Submitted 25 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting
Authors:
Tiange Zhang,
Qing Cai,
Feng Gao,
Lin Qi,
Junyu Dong
Abstract:
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision,…
▽ More
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
KEHRL: Learning Knowledge-Enhanced Language Representations with Hierarchical Reinforcement Learning
Authors:
Dongyang Li,
Taolin Zhang,
Longtao Huang,
Chengyu Wang,
Xiaofeng He,
Hui Xue
Abstract:
Knowledge-enhanced pre-trained language models (KEPLMs) leverage relation triples from knowledge graphs (KGs) and integrate these external data sources into language models via self-supervised learning. Previous works treat knowledge enhancement as two independent operations, i.e., knowledge injection and knowledge integration. In this paper, we propose to learn Knowledge-Enhanced language represe…
▽ More
Knowledge-enhanced pre-trained language models (KEPLMs) leverage relation triples from knowledge graphs (KGs) and integrate these external data sources into language models via self-supervised learning. Previous works treat knowledge enhancement as two independent operations, i.e., knowledge injection and knowledge integration. In this paper, we propose to learn Knowledge-Enhanced language representations with Hierarchical Reinforcement Learning (KEHRL), which jointly addresses the problems of detecting positions for knowledge injection and integrating external knowledge into the model in order to avoid injecting inaccurate or irrelevant knowledge. Specifically, a high-level reinforcement learning (RL) agent utilizes both internal and prior knowledge to iteratively detect essential positions in texts for knowledge injection, which filters out less meaningful entities to avoid diverting the knowledge learning direction. Once the entity positions are selected, a relevant triple filtration module is triggered to perform low-level RL to dynamically refine the triples associated with polysemic entities through binary-valued actions. Experiments validate KEHRL's effectiveness in probing factual knowledge and enhancing the model's performance on various natural language understanding tasks.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
UniPSDA: Unsupervised Pseudo Semantic Data Augmentation for Zero-Shot Cross-Lingual Natural Language Understanding
Authors:
Dongyang Li,
Taolin Zhang,
Jiali Deng,
Longtao Huang,
Chengyu Wang,
Xiaofeng He,
Hui Xue
Abstract:
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseu…
▽ More
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions. Specifically, to retrieve the tokens with similar meanings for the semantic data augmentation across different languages, we propose a sequential clustering process in 3 stages: within a single language, across multiple languages of a language family, and across languages from multiple language families. Meanwhile, considering the multi-lingual knowledge infusion with context-aware semantics while alleviating computation burden, we directly replace the key constituents of the sentences with the above-learned multi-lingual family knowledge, viewed as pseudo-semantic. The infusion process is further optimized via three de-biasing techniques without introducing any neural parameters. Extensive experiments demonstrate that our model consistently improves the performance on general zero-shot cross-lingual natural language understanding tasks, including sequence classification, information extraction, and question answering.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models
Authors:
Dongyang Li,
Junbing Yan,
Taolin Zhang,
Chengyu Wang,
Xiaofeng He,
Longtao Huang,
Hui Xue,
Jun Huang
Abstract:
Retrieval augmented generation (RAG) exhibits outstanding performance in promoting the knowledge capabilities of large language models (LLMs) with retrieved documents related to user queries. However, RAG only focuses on improving the response quality of LLMs via enhancing queries indiscriminately with retrieved information, paying little attention to what type of knowledge LLMs really need to ans…
▽ More
Retrieval augmented generation (RAG) exhibits outstanding performance in promoting the knowledge capabilities of large language models (LLMs) with retrieved documents related to user queries. However, RAG only focuses on improving the response quality of LLMs via enhancing queries indiscriminately with retrieved information, paying little attention to what type of knowledge LLMs really need to answer original queries more accurately. In this paper, we suggest that long-tail knowledge is crucial for RAG as LLMs have already remembered common world knowledge during large-scale pre-training. Based on our observation, we propose a simple but effective long-tail knowledge detection method for LLMs. Specifically, the novel Generative Expected Calibration Error (GECE) metric is derived to measure the ``long-tailness'' of knowledge based on both statistics and semantics. Hence, we retrieve relevant documents and infuse them into the model for patching knowledge loopholes only when the input query relates to long-tail knowledge. Experiments show that, compared to existing RAG pipelines, our method achieves over 4x speedup in average inference time and consistent performance improvement in downstream tasks.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Wound Tissue Segmentation in Diabetic Foot Ulcer Images Using Deep Learning: A Pilot Study
Authors:
Mrinal Kanti Dhar,
Chuanbo Wang,
Yash Patel,
Taiyu Zhang,
Jeffrey Niezgoda,
Sandeep Gopalakrishnan,
Keke Chen,
Zeyun Yu
Abstract:
Identifying individual tissues, so-called tissue segmentation, in diabetic foot ulcer (DFU) images is a challenging task and little work has been published, largely due to the limited availability of a clinical image dataset. To address this gap, we have created a DFUTissue dataset for the research community to evaluate wound tissue segmentation algorithms. The dataset contains 110 images with tis…
▽ More
Identifying individual tissues, so-called tissue segmentation, in diabetic foot ulcer (DFU) images is a challenging task and little work has been published, largely due to the limited availability of a clinical image dataset. To address this gap, we have created a DFUTissue dataset for the research community to evaluate wound tissue segmentation algorithms. The dataset contains 110 images with tissues labeled by wound experts and 600 unlabeled images. Additionally, we conducted a pilot study on segmenting wound characteristics including fibrin, granulation, and callus using deep learning. Due to the limited amount of annotated data, our framework consists of both supervised learning (SL) and semi-supervised learning (SSL) phases. In the SL phase, we propose a hybrid model featuring a Mix Transformer (MiT-b3) in the encoder and a CNN in the decoder, enhanced by the integration of a parallel spatial and channel squeeze-and-excitation (P-scSE) module known for its efficacy in improving boundary accuracy. The SSL phase employs a pseudo-labeling-based approach, iteratively identifying and incorporating valuable unlabeled images to enhance overall segmentation performance. Comparative evaluations with state-of-the-art methods are conducted for both SL and SSL phases. The SL achieves a Dice Similarity Coefficient (DSC) of 84.89%, which has been improved to 87.64% in the SSL phase. Furthermore, the results are benchmarked against two widely used SSL approaches: Generative Adversarial Networks and Cross-Consistency Training. Additionally, our hybrid model outperforms the state-of-the-art methods with a 92.99% DSC in performing binary segmentation of DFU wound areas when tested on the Chronic Wound dataset. Codes and data are available at https://github.com/uwm-bigdata/DFUTissueSegNet.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
psPRF:Pansharpening Planar Neural Radiance Field for Generalized 3D Reconstruction Satellite Imagery
Authors:
Tongtong Zhang,
Yuanxiang Li
Abstract:
Most current NeRF variants for satellites are designed for one specific scene and fall short of generalization to new geometry. Additionally, the RGB images require pan-sharpening as an independent preprocessing step. This paper introduces psPRF, a Planar Neural Radiance Field designed for paired low-resolution RGB (LR-RGB) and high-resolution panchromatic (HR-PAN) images from satellite sensors wi…
▽ More
Most current NeRF variants for satellites are designed for one specific scene and fall short of generalization to new geometry. Additionally, the RGB images require pan-sharpening as an independent preprocessing step. This paper introduces psPRF, a Planar Neural Radiance Field designed for paired low-resolution RGB (LR-RGB) and high-resolution panchromatic (HR-PAN) images from satellite sensors with Rational Polynomial Cameras (RPC). To capture the cross-modal prior from both of the LR-RGB and HR-PAN images, for the Unet-shaped architecture, we adapt the encoder with explicit spectral-to-spatial convolution (SSConv) to enhance the multimodal representation ability. To support the generalization ability of psRPF across scenes, we adopt projection loss to ensure strong geometry self-supervision. The proposed method is evaluated with the multi-scene WorldView-3 LR-RGB and HR-PAN pairs, and achieves state-of-the-art performance.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Secure Combination of Untrusted Time information Based on Optimized Dempster-Shafer Theory
Authors:
Yang Li,
Yujie Luo,
Yichen Zhang,
Ao Sun,
Wei Huang,
Shuai Zhang,
Tao Zhang,
Chuang Zhou,
Li Ma,
Jie Yang,
Mei Wu,
Heng Wang,
Yan Pan,
Yun Shao,
Xing Chen,
Ziyang Chen,
Song Yu,
Hong Guo,
Bingjie Xu
Abstract:
Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still…
▽ More
Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still missed for precision time synchronization. In this paper, a secure combination algorithm based on Dempster-Shafer theory is proposed for multiple paths method. Special optimizations are done for the combination algorithm to solve the potential problems due to untrusted evidence. Theoretical simulation shows that the proposed algorithm works much better than Fault Tolerant Algorithm (FTA) and the attack detection method based on single path. And experimental demonstration proves the feasibility and superiority of the proposed algorithm, where the time stability with 27.97 ps, 1.57 ps, and 1.12 ps at average time 1s, 10s, 100s is achieved under TDA and local clock jump. The proposed algorithm can be used to improve the security and resilience of many importance synchronization protocol, such as NTP, PTP, and TWFTT.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Large Batch Analysis for Adagrad Under Anisotropic Smoothness
Authors:
Yuxing Liu,
Rui Pan,
Tong Zhang
Abstract:
Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can d…
▽ More
Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Embracing Federated Learning: Enabling Weak Client Participation via Partial Model Training
Authors:
Sunwoo Lee,
Tuo Zhang,
Saurav Prakash,
Yue Niu,
Salman Avestimehr
Abstract:
In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL, a general FL framework that allows all available clients to join the distributed training…
▽ More
In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL, a general FL framework that allows all available clients to join the distributed training regardless of their system resource capacity. The framework is built upon a novel form of partial model training method in which each client trains as many consecutive output-side layers as its system resources allow. Our study demonstrates that EmbracingFL encourages each layer to have similar data representations across clients, improving FL efficiency. The proposed partial model training method guarantees convergence to a neighbor of stationary points for non-convex and smooth problems. We evaluate the efficacy of EmbracingFL under a variety of settings with a mixed number of strong, moderate (~40% memory), and weak (~15% memory) clients, datasets (CIFAR-10, FEMNIST, and IMDB), and models (ResNet20, CNN, and LSTM). Our empirical study shows that EmbracingFL consistently achieves high accuracy as like all clients are strong, outperforming the state-of-the-art width reduction methods (i.e. HeteroFL and FjORD).
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction
Authors:
**ge Wu,
Zhaolong Wu,
Abul Hasan,
Yunsoo Kim,
Jason P. Y. Cheung,
Teng Zhang,
Honghan Wu
Abstract:
This study proposes an approach for error correction in clinical radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs internal and external retrieval mechanisms to extract relevant medical entities and relations from the report and external knowledge sources. A three-stage inference process is introduced, dec…
▽ More
This study proposes an approach for error correction in clinical radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs internal and external retrieval mechanisms to extract relevant medical entities and relations from the report and external knowledge sources. A three-stage inference process is introduced, decomposing the task into error detection, localization, and correction subtasks, which enhances the explainability and performance of the system. The effectiveness of the approach is evaluated using a benchmark dataset created by corrupting real-world radiology reports with realistic errors, guided by domain experts. Experimental results demonstrate the benefits of the proposed methods, with the combination of internal and external retrieval significantly improving the accuracy of error detection, localization, and correction across various state-of-the-art LLMs. The findings contribute to the development of more robust and reliable error correction systems for clinical documentation.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
IDentity with Locality: An ideal hash for gene sequence search
Authors:
Aditya Desai,
Gaurav Gupta,
Tianyi Zhang,
Anshumali Shrivastava
Abstract:
Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation…
▽ More
Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Advantage Alignment Algorithms
Authors:
Juan Agustin Duque,
Milad Aghajohari,
Tim Cooijmans,
Tianyu Zhang,
Aaron Courville
Abstract:
The growing presence of artificially intelligent agents in everyday decision-making, from LLM assistants to autonomous vehicles, hints at a future in which conflicts may arise from each agent optimizing individual interests. In general-sum games these conflicts are apparent, where naive Reinforcement Learning agents get stuck in Pareto-suboptimal Nash equilibria. Consequently, opponent sha** has…
▽ More
The growing presence of artificially intelligent agents in everyday decision-making, from LLM assistants to autonomous vehicles, hints at a future in which conflicts may arise from each agent optimizing individual interests. In general-sum games these conflicts are apparent, where naive Reinforcement Learning agents get stuck in Pareto-suboptimal Nash equilibria. Consequently, opponent sha** has been introduced as a method with success at finding socially beneficial equilibria in social dilemmas. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent sha** efficiently and intuitively. This is achieved by aligning the advantages of conflicting agents in a given game by increasing the probability of mutually-benefiting actions. We prove that existing opponent sha** methods, including LOLA and LOQA, implicitly perform Advantage Alignment. Compared to these works, Advantage Alignment mathematically simplifies the formulation of opponent sha** and seamlessly works for continuous action domains. We also demonstrate the effectiveness of our algorithm in a wide range of social dilemmas, achieving state of the art results in each case, including a social dilemma version of the Negotiation Game.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
RTFormer: Re-parameter TSBN Spiking Transformer
Authors:
Hongzhi Wang,
Xiubo Liang,
Mengjian Li,
Tao Zhang
Abstract:
The Spiking Neural Networks (SNNs), renowned for their bio-inspired operational mechanism and energy efficiency, mirror the human brain's neural activity. Yet, SNNs face challenges in balancing energy efficiency with the computational demands of advanced tasks. Our research introduces the RTFormer, a novel architecture that embeds Re-parameterized Temporal Sliding Batch Normalization (TSBN) within…
▽ More
The Spiking Neural Networks (SNNs), renowned for their bio-inspired operational mechanism and energy efficiency, mirror the human brain's neural activity. Yet, SNNs face challenges in balancing energy efficiency with the computational demands of advanced tasks. Our research introduces the RTFormer, a novel architecture that embeds Re-parameterized Temporal Sliding Batch Normalization (TSBN) within the Spiking Transformer framework. This innovation optimizes energy usage during inference while ensuring robust computational performance. The crux of RTFormer lies in its integration of reparameterized convolutions and TSBN, achieving an equilibrium between computational prowess and energy conservation.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Authors:
Tao Zhang,
Ziqian Zeng,
Yuxiang Xiao,
Hui** Zhuang,
Cen Chen,
James Foulds,
Shimei Pan
Abstract:
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicl…
▽ More
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Authors:
Haoxiang Wang,
Wei Xiong,
Tengyang Xie,
Han Zhao,
Tong Zhang
Abstract:
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM…
▽ More
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Authors:
Hengyi Wang,
Haizhou Shi,
Shiwei Tan,
Weiyi Qin,
Wenyuan Wang,
Tunyu Zhang,
Akshay Nambi,
Tanuja Ganu,
Hao Wang
Abstract:
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-contex…
▽ More
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers
Authors:
Tianhua Zhang,
Kun Li,
Hongyin Luo,
Xixin Wu,
James Glass,
Helen Meng
Abstract:
Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations su…
▽ More
Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations such as in-domain rewrites and/or relevant passage labels, limiting the models' generalization and adaptation capabilities. In this paper, we introduce AdaQR ($\textbf{Ada}$ptive $\textbf{Q}$uery $\textbf{R}$ewriting), a framework for training query rewriting models with limited rewrite annotations from seed datasets and completely no passage label. Our approach begins by fine-tuning compact large language models using only ~$10\%$ of rewrite annotations from the seed dataset training split. The models are then utilized to generate rewrite candidates for each query instance. A novel approach is then proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query by marginalizing the Top-$K$ passages. This serves as the reward for optimizing the rewriter further using Direct Preference Optimization (DPO), a process free of rewrite and retrieval annotations. Experimental results on four open-domain CQA datasets demonstrate that AdaQR not only enhances the in-domain capabilities of the rewriter with limited annotation requirement, but also adapts effectively to out-of-domain datasets.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
ESCoT: Towards Interpretable Emotional Support Dialogue Systems
Authors:
Tenggan Zhang,
Xinjie Zhang,
**ming Zhao,
Li Zhou,
Qin **
Abstract:
Understanding the reason for emotional support response is crucial for establishing connections between users and emotional support dialogue systems. Previous works mostly focus on generating better responses but ignore interpretability, which is extremely important for constructing reliable dialogue systems. To empower the system with better interpretability, we propose an emotional support respo…
▽ More
Understanding the reason for emotional support response is crucial for establishing connections between users and emotional support dialogue systems. Previous works mostly focus on generating better responses but ignore interpretability, which is extremely important for constructing reliable dialogue systems. To empower the system with better interpretability, we propose an emotional support response generation scheme, named $\textbf{E}$motion-Focused and $\textbf{S}$trategy-Driven $\textbf{C}$hain-$\textbf{o}$f-$\textbf{T}$hought ($\textbf{ESCoT}$), mimicking the process of $\textit{identifying}$, $\textit{understanding}$, and $\textit{regulating}$ emotions. Specially, we construct a new dataset with ESCoT in two steps: (1) $\textit{Dialogue Generation}$ where we first generate diverse conversation situations, then enhance dialogue generation using richer emotional support strategies based on these situations; (2) $\textit{Chain Supplement}$ where we focus on supplementing selected dialogues with elements such as emotion, stimuli, appraisal, and strategy reason, forming the manually verified chains. Additionally, we further develop a model to generate dialogue responses with better interpretability. We also conduct extensive experiments and human evaluations to validate the effectiveness of the proposed ESCoT and generated dialogue responses. Our data and code are available at $\href{https://github.com/TeigenZhang/ESCoT}{https://github.com/TeigenZhang/ESCoT}$.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
An LLM-enhanced Multi-objective Evolutionary Search for Autonomous Driving Test Scenario Generation
Authors:
Haoxiang Tian,
Xingshuo Han,
Guoquan Wu,
Yuan Zhou,
Shuo Li,
Jun Wei,
Dan Ye,
Wei Wang,
Tianwei Zhang
Abstract:
The safety of Autonomous Driving Systems (ADSs) is significantly important for the implementation of autonomous vehicles (AVs). Therefore, ADSs must be evaluated thoroughly before their release and deployment to the public. How to generate diverse safety-critical test scenarios is a key task for ADS testing. This paper proposes LEADE, an LLM-enhanced scenario generation approach for ADS testing, w…
▽ More
The safety of Autonomous Driving Systems (ADSs) is significantly important for the implementation of autonomous vehicles (AVs). Therefore, ADSs must be evaluated thoroughly before their release and deployment to the public. How to generate diverse safety-critical test scenarios is a key task for ADS testing. This paper proposes LEADE, an LLM-enhanced scenario generation approach for ADS testing, which adopts the LLM-enhanced adaptive evolutionary search to generate safety-critical and diverse test scenarios. LEADE leverages LLM's ability in program understanding to better comprehend the scenario generation task, which generates high-quality scenarios of the first generation. LEADE adopts an adaptive multi-objective genetic algorithm to search for diverse safety-critical scenarios. To guide the search away from the local optima, LEADE formulates the evolutionary search into a QA task, which leverages LLM's ability in quantitative reasoning to generate differential seed scenarios to break out of the local optimal solutions. We implement and evaluate LEADE on industrial-grade full-stack ADS platform, Baidu Apollo. Experimental results show that LEADE can effectively and efficiently generate safety-critical scenarios and expose 10 diverse safety violations of Apollo. It outperforms two state-of-the-art search-based ADS testing techniques by identifying 4 new types of safety-critical scenarios on the same roads.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Technique Report of CVPR 2024 PBDL Challenges
Authors:
Ying Fu,
Yu Li,
Shaodi You,
Boxin Shi,
Jose Alvarez,
Coert van Gemeren,
Linwei Chen,
Yunhao Zou,
Zichun Wang,
Yichen Li,
Yuze Han,
Yingkai Zhang,
Jianan Wang,
Qinglin Liu,
Wei Yu,
Xiaoqian Lv,
Jianing Li,
Sheng** Zhang,
Xiangyang Ji,
Yuanpei Chen,
Yuhan Zhang,
Weihang Peng,
Liwen Zhang,
Zhe Xu,
Dingyong Gou
, et al. (77 additional authors not shown)
Abstract:
The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a…
▽ More
The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.
△ Less
Submitted 27 June, 2024; v1 submitted 15 June, 2024;
originally announced June 2024.
-
Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation
Authors:
Tong Zhang,
Yingdong Hu,
Jiacheng You,
Yang Gao
Abstract:
Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that…
▽ More
Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment. Extensive experiments in both simulated and real-world settings demonstrate that action locality is essential for boosting sample efficiency. SGRv2 excels in RLBench tasks with keyframe control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks. Furthermore, when evaluated on ManiSkill2 and MimicGen using dense control, SGRv2's success rate is 2.54 times that of SGR. In real-world environments, with only eight demonstrations, SGRv2 can perform a variety of tasks at a markedly higher success rate compared to baseline models. Project website: http://sgrv2-robot.github.io
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
Authors:
Tuo Zhang,
Tiantian Feng,
Yibin Ni,
Mengqin Cao,
Ruying Liu,
Katharine Butler,
Yanjun Weng,
Mi Zhang,
Shrikanth S. Narayanan,
Salman Avestimehr
Abstract:
Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for…
▽ More
Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning
Authors:
Cheng Niu,
Yang Guan,
Yuanhao Wu,
Juno Zhu,
Juntong Song,
Randy Zhong,
Kaihua Zhu,
Siliang Xu,
Shizhe Diao,
Tong Zhang
Abstract:
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmente…
▽ More
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
△ Less
Submitted 24 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Authors:
Rui Yang,
Ruomeng Ding,
Yong Lin,
Huan Zhang,
Tong Zhang
Abstract:
Reward models trained on human preference data have been proven to be effective for aligning Large Language Models (LLMs) with human intent within the reinforcement learning from human feedback (RLHF) framework. However, the generalization capabilities of current reward models to unseen prompts and responses are limited. This limitation can lead to an unexpected phenomenon known as reward over-opt…
▽ More
Reward models trained on human preference data have been proven to be effective for aligning Large Language Models (LLMs) with human intent within the reinforcement learning from human feedback (RLHF) framework. However, the generalization capabilities of current reward models to unseen prompts and responses are limited. This limitation can lead to an unexpected phenomenon known as reward over-optimization, where excessive optimization of rewards results in a decline in actual performance. While previous research has advocated for constraining policy optimization, our study proposes a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviate the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
NeST: Neural Stress Tensor Tomography by leveraging 3D Photoelasticity
Authors:
Akshat Dave,
Tianyi Zhang,
Aaron Young,
Ramesh Raskar,
Wolfgang Heidrich,
Ashok Veeraraghavan
Abstract:
Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrap** ambiguities. We introduce NeST, an…
▽ More
Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrap** ambiguities. We introduce NeST, an analysis-by-synthesis approach for reconstructing 3D stress tensor fields as neural implicit representations from polarization measurements. Our key insight is to jointly handle phase unwrap** and tensor tomography using a differentiable forward model based on Jones calculus. Our non-linear model faithfully matches real captures, unlike prior linear approximations. We develop an experimental multi-axis polariscope setup to capture 3D photoelasticity and experimentally demonstrate that NeST reconstructs the internal stress distribution for objects with varying shape and force conditions. Additionally, we showcase novel applications in stress analysis, such as visualizing photoelastic fringes by virtually slicing the object and viewing photoelastic fringes from unseen viewpoints. NeST paves the way for scalable non-destructive 3D photoelastic analysis.
△ Less
Submitted 24 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Chain-of-Though (CoT) prompting strategies for medical error detection and correction
Authors:
Zhaolong Wu,
Abul Hasan,
**ge Wu,
Yunsoo Kim,
Jason P. Y. Cheung,
Teng Zhang,
Honghan Wu
Abstract:
This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to…
▽ More
This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Beyond Recommendations: From Backward to Forward AI Support of Pilots' Decision-Making Process
Authors:
Zelun Tony Zhang,
Sebastian S. Feger,
Lucas Dullenkopf,
Rulu Liao,
Lukas Süsslin,
Yuanting Liu,
Andreas Butz
Abstract:
AI is anticipated to enhance human decision-making in high-stakes domains like aviation, but adoption is often hindered by challenges such as inappropriate reliance and poor alignment with users' decision-making. Recent research suggests that a core underlying issue is the recommendation-centric design of many AI systems, i.e., they give end-to-end recommendations and ignore the rest of the decisi…
▽ More
AI is anticipated to enhance human decision-making in high-stakes domains like aviation, but adoption is often hindered by challenges such as inappropriate reliance and poor alignment with users' decision-making. Recent research suggests that a core underlying issue is the recommendation-centric design of many AI systems, i.e., they give end-to-end recommendations and ignore the rest of the decision-making process. Alternative support paradigms are rare, and it remains unclear how the few that do exist compare to recommendation-centric support. In this work, we aimed to empirically compare recommendation-centric support to an alternative paradigm, continuous support, in the context of diversions in aviation. We conducted a mixed-methods study with 32 professional pilots in a realistic setting. To ensure the quality of our study scenarios, we conducted a focus group with four additional pilots prior to the study. We found that continuous support can support pilots' decision-making in a forward direction, allowing them to think more beyond the limits of the system and make faster decisions when combined with recommendations, though the forward support can be disrupted. Participants' statements further suggest a shift in design goal away from providing recommendations, to supporting quick information gathering. Our results show ways to design more helpful and effective AI decision support that goes beyond end-to-end recommendations.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
Authors:
Tianle Zhang,
Langtian Ma,
Yuchen Yan,
Yuchen Zhang,
Kai Wang,
Yue Yang,
Ziyao Guo,
Wenqi Shao,
Yang You,
Yu Qiao,
** Luo,
Kaipeng Zhang
Abstract:
Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. H…
▽ More
Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Gaussian-Forest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling
Authors:
Fengyi Zhang,
Tianjun Zhang,
Lin Zhang,
Helen Huang,
Yadan Luo
Abstract:
The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage chal…
▽ More
The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage challenge, impeding its broader application. To address this challenge, we introduce the Gaussian-Forest modeling framework, which hierarchically represents a scene as a forest of hybrid 3D Gaussians. Each hybrid Gaussian retains its unique explicit attributes while sharing implicit ones with its sibling Gaussians, thus optimizing parameterization with significantly fewer variables. Moreover, adaptive growth and pruning strategies are designed, ensuring detailed representation in complex regions and a notable reduction in the number of required Gaussians. Extensive experiments demonstrate that Gaussian-Forest not only maintains comparable speed and quality but also achieves a compression rate surpassing 10 times, marking a significant advancement in efficient scene modeling. Codes are available at https://github.com/Xian-Bei/GaussianForest.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Where Do Large Language Models Fail When Generating Code?
Authors:
Zhijie Wang,
Zijie Zhou,
Da Song,
Yuheng Huang,
Shengmai Chen,
Lei Ma,
Tianyi Zhang
Abstract:
Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two di…
▽ More
Large Language Models (LLMs) have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 558 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited different distributions of semantic and syntactic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as prompt length, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
Authors:
Yitao Xu,
Tong Zhang,
Sabine Süsstrunk
Abstract:
Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global…
▽ More
Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs.
△ Less
Submitted 20 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation
Authors:
Lu Li,
Tianyu Zhang,
Zhiqi Bu,
Suyuchen Wang,
Huan He,
Jie Fu,
Yonghui Wu,
Jiang Bian,
Yong Chen,
Yoshua Bengio
Abstract:
Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the ob…
▽ More
Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during model merging. In real-world applications, a set of solutions with various trade-offs can be more informative, hel** practitioners make decisions based on diverse preferences. In this paper, we introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP identifies a Pareto set of scaling coefficients for merging multiple models to reflect the trade-offs. The core component of MAP is approximating the evaluation metrics of the various tasks using a quadratic approximation surrogate model derived from a pre-selected set of scaling coefficients, enabling amortized inference. Experimental results on vision and natural language processing tasks show that MAP can accurately identify the Pareto front. To further reduce the required computation of MAP, we propose (1) a Bayesian adaptive sampling algorithm and (2) a nested merging scheme with multiple stages.
△ Less
Submitted 18 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Authors:
Renjie Pi,
Jianshu Zhang,
Jipeng Zhang,
Rui Pan,
Zhekai Chen,
Tong Zhang
Abstract:
Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scra** of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is t…
▽ More
Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scra** of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
VCR: Visual Caption Restoration
Authors:
Tianyu Zhang,
Suyuchen Wang,
Lu Li,
Ge Zhang,
Perouz Taslakian,
Sai Rajeswar,
Jie Fu,
Bang Liu,
Yoshua Bengio
Abstract:
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedde…
▽ More
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.
△ Less
Submitted 24 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Differentiable Discrete Elastic Rods for Real-Time Modeling of Deformable Linear Objects
Authors:
Yizhou Chen,
Yiting Zhang,
Zachary Brei,
Tiancheng Zhang,
Yuzhen Chen,
Julie Wu,
Ram Vasudevan
Abstract:
This paper addresses the task of modeling Deformable Linear Objects (DLOs), such as ropes and cables, during dynamic motion over long time horizons. This task presents significant challenges due to the complex dynamics of DLOs. To address these challenges, this paper proposes differentiable Discrete Elastic Rods For deformable linear Objects with Real-time Modeling (DEFORM), a novel framework that…
▽ More
This paper addresses the task of modeling Deformable Linear Objects (DLOs), such as ropes and cables, during dynamic motion over long time horizons. This task presents significant challenges due to the complex dynamics of DLOs. To address these challenges, this paper proposes differentiable Discrete Elastic Rods For deformable linear Objects with Real-time Modeling (DEFORM), a novel framework that combines a differentiable physics-based model with a learning framework to model DLOs accurately and in real-time. The performance of DEFORM is evaluated in an experimental setup involving two industrial robots and a variety of sensors. A comprehensive series of experiments demonstrate the efficacy of DEFORM in terms of accuracy, computational speed, and generalizability when compared to state-of-the-art alternatives. To further demonstrate the utility of DEFORM, this paper integrates it into a perception pipeline and illustrates its superior performance when compared to the state-of-the-art methods while tracking a DLO even in the presence of occlusions. Finally, this paper illustrates the superior performance of DEFORM when compared to state-of-the-art methods when it is applied to perform autonomous planning and control of DLOs. Project page: https://roahmlab.github.io/DEFORM/.
△ Less
Submitted 14 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Attention Fusion Reverse Distillation for Multi-Lighting Image Anomaly Detection
Authors:
Yiheng Zhang,
Yunkang Cao,
Tianhang Zhang,
Weiming Shen
Abstract:
This study targets Multi-Lighting Image Anomaly Detection (MLIAD), where multiple lighting conditions are utilized to enhance imaging quality and anomaly detection performance. While numerous image anomaly detection methods have been proposed, they lack the capacity to handle multiple inputs for a single sample, like multi-lighting images in MLIAD. Hence, this study proposes Attention Fusion Rever…
▽ More
This study targets Multi-Lighting Image Anomaly Detection (MLIAD), where multiple lighting conditions are utilized to enhance imaging quality and anomaly detection performance. While numerous image anomaly detection methods have been proposed, they lack the capacity to handle multiple inputs for a single sample, like multi-lighting images in MLIAD. Hence, this study proposes Attention Fusion Reverse Distillation (AFRD) to handle multiple inputs in MLIAD. For this purpose, AFRD utilizes a pre-trained teacher network to extract features from multiple inputs. Then these features are aggregated into fused features through an attention module. Subsequently, a corresponding student net-work is utilized to regress the attention fused features. The regression errors are denoted as anomaly scores during inference. Experiments on Eyecandies demonstrates that AFRD achieves superior MLIAD performance than other MLIAD alternatives, also highlighting the benefit of using multiple lighting conditions for anomaly detection.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization
Authors:
Motahareh Sohrabi,
Juan Ramirez,
Tianyue H. Zhang,
Simon Lacoste-Julien,
Jose Gallego-Posada
Abstract:
Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problems are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the…
▽ More
Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problems are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the lack of reliable, general-purpose update schemes for the Lagrange multipliers. This paper proposes the $ν$PI algorithm and contributes an optimization perspective on Lagrange multiplier updates based on PI controllers, extending the work of Stooke, Achiam and Abbeel (2020). We provide theoretical and empirical insights explaining the inability of momentum methods to address the shortcomings of gradient descent-ascent, and contrast this with the empirical success of our proposed $ν$PI controller. Moreover, we prove that $ν$PI generalizes popular momentum methods for single-objective minimization. Our experiments demonstrate that $ν$PI reliably stabilizes the multiplier dynamics and its hyperparameters enjoy robust and predictable behavior.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
Authors:
Tianrong Zhang,
Zhaohan Xi,
Ting Wang,
Prasenjit Mitra,
**ghui Chen
Abstract:
Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have s…
▽ More
Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
reAnalyst: Scalable Analysis of Reverse Engineering Activities
Authors:
Tab Zhang,
Claire Taylor,
Bart Coppens,
Waleed Mebane,
Christian Collberg,
Bjorn De Sutter
Abstract:
This paper introduces reAnalyst, a scalable analysis framework designed to facilitate the study of reverse engineering (RE) practices through the semi-automated annotation of RE activities across various RE tools. By integrating tool-agnostic data collection of screenshots, keystrokes, active processes, and other types of data during RE experiments with semi-automated data analysis and annotation,…
▽ More
This paper introduces reAnalyst, a scalable analysis framework designed to facilitate the study of reverse engineering (RE) practices through the semi-automated annotation of RE activities across various RE tools. By integrating tool-agnostic data collection of screenshots, keystrokes, active processes, and other types of data during RE experiments with semi-automated data analysis and annotation, reAnalyst aims to overcome the limitations of traditional RE studies that rely heavily on manual data collection and subjective analysis. The framework enables more efficient data analysis, allowing researchers to explore the effectiveness of protection techniques and strategies used by reverse engineers more comprehensively and efficiently. Experimental evaluations validate the framework's capability to identify RE activities from a diverse range of screenshots with varied complexities, thereby simplifying the analysis process and supporting more effective research outcomes.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.