-
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Authors:
Mingrui Wu,
Jiayi Ji,
Oucheng Huang,
Jiale Li,
Yuhang Wu,
Xiaoshuai Sun,
Rongrong Ji
Abstract:
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Benc…
▽ More
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
△ Less
Submitted 2 July, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Incorporating Worker Perspectives into MTurk Annotation Practices for NLP
Authors:
Olivia Huang,
Eve Fleisig,
Dan Klein
Abstract:
Current practices regarding data collection for natural language processing on Amazon Mechanical Turk (MTurk) often rely on a combination of studies on data quality and heuristics shared among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers' rights and poor response quality. We conducted a critical litera…
▽ More
Current practices regarding data collection for natural language processing on Amazon Mechanical Turk (MTurk) often rely on a combination of studies on data quality and heuristics shared among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers' rights and poor response quality. We conducted a critical literature review and a survey of MTurk workers aimed at addressing open questions regarding best practices for fair payment, worker privacy, data quality, and considering worker incentives. We found that worker preferences are often at odds with received wisdom among NLP researchers. Surveyed workers preferred reliable, reasonable payments over uncertain, very high payments; reported frequently lying on demographic questions; and expressed frustration at having work rejected with no explanation. We also found that workers view some quality control methods, such as requiring minimum response times or Master's qualifications, as biased and largely ineffective. Based on the survey results, we provide recommendations on how future NLP studies may better account for MTurk workers' experiences in order to respect workers' rights and improve data quality.
△ Less
Submitted 15 November, 2023; v1 submitted 5 November, 2023;
originally announced November 2023.
-
An Introduction to Kernel and Operator Learning Methods for Homogenization by Self-consistent Clustering Analysis
Authors:
Owen Huang,
Sourav Saha,
Jiachen Guo,
Wing Kam Liu
Abstract:
Recent advances in operator learning theory have improved our knowledge about learning maps between infinite dimensional spaces. However, for large-scale engineering problems such as concurrent multiscale simulation for mechanical properties, the training cost for the current operator learning methods is very high. The article presents a thorough analysis on the mathematical underpinnings of the o…
▽ More
Recent advances in operator learning theory have improved our knowledge about learning maps between infinite dimensional spaces. However, for large-scale engineering problems such as concurrent multiscale simulation for mechanical properties, the training cost for the current operator learning methods is very high. The article presents a thorough analysis on the mathematical underpinnings of the operator learning paradigm and proposes a kernel learning method that maps between function spaces. We first provide a survey of modern kernel and operator learning theory, as well as discuss recent results and open problems. From there, the article presents an algorithm to how we can analytically approximate the piecewise constant functions on R for operator learning. This implies the potential feasibility of success of neural operators on clustered functions. Finally, a k-means clustered domain on the basis of a mechanistic response is considered and the Lippmann-Schwinger equation for micro-mechanical homogenization is solved. The article briefly discusses the mathematics of previous kernel learning methods and some preliminary results with those methods. The proposed kernel operator learning method uses graph kernel networks to come up with a mechanistic reduced order method for multiscale homogenization.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Deep Learning Discrete Calculus (DLDC): A Family of Discrete Numerical Methods by Universal Approximation for STEM Education to Frontier Research
Authors:
Sourav Saha,
Chanwook Park,
Stefan Knapik,
Jiachen Guo,
Owen Huang,
Wing Kam Liu
Abstract:
The article proposes formulating and codifying a set of applied numerical methods, coined as Deep Learning Discrete Calculus (DLDC), that uses the knowledge from discrete numerical methods to interpret the deep learning algorithms through the lens of applied mathematics. The DLDC methods aim to leverage the flexibility and ever increasing resources of deep learning and rich literature of numerical…
▽ More
The article proposes formulating and codifying a set of applied numerical methods, coined as Deep Learning Discrete Calculus (DLDC), that uses the knowledge from discrete numerical methods to interpret the deep learning algorithms through the lens of applied mathematics. The DLDC methods aim to leverage the flexibility and ever increasing resources of deep learning and rich literature of numerical analysis to formulate a general class of numerical method that can directly use data with uncertainty to predict the behavior of an unknown system as well as elevate the speed and accuracy of numerical solution of the governing equations for known systems. The article is structured in two major sections. In the first section, the building blocks of the DLDC methods are presented and deep learning structures analogous to traditional numerical methods such as finite difference and finite element methods are constructed with a view to incorporate these techniques in Science, Technology, Engineering, Mathematics (STEM) syllabus for K-12 students. The second section builds upon the building blocks of the previous discussion,and proposes new solution schemes for differential and integral equations pertinent to multiscale mechanics. Each section is accompanied with mathematical formulation of the numerical methods, analogous DLDC formulation, and suitable examples.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Authors:
Yuxing Chen,
Renshu Gu,
Ouhan Huang,
Gangyong Jia
Abstract:
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flat…
▽ More
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Exploring Global Diversity and Local Context for Video Summarization
Authors:
Yingchao Pan,
Ouhan Huang,
Qinghao Ye,
Zhong** Li,
Wenjiang Wang,
Guodun Li,
Yuxing Chen
Abstract:
Video summarization aims to automatically generate a diverse and concise summary which is useful in large-scale video processing. Most of the methods tend to adopt self-attention mechanism across video frames, which fails to model the diversity of video frames. To alleviate this problem, we revisit the pairwise similarity measurement in self-attention mechanism and find that the existing inner-pro…
▽ More
Video summarization aims to automatically generate a diverse and concise summary which is useful in large-scale video processing. Most of the methods tend to adopt self-attention mechanism across video frames, which fails to model the diversity of video frames. To alleviate this problem, we revisit the pairwise similarity measurement in self-attention mechanism and find that the existing inner-product affinity leads to discriminative features rather than diversified features. In light of this phenomenon, we propose global diverse attention which uses the squared Euclidean distance instead to compute the affinities. Moreover, we model the local contextual information by novel local contextual attention to remove the redundancy in the video. By combining these two attention mechanisms, a video SUMmarization model with Diversified Contextual Attention scheme is developed, namely SUM-DCA. Extensive experiments are conducted on benchmark data sets to verify the effectiveness and the superiority of SUM-DCA in terms of F-score and rank-based evaluation without any bells and whistles.
△ Less
Submitted 27 March, 2022; v1 submitted 27 January, 2022;
originally announced January 2022.
-
MimickNet, Matching Clinical Post-Processing Under Realistic Black-Box Constraints
Authors:
Ouwen Huang,
Will Long,
Nick Bottenus,
Gregg E. Trahey,
Sina Farsiu,
Mark L. Palmeri
Abstract:
Image post-processing is used in clinical-grade ultrasound scanners to improve image quality (e.g., reduce speckle noise and enhance contrast). These post-processing techniques vary across manufacturers and are generally kept proprietary, which presents a challenge for researchers looking to match current clinical-grade workflows. We introduce a deep learning framework, MimickNet, that transforms…
▽ More
Image post-processing is used in clinical-grade ultrasound scanners to improve image quality (e.g., reduce speckle noise and enhance contrast). These post-processing techniques vary across manufacturers and are generally kept proprietary, which presents a challenge for researchers looking to match current clinical-grade workflows. We introduce a deep learning framework, MimickNet, that transforms raw conventional delay-and-summed (DAS) beams into the approximate post-processed images found on clinical-grade scanners. Training MimickNet only requires post-processed image samples from a scanner of interest without the need for explicit pairing to raw DAS data. This flexibility allows it to hypothetically approximate any manufacturer's post-processing without access to the pre-processed data. MimickNet generates images with an average similarity index measurement (SSIM) of 0.930$\pm$0.0892 on a 300 cineloop test set, and it generalizes to cardiac cineloops outside of our train-test distribution achieving an SSIM of 0.967$\pm$0.002. We also explore the theoretical SSIM achievable by evaluating MimickNet performance when trained under gray-box constraints (i.e., when both pre-processed and post-processed images are available). To our knowledge, this is the first work to establish deep learning models that closely approximate current clinical-grade ultrasound post-processing under realistic black-box constraints where before and after post-processing data is unavailable. MimickNet serves as a clinical post-processing baseline for future works in ultrasound image formation to compare against. To this end, we have made the MimickNet software open source.
△ Less
Submitted 15 August, 2019;
originally announced August 2019.