-
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Authors:
Irene Huang,
Wei Lin,
M. Jehanzeb Mirza,
Jacob A. Hansen,
Sivan Doveh,
Victor Ion Butoi,
Roei Herzig,
Assaf Arbelle,
Hilde Kuhene,
Trevor Darrel,
Chuang Gan,
Aude Oliva,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Compositional Reasoning (CR) entails gras** the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark…
▽ More
Compositional Reasoning (CR) entails gras** the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Authors:
Ian Huang,
Guandao Yang,
Leonidas Guibas
Abstract:
Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, mak…
▽ More
Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with "imagined" reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.
△ Less
Submitted 22 May, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
CAD: Photorealistic 3D Generation via Adversarial Distillation
Authors:
Ziyu Wan,
Despoina Paschalidou,
Ian Huang,
Hongyu Liu,
Bokui Shen,
Xiaoyu Xiang,
**g Liao,
Leonidas Guibas
Abstract:
The increased demand for 3D data in AR/VR, robotics and gaming applications, gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However, finding a…
▽ More
The increased demand for 3D data in AR/VR, robotics and gaming applications, gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However, finding a correct mode in the high-dimensional distribution produced by the diffusion model is challenging and often leads to issues such as over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking, our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner, which unlocks the generation of high-fidelity and photorealistic 3D content, conditioned on a single image and prompt. Moreover, by harnessing the latent space of GANs and expressive diffusion model priors, our method facilitates a wide variety of 3D applications including single-view reconstruction, high diversity generation and continuous 3D interpolation in the open domain. The experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
CIMulator: A Comprehensive Simulation Platform for Computing-In-Memory Circuit Macros with Low Bit-Width and Real Memory Materials
Authors:
Hoang-Hiep Le,
Md. Aftab Baig,
Wei-Chen Hong,
Cheng-Hsien Tsai,
Cheng-Jui Yeh,
Fu-Xiang Liang,
I-Ting Huang,
Wei-Tzu Tsai,
Ting-Yin Cheng,
Sourav De,
Nan-Yow Chen,
Wen-Jay Lee,
Ing-Chao Lin,
Da-Wei Chang,
Darsen D. Lu
Abstract:
This paper presents a simulation platform, namely CIMulator, for quantifying the efficacy of various synaptic devices in neuromorphic accelerators for different neural network architectures. Nonvolatile memory devices, such as resistive random-access memory, ferroelectric field-effect transistor, and volatile static random-access memory devices, can be selected as synaptic devices. A multilayer pe…
▽ More
This paper presents a simulation platform, namely CIMulator, for quantifying the efficacy of various synaptic devices in neuromorphic accelerators for different neural network architectures. Nonvolatile memory devices, such as resistive random-access memory, ferroelectric field-effect transistor, and volatile static random-access memory devices, can be selected as synaptic devices. A multilayer perceptron and convolutional neural networks (CNNs), such as LeNet-5, VGG-16, and a custom CNN named C4W-1, are simulated to evaluate the effects of these synaptic devices on the training and inference outcomes. The dataset used in the simulations are MNIST, CIFAR-10, and a white blood cell dataset. By applying batch normalization and appropriate optimizers in the training phase, neuromorphic systems with very low-bit-width or binary weights could achieve high pattern recognition rates that approach software-based CNN accuracy. We also introduce spiking neural networks with RRAM-based synaptic devices for the recognition of MNIST handwritten digits.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions
Authors:
Ian Huang,
Vrishab Krishna,
Omoruyi Atekha,
Leonidas Guibas
Abstract:
What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowl…
▽ More
What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Token Imbalance Adaptation for Radiology Report Generation
Authors:
Yuexin Wu,
I-Chan Huang,
Xiaolei Huang
Abstract:
Imbalanced token distributions naturally exist in text documents, leading neural language models to overfit on frequent tokens. The token imbalance may dampen the robustness of radiology report generators, as complex medical terms appear less frequently but reflect more medical information. In this study, we demonstrate how current state-of-the-art models fail to generate infrequent tokens on two…
▽ More
Imbalanced token distributions naturally exist in text documents, leading neural language models to overfit on frequent tokens. The token imbalance may dampen the robustness of radiology report generators, as complex medical terms appear less frequently but reflect more medical information. In this study, we demonstrate how current state-of-the-art models fail to generate infrequent tokens on two standard benchmark datasets (IU X-RAY and MIMIC-CXR) of radiology report generation. % However, no prior study has proposed methods to adapt infrequent tokens for text generators feeding with medical images. To solve the challenge, we propose the \textbf{T}oken \textbf{Im}balance Adapt\textbf{er} (\textit{TIMER}), aiming to improve generation robustness on infrequent tokens. The model automatically leverages token imbalance by an unlikelihood loss and dynamically optimizes generation processes to augment infrequent tokens. We compare our approach with multiple state-of-the-art methods on the two benchmarks. Experiments demonstrate the effectiveness of our approach in enhancing model robustness overall and infrequent tokens. Our ablation analysis shows that our reinforcement learning method has a major effect in adapting token imbalance for radiology report generation.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
DefGraspNets: Grasp Planning on 3D Fields with Graph Neural Nets
Authors:
Isabella Huang,
Yashraj Narang,
Ruzena Bajcsy,
Fabio Ramos,
Tucker Hermans,
Dieter Fox
Abstract:
Robotic gras** of 3D deformable objects is critical for real-world applications such as food handling and robotic surgery. Unlike rigid and articulated objects, 3D deformable objects have infinite degrees of freedom. Fully defining their state requires 3D deformation and stress fields, which are exceptionally difficult to analytically compute or experimentally measure. Thus, evaluating grasp can…
▽ More
Robotic gras** of 3D deformable objects is critical for real-world applications such as food handling and robotic surgery. Unlike rigid and articulated objects, 3D deformable objects have infinite degrees of freedom. Fully defining their state requires 3D deformation and stress fields, which are exceptionally difficult to analytically compute or experimentally measure. Thus, evaluating grasp candidates for grasp planning typically requires accurate, but slow 3D finite element method (FEM) simulation. Sampling-based grasp planning is often impractical, as it requires evaluation of a large number of grasp candidates. Gradient-based grasp planning can be more efficient, but requires a differentiable model to synthesize optimal grasps from initial candidates. Differentiable FEM simulators may fill this role, but are typically no faster than standard FEM. In this work, we propose learning a predictive graph neural network (GNN), DefGraspNets, to act as our differentiable model. We train DefGraspNets to predict 3D stress and deformation fields based on FEM-based grasp simulations. DefGraspNets not only runs up to 1500 times faster than the FEM simulator, but also enables fast gradient-based grasp optimization over 3D stress and deformation metrics. We design DefGraspNets to align with real-world grasp planning practices and demonstrate generalization across multiple test sets, including real-world experiments.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
The Media Inequality, Uncanny Mountain, and the Singularity is Far from Near: Iwaa and Sophia Robot versus a Real Human Being
Authors:
Johan F. Hoorn,
Ivy S. Huang
Abstract:
Design of Artificial Intelligence and robotics habitually assumes that adding more humanlike features improves the user experience, mainly kept in check by suspicion of uncanny effects. Three strands of theorizing are brought together for the first time and empirically put to the test: Media Equation (and in its wake, Computers Are Social Actors), Uncanny Valley theory, and as an extreme of human-…
▽ More
Design of Artificial Intelligence and robotics habitually assumes that adding more humanlike features improves the user experience, mainly kept in check by suspicion of uncanny effects. Three strands of theorizing are brought together for the first time and empirically put to the test: Media Equation (and in its wake, Computers Are Social Actors), Uncanny Valley theory, and as an extreme of human-likeness assumptions, the Singularity. We measured the user experience of real-life visitors of a number of seminars who were checked in either by Smart Dynamics' Iwaa, Hanson's Sophia robot, Sophia's on-screen avatar, or a human assistant. Results showed that human-likeness was not in appearance or behavior but in attributed qualities of being alive. Media Equation, Singularity, and Uncanny hypotheses were not confirmed. We discuss the imprecision in theorizing about human-likeness and rather opt for machines that 'function adequately.'
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Design Project of an Open-Source, Low-Cost, and Lightweight Robotic Manipulator for High School Students
Authors:
Isabella Huang,
Qianwen Zhao,
Maxine Fontaine,
Long Wang
Abstract:
In recent years, there is an increasing interest in high school robotics extracurriculars such as robotics clubs and robotics competitions. The growing demand is a result of more ubiquitous open-source software and affordable off-the-shelf hardware kits, which significantly help lower the barrier for entry-level robotics hobbyists. In this project, we present an open-source, low-cost, and lightwei…
▽ More
In recent years, there is an increasing interest in high school robotics extracurriculars such as robotics clubs and robotics competitions. The growing demand is a result of more ubiquitous open-source software and affordable off-the-shelf hardware kits, which significantly help lower the barrier for entry-level robotics hobbyists. In this project, we present an open-source, low-cost, and lightweight robotic manipulator designed and developed by a high school researcher under the guidance of a university faculty and a Ph.D. student. We believe the presented project is suitable for high school robotics research and educational activities. Our open-source package consists of mechanical design models, mechatronics specifications, and software program source codes. The mechanical design models include CAD (Computer Aided Design) files that are ready for prototy** (3D printing technology) and serve as an assembly guide accommodated with a complete bill of materials. Electrical wiring diagrams and low-level controllers are documented in detail as part of the open-source software package. The educational objective of this project is to enable high school student teams to replicate and build a robotic manipulator. The engineering experience that high school students acquire in the proposed project is full-stack, including mechanical design, mechatronics, and programming. The project significantly enriches their hands-on engineering experience in a project-based environment. Throughout this project, we discovered that the high school researcher was able to apply multidisciplinary knowledge from K-12 STEM courses to build the robotic manipulator. The researcher was able to go through a system engineering design and development process and obtain skills to use professional engineering tools including SolidWorks and Arduino microcontrollers.
△ Less
Submitted 16 March, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
LADIS: Language Disentanglement for 3D Shape Editing
Authors:
Ian Huang,
Panos Achlioptas,
Tianyi Zhang,
Sergey Tulyakov,
Minhyuk Sung,
Leonidas Guibas
Abstract:
Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network ar…
▽ More
Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network architecture, a disentanglement loss, and a new editing procedure. Additionally, to measure edit locality, we define a new metric that we call part-wise edit precision. We show that our method outperforms existing SOTA methods by 20% in terms of edit locality, and up to 6.6% in terms of language reference resolution accuracy. Our work suggests that by solely disentangling language representations, downstream 3D shape editing can become more local to relevant parts, even if the model was never given explicit part-based supervision.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
DefGraspSim: Physics-based simulation of grasp outcomes for 3D deformable objects
Authors:
Isabella Huang,
Yashraj Narang,
Clemens Eppner,
Balakumar Sundaralingam,
Miles Macklin,
Ruzena Bajcsy,
Tucker Hermans,
Dieter Fox
Abstract:
Robotic gras** of 3D deformable objects (e.g., fruits/vegetables, internal organs, bottles/boxes) is critical for real-world applications such as food processing, robotic surgery, and household automation. However, develo** grasp strategies for such objects is uniquely challenging. Unlike rigid objects, deformable objects have infinite degrees of freedom and require field quantities (e.g., def…
▽ More
Robotic gras** of 3D deformable objects (e.g., fruits/vegetables, internal organs, bottles/boxes) is critical for real-world applications such as food processing, robotic surgery, and household automation. However, develo** grasp strategies for such objects is uniquely challenging. Unlike rigid objects, deformable objects have infinite degrees of freedom and require field quantities (e.g., deformation, stress) to fully define their state. As these quantities are not easily accessible in the real world, we propose studying interaction with deformable objects through physics-based simulation. As such, we simulate grasps on a wide range of 3D deformable objects using a GPU-based implementation of the corotational finite element method (FEM). To facilitate future research, we open-source our simulated dataset (34 objects, 1e5 Pa elasticity range, 6800 grasp evaluations, 1.1M grasp measurements), as well as a code repository that allows researchers to run our full FEM-based grasp evaluation pipeline on arbitrary 3D object models of their choice. Finally, we demonstrate good correspondence between grasp outcomes on simulated objects and their real counterparts.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
PartGlot: Learning Shape Part Segmentation from Language Reference Games
Authors:
Juil Koo,
Ian Huang,
Panos Achlioptas,
Leonidas Guibas,
Minhyuk Sung
Abstract:
We introduce PartGlot, a neural framework and associated architectures for learning semantic part segmentation of 3D shape geometry, based solely on part referential language. We exploit the fact that linguistic descriptions of a shape can provide priors on the shape's parts -- as natural language has evolved to reflect human perception of the compositional structure of objects, essential to their…
▽ More
We introduce PartGlot, a neural framework and associated architectures for learning semantic part segmentation of 3D shape geometry, based solely on part referential language. We exploit the fact that linguistic descriptions of a shape can provide priors on the shape's parts -- as natural language has evolved to reflect human perception of the compositional structure of objects, essential to their recognition and use. For training, we use the paired geometry / language data collected in the ShapeGlot work for their reference game, where a speaker creates an utterance to differentiate a target shape from two distractors and the listener has to find the target based on this utterance. Our network is designed to solve this target discrimination problem, carefully incorporating a Transformer-based attention module so that the output attention can precisely highlight the semantic part or parts described in the language. Furthermore, the network operates without any direct supervision on the 3D geometry itself. Surprisingly, we further demonstrate that the learned part information is generalizable to shape classes unseen during training. Our approach opens the possibility of learning 3D shape parts from language alone, without the need for large-scale part geometry annotations, thus facilitating annotation acquisition.
△ Less
Submitted 30 March, 2022; v1 submitted 12 December, 2021;
originally announced December 2021.
-
IPC-GraspSim: Reducing the Sim2Real Gap for Parallel-Jaw Gras** with the Incremental Potential Contact Model
Authors:
Chung Min Kim,
Michael Danielczuk,
Isabella Huang,
Ken Goldberg
Abstract:
Accurately simulating whether an object will be lifted securely or dropped during gras** is a longstanding Sim2Real challenge. Soft compliant jaw tips are almost universally used with parallel-jaw robot grippers due to their ability to increase contact area and friction between the jaws and the object to be manipulated. However, interactions between the compliant surfaces and rigid objects are n…
▽ More
Accurately simulating whether an object will be lifted securely or dropped during gras** is a longstanding Sim2Real challenge. Soft compliant jaw tips are almost universally used with parallel-jaw robot grippers due to their ability to increase contact area and friction between the jaws and the object to be manipulated. However, interactions between the compliant surfaces and rigid objects are notoriously difficult to model. We introduce IPC-GraspSim, a novel grasp simulator that extends Incremental Potential Contact (IPC) - a highly accurate collision + deformation model developed in 2020 for computer graphics. IPC-GraspSim models both the dynamics and the deformation of compliant jaw tips to reduce Sim2Real gap for robot gras**. We evaluate IPC-GraspSim using a set of 2,000 physical grasps across 16 adversarial objects where analytic models perform poorly. In comparison to both analytic quasistatic contact models (soft point contact, REACH, 6DFC) and dynamic grasp simulators (Isaac Gym with FleX), results suggest IPC-GraspSim can predict robustness with higher precision and recall (F1 = 0.85). IPC-GraspSim increases F1 score by 0.03 to 0.20 over analytic baselines and 0.09 over Isaac Gym, at a cost of 8000x and 1.5x more compute time, respectively. All data, code, videos, and supplementary material are available at https://sites.google.com/berkeley.edu/ipcgraspsim.
△ Less
Submitted 1 March, 2022; v1 submitted 2 November, 2021;
originally announced November 2021.
-
DefGraspSim: Simulation-based gras** of 3D deformable objects
Authors:
Isabella Huang,
Yashraj Narang,
Clemens Eppner,
Balakumar Sundaralingam,
Miles Macklin,
Tucker Hermans,
Dieter Fox
Abstract:
Robotic gras** of 3D deformable objects (e.g., fruits/vegetables, internal organs, bottles/boxes) is critical for real-world applications such as food processing, robotic surgery, and household automation. However, develo** grasp strategies for such objects is uniquely challenging. In this work, we efficiently simulate grasps on a wide range of 3D deformable objects using a GPU-based implement…
▽ More
Robotic gras** of 3D deformable objects (e.g., fruits/vegetables, internal organs, bottles/boxes) is critical for real-world applications such as food processing, robotic surgery, and household automation. However, develo** grasp strategies for such objects is uniquely challenging. In this work, we efficiently simulate grasps on a wide range of 3D deformable objects using a GPU-based implementation of the corotational finite element method (FEM). To facilitate future research, we open-source our simulated dataset (34 objects, 1e5 Pa elasticity range, 6800 grasp evaluations, 1.1M grasp measurements), as well as a code repository that allows researchers to run our full FEM-based grasp evaluation pipeline on arbitrary 3D object models of their choice. We also provide a detailed analysis on 6 object primitives. For each primitive, we methodically describe the effects of different grasp strategies, compute a set of performance metrics (e.g., deformation, stress) that fully capture the object response, and identify simple grasp features (e.g., gripper displacement, contact area) measurable by robots prior to pickup and predictive of these performance metrics. Finally, we demonstrate good correspondence between grasps on simulated objects and their real-world counterparts.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
Nonverbal Robot Feedback for Human Teachers
Authors:
Sandy H. Huang,
Isabella Huang,
Ravi Pandya,
Anca D. Dragan
Abstract:
Robots can learn preferences from human demonstrations, but their success depends on how informative these demonstrations are. Being informative is unfortunately very challenging, because during teaching, people typically get no transparency into what the robot already knows or has learned so far. In contrast, human students naturally provide a wealth of nonverbal feedback that reveals their level…
▽ More
Robots can learn preferences from human demonstrations, but their success depends on how informative these demonstrations are. Being informative is unfortunately very challenging, because during teaching, people typically get no transparency into what the robot already knows or has learned so far. In contrast, human students naturally provide a wealth of nonverbal feedback that reveals their level of understanding and engagement. In this work, we study how a robot can similarly provide feedback that is minimally disruptive, yet gives human teachers a better mental model of the robot learner, and thus enables them to teach more effectively. Our idea is that at any point, the robot can indicate what it thinks the correct next action is, shedding light on its current estimate of the human's preferences. We analyze how useful this feedback is, both in theory and with two user studies---one with a virtual character that tests the feedback itself, and one with a PR2 robot that uses gaze as the feedback mechanism. We find that feedback can be useful for improving both the quality of teaching and teachers' understanding of the robot's capability.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
DeepBase: Deep Inspection of Neural Networks
Authors:
Thibault Sellam,
Kevin Lin,
Ian Yiran Huang,
Yiru Chen,
Michelle Yang,
Carl Vondrick,
Eugene Wu
Abstract:
Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that beha…
▽ More
Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, which we term Deep Neural Inspection. There is opportunity to provide a declarative abstraction to easily express, execute, and optimize them.
This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.
△ Less
Submitted 7 January, 2019; v1 submitted 13 August, 2018;
originally announced August 2018.
-
Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries
Authors:
Yuting Zhang,
Luyao Yuan,
Yijie Guo,
Zhiyuan He,
I-An Huang,
Honglak Lee
Abstract:
Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we pr…
▽ More
Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.
△ Less
Submitted 17 April, 2017; v1 submitted 12 April, 2017;
originally announced April 2017.