Skip to main content

Showing 1–9 of 9 results for author: Kojima, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.02691  [pdf, other

    cs.CL cs.CV

    A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

    Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

    Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and pr… ▽ More

    Submitted 30 May, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: This was published in TMLR in 2024, on January 24th

  2. arXiv:2211.16492  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Abstract Visual Reasoning with Tangram Shapes

    Authors: Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen Vong, Robert D. Hawkins, Yoav Artzi

    Abstract: We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to i… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: EMNLP 2022 long paper

  3. arXiv:2211.01994  [pdf, other

    cs.LG cs.AI cs.CL

    lilGym: Natural Language Visual Reasoning with Reinforcement Learning

    Authors: Anne Wu, Kianté Brantley, Noriyuki Kojima, Yoav Artzi

    Abstract: We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each stat… ▽ More

    Submitted 29 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: ACL 2023 Long Paper

  4. arXiv:2210.05147  [pdf, other

    cs.LG cs.CL cs.CV

    Markup-to-Image Diffusion Models with Scheduled Sampling

    Authors: Yuntian Deng, Noriyuki Kojima, Alexander M. Rush

    Abstract: Building on recent advances in image generation, we present a fully data-driven approach to rendering markup into images. The approach is based on diffusion models, which parameterize the distribution of data using a sequence of denoising operations on top of a Gaussian noise distribution. We view the diffusion denoising process as a sequential decision making process, and show that it exhibits co… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  5. arXiv:2108.04812  [pdf, other

    cs.CL cs.AI cs.LG

    Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

    Authors: Noriyuki Kojima, Alane Suhr, Yoav Artzi

    Abstract: We study continual learning for natural language instruction generation, by observing human users' instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system's success communicating its intent. We sh… ▽ More

    Submitted 10 August, 2021; originally announced August 2021.

    Comments: To appear in TACL 2021. The arXiv version is a pre-MIT Press publication version

  6. arXiv:2007.13215  [pdf, other

    cs.CV

    OASIS: A Large-Scale Dataset for Single Image 3D in the Wild

    Authors: Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng

    Abstract: Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate lead… ▽ More

    Submitted 26 July, 2020; originally announced July 2020.

    Comments: Accepted to CVPR 2020

  7. arXiv:2005.01678  [pdf, other

    cs.CL

    What is Learned in Visually Grounded Neural Syntax Acquisition

    Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Alexander M. Rush, Yoav Artzi

    Abstract: Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified ver… ▽ More

    Submitted 18 May, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: In ACL 2020

  8. arXiv:1907.11770  [pdf, other

    cs.CV

    To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments

    Authors: Noriyuki Kojima, Jia Deng

    Abstract: In this paper we compare learning-based methods and classical methods for navigation in virtual environments. We construct classical navigation agents and demonstrate that they outperform state-of-the-art learning-based agents on two standard benchmarks: MINOS and Stanford Large-Scale 3D Indoor Spaces. We perform detailed analysis to study the strengths and weaknesses of learned agents and classic… ▽ More

    Submitted 26 July, 2019; originally announced July 2019.

  9. arXiv:1809.08761  [pdf, other

    cs.CL cs.CV

    Speaker Naming in Movies

    Authors: Mahmoud Azab, Mingzhe Wang, Max Smith, Noriyuki Kojima, Jia Deng, Rada Mihalcea

    Abstract: We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms se… ▽ More

    Submitted 24 September, 2018; originally announced September 2018.