Skip to main content

Showing 1–8 of 8 results for author: Kil, J

.
  1. arXiv:2407.00087  [pdf, other

    cs.AI cs.CL cs.LG

    ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

    Authors: Ju-Seung Byun, Jiyun Chun, Jihyung Kil, Andrew Perrault

    Abstract: Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GP… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

  2. arXiv:2402.11058  [pdf, other

    cs.CV cs.CL

    II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

    Authors: Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

    Abstract: Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especia… ▽ More

    Submitted 2 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL 2024 Findings

  3. arXiv:2402.04476  [pdf, other

    cs.CV cs.AI cs.CL

    Dual-View Visual Contextualization for Web Navigation

    Authors: Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao

    Abstract: Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, mak… ▽ More

    Submitted 30 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Accepted to CVPR 2024

  4. arXiv:2401.01614  [pdf, other

    cs.IR cs.AI cs.CL cs.CV

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

    Abstract: The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on a… ▽ More

    Submitted 12 March, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

  5. arXiv:2209.05534  [pdf, other

    cs.CV cs.CL

    PreSTU: Pre-Training for Scene-Text Understanding

    Authors: Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

    Abstract: The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives… ▽ More

    Submitted 19 August, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: Accepted to ICCV 2023

  6. arXiv:2202.07028  [pdf, other

    cs.AI cs.CL cs.CV cs.LG cs.RO

    One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

    Authors: Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, Yu Su

    Abstract: We study the problem of develo** autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in… ▽ More

    Submitted 10 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: 10 pages, 5 figures. Accepted to CVPR 2022

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15482-15491

  7. arXiv:2109.06122  [pdf, other

    cs.CV cs.CL

    Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

    Authors: Jihyung Kil, Cheng Zhang, Dong Xuan, Wei-Lun Chao

    Abstract: Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being ask… ▽ More

    Submitted 8 November, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021

  8. arXiv:2104.10355  [pdf, other

    cs.CV cs.AI cs.CL

    Revisiting Document Representations for Large-Scale Zero-Shot Learning

    Authors: Jihyung Kil, Wei-Lun Chao

    Abstract: Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semantic representations. We argue that documents like Wikipedia pages contain rich visual information, which however can easily be buried by the vast amoun… ▽ More

    Submitted 21 April, 2021; originally announced April 2021.

    Comments: Accepted to NAACL 2021