Skip to main content

Showing 1–50 of 75 results for author: Lee, Y J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, **ghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with au… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.09400  [pdf, other

    cs.CV cs.LG

    Yo'LLaVA: Your Personalized Language and Vision Assistant

    Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

    Abstract: Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in o… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://thaoshibe.github.io/YoLLaVA

  3. arXiv:2405.17430  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Matryoshka Multimodal Models

    Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

    Abstract: Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Project Page: https://matryoshka-mm.github.io/

  4. arXiv:2403.15388  [pdf, other

    cs.CV cs.AI cs.CL

    LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

    Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

    Abstract: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which… ▽ More

    Submitted 22 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: Project page: https://llava-prumerge.github.io/

  5. arXiv:2402.16363  [pdf, other

    cs.CL cs.AI

    LLM Inference Unveiled: Survey and Roofline Model Insights

    Authors: Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer

    Abstract: The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summ… ▽ More

    Submitted 1 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  6. arXiv:2402.15583  [pdf, other

    cs.CV cs.LG

    Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

    Authors: Yichen Xie, Hongge Chen, Gregory P. Meyer, Yong Jae Lee, Eric M. Wolff, Masayoshi Tomizuka, Wei Zhan, Yuning Chai, Xin Huang

    Abstract: Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to signific… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  7. arXiv:2402.13254  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

    Authors: Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee

    Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation mode… ▽ More

    Submitted 12 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: 15 pages, 6 figures, 12 tables, Project Page: https://countercurate.github.io/

  8. arXiv:2401.10219  [pdf, other

    cs.CV

    Edit One for All: Interactive Batch Image Editing

    Authors: Thao Nguyen, Utkarsh Ojha, Yuheng Li, Haotian Liu, Yong Jae Lee

    Abstract: In recent years, image editing has advanced remarkably. With increased human control, it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change, to straight up dragging the contents of the image in an interactive point-based manner. However, most of the focus has remained on editing single images at a time. Whether and how we can simultaneously edit… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Project page: https://thaoshibe.github.io/edit-one-for-all/

  9. arXiv:2312.07532  [pdf, other

    cs.CV cs.AI cs.CL

    Interfacing Foundation Models' Embeddings

    Authors: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

    Abstract: We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: CODE: https://github.com/UX-Decoder/FIND

  10. arXiv:2312.02253  [pdf, other

    cs.CV cs.AI cs.LG

    Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

    Authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

    Abstract: Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In thi… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  11. arXiv:2312.00784  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

    Authors: Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee

    Abstract: While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual… ▽ More

    Submitted 26 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR2024. Project page: https://vip-llava.github.io/

  12. arXiv:2311.07377  [pdf, other

    cs.SE cs.AI cs.DC cs.RO

    Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal Approach

    Authors: Xi Zheng, Aloysius K. Mok, Ruzica Piskac, Yong Jae Lee, Bhaskar Krishnamachari, Dakai Zhu, Oleg Sokolsky, Insup Lee

    Abstract: The integration of machine learning (ML) into cyber-physical systems (CPS) offers significant benefits, including enhanced efficiency, predictive capabilities, real-time responsiveness, and the enabling of autonomous operations. This convergence has accelerated the development and deployment of a range of real-world applications, such as autonomous vehicles, delivery drones, service robots, and te… ▽ More

    Submitted 16 May, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

  13. arXiv:2311.05889  [pdf, other

    eess.IV cs.CV cs.LG

    Semantic Map Guided Synthesis of Wireless Capsule Endoscopy Images using Diffusion Models

    Authors: Hae** Lee, Jeongwoo Ju, Jonghyuck Lee, Yeoun Joo Lee, Heechul Jung

    Abstract: Wireless capsule endoscopy (WCE) is a non-invasive method for visualizing the gastrointestinal (GI) tract, crucial for diagnosing GI tract diseases. However, interpreting WCE results can be time-consuming and tiring. Existing studies have employed deep neural networks (DNNs) for automatic GI tract lesion detection, but acquiring sufficient training examples, particularly due to privacy concerns, r… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  14. arXiv:2310.03744  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Improved Baselines with Visual Instruction Tuning

    Authors: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

    Abstract: Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response form… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Camera ready, CVPR 2024 (highlight). LLaVA project page: https://llava-vl.github.io

  15. arXiv:2309.12530  [pdf, other

    cs.CV

    A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

    Authors: Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee

    Abstract: Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unsee… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: to appear at ICCV2023

  16. arXiv:2309.10313  [pdf, other

    cs.CL cs.AI cs.LG

    Investigating the Catastrophic Forgetting in Multimodal Large Language Models

    Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

    Abstract: Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on develo** general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still… ▽ More

    Submitted 5 December, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  17. arXiv:2307.14331  [pdf, other

    cs.CV

    Visual Instruction Inversion: Image Editing via Visual Prompting

    Authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee

    Abstract: Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: Project page: https://thaoshibe.github.io/visii/

  18. arXiv:2307.13697  [pdf, other

    cs.CV cs.AI

    Benchmarking and Analyzing Generative Data for Visual Recognition

    Authors: Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu

    Abstract: Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external data (\ie generative \vs retrieval \vs original). Our key contributions are: \textbf{1) GenBench Construction:} We devise \textbf{GenBench}, a broad benchmar… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Research Report

  19. arXiv:2306.17154  [pdf, other

    cs.CV

    Generate Anything Anywhere in Any Scene

    Authors: Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee

    Abstract: Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields. However, challenges persist in creating controllable models for personalized object generation. In this paper, we first identify the entanglement issues in existing personalized generative models, and then propose a straightforward and efficient data augmentation training stra… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  20. arXiv:2306.06094  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

    Authors: Mu Cai, Zeyi Huang, Yuheng Li, Haohan Wang, Yong Jae Lee

    Abstract: Recently, large language models (LLMs) have made significant advancements in natural language understanding and generation. However, their potential in computer vision remains largely unexplored. In this paper, we introduce a new, exploratory approach that enables LLMs to process images using the Scalable Vector Graphics (SVG) format. By leveraging the XML-based textual descriptions of SVG represe… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  21. arXiv:2304.08485  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Visual Instruction Tuning

    Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

    Abstract: Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce… ▽ More

    Submitted 11 December, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Oral; project page: https://llava-vl.github.io/

  22. arXiv:2304.06718  [pdf, other

    cs.CV

    Segment Everything Everywhere All at Once

    Authors: Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee

    Abstract: In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four… ▽ More

    Submitted 11 July, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

  23. arXiv:2303.07269  [pdf, other

    cs.CV cs.LG

    InPL: Pseudo-labeling the Inliers First for Imbalanced Semi-supervised Learning

    Authors: Zhuoran Yu, Yin Li, Yong Jae Lee

    Abstract: Recent state-of-the-art methods in imbalanced semi-supervised learning (SSL) rely on confidence-based pseudo-labeling with consistency regularization. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the ps… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted by ICLR 2023

  24. arXiv:2302.10174  [pdf, other

    cs.CV cs.LG

    Towards Universal Fake Image Detectors that Generalize Across Generative Models

    Authors: Utkarsh Ojha, Yuheng Li, Yong Jae Lee

    Abstract: With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting… ▽ More

    Submitted 1 April, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

  25. arXiv:2301.07094  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Learning Customized Visual Models with Retrieval-Augmented Knowledge

    Authors: Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, Chunyuan Li

    Abstract: Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framew… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

  26. arXiv:2301.07093  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    GLIGEN: Open-Set Grounded Text-to-Image Generation

    Authors: Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee

    Abstract: Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding… ▽ More

    Submitted 16 April, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

  27. arXiv:2212.11270  [pdf, other

    cs.CV cs.CL

    Generalized Decoding for Pixel, Image, and Language

    Authors: Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao

    Abstract: We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

    Comments: https://x-decoder-vl.github.io

  28. arXiv:2212.04875  [pdf, other

    cs.CV cs.AI

    Expeditious Saliency-guided Mix-up through Random Gradient Thresholding

    Authors: Minh-Long Luu, Zeyi Huang, Eric P. Xing, Yong Jae Lee, Haohan Wang

    Abstract: Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities… ▽ More

    Submitted 10 August, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: Accepted Long paper at 2nd Practical-DL Workshop at AAAI 2023

  29. arXiv:2211.02707  [pdf, other

    cs.CV

    Contrastive Learning for Diverse Disentangled Foreground Generation

    Authors: Yuheng Li, Yijun Li, **gwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

    Abstract: We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: ECCV 2022

  30. arXiv:2209.06723  [pdf

    cs.CL

    Toward Improving Health Literacy in Patient Education Materials with Neural Machine Translation Models

    Authors: David Oniani, Sreekanth Sreekumar, Renuk DeAlmeida, Dinuk DeAlmeida, Vivian Hui, Young Ji Lee, Yiye Zhang, Leming Zhou, Yanshan Wang

    Abstract: Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this study, we propose to leverage natural language pro… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

  31. arXiv:2206.06359  [pdf, other

    cs.CV cs.AI cs.LG

    EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

    Authors: Zhuoran Yu, Yin Li, Yong Jae Lee

    Abstract: Recent state-of-the-art methods in semi-supervised learning (SSL) combine consistency regularization with confidence-based pseudo-labeling. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the pseudo-labels… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

  32. arXiv:2205.16004  [pdf, other

    cs.CV cs.LG

    What Knowledge Gets Distilled in Knowledge Distillation?

    Authors: Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, Yong Jae Lee

    Abstract: Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understa… ▽ More

    Submitted 6 November, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2023 camera ready

  33. arXiv:2204.08790  [pdf, other

    cs.CV cs.CL cs.LG

    ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

    Authors: Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, ** **, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao

    Abstract: Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public bench… ▽ More

    Submitted 13 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022 (Datasets and Benchmarks Track). The first two authors contribute equally. Benchmark page: https://computer-vision-in-the-wild.github.io/ELEVATER/

  34. arXiv:2204.04384  [pdf, other

    cs.LG cs.CV

    The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

    Authors: Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, Eric P. Xing

    Abstract: Training with an emphasis on "hard-to-learn" components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e.g., generalization across distributions) is valued. Existing literature discussing this "hard-to-learn" concept are mainly expanded either along the dimension of the samples or the dimensi… ▽ More

    Submitted 9 April, 2022; originally announced April 2022.

    Comments: to appear at CVPR2022

  35. arXiv:2204.02898  [pdf, other

    cs.CV

    End-to-End Instance Edge Detection

    Authors: Xueyan Zou, Haotian Liu, Yong Jae Lee

    Abstract: Edge detection has long been an important problem in the field of computer vision. Previous works have explored category-agnostic or category-aware edge detection. In this paper, we explore edge detection in the context of object instances. Although object boundaries could be easily derived from segmentation masks, in practice, instance segmentation models are trained to maximize IoU to the ground… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

  36. arXiv:2203.14954  [pdf, other

    cs.CV cs.AI cs.LG

    GIRAFFE HD: A High-Resolution 3D-aware Generative Model

    Authors: Yang Xue, Yuheng Li, Krishna Kumar Singh, Yong Jae Lee

    Abstract: 3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation. In particular, the current state-of-the-art model GIRAFFE can control each object's rotation, translation, scale, and scene camera pose without corresponding supervision. However, GIRAFFE only operates well when the image resolution is low. We propose GIRAFFE HD, a high-res… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: CVPR 2022

  37. arXiv:2203.11183  [pdf, other

    cs.CV

    Masked Discrimination for Self-Supervised Learning on Point Clouds

    Authors: Haotian Liu, Mu Cai, Yong Jae Lee

    Abstract: Masked autoencoding has achieved great success for self-supervised learning in the image and language domains. However, mask based pretraining has yet to show benefits for point cloud understanding, likely due to standard backbones like PointNet being unable to properly handle the training versus testing distribution mismatch introduced by masking during training. In this paper, we bridge this gap… ▽ More

    Submitted 1 August, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: ECCV 2022; Code: https://github.com/haotian-liu/MaskPoint

  38. arXiv:2111.03740  [pdf, other

    cs.LG

    Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

    Authors: Haohan Wang, Zeyi Huang, Hanlin Zhang, Yong Jae Lee, Eric Xing

    Abstract: Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar acros… ▽ More

    Submitted 16 June, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: to appear at UAI 2022

  39. arXiv:2110.04281  [pdf, other

    cs.CV cs.LG

    Collaging Class-specific GANs for Semantic Image Synthesis

    Authors: Yuheng Li, Yijun Li, **gwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

    Abstract: We propose a new approach for high resolution semantic image synthesis. It consists of one base image generator and multiple class-specific generators. The base generator generates high quality images based on a segmentation map. To further improve the quality of different objects, we create a bank of Generative Adversarial Networks (GANs) by separately training class-specific models. This has sev… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: ICCV 2021

  40. arXiv:2108.13258  [pdf, other

    cs.CV

    Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

    Authors: Maheen Rashid, Sofia Broomé, Katrina Ask, Elin Hernlund, Pia Haubro Andersen, Hedvig Kjellström, Yong Jae Lee

    Abstract: Timely detection of horse pain is important for equine welfare. Horses express pain through their facial and body behavior, but may hide signs of pain from unfamiliar human observers. In addition, collecting visual data with detailed annotation of horse behavior and pain state is both cumbersome and not scalable. Consequently, a pragmatic equine pain classification system would use video of the un… ▽ More

    Submitted 30 August, 2021; originally announced August 2021.

  41. arXiv:2104.06820  [pdf, other

    cs.CV cs.GR cs.LG

    Few-shot Image Generation via Cross-domain Correspondence

    Authors: Utkarsh Ojha, Yijun Li, **gwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, Richard Zhang

    Abstract: Training generative models, such as GANs, on a target domain containing limited examples (e.g., 10) can easily result in overfitting. In this work, we seek to utilize a large source domain for pretraining and transfer the diversity information from source to target. We propose to preserve the relative similarities and differences between instances in the source via a novel cross-domain distance co… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: CVPR 2021

  42. arXiv:2104.03507  [pdf, other

    cs.CV

    Progressive Temporal Feature Alignment Network for Video Inpainting

    Authors: Xueyan Zou, Linjie Yang, Ding Liu, Yong Jae Lee

    Abstract: Video inpainting aims to fill spatio-temporal "corrupted" regions with plausible content. To achieve this goal, it is necessary to find correspondences from neighbouring frames to faithfully hallucinate the unknown content. Current methods achieve this goal through attention, flow-based war**, or 3D temporal convolution. However, flow-based war** can create artifacts when optical flow is not a… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: Accepted in CVPR2021

  43. arXiv:2104.02052  [pdf, other

    cs.CV cs.GR cs.LG

    Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

    Authors: Utkarsh Ojha, Krishna Kumar Singh, Yong Jae Lee

    Abstract: We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e.g., dogs and cars). The goal is to learn a generative model that learns an intermediate distribution, which borrows a subset of properties from each domain, enabling the generation of images that did not exist in any domain exclusively. This challenging problem requires an… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Camera ready version for ICLR 2021

  44. arXiv:2012.12259  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    YolactEdge: Real-time Instance Segmentation on the Edge

    Authors: Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, Yong Jae Lee

    Abstract: We propose YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7 FPS on an RTX 2080 Ti) with a ResNet-101 backbone on 550x550 resolution images. To achieve this, we make two improvements to the state-of-the-art image-based real-time method YOLACT: (1) ap… ▽ More

    Submitted 1 April, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

    Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  45. arXiv:2010.07292  [pdf, other

    cs.CY cs.CL

    My Team Will Go On: Differentiating High and Low Viability Teams through Team Interaction

    Authors: Hancheng Cao, Vivian Yang, Victor Chen, Yu ** Lee, Lydia Stone, N'godjigui Junior Diarrassouba, Mark E. Whiting, Michael S. Bernstein

    Abstract: Understanding team viability -- a team's capacity for sustained and future success -- is essential for building effective teams. In this study, we aggregate features drawn from the organizational behavior literature to train a viability classification model over a dataset of 669 10-minute text conversations of online teams. We train classifiers to identify teams at the top decile (most viable team… ▽ More

    Submitted 3 November, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

    Comments: CSCW 2020 Honorable Mention Award

    Journal ref: Proc. ACM Hum.-Comput. Interact. 4, CSCW3, Article 230 (December 2020)

  46. arXiv:2008.09604  [pdf, other

    cs.CV

    Delving Deeper into Anti-aliasing in ConvNets

    Authors: Xueyan Zou, Fanyi Xiao, Zhiding Yu, Yong Jae Lee

    Abstract: Aliasing refers to the phenomenon that high frequency signals degenerate into completely different ones after sampling. It arises as a problem in the context of deep learning as downsampling layers are widely adopted in deep architectures to reduce parameters and computation. The standard solution is to apply a low-pass filter (e.g., Gaussian blur) before downsampling. However, it can be suboptima… ▽ More

    Submitted 21 August, 2020; originally announced August 2020.

    Comments: [Accepted in BMVC2020] code: https://maureenzou.github.io/ddac/

  47. arXiv:2004.04725  [pdf, other

    cs.CV cs.LG eess.IV

    Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

    Authors: Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G. Schwing, Jan Kautz

    Abstract: Weakly supervised learning has emerged as a compelling tool for object detection by reducing the need for strong supervision during training. However, major challenges remain: (1) differentiation of object instances can be ambiguous; (2) detectors tend to focus on discriminative parts rather than entire objects; (3) without ground truth, object proposals have to be redundant for high recalls, caus… ▽ More

    Submitted 21 October, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: CVPR 2020

  48. arXiv:2003.03047  [pdf, other

    cs.RO

    Robotic Assembly across Multiple Contact Stiffnesses with Robust Force Controllers

    Authors: Ying Jun Wilson Lee, Quang-Cuong Pham

    Abstract: Active Force Control (AFC) is an important scheme for tackling high-precision robotic assembly. Classical force controllers are highly surface-dependent: the controller must be carefully tuned for each type of surface in contact, in order to avoid instabilities and to achieve a reasonable performance level. Here, we build upon the recently-developed Convex Controller Synthesis (CCS) to enable high… ▽ More

    Submitted 6 March, 2020; originally announced March 2020.

    Comments: 6 pages, 9 figures

  49. arXiv:2002.01449  [pdf, other

    cs.CV

    Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

    Authors: Maheen Rashid, Hedvig Kjellström, Yong Jae Lee

    Abstract: We present a method for weakly-supervised action localization based on graph convolutions. In order to find and classify video time segments that correspond to relevant action classes, a system must be able to both identify discriminative time segments in each video, and identify the full extent of each action. Achieving this with weak video level labels requires the system to use similarity and d… ▽ More

    Submitted 4 February, 2020; originally announced February 2020.

    Comments: Accepted at WACV 2020

  50. arXiv:2001.08740  [pdf, other

    cs.CV

    Audiovisual SlowFast Networks for Video Recognition

    Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

    Abstract: We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcom… ▽ More

    Submitted 8 March, 2020; v1 submitted 23 January, 2020; originally announced January 2020.

    Comments: Technical report