Skip to main content

Showing 1–24 of 24 results for author: Changpinyo, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  2. arXiv:2305.18565  [pdf, other

    cs.CV cs.CL cs.LG

    PaLI-X: On Scaling up a Multilingual Vision and Language Model

    Authors: Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic , et al. (18 additional authors not shown)

    Abstract: We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-sh… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  3. arXiv:2305.10400  [pdf, other

    cs.CL cs.CV

    What You See is What You Read? Improving Text-Image Alignment Evaluation

    Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

    Abstract: Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to… ▽ More

    Submitted 26 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023. Website: https://wysiwyr-itm.github.io/

  4. arXiv:2302.11713  [pdf, other

    cs.CV cs.AI cs.CL

    Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

    Authors: Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, Ming-Wei Chang

    Abstract: Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visua… ▽ More

    Submitted 17 October, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: EMNLP 2023 (main conference); Our dataset and evaluation is available at https://open-vision-language.github.io/infoseek/

  5. arXiv:2302.11217  [pdf, other

    cs.CV

    Connecting Vision and Language with Video Localized Narratives

    Authors: Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari

    Abstract: We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narrati… ▽ More

    Submitted 15 March, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: Accepted at CVPR 2023

  6. arXiv:2212.09898  [pdf, other

    cs.CV

    MetaCLUE: Towards Comprehensive Visual Metaphors Research

    Authors: Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani

    Abstract: Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, met… ▽ More

    Submitted 2 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted in CVPR 2023. Project page: https://metaclue.github.io/ , Video summary: https://youtu.be/V3TmeNETL-o

  7. arXiv:2209.06794  [pdf, other

    cs.CV cs.CL

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner , et al. (4 additional authors not shown)

    Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaL… ▽ More

    Submitted 5 June, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: ICLR 2023 (Notable-top-5%)

  8. arXiv:2209.05534  [pdf, other

    cs.CV cs.CL

    PreSTU: Pre-Training for Scene-Text Understanding

    Authors: Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

    Abstract: The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives… ▽ More

    Submitted 19 August, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: Accepted to ICCV 2023

  9. arXiv:2209.05401  [pdf, other

    cs.CL cs.CV

    MaXM: Towards Multilingual Visual Question Answering

    Authors: Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish V. Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut

    Abstract: Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data gene… ▽ More

    Submitted 24 October, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: EMNLP 2023 (Findings). https://github.com/google-research-datasets/maxm

  10. arXiv:2205.01883  [pdf, other

    cs.CV cs.CL

    All You May Need for VQA are Image Captions

    Authors: Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut

    Abstract: Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resultin… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

    Comments: 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022)

  11. arXiv:2107.02170  [pdf, other

    cs.CV cs.AI cs.LG

    On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

    Authors: Tai-Yu Pan, Cheng Zhang, Yandong Li, Hexiang Hu, Dong Xuan, Soravit Changpinyo, Boqing Gong, Wei-Lun Chao

    Abstract: Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or re-weighting. In this paper, we investigate a largely overlooked approach -- post-processing calibration of confidence scores. We propose NorCal, Normalized Calibrat… ▽ More

    Submitted 29 November, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted to NeurIPS 2021

  12. arXiv:2104.12727  [pdf, other

    cs.CV

    2.5D Visual Relationship Detection

    Authors: Yu-Chuan Su, Soravit Changpinyo, Xiangning Chen, Sathish Thoppay, Cho-Jui Hsieh, Lior Shapira, Radu Soricut, Hartwig Adam, Matthew Brown, Ming-Hsuan Yang, Boqing Gong

    Abstract: Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relati… ▽ More

    Submitted 26 April, 2021; originally announced April 2021.

  13. arXiv:2102.08981  [pdf, other

    cs.CV cs.CL

    Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

    Authors: Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

    Abstract: The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step… ▽ More

    Submitted 30 March, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021). Our dataset is available at https://github.com/google-research-datasets/conceptual-12m

  14. arXiv:2102.08884  [pdf, other

    cs.CV

    MosaicOS: A Simple and Effective Use of Object-Centric Images for Long-Tailed Object Detection

    Authors: Cheng Zhang, Tai-Yu Pan, Yandong Li, Hexiang Hu, Dong Xuan, Soravit Changpinyo, Boqing Gong, Wei-Lun Chao

    Abstract: Many objects do not appear frequently enough in complex scenes (e.g., certain handbags in living rooms) for training an accurate object detector, but are often found frequently by themselves (e.g., in product images). Yet, these object-centric images are not effectively leveraged for improving object detection in scene-centric images. In this paper, we propose Mosaic of Object-centric images as Sc… ▽ More

    Submitted 13 September, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: Accepted to ICCV 2021

  15. arXiv:2102.04980  [pdf, other

    cs.CV cs.CL

    Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

    Authors: Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut

    Abstract: Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an… ▽ More

    Submitted 24 August, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

    Comments: IEEE/CVF International Conference on Computer Vision (ICCV 2021)

  16. arXiv:2009.05175  [pdf, other

    cs.CL cs.CV

    Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models

    Authors: Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish Thapliyal, Radu Soricut

    Abstract: Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end genera… ▽ More

    Submitted 30 October, 2022; v1 submitted 10 September, 2020; originally announced September 2020.

  17. arXiv:1912.03098  [pdf, other

    cs.CV

    Connecting Vision and Language with Localized Narratives

    Authors: Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari

    Abstract: We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a… ▽ More

    Submitted 20 July, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

    Comments: ECCV 2020 Camera Ready

  18. arXiv:1909.02097  [pdf, other

    cs.CL cs.CV

    Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

    Authors: Soravit Changpinyo, Bo Pang, Piyush Sharma, Radu Soricut

    Abstract: Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine… ▽ More

    Submitted 4 September, 2019; originally announced September 2019.

    Comments: The 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)

  19. arXiv:1812.06423  [pdf, other

    cs.CV

    Classifier and Exemplar Synthesis for Zero-Shot Learning

    Authors: Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, Fei Sha

    Abstract: Zero-shot learning (ZSL) enables solving a task without the need to see its examples. In this paper, we propose two ZSL frameworks that learn to synthesize parameters for novel unseen classes. First, we propose to cast the problem of ZSL as learning manifold embeddings from graphs composed of object classes, leading to a flexible approach that synthesizes "classifiers" for the unseen classes. Then… ▽ More

    Submitted 18 July, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

    Comments: Extended version of arXiv:1603.00550 (CVPR 2016) and arXiv:1605.08151 (ICCV 2017); Accepted for publication in International Journal of Computer Vision (IJCV)

  20. arXiv:1808.04151  [pdf, other

    cs.CL

    Multi-Task Learning for Sequence Tagging: An Empirical Study

    Authors: Soravit Changpinyo, Hexiang Hu, Fei Sha

    Abstract: We study three general multi-task learning (MTL) approaches on 11 sequence tagging tasks. Our extensive empirical results show that in about 50% of the cases, jointly learning all 11 tasks improves upon either independent or pairwise learning of the tasks. We also show that pairwise MTL can inform us what tasks can benefit others or what tasks can be benefited if they are learned jointly. In parti… ▽ More

    Submitted 13 August, 2018; originally announced August 2018.

    Comments: In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)

  21. arXiv:1702.06257  [pdf, other

    cs.CV

    The Power of Sparsity in Convolutional Neural Networks

    Authors: Soravit Changpinyo, Mark Sandler, Andrey Zhmoginov

    Abstract: Deep convolutional networks are well-known for their high computational and memory demands. Given limited resources, how does one design a network that balances its size, training time, and prediction accuracy? A surprisingly effective approach to trade accuracy for size and speed is to simply reduce the number of channels in each convolutional layer by a fixed fraction and retrain the network. In… ▽ More

    Submitted 20 February, 2017; originally announced February 2017.

  22. arXiv:1605.08151  [pdf, other

    cs.CV

    Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning

    Authors: Soravit Changpinyo, Wei-Lun Chao, Fei Sha

    Abstract: Leveraging class semantic descriptions and examples of known objects, zero-shot learning makes it possible to train a recognition model for an object class whose examples are not available. In this paper, we propose a novel zero-shot learning model that takes advantage of clustering structures in the semantic embedding space. The key idea is to impose the structural constraint that semantic repres… ▽ More

    Submitted 20 August, 2017; v1 submitted 26 May, 2016; originally announced May 2016.

    Comments: ICCV2017 camera-ready

  23. arXiv:1605.04253  [pdf, ps, other

    cs.CV

    An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild

    Authors: Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, Fei Sha

    Abstract: Zero-shot learning (ZSL) methods have been studied in the unrealistic setting where test data are assumed to come from unseen classes only. In this paper, we advocate studying the problem of generalized zero-shot learning (GZSL) where the test data's class memberships are unconstrained. We show empirically that naively using the classifiers constructed by ZSL approaches does not perform well in th… ▽ More

    Submitted 11 January, 2017; v1 submitted 13 May, 2016; originally announced May 2016.

    Comments: ECCV2016 camera-ready

  24. arXiv:1603.00550  [pdf, other

    cs.CV

    Synthesized Classifiers for Zero-Shot Learning

    Authors: Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, Fei Sha

    Abstract: Given semantic descriptions of object classes, zero-shot learning aims to accurately recognize objects of the unseen classes, from which no examples are available at the training stage, by associating them to the seen classes, from which labeled examples are provided. We propose to tackle this problem from the perspective of manifold learning. Our main idea is to align the semantic space that is d… ▽ More

    Submitted 27 May, 2016; v1 submitted 1 March, 2016; originally announced March 2016.