Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Cai, Mu; Huang, Zeyi; Li, Yuheng; Wang, Haohan; Lee, Yong Jae

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.06094 (cs)

[Submitted on 9 Jun 2023]

Title:Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Authors:Mu Cai, Zeyi Huang, Yuheng Li, Haohan Wang, Yong Jae Lee

View PDF

Abstract:Recently, large language models (LLMs) have made significant advancements in natural language understanding and generation. However, their potential in computer vision remains largely unexplored. In this paper, we introduce a new, exploratory approach that enables LLMs to process images using the Scalable Vector Graphics (SVG) format. By leveraging the XML-based textual descriptions of SVG representations instead of raster images, we aim to bridge the gap between the visual and textual modalities, allowing LLMs to directly understand and manipulate images without the need for parameterized visual components. Our method facilitates simple image classification, generation, and in-context learning using only LLM capabilities. We demonstrate the promise of our approach across discriminative and generative tasks, highlighting its (i) robustness against distribution shift, (ii) substantial improvements achieved by tap** into the in-context learning abilities of LLMs, and (iii) image understanding and generation capabilities with human guidance. Our code, data, and models can be found here this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2306.06094 [cs.CV]
	(or arXiv:2306.06094v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.06094

Submission history

From: Mu Cai [view email]
[v1] Fri, 9 Jun 2023 17:57:01 UTC (1,193 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators