HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln
  • failed: subscript
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY 4.0
arXiv:2312.11556v1 [cs.CV] 17 Dec 2023
\NewDocumentCommand\emojistar
[Uncaptioned image]\NewDocumentCommand\emojidizzy[Uncaptioned image]

StarVector: Generating Scalable Vector Graphics Code from Images

Juan A. Rodriguez1,2,4 Shubham Agarwal1, 2 Issam H. Laradji1, 5 Pau Rodriguez6*
David Vazquez1 Christopher Pal1,2,3 Marco Pedersoli4
Abstract

Scalable Vector Graphics (SVGs) have become integral in modern image rendering applications due to their infinite scalability in resolution, versatile usability, and editing capabilities. SVGs are particularly popular in the fields of web development and graphic design. Existing approaches for SVG modeling using deep learning often struggle with generating complex SVGs and are restricted to simpler ones that require extensive processing and simplification. This paper introduces StarVector, a multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. Our approach utilizes a CLIP image encoder to extract visual representations from pixel-based images, which are then transformed into visual tokens via an adapter module. These visual tokens are pre-pended to the SVG token embeddings, and the sequence is modeled by the StarCoder model using next-token prediction, effectively learning to align the visual and code tokens. This enables StarVector to generate unrestricted SVGs that accurately represent pixel images. To evaluate StarVector’s performance, we present SVG-Bench, a comprehensive benchmark for evaluating SVG methods across multiple datasets and relevant metrics. Within this benchmark, we introduce novel datasets including SVG-Stack, a large-scale dataset of real-world SVG examples, and use it to pre-train StarVector as a large foundation model for SVGs. Our results demonstrate significant enhancements in visual quality and complexity handling over current methods, marking a notable advancement in SVG generation technology. Code and models: https://github.com/joanrod/star-vector

1ServiceNow Research 2Mila - Quebec AI Institute 3Canada CIFAR AI Chair 4ÉTS, Montréal, Canada

5UBC, Vancouver, Canada 6Apple MLR, Barcelona, Spain * External collaboration



[email protected]

[Uncaptioned image]
Figure 1: Image-to-SVG generation task: Given an input image, generate the corresponding SVG code. On the left, we show test examples of complex SVGs from SVG-Emoji and SVG-Stack datasets. StarVector encodes images and processes them in a multimodal language modeling fashion, to generate executable SVG code that resembles the input image. We show real generated code and rasterized images produced by our StarVector model, showing impressive capabilities at generating appealing SVG designs and using complex syntax.

1 Introduction

Vector Graphics represent an archetypal form of image representation, where visual compositions are constituted by primitive shapes such as vector paths, curves, or polygons, parametrized by mathematical equations [41]. This contrasts starkly with raster graphics, where images are represented as pixels on a grid. The primary advantage of vector graphics lies in their ability to maintain high precision and consistent visual quality across various resolutions, as they can be scaled arbitrarily without any loss of detail [47, 34].

In the realm of modern image rendering, Scalable Vector Graphics (SVGs) [54] have become a standard for encapsulating vector graphics in a code-based format. SVGs are the preferred choice for many artistic use cases like icon creation or typography. This format has gained prominence in applications demanding fast, efficient, and high-quality image rendering. In web design, SVG contributes to enhanced rendering speeds and image compression owing to their inherently small file sizes. They also offer dynamic editability, allowing for straightforward modifications in color and size, which is crucial for accessibility and responsive design. For graphic design and scientific visualization, SVGs are prized for their visual quality, facilitating the creation of versatile designs and ensuring high-quality print outputs.

The SVG format utilizes Extensible Markup Language (XML) [26] syntax to define vector graphics, offering a rich palette for a broad range of graphical properties and effects. Central to SVG is the vector path (or simply path), comprised of points and control points connected by mathematically defined lines or curves, allowing detailed control over the shape and size of the graphics. SVGs can also incorporate a variety of other primitives, such as rectangles, ellipses, and text, along with styles and advanced capabilities.

Despite the eminent advantages of SVGs, existing deep learning-based generative solutions have limitations in producing high-quality, complex SVGs. Current approaches [13, 61, 12] typically model SVGs by learning a latent variable model over command paths. Such methods predominantly utilize simplified SVGs, limited to path commands and often restricted in complexity, with some focusing solely on elementary fonts or icons [83, 79]. Recent advancements involve using powerful image diffusion models [64] to generate raster images, which are then converted into SVG [34]. Nevertheless, it involves a costly iterative process of refinement and is also limited to paths. Despite these efforts, a gap remains in systems that can directly generate detailed and complex SVG code, leveraging the full range of SVG primitives necessary for intricate designs.

This paper studies the task of image-to-SVG generation (Figure 1), which has been traditionally approached as a form of image vectorization [42, 85], relying predominantly on image processing and curve fitting techniques [41, 78]. Our research diverges from these methods, posing the task as a code-generation problem building upon recent advancements in Large Language Models (LLMs) [9, 74, 10]. Thanks to the success in scaling up transformers [75], these models have shown outstanding downstream abilities in tasks like language understanding [16], summarization [71], or coding [50, 40, 65]. The emergent capabilities of LLMs in code creation are particularly relevant to our work, as shown by Bubeck et al. [10] in a study using GPT-4 [51] on generating SVG code.

In this work, we propose a novel paradigm, where a multimodal LLM learns SVG synthesis as an image-to-code generation task. In this new setting, we tackle the problem of image-to-SVG generation by learning a CLIP [57] image encoder coupled with an adapter model to project images into visual token embeddings (visual tokens) and use them to condition a StarCoder [40] model to generate an SVG code associated with the input image. The StarVector architecture is shown in Figure 2. Addressing SVG generation with a code generation language model (CodeLLM) allows for preserving the richness and versatility of SVG primitives and syntax, as it can handle unaltered real-world SVGs and no need for simplification. Further, using the SVG code as a representation instead of a latent variable can bring enhanced editing capabilities. We propose the task of image-to-SVG as a pre-training task for building a foundation model [51, 74] for SVG generation.

Refer to caption
Figure 2: StarVector architecture: Images in the pixel space are encoded into a set of 2D embeddings using CLIP [56]. The Adapter applies a non-linear transformation to the image embeddings to align them with Code-LLM space, obtaining visual tokens. StarCoder uses the image embeddings as context to generate the SVG. During training the task is supervised by the next token prediction of the SVG tokens. During inference, the model uses the visual tokens from an input image to predict SVG code autoregressively.

Contributions.

In summary, our contributions are the following: i) We introduce StarVector, a Large Multimodal Model for code generation, which leverages image and language modalities for generating executable SVG code from images. ii) We present SVG-Bench, a unified evaluation benchmark for SVG generation methods, which facilitates access to popular SVG datasets and metrics. Within this benchmark, we introduce two new datasets namely SVG-Emoji (composed of 10k complex emoji SVGs) and SVG-Stack (a large-scale dataset with over 2M real-world SVGs). iii) We evaluate StarVector and prior baselines on SVG-Bench which focuses on the image-to-SVG generation task. We showcase the ability of our model to generalize to complex SVGs and demonstrate the effectiveness of pre-training StarVector on the large-scale SVG-Stack dataset.

The paper is structured as follows: Section 2 presents previous methods related to our research on SVG generation while Section 3 explains the StarVector method in detail. We present SVG-Bench (with accompanying datasets and metrics) in Section 4, followed by experimental results in Section 5 and conclusions in Section 6.

2 Related Work

This section presents prior research relevant to our study, encompassing methods in vector graphics and SVG generation, developments in CodeLLMs, and advancements in multimodal models that integrate image and textual data.

SVG Generation Methods.

Early efforts in vector graphics111https://en.wikipedia.org/wiki/Comparison_of_raster-to-vector_conversion_software predominantly utilized traditional image processing techniques for tasks like image vectorization [23, 85, 42], often involving segmentation and polynomial curve fitting [41, 78]. With the advent of deep learning, new approaches emerged. SVG-VAE [45], a class-conditional Variational Autoencoder (VAE) [35], predicts a latent style vector and generates SVGs using a LSTM decoder [30]. DeepSVG [13] proposes a hierarchical VAE architecture using transformers to represent SVG paths. Im2Vec [61] translates pixel images into latent representations, which can be decoded into paths via a recurrent neural network (RNN). However, latent-based methods are limited to path primitives, thus restricting their scope to a subset of SVGs. Because of this limitation, they tend to not generalize well and overfit on the complex-looking SVG datasets.

Recent trends in image generation using diffusion [29, 64] or autoregressive [25, 59, 86] models have also been explored in the SVG space. VectorFusion [34] leverages a strong text-to-image diffusion model to find the SVG via iterative optimization. CLIPasso [77] uses a CLIP distance loss to iteratively refine SVG from sketches. Both these solutions can be slow due to their iterative nature. Similar to ours, IconShop [83] trains a BERT [22] model for text-to-SVG conversion on icons, using path commands as tokens of the language model, while we use the SVG code.

This study addresses these challenges by proposing a new avenue in SVG modeling. We design a model capable of generating unrestricted SVG code, focusing on directly rendering vector graphics within the SVG code space, bypassing the constraints of previous methodologies.

Language Models for Code Generation.

CodeLLMs, or large language models for code generation, have gained significant popularity in recent literature due to advances in natural language processing (NLP) and the transformer architectures [75], such as the GPT [55, 9, 51] and Llama [73, 74] families. Extensive availability of code datasets [8, 32, 27, 36], has allowed training powerful CodeLLMs that have changed the way software developers do their work [17]. Codex [14] learns to generate Python functions based on input docstrings and evaluates the correctness of the generated code samples automatically through unit tests. Codegen [50], studies multi-turn program synthesis under scaling laws, offering a family of models trained in several programming languages. StarCoder [40] presents a series of models with various sizes, trained on several coding languages using a fill-in-the-middle objective.

Despite SVG popularity, SVG language has been typically avoided in training large coding models [2, 40] (possibly for prioritizing more crowded coding communities). This research seeks to bridge this gap by fine-tuning a proficient CodeLLM specifically on SVG code. Furthermore, we integrate a vision module to facilitate the pre-training task of image-to-SVG conversion.

Multimodal Tasks and Models

In recent years, there have been numerous works at the intersection of vision and language on topics like image captioning [37, 38, 39], visual question answering (VQA) [3], contrastive learning [57, 15] or text-to-image generation [59, 25, 60, 64]. For obtaining visual features some multimodal models [48, 39, 57] use Vision transformers (ViT) [24]. Convolutional-based image encoders like ConvNext [44] or VQGAN [25] have been also explored [25, 57, 64], that aim to preserve more detailed features from images. Some models like Flamingo [1], MAPL [48] or BLIP2 [39] use an intermediate map** module to convert image features into fixed-size token embeddings. Similar to ours, Llava [43] obtains a set of visual tokens by projecting ViT features directly into the LLM embedding space.

While the majority of multimodal research has primarily been centered around a fusion of images and natural text [57, 39, 1, 59, 48], there has been relatively less attention to the process of translating images into code, except for few studies that convert web pages to HTML [20], image to Latex markup [21], GUI screenshot-to-code conversion [7, 5], and the generation of scientific vector graphics through Tikz [6]. This progress suggests that handling image generation as a coding task is an appealing solution. Our work explores different image-encoding techniques for the vision part and uses all available visual tokens to condition a StarCoder CodeLLM on images.

3 StarVector

This section describes StarVector, a multimodal code generation model that accepts images and generates compilable SVG code associated with it. We formulate the task of SVG generation as a sequence modeling problem, where sequences of tokens corresponding to the image and SVG domains are concatenated and modeled autoregressively. The proposed architecture is shown in Figure 2. StarVector integrates an Image Encoder i.e., CLIP, with a CodeLLM i.e., StarCoder through an Adapter layer. The Adapter converts image representations into visual tokens aligned with the SVG token embeddings used in the CodeLLM. After fine-tuning, image and text token embeddings share the same representation space, and the CodeLLM acquires SVG generation proficiency through next-token prediction, providing an effective solution for image-to-SVG conversion.

3.1 Image Encoder and Visual Tokens

The efficacy of our model relies heavily on the image encoder, which needs to preserve intricate details and semantic content in the original image because, unlike captioning, where the output is typically short, SVG generation requires generating much longer sequences of code to obtain results of higher complexity. The image encoder projects the input image into a set of 2D embeddings rich in fine-grained details. To choose the best encoder, we draw inspiration from the success of pre-trained encoder backbones in downstream computer vision tasks such as classification [57], retrieval, and generation [25], including both convolutional and transformer-based models. Formally, we experiment with CLIP ViT-L/14 [57], ConvNext [44] (both pre-trained on LAION-2B [66]), and VQGAN [25], which we pre-train on an image reconstruction task using raster images from SVG-Stack. As the output of the image encoder, we utilize all the available hidden representations in the last layers to bring the most rich features. We define the output of the encoder zvsubscript𝑧𝑣z_{v}italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as a flattened 2D grid of Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT embedding sequences. For CLIP we have Lv=257subscript𝐿𝑣257L_{v}=257italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 257 embeddings, including the CLS token. For VQGAN, we use the pre-quantization layers and flatten them to obtain Lv=196subscript𝐿𝑣196L_{v}=196italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 196 embeddings. For ConvNext, we flatten the last activation map to obtain Lv=49subscript𝐿𝑣49L_{v}=49italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 49 embeddings.

Adapter. The Adapter module performs a non-linear projection of the image embeddings into the LLM embedding space, producing a set of visual token embeddings (or visual tokens). This transformation matches the embedding dimensionality and aligns the image representations with the language model’s embedding space, effectively bridging the visual and SVG code modalities for the generation task. The Adapter is composed of a sequence of fully connected (FC) layers with Swish [58] activation function and Batch Normaliazation [33].

3.2 CodeLLM

The CodeLLM generates the complete SVG code conditioned on the visual tokens representing the image. We employ the StarCoder architecture by Li et al. [40] with pre-trained weights, which provide a general model for code completion tasks. StarCoder is a decoder-only architecture that uses Multi-Query Attention [68] for efficient sampling. To address the long sequence lengths and high memory demands, we use flash-attention [18], enabling fine-tuning StarCoder with a context length of 8,192 tokens, the only restriction of our models. This approach mitigates the quadratic complexity typically associated with neural attention in long sequences. The fine-tuning process updates all the model weights to overcome the distribution shift from the original pre-training task of general code generation to our specific task of image-to-SVG conversion. We emphasize that the pre-trained StarCoder is not trained to generate SVG code and thus needs to be fine-tuned end-to-end.

Training Process.

During training, we first encode images x𝑥xitalic_x with the image encoder E𝐸Eitalic_E as E(x)𝐸𝑥E(x)italic_E ( italic_x ), which returns a hidden 2D features zvsubscript𝑧𝑣z_{v}italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of dimension Lv×Dvsubscript𝐿𝑣subscript𝐷𝑣L_{v}\times D_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, where Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the sequence length and Dvsubscript𝐷𝑣D_{v}italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT the embedding size. The adapter A𝐴Aitalic_A projects zvsubscript𝑧𝑣z_{v}italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT into the CodeLLM dimensionality as A(zv)𝐴subscript𝑧𝑣A(z_{v})italic_A ( italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) resulting in visual tokens hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of dimensionality Lv×Dlsubscript𝐿𝑣subscript𝐷𝑙L_{v}\times D_{l}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, where Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the internal dimensionality of the CodeLLM. The ground truth SVG code is also tokenized and embedded into the CodeLLM space, as hlsubscript𝑙h_{l}italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, with the same dimensionality as the visual tokens hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. During training, we concatenate visual and SVG token embeddings, and the sequence is modeled using standard language modeling training objective, i.e., next token prediction using SVG code as supervision. During inference, we only compute visual tokens from images and decode autoregressively from the CodeLLM with hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as context.

4 SVGBench: Benchmark for SVG Validation

We propose SVGBench, a unified benchmark for assessing SVG synthesis models composed of tasks, datasets, and metrics. Here we evaluate the proposed StarVector and baselines on the task of image-to-SVG generation. This task has been the primary benchmark in previous works and assesses the model’s ability to generate SVG samples that resemble the input image in the pixel space. We aim to define a standardized evaluation setting for SVG generation, building upon popular practices in the recent literature [45, 13, 61]. In summary, we compile together the popular datasets and metrics used in prior works and propose new ones that define SVG-Bench.

Dataset Train Val Test Testsim Source Avg. Token Length SVG Primitives
SVG-Fonts 1,831,857 91,593 4,821 3,745 Glypazzn [45] 2,121 ±plus-or-minus\pm± 1,868 Vector path
SVG-Icons 80,442 6,256 2,449 1,277 DeepSVG [13] 2,449 ±plus-or-minus\pm± 1,543 Vector path
\hdashlineSVG-Emoji 8,708 667 668 96 OpenMoji, NotoEmoji, TweMoji 2,551 ±plus-or-minus\pm± 1805 All
SVG-Stack 2,169,710 108,456 5,709 1,585 TheStack [36] 1,822 ±plus-or-minus\pm± 1,808 All
Table 1: Datasets in SVG-Bench. We show the number of samples per split, with an additional reduced test set composed of simplified SVGs. We facilitate the source where the dataset was acquired, and statistics about the length of the SVG code in tokens, considering the tokenizer trained by StarCoder [40]. Finally, we show the type of SVG primitives used in the datasets. SVG-Emoji and SVG-Stack are introduced in this paper. See Appendix 8 to see statistics and visualize samples from the datasets.
Model Input Output Architecture SVG Simplification Seq Format SVG commands SVG primitives
Vtracer [78] Image SVG Clustering + curve fitting Commands M, L, C Vector path
DeepSVG [13] SVG SVG Transformer Commands M, L, C Vector path
Im2Vec [61] Image SVG RNN Keypoints M, L, C Vector path
GPT-4 Vision [51] Image SVG Multimodal LLM SVG Code All All
StarVector (ours) Image SVG Multimodal LLM SVG code All All
Table 2: Baseline comparison. While prior works consider only 3 simple commands - M (Move), L (Line) and C (Curve), our model in principle can handle all type of complex SVG commands.

4.1 Datasets

To encompass varying degrees of visual complexity across different colors, shapes, and text, we select datasets comprising examples of fonts, emojis, icons, and real-world examples e.g., the ones seen on websites. The datasets included in SVG-Bench are visually diverse and frequently used by digital artists in real-life scenarios. We use SVG-Fonts introduced as Glypazzn [45] and SVG-Icons in DeepSVG [13]. In the absence of a standard benchmark, we create different splits for all the datasets, which we release to the community for further research. Following is a description of two datasets we propose for SVG generation.

SVG-Emoji

We propose SVG-Emoji, a dataset of 10k image-SVG pairs created by collating multiple smaller emoji datasets from different sources into a unified dataset. Formally, we include examples from TweMoji222https://github.com/twitter/twemoji, OpenMoji333https://openmoji.org/ and NotoEmoji444https://github.com/googlefonts/noto-emoji, where we also provide information about their class and the corresponding caption.

SVG-Stack

A key contribution of this work is SVG-Stack, a comprehensive, large-scale dataset of real-world SVGs, which we introduce as part of SVG-Bench. SVG-Stack is sourced from The Stack [36], a diverse dataset comprising code samples from various software languages, collated from public GitHub repositories555https://huggingface.co/spaces/bigcode/in-the-stack. Our selection builds upon the initial filtering and de-duplication processes conducted in [36, 2, 40]. We perform additional filtering to remove possible duplicates from other SVG datasets in the benchmark. We extracted SVG code from The Stack and rasterized it at 224x224 resolution to obtain ground truth images. This large amount of images, in conjunction with the SVG code, forms the foundation for pre-training StarVector, enabling it to learn the image-to-SVG conversion task effectively.

Table 1 shows the dataset statistics defined in SVG-Bench. We create splits for train, validation, and test. We also create another test split using a pipeline of filtering and simplification to be consistent with the other baselines.

4.2 Evaluation Protocol

In evaluating SVG models, it is crucial to employ metrics that accurately measure the fidelity of the generated SVG with the original image, considering both vector and raster-pixel representations. Traditional pixel-based metrics may not be entirely adequate for SVG assessment, as the predominance of background colors can skew them. For instance, a simple SVG with a predominantly white background might misleadingly score well in these metrics. To address this, our evaluation framework also includes deep perceptual-based metrics and vector-space metrics, offering a more comprehensive and nuanced assessment of SVG conversion quality. We compute the following metrics:

  • Pixel-based metrics. We employ Mean Squared Error (MSE) and Structural Similarity Index (SSIM) [81, 80]. MSE quantifies the average squared difference between the generated and the original images’ pixels, offering a straightforward measure of pixel-level accuracy. SSIM evaluates image quality based on the understanding of visual perception, measuring changes in structural information, luminance, and contrast.

  • Vector-based metrics. We utilize Chamfer distance (CD), a metric adapted from point cloud literature [84]. This involves discretizing each SVG into a set of points by sampling paths at regular intervals. CD measures the average nearest-point distance between corresponding points in two SVGs, providing a quantitative measure of similarity. A smaller CD indicates that the two SVGs are more similar, while a larger distance suggests they are more distinct. Having two SVGs s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, defined with a set of points p1s1subscript𝑝1subscript𝑠1p_{1}\in s_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2s2subscript𝑝2subscript𝑠2p_{2}\in s_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in 2D, the CD is defined as,

    c(s1,s2)=1|s1|p1s1minp2s2p1p222+1|s2|p2s2minp2s1p2p122,𝑐subscript𝑠1subscript𝑠21subscript𝑠1subscriptsubscript𝑝1subscript𝑠1subscriptsubscript𝑝2subscript𝑠2superscriptsubscriptnormsubscript𝑝1subscript𝑝2221subscript𝑠2subscriptsubscript𝑝2subscript𝑠2subscriptsubscript𝑝2subscript𝑠1superscriptsubscriptnormsubscript𝑝2subscript𝑝122\displaystyle c(s_{1},s_{2})=\frac{1}{|s_{1}|}\sum_{p_{1}\in s_{1}}\min_{p_{2}% \in s_{2}}\|p_{1}-p_{2}\|_{2}^{2}+\frac{1}{|s_{2}|}\sum_{p_{2}\in s_{2}}\min_{% p_{2}\in s_{1}}\|p_{2}-p_{1}\|_{2}^{2},italic_c ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG | italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

    (1)

    where |si|subscript𝑠𝑖|s_{i}|| italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the cardinality of set sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and .22\|.\|_{2}^{2}∥ . ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the squared Euclidean norm.

  • Perceptual-based Metrics. We incorporate the Learned Perceptual Image Patch Similarity (LPIPS) [87] metric, which uses deep learning models trained on human perceptual judgments. This metric is particularly effective in capturing the nuances of human visual perception, providing a more subjective assessment of image quality beyond mere pixel-level comparisons.

4.3 Baselines

Here we describe the baselines used to compare StarVector’s performance in the task of image-to-SVG conversion. We consider previous deep learning-based methods and rule-based traditional methods. We evaluate the baselines with publicly available code in our proposed setup.

Im2Vec [61] uses an end-to-end VAE, trained using only image supervision to produce vector graphics. Input rasterized image is encoded to a ‘global’ latent vector, which is passed to an RNN to produce latent code for each path. The path decoder decodes these path codes into Bezier paths to generate the output SVG. We used the publicly available code666https://github.com/preddy5/Im2Vec to report the results.

DeepSVG [13] was introduced as a hierarchical path-based VAE encoder-decoder transformer architecture. Here input paths are encoded separately using a path encoder which are aggregated using a second encoder to produce a latent vector. The decoder uses this latent vector to output the path representations which provide actual draw commands and arguments. We used the open-source code777https://github.com/alexandre01/deepsvg to reproduce the results on different datasets. However, since the DeepSVG framework only allows simplified SVGs, we report results on the ‘simplified’ test set in Table 5.

VTracer888https://github.com/visioncortex/vtracer [78] is a rule-based algorithm to convert images to SVGs. This 3-step pipeline algorithm relies on the hierarchical clustering of images which are traced into vectors. First, pixels are converted into paths, which are simplified into polygons. In the last step, polygons are smoothened and approximated with a Bezier curve fitter.

GPT-4-Vision (Preview) [52] was recently released by OpenAI as a vision-based multimodal model, with a limit of 100 API calls per day in the preview mode. We use GPT-4V by inserting an image and zero-shot prompting to generate SVG code. See Appendix 11 for more details.

5 Experiments and Results

This section presents the main experiments performed with the StarVector model on SVGBench. We report results on the simplified test as well as the full test set for the metrics defined in Section 4.2. We also ablate our model with different image encoders and data augmentation. Finally, we consider the effect of pre-training on SVG-Stack and fine-tuning on other datasets.

We use HuggingFace Transformers [82] and PyTorch [53] for the implementation. We reproduce baseline models from official repositories, respecting the proposed hyperparameters (see Appendix 11 for more detail). All experiments were done using 4 A100 80GB GPUs. We use a batch size of 2 with a gradient accumulation of 8 steps and a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for training. Models trained on SVG-Stack with AdamW optimizer [46] require approximately 5 days of training.

Table 3: Results on simplified (sim) datasets for the task of image-to-SVG conversion for different methods across SVGBench. Bold cells display the best model, and underlined cells show the second place (across all tables).
SVG-Fontssim SVG-Emojissim SVG-Iconssim SVG-Stacksim
Method MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow
VTracer [78] 0.014 5.631 0.044 0.946 0.018 4.494 0.064 0.911 0.009 3.885 0.052 0.952 0.016 4.983 0.061 0.918
DeepSVG [13] 0.046 3.747 0.163 0.823 0.069 5.486 0.278 0.748 0.04 2.492 0.153 0.851 0.066 4.804 0.235 0.736
Im2Vec [61] 0.031 133.977 0.187 0.871 0.042 26.457 0.258 0.826 0.029 146.616 0.178 0.885 0.043 138.031 0.258 0.813
GPT-4 Vision (100 examples) 0.091 65.103 0.248 0.755 0.099 52.206 0.268 0.701 0.128 50.649 0.271 0.709 0.131 55.455 0.28 0.668
\hdashline StarVector (ours) 0.019 1.435 0.043 0.93 0.038 1.005 0.073 0.859 0.018 0.665 0.036 0.931 0.038 2.208 0.098 0.858
Table 4: Results on complete datasets for the task of image-to-SVG conversion. Metrics are computed on the full test sets of SVG-Bench. DeepSVG is not included as it does not support complex SVG images.
SVG-Fonts SVG-Emojis SVG-Icons SVG-Stack
Method MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow MSE normal-↓\downarrow CD normal-↓\downarrow LPIPS normal-↓\downarrow SSIM normal-↑\uparrow
VTracer [78] 0.007 4.105 0.029 0.903 0.007 8.261 0.064 0.913 0.014 3.335 0.068 0.927 0.007 6.318 0.057 0.891
Im2Vec [61] 0.133 144.413 0.208 0.802 0.124 39.135 0.528 0.658 0.052 145.497 0.221 0.831 0.179 141.573 0.357 0.688
GPT-4 Vision (100 examples) 0.194 27.909 0.268 0.689 0.162 21.134 0.404 0.612 0.135 49.249 0.299 0.666 0.192 16.981 0.37 0.604
\hdashline StarVector (ours) 0.008 2.098 0.013 0.976 0.051 2.194 0.202 0.778 0.022 0.483 0.043 0.923 0.072 6.153 0.153 0.785
Refer to caption
Figure 3: Results on simplified SVG-Icons and SVG-Fonts test.
Refer to caption
Figure 4: Results on SVG-Icons test set
Refer to caption
Figure 5: Results on SVG-Emoji test set
Refer to caption
Figure 6: Results on SVG-Stack test set
Table 4: Results on complete datasets for the task of image-to-SVG conversion. Metrics are computed on the full test sets of SVG-Bench. DeepSVG is not included as it does not support complex SVG images.

Supplementary Material

In the following, we present a further description of the StarVector architecture, its training process, and how we generate SVG samples from images. We also provide more details about SVGBench with the proposed datasets as well as the different baselines within the evaluation setup. We also include additional results and discussions of our method for image-to-SVG generation.

Table 3: Results on simplified (sim) datasets for the task of image-to-SVG conversion for different methods across SVGBench. Bold cells display the best model, and underlined cells show the second place (across all tables).