StarVector: Generating Scalable Vector Graphics Code from Images
Abstract
Scalable Vector Graphics (SVGs) have become integral in modern image rendering applications due to their infinite scalability in resolution, versatile usability, and editing capabilities. SVGs are particularly popular in the fields of web development and graphic design. Existing approaches for SVG modeling using deep learning often struggle with generating complex SVGs and are restricted to simpler ones that require extensive processing and simplification. This paper introduces StarVector, a multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. Our approach utilizes a CLIP image encoder to extract visual representations from pixel-based images, which are then transformed into visual tokens via an adapter module. These visual tokens are pre-pended to the SVG token embeddings, and the sequence is modeled by the StarCoder model using next-token prediction, effectively learning to align the visual and code tokens. This enables StarVector to generate unrestricted SVGs that accurately represent pixel images. To evaluate StarVector’s performance, we present SVG-Bench, a comprehensive benchmark for evaluating SVG methods across multiple datasets and relevant metrics. Within this benchmark, we introduce novel datasets including SVG-Stack, a large-scale dataset of real-world SVG examples, and use it to pre-train StarVector as a large foundation model for SVGs. Our results demonstrate significant enhancements in visual quality and complexity handling over current methods, marking a notable advancement in SVG generation technology. Code and models: https://github.com/joanrod/star-vector
1ServiceNow Research 2Mila - Quebec AI Institute 3Canada CIFAR AI Chair 4ÉTS, Montréal, Canada
5UBC, Vancouver, Canada 6Apple MLR, Barcelona, Spain * External collaboration
1 Introduction
Vector Graphics represent an archetypal form of image representation, where visual compositions are constituted by primitive shapes such as vector paths, curves, or polygons, parametrized by mathematical equations [41]. This contrasts starkly with raster graphics, where images are represented as pixels on a grid. The primary advantage of vector graphics lies in their ability to maintain high precision and consistent visual quality across various resolutions, as they can be scaled arbitrarily without any loss of detail [47, 34].
In the realm of modern image rendering, Scalable Vector Graphics (SVGs) [54] have become a standard for encapsulating vector graphics in a code-based format. SVGs are the preferred choice for many artistic use cases like icon creation or typography. This format has gained prominence in applications demanding fast, efficient, and high-quality image rendering. In web design, SVG contributes to enhanced rendering speeds and image compression owing to their inherently small file sizes. They also offer dynamic editability, allowing for straightforward modifications in color and size, which is crucial for accessibility and responsive design. For graphic design and scientific visualization, SVGs are prized for their visual quality, facilitating the creation of versatile designs and ensuring high-quality print outputs.
The SVG format utilizes Extensible Markup Language (XML) [26] syntax to define vector graphics, offering a rich palette for a broad range of graphical properties and effects. Central to SVG is the vector path (or simply path), comprised of points and control points connected by mathematically defined lines or curves, allowing detailed control over the shape and size of the graphics. SVGs can also incorporate a variety of other primitives, such as rectangles, ellipses, and text, along with styles and advanced capabilities.
Despite the eminent advantages of SVGs, existing deep learning-based generative solutions have limitations in producing high-quality, complex SVGs. Current approaches [13, 61, 12] typically model SVGs by learning a latent variable model over command paths. Such methods predominantly utilize simplified SVGs, limited to path commands and often restricted in complexity, with some focusing solely on elementary fonts or icons [83, 79]. Recent advancements involve using powerful image diffusion models [64] to generate raster images, which are then converted into SVG [34]. Nevertheless, it involves a costly iterative process of refinement and is also limited to paths. Despite these efforts, a gap remains in systems that can directly generate detailed and complex SVG code, leveraging the full range of SVG primitives necessary for intricate designs.
This paper studies the task of image-to-SVG generation (Figure 1), which has been traditionally approached as a form of image vectorization [42, 85], relying predominantly on image processing and curve fitting techniques [41, 78]. Our research diverges from these methods, posing the task as a code-generation problem building upon recent advancements in Large Language Models (LLMs) [9, 74, 10]. Thanks to the success in scaling up transformers [75], these models have shown outstanding downstream abilities in tasks like language understanding [16], summarization [71], or coding [50, 40, 65]. The emergent capabilities of LLMs in code creation are particularly relevant to our work, as shown by Bubeck et al. [10] in a study using GPT-4 [51] on generating SVG code.
In this work, we propose a novel paradigm, where a multimodal LLM learns SVG synthesis as an image-to-code generation task. In this new setting, we tackle the problem of image-to-SVG generation by learning a CLIP [57] image encoder coupled with an adapter model to project images into visual token embeddings (visual tokens) and use them to condition a StarCoder [40] model to generate an SVG code associated with the input image. The StarVector architecture is shown in Figure 2. Addressing SVG generation with a code generation language model (CodeLLM) allows for preserving the richness and versatility of SVG primitives and syntax, as it can handle unaltered real-world SVGs and no need for simplification. Further, using the SVG code as a representation instead of a latent variable can bring enhanced editing capabilities. We propose the task of image-to-SVG as a pre-training task for building a foundation model [51, 74] for SVG generation.
Contributions.
In summary, our contributions are the following: i) We introduce StarVector, a Large Multimodal Model for code generation, which leverages image and language modalities for generating executable SVG code from images. ii) We present SVG-Bench, a unified evaluation benchmark for SVG generation methods, which facilitates access to popular SVG datasets and metrics. Within this benchmark, we introduce two new datasets namely SVG-Emoji (composed of 10k complex emoji SVGs) and SVG-Stack (a large-scale dataset with over 2M real-world SVGs). iii) We evaluate StarVector and prior baselines on SVG-Bench which focuses on the image-to-SVG generation task. We showcase the ability of our model to generalize to complex SVGs and demonstrate the effectiveness of pre-training StarVector on the large-scale SVG-Stack dataset.
The paper is structured as follows: Section 2 presents previous methods related to our research on SVG generation while Section 3 explains the StarVector method in detail. We present SVG-Bench (with accompanying datasets and metrics) in Section 4, followed by experimental results in Section 5 and conclusions in Section 6.
2 Related Work
This section presents prior research relevant to our study, encompassing methods in vector graphics and SVG generation, developments in CodeLLMs, and advancements in multimodal models that integrate image and textual data.
SVG Generation Methods.
Early efforts in vector graphics111https://en.wikipedia.org/wiki/Comparison_of_raster-to-vector_conversion_software predominantly utilized traditional image processing techniques for tasks like image vectorization [23, 85, 42], often involving segmentation and polynomial curve fitting [41, 78]. With the advent of deep learning, new approaches emerged. SVG-VAE [45], a class-conditional Variational Autoencoder (VAE) [35], predicts a latent style vector and generates SVGs using a LSTM decoder [30]. DeepSVG [13] proposes a hierarchical VAE architecture using transformers to represent SVG paths. Im2Vec [61] translates pixel images into latent representations, which can be decoded into paths via a recurrent neural network (RNN). However, latent-based methods are limited to path primitives, thus restricting their scope to a subset of SVGs. Because of this limitation, they tend to not generalize well and overfit on the complex-looking SVG datasets.
Recent trends in image generation using diffusion [29, 64] or autoregressive [25, 59, 86] models have also been explored in the SVG space. VectorFusion [34] leverages a strong text-to-image diffusion model to find the SVG via iterative optimization. CLIPasso [77] uses a CLIP distance loss to iteratively refine SVG from sketches. Both these solutions can be slow due to their iterative nature. Similar to ours, IconShop [83] trains a BERT [22] model for text-to-SVG conversion on icons, using path commands as tokens of the language model, while we use the SVG code.
This study addresses these challenges by proposing a new avenue in SVG modeling. We design a model capable of generating unrestricted SVG code, focusing on directly rendering vector graphics within the SVG code space, bypassing the constraints of previous methodologies.
Language Models for Code Generation.
CodeLLMs, or large language models for code generation, have gained significant popularity in recent literature due to advances in natural language processing (NLP) and the transformer architectures [75], such as the GPT [55, 9, 51] and Llama [73, 74] families. Extensive availability of code datasets [8, 32, 27, 36], has allowed training powerful CodeLLMs that have changed the way software developers do their work [17]. Codex [14] learns to generate Python functions based on input docstrings and evaluates the correctness of the generated code samples automatically through unit tests. Codegen [50], studies multi-turn program synthesis under scaling laws, offering a family of models trained in several programming languages. StarCoder [40] presents a series of models with various sizes, trained on several coding languages using a fill-in-the-middle objective.
Despite SVG popularity, SVG language has been typically avoided in training large coding models [2, 40] (possibly for prioritizing more crowded coding communities). This research seeks to bridge this gap by fine-tuning a proficient CodeLLM specifically on SVG code. Furthermore, we integrate a vision module to facilitate the pre-training task of image-to-SVG conversion.
Multimodal Tasks and Models
In recent years, there have been numerous works at the intersection of vision and language on topics like image captioning [37, 38, 39], visual question answering (VQA) [3], contrastive learning [57, 15] or text-to-image generation [59, 25, 60, 64]. For obtaining visual features some multimodal models [48, 39, 57] use Vision transformers (ViT) [24]. Convolutional-based image encoders like ConvNext [44] or VQGAN [25] have been also explored [25, 57, 64], that aim to preserve more detailed features from images. Some models like Flamingo [1], MAPL [48] or BLIP2 [39] use an intermediate map** module to convert image features into fixed-size token embeddings. Similar to ours, Llava [43] obtains a set of visual tokens by projecting ViT features directly into the LLM embedding space.
While the majority of multimodal research has primarily been centered around a fusion of images and natural text [57, 39, 1, 59, 48], there has been relatively less attention to the process of translating images into code, except for few studies that convert web pages to HTML [20], image to Latex markup [21], GUI screenshot-to-code conversion [7, 5], and the generation of scientific vector graphics through Tikz [6]. This progress suggests that handling image generation as a coding task is an appealing solution. Our work explores different image-encoding techniques for the vision part and uses all available visual tokens to condition a StarCoder CodeLLM on images.
3 StarVector
This section describes StarVector, a multimodal code generation model that accepts images and generates compilable SVG code associated with it. We formulate the task of SVG generation as a sequence modeling problem, where sequences of tokens corresponding to the image and SVG domains are concatenated and modeled autoregressively. The proposed architecture is shown in Figure 2. StarVector integrates an Image Encoder i.e., CLIP, with a CodeLLM i.e., StarCoder through an Adapter layer. The Adapter converts image representations into visual tokens aligned with the SVG token embeddings used in the CodeLLM. After fine-tuning, image and text token embeddings share the same representation space, and the CodeLLM acquires SVG generation proficiency through next-token prediction, providing an effective solution for image-to-SVG conversion.
3.1 Image Encoder and Visual Tokens
The efficacy of our model relies heavily on the image encoder, which needs to preserve intricate details and semantic content in the original image because, unlike captioning, where the output is typically short, SVG generation requires generating much longer sequences of code to obtain results of higher complexity. The image encoder projects the input image into a set of 2D embeddings rich in fine-grained details. To choose the best encoder, we draw inspiration from the success of pre-trained encoder backbones in downstream computer vision tasks such as classification [57], retrieval, and generation [25], including both convolutional and transformer-based models. Formally, we experiment with CLIP ViT-L/14 [57], ConvNext [44] (both pre-trained on LAION-2B [66]), and VQGAN [25], which we pre-train on an image reconstruction task using raster images from SVG-Stack. As the output of the image encoder, we utilize all the available hidden representations in the last layers to bring the most rich features. We define the output of the encoder as a flattened 2D grid of embedding sequences. For CLIP we have embeddings, including the CLS token. For VQGAN, we use the pre-quantization layers and flatten them to obtain embeddings. For ConvNext, we flatten the last activation map to obtain embeddings.
Adapter. The Adapter module performs a non-linear projection of the image embeddings into the LLM embedding space, producing a set of visual token embeddings (or visual tokens). This transformation matches the embedding dimensionality and aligns the image representations with the language model’s embedding space, effectively bridging the visual and SVG code modalities for the generation task. The Adapter is composed of a sequence of fully connected (FC) layers with Swish [58] activation function and Batch Normaliazation [33].
3.2 CodeLLM
The CodeLLM generates the complete SVG code conditioned on the visual tokens representing the image. We employ the StarCoder architecture by Li et al. [40] with pre-trained weights, which provide a general model for code completion tasks. StarCoder is a decoder-only architecture that uses Multi-Query Attention [68] for efficient sampling. To address the long sequence lengths and high memory demands, we use flash-attention [18], enabling fine-tuning StarCoder with a context length of 8,192 tokens, the only restriction of our models. This approach mitigates the quadratic complexity typically associated with neural attention in long sequences. The fine-tuning process updates all the model weights to overcome the distribution shift from the original pre-training task of general code generation to our specific task of image-to-SVG conversion. We emphasize that the pre-trained StarCoder is not trained to generate SVG code and thus needs to be fine-tuned end-to-end.
Training Process.
During training, we first encode images with the image encoder as , which returns a hidden 2D features of dimension , where is the sequence length and the embedding size. The adapter projects into the CodeLLM dimensionality as resulting in visual tokens of dimensionality , where is the internal dimensionality of the CodeLLM. The ground truth SVG code is also tokenized and embedded into the CodeLLM space, as , with the same dimensionality as the visual tokens . During training, we concatenate visual and SVG token embeddings, and the sequence is modeled using standard language modeling training objective, i.e., next token prediction using SVG code as supervision. During inference, we only compute visual tokens from images and decode autoregressively from the CodeLLM with as context.
4 SVGBench: Benchmark for SVG Validation
We propose SVGBench, a unified benchmark for assessing SVG synthesis models composed of tasks, datasets, and metrics. Here we evaluate the proposed StarVector and baselines on the task of image-to-SVG generation. This task has been the primary benchmark in previous works and assesses the model’s ability to generate SVG samples that resemble the input image in the pixel space. We aim to define a standardized evaluation setting for SVG generation, building upon popular practices in the recent literature [45, 13, 61]. In summary, we compile together the popular datasets and metrics used in prior works and propose new ones that define SVG-Bench.
Dataset | Train | Val | Test | Testsim | Source | Avg. Token Length | SVG Primitives |
SVG-Fonts | 1,831,857 | 91,593 | 4,821 | 3,745 | Glypazzn [45] | 2,121 1,868 | Vector path |
SVG-Icons | 80,442 | 6,256 | 2,449 | 1,277 | DeepSVG [13] | 2,449 1,543 | Vector path |
\hdashlineSVG-Emoji | 8,708 | 667 | 668 | 96 | OpenMoji, NotoEmoji, TweMoji | 2,551 1805 | All |
SVG-Stack | 2,169,710 | 108,456 | 5,709 | 1,585 | TheStack [36] | 1,822 1,808 | All |
Model | Input | Output | Architecture | SVG Simplification | Seq Format | SVG commands | SVG primitives |
Vtracer [78] | Image | SVG | Clustering + curve fitting | ✓ | Commands | M, L, C | Vector path |
DeepSVG [13] | SVG | SVG | Transformer | ✓ | Commands | M, L, C | Vector path |
Im2Vec [61] | Image | SVG | RNN | ✓ | Keypoints | M, L, C | Vector path |
GPT-4 Vision [51] | Image | SVG | Multimodal LLM | SVG Code | All | All | |
StarVector (ours) | Image | SVG | Multimodal LLM | SVG code | All | All |
4.1 Datasets
To encompass varying degrees of visual complexity across different colors, shapes, and text, we select datasets comprising examples of fonts, emojis, icons, and real-world examples e.g., the ones seen on websites. The datasets included in SVG-Bench are visually diverse and frequently used by digital artists in real-life scenarios. We use SVG-Fonts introduced as Glypazzn [45] and SVG-Icons in DeepSVG [13]. In the absence of a standard benchmark, we create different splits for all the datasets, which we release to the community for further research. Following is a description of two datasets we propose for SVG generation.
SVG-Emoji
We propose SVG-Emoji, a dataset of 10k image-SVG pairs created by collating multiple smaller emoji datasets from different sources into a unified dataset. Formally, we include examples from TweMoji222https://github.com/twitter/twemoji, OpenMoji333https://openmoji.org/ and NotoEmoji444https://github.com/googlefonts/noto-emoji, where we also provide information about their class and the corresponding caption.
SVG-Stack
A key contribution of this work is SVG-Stack, a comprehensive, large-scale dataset of real-world SVGs, which we introduce as part of SVG-Bench. SVG-Stack is sourced from The Stack [36], a diverse dataset comprising code samples from various software languages, collated from public GitHub repositories555https://huggingface.co/spaces/bigcode/in-the-stack. Our selection builds upon the initial filtering and de-duplication processes conducted in [36, 2, 40]. We perform additional filtering to remove possible duplicates from other SVG datasets in the benchmark. We extracted SVG code from The Stack and rasterized it at 224x224 resolution to obtain ground truth images. This large amount of images, in conjunction with the SVG code, forms the foundation for pre-training StarVector, enabling it to learn the image-to-SVG conversion task effectively.
Table 1 shows the dataset statistics defined in SVG-Bench. We create splits for train, validation, and test. We also create another test split using a pipeline of filtering and simplification to be consistent with the other baselines.
4.2 Evaluation Protocol
In evaluating SVG models, it is crucial to employ metrics that accurately measure the fidelity of the generated SVG with the original image, considering both vector and raster-pixel representations. Traditional pixel-based metrics may not be entirely adequate for SVG assessment, as the predominance of background colors can skew them. For instance, a simple SVG with a predominantly white background might misleadingly score well in these metrics. To address this, our evaluation framework also includes deep perceptual-based metrics and vector-space metrics, offering a more comprehensive and nuanced assessment of SVG conversion quality. We compute the following metrics:
-
•
Pixel-based metrics. We employ Mean Squared Error (MSE) and Structural Similarity Index (SSIM) [81, 80]. MSE quantifies the average squared difference between the generated and the original images’ pixels, offering a straightforward measure of pixel-level accuracy. SSIM evaluates image quality based on the understanding of visual perception, measuring changes in structural information, luminance, and contrast.
-
•
Vector-based metrics. We utilize Chamfer distance (CD), a metric adapted from point cloud literature [84]. This involves discretizing each SVG into a set of points by sampling paths at regular intervals. CD measures the average nearest-point distance between corresponding points in two SVGs, providing a quantitative measure of similarity. A smaller CD indicates that the two SVGs are more similar, while a larger distance suggests they are more distinct. Having two SVGs and , defined with a set of points and in 2D, the CD is defined as,
(1) where is the cardinality of set , and is the squared Euclidean norm.
-
•
Perceptual-based Metrics. We incorporate the Learned Perceptual Image Patch Similarity (LPIPS) [87] metric, which uses deep learning models trained on human perceptual judgments. This metric is particularly effective in capturing the nuances of human visual perception, providing a more subjective assessment of image quality beyond mere pixel-level comparisons.
4.3 Baselines
Here we describe the baselines used to compare StarVector’s performance in the task of image-to-SVG conversion. We consider previous deep learning-based methods and rule-based traditional methods. We evaluate the baselines with publicly available code in our proposed setup.
Im2Vec [61] uses an end-to-end VAE, trained using only image supervision to produce vector graphics. Input rasterized image is encoded to a ‘global’ latent vector, which is passed to an RNN to produce latent code for each path. The path decoder decodes these path codes into Bezier paths to generate the output SVG. We used the publicly available code666https://github.com/preddy5/Im2Vec to report the results.
DeepSVG [13] was introduced as a hierarchical path-based VAE encoder-decoder transformer architecture. Here input paths are encoded separately using a path encoder which are aggregated using a second encoder to produce a latent vector. The decoder uses this latent vector to output the path representations which provide actual draw commands and arguments. We used the open-source code777https://github.com/alexandre01/deepsvg to reproduce the results on different datasets. However, since the DeepSVG framework only allows simplified SVGs, we report results on the ‘simplified’ test set in Table 5.
VTracer888https://github.com/visioncortex/vtracer [78] is a rule-based algorithm to convert images to SVGs. This 3-step pipeline algorithm relies on the hierarchical clustering of images which are traced into vectors. First, pixels are converted into paths, which are simplified into polygons. In the last step, polygons are smoothened and approximated with a Bezier curve fitter.
5 Experiments and Results
This section presents the main experiments performed with the StarVector model on SVGBench. We report results on the simplified test as well as the full test set for the metrics defined in Section 4.2. We also ablate our model with different image encoders and data augmentation. Finally, we consider the effect of pre-training on SVG-Stack and fine-tuning on other datasets.
We use HuggingFace Transformers [82] and PyTorch [53] for the implementation. We reproduce baseline models from official repositories, respecting the proposed hyperparameters (see Appendix 11 for more detail). All experiments were done using 4 A100 80GB GPUs. We use a batch size of 2 with a gradient accumulation of 8 steps and a learning rate of for training. Models trained on SVG-Stack with AdamW optimizer [46] require approximately 5 days of training.
SVG-Fontssim | SVG-Emojissim | SVG-Iconssim | SVG-Stacksim | |||||||||||||
Method | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM |
VTracer [78] | 0.014 | 5.631 | 0.044 | 0.946 | 0.018 | 4.494 | 0.064 | 0.911 | 0.009 | 3.885 | 0.052 | 0.952 | 0.016 | 4.983 | 0.061 | 0.918 |
DeepSVG [13] | 0.046 | 3.747 | 0.163 | 0.823 | 0.069 | 5.486 | 0.278 | 0.748 | 0.04 | 2.492 | 0.153 | 0.851 | 0.066 | 4.804 | 0.235 | 0.736 |
Im2Vec [61] | 0.031 | 133.977 | 0.187 | 0.871 | 0.042 | 26.457 | 0.258 | 0.826 | 0.029 | 146.616 | 0.178 | 0.885 | 0.043 | 138.031 | 0.258 | 0.813 |
GPT-4 Vision (100 examples) | 0.091 | 65.103 | 0.248 | 0.755 | 0.099 | 52.206 | 0.268 | 0.701 | 0.128 | 50.649 | 0.271 | 0.709 | 0.131 | 55.455 | 0.28 | 0.668 |
\hdashline StarVector (ours) | 0.019 | 1.435 | 0.043 | 0.93 | 0.038 | 1.005 | 0.073 | 0.859 | 0.018 | 0.665 | 0.036 | 0.931 | 0.038 | 2.208 | 0.098 | 0.858 |
SVG-Fonts | SVG-Emojis | SVG-Icons | SVG-Stack | |||||||||||||
Method | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM | MSE | CD | LPIPS | SSIM |
VTracer [78] | 0.007 | 4.105 | 0.029 | 0.903 | 0.007 | 8.261 | 0.064 | 0.913 | 0.014 | 3.335 | 0.068 | 0.927 | 0.007 | 6.318 | 0.057 | 0.891 |
Im2Vec [61] | 0.133 | 144.413 | 0.208 | 0.802 | 0.124 | 39.135 | 0.528 | 0.658 | 0.052 | 145.497 | 0.221 | 0.831 | 0.179 | 141.573 | 0.357 | 0.688 |
GPT-4 Vision (100 examples) | 0.194 | 27.909 | 0.268 | 0.689 | 0.162 | 21.134 | 0.404 | 0.612 | 0.135 | 49.249 | 0.299 | 0.666 | 0.192 | 16.981 | 0.37 | 0.604 |
\hdashline StarVector (ours) | 0.008 | 2.098 | 0.013 | 0.976 | 0.051 | 2.194 | 0.202 | 0.778 | 0.022 | 0.483 | 0.043 | 0.923 | 0.072 | 6.153 | 0.153 | 0.785 |
Supplementary Material
In the following, we present a further description of the StarVector architecture, its training process, and how we generate SVG samples from images. We also provide more details about SVGBench with the proposed datasets as well as the different baselines within the evaluation setup. We also include additional results and discussions of our method for image-to-SVG generation.