Better Generalization with Semantic IDs:
A Case Study in Ranking for Recommendations

Anima Singh Trung Vu Nikhil Mehta Raghunandan Keshavan Google Maheswaran Sathiamoorthy Google DeepMind Yilin Zheng Google Lichan Hong Google DeepMind Lukasz Heldt Google Li Wei Google Devansh Tandon Google Ed H. Chi Google DeepMind Xinyang Yi Google DeepMind

Abstract

Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs Rajput et al. (2023) – a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items – as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model Kudo (2018) that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality.

^†^†footnotetext: ^*Equal contributions. Correspondence to {animasingh, trungtvu, nikhilmehta}@google.com

1 Introduction

Neural models with large embedding tables are widely used in industry-scale recommender systems for scoring and ranking vast collections of items. These tables, often containing millions or even billions of rows, facilitate rapid memorization of item quality by modeling randomly-hashed item identifiers. It’s worth noting that learning good item representations is crucial for personalization, as users are typically modeled as a sequence of items. Concretely, in this paper, we consider a neural ranking in a video recommendation system at YouTube. In this model, every video gets a unique identifier referred to as video ID, which is a random string devoid of meaning. This approach is widely adopted in numerous industry-scale recommender systems (e.g., Cheng et al. (2016); Kim et al. (2007); Koren et al. (2009); Zhao et al. (2019b)).

In this paper, we study content-based item representations that can improve the generalization for new and long-tail item distributions while kee** models’ power of memorization without sacrificing overall quality, with a focus on recommendation ranking models. A common technique for encoding item id is to learn one-hot embeddings. However, given an extremely large item corpus with billions of videos, learning one embedding vector per video can be resource-intensive, and more importantly, are vulnerable to the data sparsity of torso and tail items. For using a limited number of embeddings, an alternative approach is to use the hashing trick Weinberger et al. (2009) that maps many items to the same row. This approach can cause random collisions when the original item IDs are not semantically meaningful. When it comes to using content embeddings from pre-trained multimodal item encoders, it is unclear if large item ID table can be fully replaced due to the loss of item-level memorization. In Yuan et al. (2023), authors show that frozen item embeddings outperformed item ID baselines for SASRec Kang and McAuley (2018), but not for two-tower models Rendle et al. (2020) for datasets with up to $150k$ -size corpus. In our experiments in YouTube with a much larger corpus, we observed a significant quality reduction (Section 4.2) when the use of video IDs is replaced with content embeddings. A recent study Ni et al. (2023) has demonstrated the effectiveness of video encoders that use end-to-end training (VideoRec) to replace video ID in recommendation models for short videos. However, this approach comes with 10-50x computational cost over the ID baseline.

We propose a new framework of adapting content embeddings in ranking models with the flexibility of controlling generalization and memorization. Our method is based on item Semantic IDs (SIDs) which are originally proposed in TIGER Rajput et al. (2023) as a hierarchical, sequential and compact representation for generative retrieval. The hierarchical nature of SID offer the flexibilty of granuality control by using various levels of prefixes, and the sequential property draws the connection to subword tokenization, e.g., SentencePiece model (SPM) Kudo (2018) in LLMs. Notably, TIGER Rajput et al. (2023) uses SIDs for generative retrieval where efficiency is not a primary consideration, while our work focuses on using Semantic IDs in resource-constrained and latency-sensitive production-scale ranking models, where the hashing and adaptation through embeddings is the key.

The detailed contributions are: (1) We propose two ways of adapting SIDs in recommendation models as a replacement of item IDs: n-gram and SPM. For both of them, the key idea is to create content-based hashing through sub-pieces of item SIDs, while SPM provides a learnable approach from item distribution by grou** sub-pieces with variable lengths; (2) We conduct extensive experiments on the YouTube dataset to demonstrate the effectiveness of our approaches. To that end, we show that SID-based adaption outperforms the directly using content embeddings. We also demonstrate the superior performance of SPM over n-gram when using large embedding tables with the same number of embedding lookups per item; (3) We also demonstrate the productionization of SIDs for a corpus of billions of videos in YouTube with examples of meaningful and granular hierarchical relationships, along with the success of replacing video IDs in the product scenario.

2 Related Work

Embedding learning

Recommender models rely on learning good representation of categorical features. A common technique to encode categorical features is to train embeddings using one-hot embeddings. Word2vec Mikolov et al. (2013) popularized this in the context of language models. Hashing trick Weinberger et al. (2009) is typically used when the cardinality is high, but it causes random collisions. Multiple hashing Zhang et al. (2020) offers some relief but still leads to random collisions. Deep Hash Embedding Kang et al. (2021) circumvents this problem by not maintaining embedding tables but at the cost of increased computation in the hidden layers. In contrast, we use Semantic IDs — a compute-efficient way to avoid random collisions during embedding learning for item IDs. Semantic IDs improve generalization in recommender models by enabling collisions between semantically related items.

Cold-start and content information

Content-based recommender models have been proposed to combat cold-start issues (e.g. Schein et al. (2002), Volkovs et al. (2017b)) and to enable transferable recommendations (Wang et al. (2022), Hou et al. (2022), Ni et al. (2023)). Recently, embeddings derived from content information are also popular (e.g., DropoutNet Volkovs et al. (2017a), CC-CC Shi et al. (2019) and Du et al. (2020)). PinSage Ying et al. (2018) aggregates visual, text, and engagement information to represent items. Moreover, PinnerFormer Pancha et al. (2022) uses sequences of PinSage embeddings corresponding to item history to build a sequential recommendation model. In contrast to these efforts, our goal is to develop content-derived representations that not only generalize well. However, it can also improve performance relative to using item ID features which is a significantly challenging task. Ni et al. (2023) have successfully tackled the challenge of replacing video ID with content embedding derived from video encoders that are trained end-to-end with the recommendation model for short videos. In a similar vein, TransRec Wang et al. (2022) also trains end-to-end and uses multiple modality information to represent items for enabling transferable recommendations. However, both approaches significantly increase training costs, making them challenging to deploy in production. Semantic IDs offer an efficient compression of content embeddings into discrete tokens, making it feasible to use content signals in production recommendation systems. Furthermore, unlike PinnerFormer Pancha et al. (2022) which is used for offline inference, our focus is to improve generalization of a ranking model used for real-time inference. Therefore, approaches that significantly increase resource costs (including storage, training and serving) make them infeasible to deploy in production. Semantic IDs offer an efficient compression of content embeddings into discrete tokens, making it feasible to use content signals in production recommendation systems. Ni et al. (2023) introduce a large dataset of short videos and show that existing video encoders do not produce embeddings that are useful for recommendations purposes.

Discrete representations

Several techniques exist to discretize embeddings, including VQ-VAE Van Den Oord et al. (2017), VQ-GAN Esser et al. (2021) and their variants used for generative modeling (e.g., Parti Yu et al. (2022) and SoundStream Zeghidour et al. (2021)). TIGER Rajput et al. (2023) used RQ-VAE in the context of recommender applications. Conventional techniques like Product Quantization Jegou et al. (2010) and its variants are used by many recommender models (e.g., MGQE Kang et al. (2020) and Hou et al. (2022)). However, these do not offer hierarchical semantics, which we leverage in our work.

3 Proposed Approaches

3.1 Overview

Given content embeddings for a corpus of items, in contrast with the approach of directly using the embeddings as input feature, we propose an efficient two-stage approach to leverage content signal in downstream recommendation models.

•

Stage 1: Efficient compression of content embeddings into discrete Semantic IDs. We propose a Residual Quantization technique called RQ-VAE Rajput et al. (2023); Lee et al. (2022); Zeghidour et al. (2021) to quantize dense content embeddings into discrete tokens to capture semantic information about videos. This compression is crucial to allow us to efficiently represent a user’s past history because each item can be efficiently be represented as a few integers rather than high-dim embeddings. Once trained, we freeze the trained RQ-VAE model and use it for training the downstream ranking model in Stage 2.
•

Stage 2: Training the ranking model with Semantic IDs. We use the model from Stage 1 to map each item to its Semantic ID and then train embeddings for Semantic ID, along with the rest of the ranking model (Section 3.3). In practical scenario, ranking models are typically trained sequentially on recently logged data.

A key design choice in our proposal is to train and then freeze the RQ-VAE model from Stage 1. The frozen RQ-VAE model generates Semantic IDs for training and serving the ranking model. Recent data may include items that may not exist in the training distribution of the RQ-VAE model. This raises a potential concern from freezing the model, which could hurt the performance of the ranking model over time. As detailed in Appendix A.2, our analysis of YouTube ranking models utilizing Semantic IDs derived from RQ-VAE models trained on both older and recent data reveals comparable performance, indicating the stability of learned semantic representations over time.

3.2 RQ-VAE for Semantic IDs (SIDs)

Refer to caption — Figure 1: Illustration of RQ-VAE: The input vector ${\bm{x}}$ is encoded into a latent ${\bm{z}}$ , which is then recursively quantized by looking up the nearest codebook vector of the residual at each level. In this figure, the item represented by ${\bm{x}}$ has $(1,4,6,2)$ as its Semantic ID.

SIDs are generated from item content embeddings using Residual-Quantized Variational AutoEncoder (RQ-VAE) Lee et al. (2022); Zeghidour et al. (2021); Rajput et al. (2023) that applies quantization on residuals at multiple levels as shown in Figure 1. There are three jointly-trained components: (1) an encoder ${\mathcal{E}}$ that maps the content embedding ${\bm{x}}\in\mathbb{R}^{D}$ to a latent vector ${\bm{z}}\in\mathbb{R}^{D^{\prime}}$ , (2) a residual quantizer with $L$ levels, each with a codebook ${\mathcal{C}}_{l}:=\{{\bm{e}}^{l}_{k}\}_{k=1}^{K}$ , where ${\bm{e}}^{l}_{k}\in\mathbb{R}^{D^{\prime}}$ and $K$ is the codebook size; the quantizer recursively quantizes the residual ${\bm{r}}_{l}$ at each level $l$ to the nearest codebook vector ${\bm{e}}_{c_{l}}$ , and (3) a decoder ${\mathcal{D}}$ that maps the quantized latent $\hat{{\bm{z}}}$ back to the original embedding space $\hat{x}$ . We use the following loss to train the RQ-VAE model: ${\mathcal{L}}={\mathcal{L}}_{recon}+{\mathcal{L}}_{rqvae}$ , where ${\mathcal{L}}_{recon}=\|{\bm{x}}-\hat{{\bm{x}}}\|^{2}$ and ${\mathcal{L}}_{rqvae}=\sum_{l=1}^{L}\ \beta\|{\bm{r}}_{l}-\text{sg}[{\bm{e}}_{% c_{l}}]\|^{2}+\|\text{sg}[{\bm{r}}_{l}]-{\bm{e}}_{c_{l}}\|^{2}$ and sg denotes the stop-gradient operator. ${\mathcal{L}}_{recon}$ aims to reconstruct the content embedding ${\bm{x}}$ . The first and the second terms in ${\mathcal{L}}_{rqvae}$ encourages the encoder and the codebook vectors to be trained such that ${\bm{r}}_{l}$ and ${\bm{e}}_{c_{l}}$ move towards each other.

3.3 Semantic ID Representation in Ranking

In this section, we discuss how we model item representations derived from SIDs to use in ranking models. For a given item $v$ , an RQ-VAE model with $L$ levels generates a SID as a sequence $(c^{v}_{1},...c^{v}_{L})$ . The idea of adaptation is to create subwords for hashing the SID sequence into a number of learnable embeddings. We propose two techniques for the adaptation:

N-gram-based: N-gram item representations leverage SID codes by grou** them into subwords of length N. Each subword is then associated with a learnable embedding, effectively capturing the semantic relationships within the N-gram. The item representation is constructed by summing the embeddings of all N-gram subwords in an item. For instance, a unigram representation would have L subwords, each containing a single code: ${(c^{v}_{1}),...,(c^{v}_{L})}$ . A bigram representation with non-overlap** codes would consist of L/2 subwords, each containing two consecutive codes: ${(c^{v}_{1},c^{v}2),...,(c^{v}_{L-1},c^{v}_{L})}$ . To associate learnable embeddings with these N-gram-based subwords, a separate embedding table is learned for each subgroup. Since each code has a cardinality of K, the embedding table for an N-gram group contains $K^{N}$ rows. These embedding tables are jointly trained with the other parameters of the ranking model, enabling the network to learn representations that effectively capture the relationship between semantic codes within the context of the ranking task.

SPM-based: While N-gram-based video representations offer a straightforward approach to capture relationships between sequential codes in Semantic ID, they suffer from limitations that hinder their effectiveness. First, their reliance on fixed grou** based on predefined N-gram sizes restricts their ability to adapt to the specific characteristics of the Semantic ID corpus, leading to suboptimal embedding table lookups. Second, the number of rows in the embedding tables in N-gram grow exponentially with N, imposing a significant memory burden. These challenges motivate adaptation of Semantic IDs with Sentence Piece Models (SPM) Kudo (2018), which offer a more adaptive and efficient solution for representing item content. We propose using SPM to dynamically learn Semantic ID subwords based on the distribution of impressed items. This allows dynamic length subwords such that popular co-occuring codes are automatically comined as a single subgroup, whereas codes that rarely co-occur together may fallback to unigram. For SPM-based representation, we learn a single embedding table where each row corresponds to a particular variable-length subpieces. By adaptively constructing subword vocabularies given a fixed embedding table size, the SPM vocabulary allows striking a balance between generalization and memorization.

4 Experiments

4.1 Experimental Setup

Ranking Model. We conduct our experiments on a multitask production ranking model Tang et al. (2023); Zhao et al. (2019a), which is used for recommending the next video to watch, given a video a user is watching and user’s past activities. This model uses O(10) million buckets for random hashing to accommodate O(100) millions of videos in our corpus and is trained sequentially on logged data. In the baseline, random hashing of video IDs is used for three key features: users’ watch history, watch video, and the candidate video to be ranked. We evaluate our methods on the data that the trained model has not yet seen, allowing us to understand the performance under the data-distribution shift of the video corpus.

The inherent scale and real-time demands of ranking models necessitate embedding tables with specific characteristics to ensure efficient and effective performance. Firstly, the embedding table size needs to easily fit in the memory. This was one of our key considerations when deciding N in the N-gram-based Semantic ID representations. Since the number of rows in the embedding tables grow exponentially with N, we limit our analysis to $N\leq 2$ for N-gram-based representations. Secondly, the embedding lookups need to be fast to provide near-instantaneous responses to user requests. Our analysis is grounded in the above two properties.

Content Embeddings. Semantic IDs are generated using dense content embeddings. We use a video encoder to generate dense content embeddings for each YouTube video. The video encoder is a transformer model that uses Video- BERT Sun et al. (2019) as the backbone architecture, takes audio and visual features as inputs, and outputs $2048$ -dimensional embeddings that capture the topicality of the video. This model was trained using techniques described in Lee et al. (2020).

Experimental Settings. We compare the two proposed Semantic ID-based representations with two baseline representation techniques: directly using raw content embeddings referred to as Dense Input, and the commonly used randomized hashed IDs referred to as Random Hashing. Since directly using dense input embeddings as item representation obviate the need for embedding table parameters, we also introduce additional baselines for the Dense Input approach for a fair comparison, where we increase the ranking model layers by 1.5x and 2x to study how increasing the model depth affects the ranking performance. To generate the Semantic IDs, we use $L=8$ depth resulting in 8 codes in the Semantic ID of each video. The codebook size for RQ-VAE was set to $K=2048$ .

Evaluation metrics The ranking model is trained sequentially on the first $N$ days of data, where each day contains logged data generated from user interactions on that day. We evaluate the model’s performance using AUC for CTR for the data from ( $N+1$ )-th day. We further slice the metric on items introduced on the ( $N+1$ )-th day. We refer to this as CTR/1D. CTR AUC and CTR/1D AUC metrics evaluate the model’s ability to generalize over time due to data distribution shifts and cold-start items, respectively. A $0.1\%$ change in CTR AUC is considered significant for our ranking model.

4.2 Performance of Semantic ID

Storing content embeddings for each video in users’ watch history is highly resource intensive. Hence, training a baseline large-scale ranking model that uses content embeddings to represent each video in users’ watch history is infeasible. To better understand which representation method performs better, we consider two settings of the ranking model. First, we compare the SID-based representation with raw content embeddings and random hashing based ID such that user history is not used as an input feature (Figure 2). In this setting, two video features (i.e., current and candidate video) are used as input features to the ranking model. In the second setting, we use users’ watch history as the input feature (along with current and candidate video), where the SID-based representation is compared with random hashing (Figure 3).

Dense Content Embedding vs. Random Hashing. We observe that directly using content embeddings (Dense Input) to replace random hashing-based IDs, without additional changes to the model architecture, doesn’t lead to better quality. As shown in figures 2(a)-2(b), the Dense Input baseline performs worse than the video-ID based baseline. We hypothesize that that the ranking models heavily rely on memorization from the ID-based embedding tables; replacing the embedding table with fixed dense content embeddings as a feature leads to poor CTR. For testing this hypothesis, we also ran experiments with 1.5x-2x layers in the ranking model to increase the model’s memorization ability for the Dense Input baseline. We found that increasing the depth does improve quality for both overall and cold-start items compared to the random hashing-baseline. In fact, the increase in CTR is higher for the Dense Input Model with 2x layers compared to Dense Input with 1.5x layers, indicating more the number of layers, better the memorization (Overall CTR) and generalization (cold-start CTR/1D). However, increasing the number of layers can cause the serving cost to increase considerably. As discussed below, SIDs allows retaining the semantic information from raw content embeddings, while still flexibly and efficiently providing memorization via learned embedding tables.

SID vs. Baselines. We compare the two types of SID representations (N-gram and SPM) with the baselines, where for N-gram-SID, we use Unigram (N=1) and Bigram (N=2). When using N-gram, the embedding table size is based on all the possible combinations for the respective N-gram, i.e., Unigram-SID has $8\times K$ rows and Bigram-SID has $4\times K^{2}$ rows, respectively. We found that both Unigram-SID and Bigram-SID lead to worse overall CTR compared to Random Hashing when the user history is not used as an input feature (Figure 2). This could be because of skew in the content in the training data, causing sparse usage of the embedding table. This issue doesn’t occur in random hashing since the embeddings are uniformly used due to random assignment of videos to embeddings in the embedding table. On the other hand, when we use the user history as an input feature (Figure 3), both Unigram-SID and Bigram-SID perform much better than random hashing because the video content in users’ watch history is likely covering more diverse content, leading to more uniform usage of the embedding table. Next, we show impressive gains from the SPM-SID-based video representations. While SPM-SID consistently outperformed N-gram representations when employing larger embedding tables, particularly evident in the improved CTR/1D AUC metrics (see Figures 2(b) and 3(b)), suggesting greater generalization capabilities towards cold-start items, a nuanced observation emerges for smaller embedding table sizes. Specifically, when the embedding table size is limited ( $8\times K$ or $4\times K^{2}$ ), N-gram methods demonstrate a slight advantage over SPM-SID. This behavior can be attributed to the smaller subword vocabulary learned by SPM within these constrained table sizes, potentially hindering its ability to fully capture complex semantic relationships. Note that for most production ranking models, a large embedding table is necessary for good quality. Hence, the SPM-SID based representation is more beneficial for large-scale production ranking models. Overall, both Bigram-SID and SPM-SID significantly outperformed random hashing in our experiments with large-scale ranking models, highlighting the importance of structured representations for capturing semantic relationships in improving cold-start video recommendations.

Efficiency in SPM-SID vs. N-gram-SID. In contrast to N-gramSID representations, which utilize fixed embedding table sizes, SPM-SID offers the flexibility of adapting to a given embedding table size. This adaptation is achieved through the construction of subwords directly based on the training data. Given a fixed embedding table, SPM dynamically generates subwords, each map** to a unique table entry. This optimizes Semantic ID representation within the size constraint, improving video representation efficiency. Moreover, in terms of embedding table lookups SPM-SID is more optimal compared to N-gram-SID. We plot the number of embedding lookups per video vs. the embedding table size in figure 4. The plot highlights the adaptive nature of SPM, where the number of lookups are dynamically reduced for the head/common videos in the training data, while the average number of lookups are comparable to the fixed number of lookups in N-gram. This adaptive nature of SPM contributes to its enhanced efficiency and scalability, making it a more suitable approach for large-scale ranking models.

5 Conclusion and Future Work

This paper tackles the challenging task of removing reliance on widely used item IDs in recommendation models. Using the YouTube ranking model as a case study, we discuss the disadvantages of using item ID features in large-scale production recommendation models. Using RQ-VAE, we develop Semantic IDs for billions of YouTube videos from frozen content embeddings to capture semantically meaningful hierarchical structures across the corpus. We propose and demonstrate Semantic IDs as an effective method for replacing video IDs to improve generalization by introducing meaningful collisions.

References

Cheng et al. [2016] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.
Dhariwal et al. [2020] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever. Jukebox: A generative model for music, 2020.
Du et al. [2020] X. Du, X. Wang, X. He, Z. Li, J. Tang, and T.-S. Chua. How to learn item representation for cold-start multimedia recommendation? In Proceedings of the 28th ACM International Conference on Multimedia, pages 3469–3477, 2020.
Esser et al. [2021] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
Hou et al. [2022] Y. Hou, Z. He, J. McAuley, and W. X. Zhao. Learning vector-quantized item representation for transferable sequential recommenders. arXiv preprint arXiv:2210.12316, 2022.
Jegou et al. [2010] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
Kang and McAuley [2018] W. Kang and J. J. McAuley. Self-attentive sequential recommendation. CoRR, abs/1808.09781, 2018. URL http://arxiv.longhoe.net/abs/1808.09781.
Kang et al. [2020] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, L. Hong, and E. H. Chi. Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems. In Companion Proceedings of the Web Conference 2020, pages 562–566, 2020.
Kang et al. [2021] W.-C. Kang, D. Z. Cheng, T. Yao, X. Yi, T. Chen, L. Hong, and E. H. Chi. Learning to embed categorical features without embedding tables for recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 840–850, 2021.
Kim et al. [2007] D. Kim, K.-s. Kim, K.-H. Park, J.-H. Lee, and K. M. Lee. A music recommendation system with a dynamic k-means clustering algorithm. In Sixth international conference on machine learning and applications (ICMLA 2007), pages 399–403. IEEE, 2007.
Koren et al. [2009] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
Kudo [2018] T. Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates, 2018.
Lee et al. [2022] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
Lee et al. [2020] H. Lee, J. Lee, J. Y.-H. Ng, and P. Natsev. Large scale video representation learning via relational graph clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Mikolov et al. [2013] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
Ni et al. [2023] Y. Ni, Y. Cheng, X. Liu, J. Fu, Y. Li, X. He, Y. Zhang, and F. Yuan. A content-driven micro-video recommendation dataset at scale. arXiv preprint arXiv:2309.15379, 2023.
Pancha et al. [2022] N. Pancha, A. Zhai, J. Leskovec, and C. Rosenberg. Pinnerformer: Sequence modeling for user representation at pinterest. arXiv preprint arXiv:2205.04507, 2022.
Rajput et al. [2023] S. Rajput, N. Mehta, A. Singh, R. Keshavan, T. Vu, L. Heldt, L. Hong, Y. Tay, V. Q. Tran, J. Samost, and M. Sathiamoorthy. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 2023.
Rendle et al. [2020] S. Rendle, W. Krichene, L. Zhang, and J. Anderson. Neural collaborative filtering vs. matrix factorization revisited, 2020.
Schein et al. [2002] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 253–260, 2002.
Shi et al. [2019] S. Shi, M. Zhang, X. Yu, Y. Zhang, B. Hao, Y. Liu, and S. Ma. Adaptive feature sampling for recommendation with missing content feature values. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1451–1460, 2019.
Sun et al. [2019] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid. Videobert: A joint model for video and language representation learning. CoRR, abs/1904.01766, 2019. URL http://arxiv.longhoe.net/abs/1904.01766.
Tang et al. [2023] J. Tang, Y. Drori, D. Chang, M. Sathiamoorthy, J. Gilmer, L. Wei, X. Yi, L. Hong, and E. H. Chi. Improving training stability for multitask ranking models in recommender systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, aug 2023. doi: 10.1145/3580305.3599846. URL https://doi.org/10.1145%2F3580305.3599846.
Van Den Oord et al. [2017] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Volkovs et al. [2017a] M. Volkovs, G. Yu, and T. Poutanen. Dropoutnet: Addressing cold start in recommender systems. Advances in neural information processing systems, 30, 2017a.
Volkovs et al. [2017b] M. Volkovs, G. W. Yu, and T. Poutanen. Content-based neighbor models for cold start in recommender systems. In Proceedings of the Recommender Systems Challenge 2017, pages 1–6. 2017b.
Wang et al. [2022] J. Wang, F. Yuan, M. Cheng, J. M. Jose, C. Yu, B. Kong, X. He, Z. Wang, B. Hu, and Z. Li. Transrec: Learning transferable recommendation from mixture-of-modality feedback. arXiv preprint arXiv:2206.06190, 2022.
Weinberger et al. [2009] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning, pages 1113–1120, 2009.
Ying et al. [2018] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, jul 2018. doi: 10.1145/3219819.3219890. URL https://doi.org/10.1145%2F3219819.3219890.
Yu et al. [2022] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
Yuan et al. [2023] Z. Yuan, F. Yuan, Y. Song, Y. Li, J. Fu, F. Yang, Y. Pan, and Y. Ni. Where to go next for recommender systems? id- vs. modality-based recommender models revisited, 2023.
Zeghidour et al. [2021] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. CoRR, abs/2107.03312, 2021. URL https://arxiv.longhoe.net/abs/2107.03312.
Zhang et al. [2020] C. Zhang, Y. Liu, Y. Xie, S. I. Ktena, A. Tejani, A. Gupta, P. K. Myana, D. Dilipkumar, S. Paul, I. Ihara, et al. Model size reduction using frequency based double hashing for recommender systems. In Proceedings of the 14th ACM Conference on Recommender Systems, pages 521–526, 2020.
Zhao et al. [2019a] Z. Zhao, L. Hong, L. Wei, J. Chen, A. Nath, S. Andrews, A. Kumthekar, M. Sathiamoorthy, X. Yi, and E. Chi. Recommending what video to watch next: A multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, page 43–51, New York, NY, USA, 2019a. Association for Computing Machinery. ISBN 9781450362436. doi: 10.1145/3298689.3346997. URL https://doi.org/10.1145/3298689.3346997.
Zhao et al. [2019b] Z. Zhao, L. Hong, L. Wei, J. Chen, A. Nath, S. Andrews, A. Kumthekar, M. Sathiamoorthy, X. Yi, and E. Chi. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 43–51, 2019b.

Appendix A Appendix

A.1 RQ-VAE Training and Serving Setup

Model Hyperparameters. For the RQ-VAE model, we use a $1$ -layer encoder decoder model with dimension 256. We apply $L=8$ levels of quantization using codebook size $K=2048$ for each.

RQ-VAE Training: We train the RQ-VAE model on a random sample of impressed videos until the reconstruction loss stabilizes ( $\approx$ 10s of millions of steps for our corpus). Vector quantization techniques are known to suffer from codebook collapse Dhariwal et al. [2020] during training, where the model only uses a small proportion of codebook vectors. To address this challenge, we reset unused codebook vectors at each training step to content embeddings of randomly sampled videos from within the batch Zeghidour et al. [2021], which significantly improved the codebook utilization. We use $\beta=0.25$ to compute the training loss. Once trained, we freeze the RQ-VAE model and use the encoder to produce Semantic IDs for videos.

RQ-VAE Serving/Inference: As new videos get introduced into the corpus, we generate the Semantic IDs using the frozen RQ-VAE model. Semantic IDs are then stored and served similarly to other features used for ranking.

A.2 Stability of Semantic IDs over time

To study Semantic IDs’ stability, we train two RQ-VAE models: RQ-VAE_v0 and RQ-VAE_v1, using data 6 months apart. Figure 5 shows that the performance of the production ranking model trained on recent engagement data (using SID-3Bigram-sum) are comparable for Semantic IDs derived from both RQ-VAE_v0 and RQ-VAE_v1. This confirms that semantic token space for videos learned via RQ-VAE is stable for use in the downstream production ranking model over time.

A.3 Semantic IDs as hierarchy of concepts

We illustrate the hierarchy of concepts captured by Semantic IDs from the videos in our corpus. Section 4.1 details the hyper-parameters used to train the RQ-VAE model. Intuitively, we can think of Semantic IDs as forming a trie over videos, with higher levels representing coarser concepts and lower levels representing more fine-grained concepts. Figures 6 and 7 show two example sub-tries from our trained RQ-VAE model with 4 tokens that captures a hierarchy of concepts within sports and food vlogging videos.

A.4 Similarity Analysis with Semantic ID

Table 1 shows the average pairwise cosine similarity in the content embedding space for all videos with a shared Semantic ID prefix of length $n$ and their corresponding sub-trie sizes. We consider two videos with Semantic IDs $(1,2,3,4)$ and $(1,2,6,7)$ to have a shared prefix of length $2$ . We observe that as the shared prefix length increases, average pairwise cosine similarity increases while the sub-trie size decreases. These suggest that Semantic ID prefixes represent increasingly granular concepts as their lengths increase.

Shared prefix length	Average pairwise cosine similarity	Typical sub-trie size
1	0.41	150,000-450,000
2	0.68	20-150
3	0.91	1-5
4	0.97	1

Table 1: Aggregate metrics for videos sharing Semantic ID prefix of length

n

. The typical sub-trie size refers to the 25th-75th percentile range (with rounding).

Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations