-
Learning to Collide: Recommendation System Model Compression with Learned Hash Functions
Authors:
Benjamin Ghaemmaghami,
Mustafa Ozdal,
Rakesh Komuravelli,
Dmitriy Korchev,
Dheevatsa Mudigere,
Krishnakumar Nair,
Maxim Naumov
Abstract:
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. These embedding tables can often reach hundreds of gigabytes which increases hardware requirements and training cost. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique repr…
▽ More
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. These embedding tables can often reach hundreds of gigabytes which increases hardware requirements and training cost. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size. However, this approach introduces collisions between semantically dissimilar ids that degrade model quality. We introduce an alternative approach, Learned Hash Functions, which instead learns a new map** function that encourages collisions between semantically similar ids. We derive this learned map** from historical data and embedding access patterns. We experiment with this technique on a production model and find that a map** informed by the combination of access frequency and a learned low dimension embedding is the most effective. We demonstrate a small improvement relative to the hashing trick and other collision related compression techniques. This is ongoing work that explores the impact of categorical id collisions on recommendation model quality and how those collisions may be controlled to improve model performance.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Authors:
Mark Zhao,
Niket Agarwal,
Aarti Basant,
Bugra Gedik,
Satadru Pan,
Mustafa Ozdal,
Rakesh Komuravelli,
Jerry Pan,
Tianshu Bao,
Haowei Lu,
Sundaram Narayanan,
Jack Langman,
Kevin Wilfong,
Harsha Rastogi,
Carole-Jean Wu,
Christos Kozyrakis,
Parik Pol
Abstract:
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipe…
▽ More
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale.
This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.
△ Less
Submitted 22 April, 2022; v1 submitted 20 August, 2021;
originally announced August 2021.
-
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Authors:
Dheevatsa Mudigere,
Yuchen Hao,
Jianyu Huang,
Zhihao Jia,
Andrew Tulloch,
Srinivas Sridharan,
Xing Liu,
Mustafa Ozdal,
Jade Nie,
Jongsoo Park,
Liang Luo,
Jie Amy Yang,
Leon Gao,
Dmytro Ivchenko,
Aarti Basant,
Yuxi Hu,
Jiyan Yang,
Ehsan K. Ardestani,
Xiaodong Wang,
Rakesh Komuravelli,
Ching-Hsiang Chu,
Serhat Yilmaz,
Huayu Li,
Jiyuan Qian,
Zhuobo Feng
, et al. (28 additional authors not shown)
Abstract:
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pa…
▽ More
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) develo** sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
△ Less
Submitted 26 February, 2023; v1 submitted 11 April, 2021;
originally announced April 2021.
-
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
Authors:
Maxim Naumov,
John Kim,
Dheevatsa Mudigere,
Srinivas Sridharan,
Xiaodong Wang,
Whitney Zhao,
Serhat Yilmaz,
Changkyu Kim,
Hector Yuen,
Mustafa Ozdal,
Krishnakumar Nair,
Isabel Gao,
Bor-Yiing Su,
Jiyan Yang,
Mikhail Smelyanskiy
Abstract:
Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present uniq…
▽ More
Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.
△ Less
Submitted 18 August, 2020; v1 submitted 20 March, 2020;
originally announced March 2020.