-
DataComp-LM: In search of the next generation of training sets for language models
Authors:
Jeffrey Li,
Alex Fang,
Georgios Smyrnis,
Maor Ivgi,
Matt Jordan,
Samir Gadre,
Hritik Bansal,
Etash Guha,
Sedrick Keh,
Kushal Arora,
Saurabh Garg,
Rui Xin,
Niklas Muennighoff,
Reinhard Heckel,
Jean Mercat,
Mayee Chen,
Suchin Gururangan,
Mitchell Wortsman,
Alon Albalak,
Yonatan Bitton,
Marianna Nezhurina,
Amro Abbas,
Cheng-Yu Hsieh,
Dhruba Ghosh,
Josh Gardner
, et al. (34 additional authors not shown)
Abstract:
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat…
▽ More
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Authors:
Hadi Pouransari,
Chun-Liang Li,
Jen-Hao Rick Chang,
Pavan Kumar Anasosalu Vasu,
Cem Koc,
Vaishaal Shankar,
Oncel Tuzel
Abstract:
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal n…
▽ More
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training large language models: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
Authors:
Pavan Kumar Anasosalu Vasu,
Hadi Pouransari,
Fartash Faghri,
Oncel Tuzel
Abstract:
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks.…
▽ More
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models
Authors:
Raviteja Vemulapalli,
Hadi Pouransari,
Fartash Faghri,
Sachin Mehta,
Mehrdad Farajtabar,
Mohammad Rastegari,
Oncel Tuzel
Abstract:
Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to…
▽ More
Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data?", and propose a simple task-oriented knowledge transfer approach as a highly effective solution to this problem. Our experimental results on five target tasks show that the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining by up to 11.6%, 22.1%, 13.7%, and 29.8%, respectively. Furthermore, the proposed approach also demonstrates up to 9x, 4x and 15x reduction in pretraining compute cost when compared to task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, respectively, while outperforming them. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and introduce a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.
△ Less
Submitted 1 July, 2024; v1 submitted 29 November, 2023;
originally announced November 2023.
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Authors:
Pavan Kumar Anasosalu Vasu,
Hadi Pouransari,
Fartash Faghri,
Raviteja Vemulapalli,
Oncel Tuzel
Abstract:
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of ef…
▽ More
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$ improved learning efficiency when compared with non-reinforced CLIP training. Code and models are available at https://github.com/apple/ml-mobileclip .
△ Less
Submitted 1 April, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
TiC-CLIP: Continual Training of CLIP Models
Authors:
Saurabh Garg,
Mehrdad Farajtabar,
Hadi Pouransari,
Raviteja Vemulapalli,
Sachin Mehta,
Oncel Tuzel,
Vaishaal Shankar,
Fartash Faghri
Abstract:
Kee** large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language mode…
▽ More
Kee** large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
△ Less
Submitted 21 March, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
Authors:
Haoxiang Wang,
Pavan Kumar Anasosalu Vasu,
Fartash Faghri,
Raviteja Vemulapalli,
Mehrdad Farajtabar,
Sachin Mehta,
Mohammad Rastegari,
Oncel Tuzel,
Hadi Pouransari
Abstract:
The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficient…
▽ More
The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.
△ Less
Submitted 10 June, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
Authors:
Mohammadreza Salehi,
Mehrdad Farajtabar,
Maxwell Horton,
Fartash Faghri,
Hadi Pouransari,
Raviteja Vemulapalli,
Oncel Tuzel,
Ali Farhadi,
Mohammad Rastegari,
Sachin Mehta
Abstract:
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual represent…
▽ More
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals
Authors:
Ran Liu,
Ellen L. Zippi,
Hadi Pouransari,
Chris Sandino,
**g** Nie,
Hanlin Goh,
Erdrin Azemi,
Ali Moin
Abstract:
Leveraging multimodal information from biosignals is vital for building a comprehensive representation of people's physical and mental states. However, multimodal biosignals often exhibit substantial distributional shifts between pretraining and inference datasets, stemming from changes in task specification or variations in modality compositions. To achieve effective pretraining in the presence o…
▽ More
Leveraging multimodal information from biosignals is vital for building a comprehensive representation of people's physical and mental states. However, multimodal biosignals often exhibit substantial distributional shifts between pretraining and inference datasets, stemming from changes in task specification or variations in modality compositions. To achieve effective pretraining in the presence of potential distributional shifts, we propose a frequency-aware masked autoencoder ($\texttt{bio}$FAME) that learns to parameterize the representation of biosignals in the frequency space. $\texttt{bio}$FAME incorporates a frequency-aware transformer, which leverages a fixed-size Fourier-based operator for global token mixing, independent of the length and sampling rate of inputs. To maintain the frequency components within each input channel, we further employ a frequency-maintain pretraining strategy that performs masked autoencoding in the latent space. The resulting architecture effectively utilizes multimodal information during pretraining, and can be seamlessly adapted to diverse tasks and modalities at test time, regardless of input size and order. We evaluated our approach on a diverse set of transfer experiments on unimodal time series, achieving an average of $\uparrow$5.5% improvement in classification accuracy over the previous state-of-the-art. Furthermore, we demonstrated that our architecture is robust in modality mismatch scenarios, including unpredicted modality dropout or substitution, proving its practical utility in real-world applications. Code is available at https://github.com/apple/ml-famae .
△ Less
Submitted 18 April, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Authors:
Fartash Faghri,
Hadi Pouransari,
Sachin Mehta,
Mehrdad Farajtabar,
Ali Farhadi,
Mohammad Rastegari,
Oncel Tuzel
Abstract:
We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-base…
▽ More
We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-based models and performing large-scale study of distillation with state-of-the-art models with various data augmentations. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks (e.g., segmentation and detection). As an example, the accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the ImageNet validation set is also reduced by 9.9%. Using this backbone with Mask-RCNN for object detection on MS-COCO, the mean average precision improves by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers. For MobileNetV3 and Swin-Tiny, we observe significant improvements on ImageNet-R/A/C of up to 20% improved robustness. Models pretrained on ImageNet+ and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4% improved accuracy. The code, datasets, and pretrained models are available at https://github.com/apple/ml-dr.
△ Less
Submitted 22 September, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
FastFill: Efficient Compatible Model Update
Authors:
Florian Jaeckle,
Fartash Faghri,
Ali Farhadi,
Oncel Tuzel,
Hadi Pouransari
Abstract:
In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through a similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with f…
▽ More
In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through a similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with features already in the gallery computed with the old model. Subsequently, all features in the gallery need to be re-computed using the new embedding model -- a computationally expensive process called backfilling. Recently, compatible representation learning methods have been proposed to avoid backfilling. Despite their relative success, there is an inherent trade-off between the new model performance and its compatibility with the old model. In this work, we introduce FastFill: a compatible model update process using feature alignment and policy based partial backfilling to promptly elevate retrieval performance. We show that previous backfilling strategies suffer from decreased performance and demonstrate the importance of both the training objective and the ordering in online partial backfilling. We propose a new training method for feature alignment between old and new embedding models using uncertainty estimation. Compared to previous works, we obtain significantly improved backfilling results on a variety of datasets: mAP on ImageNet (+4.4\%), Places-365 (+2.7\%), and VGG-Face2 (+1.3\%). Further, we demonstrate that when updating a biased model with FastFill, the minority subgroup accuracy gap promptly vanishes with a small fraction of partial backfilling.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Authors:
Elan Rosenfeld,
Preetum Nakkiran,
Hadi Pouransari,
Oncel Tuzel,
Fartash Faghri
Abstract:
Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment da…
▽ More
Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two orders of magnitude less time and data and training 77% fewer parameters.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Forward Compatible Training for Large-Scale Embedding Retrieval Systems
Authors:
Vivek Ramanujan,
Pavan Kumar Anasosalu Vasu,
Ali Farhadi,
Oncel Tuzel,
Hadi Pouransari
Abstract:
In visual retrieval systems, updating the embedding model requires recomputing features for every piece of data. This expensive process is referred to as backfilling. Recently, the idea of backward compatible training (BCT) was proposed. To avoid the cost of backfilling, BCT modifies training of the new model to make its representations compatible with those of the old model. However, BCT can sign…
▽ More
In visual retrieval systems, updating the embedding model requires recomputing features for every piece of data. This expensive process is referred to as backfilling. Recently, the idea of backward compatible training (BCT) was proposed. To avoid the cost of backfilling, BCT modifies training of the new model to make its representations compatible with those of the old model. However, BCT can significantly hinder the performance of the new model. In this work, we propose a new learning paradigm for representation learning: forward compatible training (FCT). In FCT, when the old model is trained, we also prepare for a future unknown version of the model. We propose learning side-information, an auxiliary feature for each sample which facilitates future updates of the model. To develop a powerful and flexible framework for model compatibility, we combine side-information with a forward transformation from old to new embeddings. Training of the new model is not modified, hence, its accuracy is not degraded. We demonstrate significant retrieval accuracy improvement compared to BCT for various datasets: ImageNet-1k (+18.1%), Places-365 (+5.4%), and VGG-Face2 (+8.3%). FCT obtains model compatibility when the new and old models are trained across different datasets, losses, and architectures.
△ Less
Submitted 29 March, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution
Authors:
Hadi Pouransari,
Mojan Javaheripi,
Vinay Sharma,
Oncel Tuzel
Abstract:
Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (…
▽ More
Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output distributions; (2) sampling examples from an approximation to the underlying data distribution; and (3) matching student and teacher output distributions over this extended set including uncertain samples. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures: 16% regression error reduction on the MPIIGaze dataset, +3.4% to +9.1% improvement in top-1 classification accuracy on the CIFAR100 dataset, and +2.9% top-1 improvement on the ImageNet dataset.
△ Less
Submitted 20 November, 2020; v1 submitted 30 June, 2020;
originally announced July 2020.
-
Least squares binary quantization of neural networks
Authors:
Hadi Pouransari,
Zhucheng Tu,
Oncel Tuzel
Abstract:
Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling…
▽ More
Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling strategies. Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error. Our quantization algorithms can be implemented efficiently on the hardware using bitwise operations. We present proofs to show that our proposed methods are optimal, and also provide empirical error analysis. We conduct experiments on the ImageNet dataset and show a reduced accuracy gap when using the proposed least squares quantization algorithms.
△ Less
Submitted 13 June, 2020; v1 submitted 8 January, 2020;
originally announced January 2020.
-
Democratizing Production-Scale Distributed Deep Learning
Authors:
Minghuang Ma,
Hadi Pouransari,
Daniel Chao,
Saurabh Adya,
Santiago Akle Serrano,
Yi Qin,
Dan Gimnicher,
Dominic Walsh
Abstract:
The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off…
▽ More
The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce \emph{Alchemist} - an internal service built at Apple from the ground up for \emph{easy}, \emph{fast}, and \emph{scalable} distributed training. We discuss its design, implementation, and examples of running different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection.
△ Less
Submitted 3 November, 2018; v1 submitted 31 October, 2018;
originally announced November 2018.
-
A distributed-memory hierarchical solver for general sparse linear systems
Authors:
Chao Chen,
Hadi Pouransari,
Sivasankaran Rajamanickam,
Erik G. Boman,
Eric Darve
Abstract:
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct s…
▽ More
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.
△ Less
Submitted 19 December, 2017;
originally announced December 2017.
-
Particle-to-fluid heat transfer in particle-laden turbulence
Authors:
Hadi Pouransari,
Ali Mani
Abstract:
Preferential concentration of inertial particles by turbulence is a well recognized phenomenon. This study investigates how this phenomenon impacts the mean heat transfer between the fluid phase and the particle phase. Using direct numerical simulations of homogeneous and isotropic turbulent flows coupled with Lagrangian point particle tracking, we explore this phenomenon over wide range of input…
▽ More
Preferential concentration of inertial particles by turbulence is a well recognized phenomenon. This study investigates how this phenomenon impacts the mean heat transfer between the fluid phase and the particle phase. Using direct numerical simulations of homogeneous and isotropic turbulent flows coupled with Lagrangian point particle tracking, we explore this phenomenon over wide range of input parameters. Among the nine independent dimensionless numbers defining this problem, we show that particle Stokes number, defined based on large eddy time, and a new identified number called heat mixing parameter have the most significant effect on particle to gas heat transfer, while variation in other non-dimensional numbers can be ignored. An investigation of regimes with significant particle mass loading, suggests that the mean heat transfer from particles to gas is hardly affected by momentum two-way coupling. Using our numerical results we propose an algebraic reduced order model for heat transfer in particle-laden turbulence.
△ Less
Submitted 6 July, 2018; v1 submitted 3 October, 2017;
originally announced October 2017.
-
Sparse Hierarchical Solvers with Guaranteed Convergence
Authors:
Kai Yang,
Hadi Pouransari,
Eric Darve
Abstract:
Solving sparse linear systems from discretized PDEs is challenging. Direct solvers have in many cases quadratic complexity (depending on geometry), while iterative solvers require problem dependent preconditioners to be robust and efficient. Approximate factorization preconditioners, such as incomplete LU factorization, provide cheap approximations to the system matrix. However, even a highly accu…
▽ More
Solving sparse linear systems from discretized PDEs is challenging. Direct solvers have in many cases quadratic complexity (depending on geometry), while iterative solvers require problem dependent preconditioners to be robust and efficient. Approximate factorization preconditioners, such as incomplete LU factorization, provide cheap approximations to the system matrix. However, even a highly accurate preconditioner may have deteriorating performance when the condition number of the system matrix increases. By increasing the accuracy on low-frequency errors, we propose a novel hierarchical solver with improved robustness with respect to the condition number of the linear system. This solver retains the linear computational cost and memory footprint of the original algorithm.
△ Less
Submitted 12 March, 2017; v1 submitted 10 November, 2016;
originally announced November 2016.
-
Parallel variable-density particle-laden turbulence simulation
Authors:
Hadi Pouransari,
Milad Mortazavi,
Ali Mani
Abstract:
We have developed a fully parallel C++/MPI based simulation code for variable-density particle-laden turbulent flows. The fluid is represented through a uniform Eulerian staggered grid, while particles are modeled using a Lagrangian point-particle framework. Spatial discretization is second-order accurate, and time integration has a fourth-order accuracy. Two-way coupling of the particles with the…
▽ More
We have developed a fully parallel C++/MPI based simulation code for variable-density particle-laden turbulent flows. The fluid is represented through a uniform Eulerian staggered grid, while particles are modeled using a Lagrangian point-particle framework. Spatial discretization is second-order accurate, and time integration has a fourth-order accuracy. Two-way coupling of the particles with the background flow is considered in both momentum and energy equations. The code is fully modular and abstracted, and easily can be extended or modified. We have considered two different boundary conditions. We have also developed a novel parallel linear solver for the variable density Poisson equation that arises in the calculation.
△ Less
Submitted 20 January, 2016;
originally announced January 2016.
-
Fast hierarchical solvers for sparse matrices using extended sparsification and low-rank approximation
Authors:
Hadi Pouransari,
Pieter Coulier,
Eric Darve
Abstract:
Inversion of sparse matrices with standard direct solve schemes is robust, but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability; but, need to be used with an appropriate preconditioner (e.g., ILU, AMG, Gauss-Seidel, etc.) for proper convergence. The choice of an effective preconditioner is highly problem dependent. We propose a novel fully algebraic s…
▽ More
Inversion of sparse matrices with standard direct solve schemes is robust, but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability; but, need to be used with an appropriate preconditioner (e.g., ILU, AMG, Gauss-Seidel, etc.) for proper convergence. The choice of an effective preconditioner is highly problem dependent. We propose a novel fully algebraic sparse matrix solve algorithm, which has linear complexity with the problem size. Our scheme is based on the Gauss elimination. For a given matrix, we approximate the LU factorization with a tunable accuracy determined a priori. This method can be used as a stand-alone direct solver with linear complexity and tunable accuracy, or it can be used as a black-box preconditioner in conjunction with iterative methods such as GMRES. The proposed solver is based on the low-rank approximation of fill-ins generated during the elimination. Similar to H-matrices, fill-ins corresponding to blocks that are well-separated in the adjacency graph are represented via a hierarchical structure. The linear complexity of the algorithm is guaranteed if the blocks corresponding to well-separated clusters of variables are numerically low-rank.
△ Less
Submitted 14 December, 2016; v1 submitted 26 October, 2015;
originally announced October 2015.
-
Optimizing the adaptive fast multipole method for fractal sets
Authors:
Hadi Pouransari,
Eric Darve
Abstract:
We have performed a detailed analysis of the fast multipole method (FMM) in the adaptive case, in which the depth of the FMM tree is non-uniform. Previous works in this area have focused mostly on special types of adaptive distributions, for example when points accumulate on a 2D manifold or accumulate around a few points in space. Instead, we considered a more general situation in which fractal s…
▽ More
We have performed a detailed analysis of the fast multipole method (FMM) in the adaptive case, in which the depth of the FMM tree is non-uniform. Previous works in this area have focused mostly on special types of adaptive distributions, for example when points accumulate on a 2D manifold or accumulate around a few points in space. Instead, we considered a more general situation in which fractal sets, e.g., Cantor sets and generalizations, are used to create adaptive sets of points. Such sets are characterized by their dimension, a number between 0 and 3. We introduced a mathematical framework to define a converging sequence of octrees, and based on that, demonstrated how to increase $N \to \infty$.
A new complexity analysis for the adaptive FMM is introduced. It is shown that the ${\cal{O}}(N)$ complexity is achievable for any distribution of particles, when a modified adaptive FMM is exploited. We analyzed how the FMM performs for fractal point distributions, and how optimal parameters can be picked, e.g., the criterion used to stop the subdivision of an FMM cell. A new subdividing double-threshold method is introduced, and better performance demonstrated. Parameters in the FMM are modeled as a function of particle distribution dimension, and the optimal values are obtained. A three dimensional kernel independent black box adaptive FMM is implemented and used for all calculations.
△ Less
Submitted 11 August, 2015;
originally announced August 2015.
-
The inverse fast multipole method: using a fast approximate direct solver as a preconditioner for dense linear systems
Authors:
Pieter Coulier,
Hadi Pouransari,
Eric Darve
Abstract:
Although some preconditioners are available for solving dense linear systems, there are still many matrices for which preconditioners are lacking, in particular in cases where the size of the matrix $N$ becomes very large. There remains hence a great need to develop general purpose preconditioners whose cost scales well with the matrix size $N$. In this paper, we propose a preconditioner with broa…
▽ More
Although some preconditioners are available for solving dense linear systems, there are still many matrices for which preconditioners are lacking, in particular in cases where the size of the matrix $N$ becomes very large. There remains hence a great need to develop general purpose preconditioners whose cost scales well with the matrix size $N$. In this paper, we propose a preconditioner with broad applicability and with cost $\mathcal{O}(N)$ for dense matrices, when the matrix is given by a smooth kernel. Extending the method using the same framework to general $\mathcal{H}^2$-matrices is relatively straightforward. These preconditioners have a controlled accuracy (machine accuracy can be achieved if needed) and scale linearly with $N$. They are based on an approximate direct solve of the system. The linear scaling of the algorithm is achieved by means of two key ideas. First, the $\mathcal{H}^2$-structure of the dense matrix is exploited to obtain an extended sparse system of equations. Second, fill-ins arising when performing the elimination are compressed as low-rank matrices if they correspond to well-separated interactions. This ensures that the sparsity pattern of the extended sparse matrix is preserved throughout the elimination, hence resulting in a very efficient algorithm with $\mathcal{O}(N \log(1/\varepsilon)^2 )$ computational cost and $\mathcal{O}(N \log 1/\varepsilon )$ memory requirement, for an error tolerance $0 < \varepsilon < 1$. The solver is inexact, although the error can be controlled and made as small as needed. These solvers are related to ILU in the sense that the fill-in is controlled. However, in ILU, most of the fill-in is simply discarded whereas here it is approximated using low-rank blocks, with a prescribed tolerance. Numerical examples are discussed to demonstrate the linear scaling of the method and to illustrate its effectiveness as a preconditioner.
△ Less
Submitted 4 February, 2016; v1 submitted 7 August, 2015;
originally announced August 2015.