Search | arXiv e-print repository

Simulated Overparameterization

Authors: Hanna Mazzawi, Pranjal Awasthi, Xavi Gonzalvo, Srikumar Ramalingam

Abstract: In this work, we introduce a novel paradigm called Simulated Overparametrization (SOP). SOP merges the computational efficiency of compact models with the advanced learning proficiencies of overparameterized models. SOP proposes a unique approach to model training and inference, where a model with a significantly larger number of parameters is trained in such a way that a smaller, efficient subset… ▽ More In this work, we introduce a novel paradigm called Simulated Overparametrization (SOP). SOP merges the computational efficiency of compact models with the advanced learning proficiencies of overparameterized models. SOP proposes a unique approach to model training and inference, where a model with a significantly larger number of parameters is trained in such a way that a smaller, efficient subset of these parameters is used for the actual computation during inference. Building upon this framework, we present a novel, architecture agnostic algorithm called "majority kernels", which seamlessly integrates with predominant architectures, including Transformer models. Majority kernels enables the simulated training of overparameterized models, resulting in performance gains across architectures and tasks. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as combinatorial optimization methods based on submodular optimization. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2306.11903 [pdf, other]

Deep Fusion: Efficient Network Training via Pre-trained Initializations

Authors: Hanna Mazzawi, Xavi Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin

Abstract: In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks in the context of LLMs is the need for large amounts of computational resources and time. To mitigate this, network growing algorithms offer potential cost savings, but th… ▽ More In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks in the context of LLMs is the need for large amounts of computational resources and time. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. We present two notable contributions in this paper. First, we present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. Second, we propose a theoretical framework using backward error analysis to illustrate the dynamics of mid-training network growth. Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements, maintaining or surpassing traditional training methods' performance in various NLP tasks and T5 model sizes. Finally, we validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that with carefully optimized training dynamics, it significantly reduces both training time and resource consumption. △ Less

Submitted 26 June, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:1906.01550 [pdf, other]

Towards Task and Architecture-Independent Generalization Gap Predictors

Authors: Scott Yak, Javier Gonzalvo, Hanna Mazzawi

Abstract: Can we use deep learning to predict when deep learning works? Our results suggest the affirmative. We created a dataset by training 13,500 neural networks with different architectures, on different variations of spiral datasets, and using different optimization parameters. We used this dataset to train task-independent and architecture-independent generalization gap predictors for those neural net… ▽ More Can we use deep learning to predict when deep learning works? Our results suggest the affirmative. We created a dataset by training 13,500 neural networks with different architectures, on different variations of spiral datasets, and using different optimization parameters. We used this dataset to train task-independent and architecture-independent generalization gap predictors for those neural networks. We extend Jiang et al. (2018) to also use DNNs and RNNs and show that they outperform the linear model, obtaining $R^2=0.965$. We also show results for architecture-independent, task-independent, and out-of-distribution generalization gap prediction tasks. Both DNNs and RNNs consistently and significantly outperform linear models, with RNNs obtaining $R^2=0.584$. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: 8 pages, 6 figures, 2 tables. To be presented at ICML 2019 "Understanding and Improving Generalization in Deep Learning" Workshop (poster)

arXiv:1905.00080 [pdf, other]

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

Authors: Charles Weill, Javier Gonzalvo, Vitaly Kuznetsov, Scott Yang, Scott Yak, Hanna Mazzawi, Eugen Hotaj, Ghassen Jerfel, Vladimir Macko, Ben Adlam, Mehryar Mohri, Corinna Cortes

Abstract: AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework for automatically learning high-quality ensembles with minimal expert intervention. Our framework is inspired by the AdaNet algorithm (Cortes et al., 2017) which learns the structure of a neural network as an ensemble of subnetworks. We designed it to: (1) integrate with the existing TensorFlow ecosystem, (2) offer sensible de… ▽ More AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework for automatically learning high-quality ensembles with minimal expert intervention. Our framework is inspired by the AdaNet algorithm (Cortes et al., 2017) which learns the structure of a neural network as an ensemble of subnetworks. We designed it to: (1) integrate with the existing TensorFlow ecosystem, (2) offer sensible default search spaces to perform well on novel datasets, (3) present a flexible API to utilize expert information when available, and (4) efficiently accelerate training with distributed CPU, GPU, and TPU hardware. The code is open-source and available at: https://github.com/tensorflow/adanet. △ Less

Submitted 30 April, 2019; originally announced May 2019.

arXiv:1903.06236 [pdf, other]

Improving Neural Architecture Search Image Classifiers via Ensemble Learning

Authors: Vladimir Macko, Charles Weill, Hanna Mazzawi, Javier Gonzalvo

Abstract: Finding the best neural network architecture requires significant time, resources, and human expertise. These challenges are partially addressed by neural architecture search (NAS) which is able to find the best convolutional layer or cell that is then used as a building block for the network. However, once a good building block is found, manual design is still required to assemble the final archi… ▽ More Finding the best neural network architecture requires significant time, resources, and human expertise. These challenges are partially addressed by neural architecture search (NAS) which is able to find the best convolutional layer or cell that is then used as a building block for the network. However, once a good building block is found, manual design is still required to assemble the final architecture as a combination of multiple blocks under a predefined parameter budget constraint. A common solution is to stack these blocks into a single tower and adjust the width and depth to fill the parameter budget. However, these single tower architectures may not be optimal. Instead, in this paper we present the AdaNAS algorithm, that uses ensemble techniques to compose a neural network as an ensemble of smaller networks automatically. Additionally, we introduce a novel technique based on knowledge distillation to iteratively train the smaller networks using the previous ensemble as a teacher. Our experiments demonstrate that ensembles of networks improve accuracy upon a single neural network while kee** the same number of parameters. Our models achieve comparable results with the state-of-the-art on CIFAR-10 and sets a new state-of-the-art on CIFAR-100. △ Less

Submitted 14 March, 2019; originally announced March 2019.

arXiv:1502.04137 [pdf, ps, other]

Non-Adaptive Learning a Hidden Hipergraph

Authors: Hasan Abasi, Nader H. Bshouty, Hanna Mazzawi

Abstract: We give a new deterministic algorithm that non-adaptively learns a hidden hypergraph from edge-detecting queries. All previous non-adaptive algorithms either run in exponential time or have non-optimal query complexity. We give the first polynomial time non-adaptive learning algorithm for learning hypergraph that asks almost optimal number of queries. We give a new deterministic algorithm that non-adaptively learns a hidden hypergraph from edge-detecting queries. All previous non-adaptive algorithms either run in exponential time or have non-optimal query complexity. We give the first polynomial time non-adaptive learning algorithm for learning hypergraph that asks almost optimal number of queries. △ Less

Submitted 13 February, 2015; originally announced February 2015.

arXiv:1405.0792 [pdf, ps, other]

On Exact Learning Monotone DNF from Membership Queries

Authors: Hasan Abasi, Nader H. Bshouty, Hanna Mazzawi

Abstract: In this paper, we study the problem of learning a monotone DNF with at most $s$ terms of size (number of variables in each term) at most $r$ ($s$ term $r$-MDNF) from membership queries. This problem is equivalent to the problem of learning a general hypergraph using hyperedge-detecting queries, a problem motivated by applications arising in chemical reactions and genome sequencing. We first pres… ▽ More In this paper, we study the problem of learning a monotone DNF with at most $s$ terms of size (number of variables in each term) at most $r$ ($s$ term $r$-MDNF) from membership queries. This problem is equivalent to the problem of learning a general hypergraph using hyperedge-detecting queries, a problem motivated by applications arising in chemical reactions and genome sequencing. We first present new lower bounds for this problem and then present deterministic and randomized adaptive algorithms with query complexities that are almost optimal. All the algorithms we present in this paper run in time linear in the query complexity and the number of variables $n$. In addition, all of the algorithms we present in this paper are asymptotically tight for fixed $r$ and/or $s$. △ Less

Submitted 5 May, 2014; originally announced May 2014.

arXiv:1001.0405 [pdf, ps, other]

Optimal Query Complexity for Reconstructing Hypergraphs

Authors: Nader H. Bshouty, Hanna Mazzawi

Abstract: In this paper we consider the problem of reconstructing a hidden weighted hypergraph of constant rank using additive queries. We prove the following: Let $G$ be a weighted hidden hypergraph of constant rank with n vertices and $m$ hyperedges. For any $m$ there exists a non-adaptive algorithm that finds the edges of the graph and their weights using $$ O(\frac{m\log n}{\log m}) $$ additive querie… ▽ More In this paper we consider the problem of reconstructing a hidden weighted hypergraph of constant rank using additive queries. We prove the following: Let $G$ be a weighted hidden hypergraph of constant rank with n vertices and $m$ hyperedges. For any $m$ there exists a non-adaptive algorithm that finds the edges of the graph and their weights using $$ O(\frac{m\log n}{\log m}) $$ additive queries. This solves the open problem in [S. Choi, J. H. Kim. Optimal Query Complexity Bounds for Finding Graphs. {\em STOC}, 749--758,~2008]. When the weights of the hypergraph are integers that are less than $O(poly(n^d/m))$ where $d$ is the rank of the hypergraph (and therefore for unweighted hypergraphs) there exists a non-adaptive algorithm that finds the edges of the graph and their weights using $$ O(\frac{m\log \frac{n^d}{m}}{\log m}). $$ additive queries. Using the information theoretic bound the above query complexities are tight. △ Less

Submitted 3 January, 2010; originally announced January 2010.

Showing 1–8 of 8 results for author: Mazzawi, H