-
SCoRe: Submodular Combinatorial Representation Learning
Authors:
Anay Majee,
Suraj Kothawade,
Krishnateja Killamsetty,
Rishabh Iyer
Abstract:
In this paper we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework, a novel approach in representation learning that addresses inter-class bias and intra-class variance. SCoRe provides a new combinatorial viewpoint to representation learning, by introducing a family of loss functions based on set-based submodular information measures. We develop two novel combinatori…
▽ More
In this paper we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework, a novel approach in representation learning that addresses inter-class bias and intra-class variance. SCoRe provides a new combinatorial viewpoint to representation learning, by introducing a family of loss functions based on set-based submodular information measures. We develop two novel combinatorial formulations for loss functions, using the Total Information and Total Correlation, that naturally minimize intra-class variance and inter-class bias. Several commonly used metric/contrastive learning loss functions like supervised contrastive loss, orthogonal projection loss, and N-pairs loss, are all instances of SCoRe, thereby underlining the versatility and applicability of SCoRe in a broad spectrum of learning scenarios. Novel objectives in SCoRe naturally model class-imbalance with up to 7.6\% improvement in classification on CIFAR-10-LT, CIFAR-100-LT, MedMNIST, 2.1% on ImageNet-LT, and 19.4% in object detection on IDD and LVIS (v1.0), demonstrating its effectiveness over existing approaches.
△ Less
Submitted 6 June, 2024; v1 submitted 29 September, 2023;
originally announced October 2023.
-
Beyond Active Learning: Leveraging the Full Potential of Human Interaction via Auto-Labeling, Human Correction, and Human Verification
Authors:
Nathan Beck,
Krishnateja Killamsetty,
Suraj Kothawade,
Rishabh Iyer
Abstract:
Active Learning (AL) is a human-in-the-loop framework to interactively and adaptively label data instances, thereby enabling significant gains in model performance compared to random sampling. AL approaches function by selecting the hardest instances to label, often relying on notions of diversity and uncertainty. However, we believe that these current paradigms of AL do not leverage the full pote…
▽ More
Active Learning (AL) is a human-in-the-loop framework to interactively and adaptively label data instances, thereby enabling significant gains in model performance compared to random sampling. AL approaches function by selecting the hardest instances to label, often relying on notions of diversity and uncertainty. However, we believe that these current paradigms of AL do not leverage the full potential of human interaction granted by automated label suggestions. Indeed, we show that for many classification tasks and datasets, most people verifying if an automatically suggested label is correct take $3\times$ to $4\times$ less time than they do changing an incorrect suggestion to the correct label (or labeling from scratch without any suggestion). Utilizing this result, we propose CLARIFIER (aCtive LeARnIng From tIEred haRdness), an Interactive Learning framework that admits more effective use of human interaction by leveraging the reduced cost of verification. By targeting the hard (uncertain) instances with existing AL methods, the intermediate instances with a novel label suggestion scheme using submodular mutual information functions on a per-class basis, and the easy (confident) instances with highest-confidence auto-labeling, CLARIFIER can improve over the performance of existing AL approaches on multiple datasets -- particularly on those that have a large number of classes -- by almost 1.5$\times$ to 2$\times$ in terms of relative labeling cost.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models
Authors:
H S V N S Kowndinya Renduchintala,
Krishnateja Killamsetty,
Sumit Bhatia,
Milan Aggarwal,
Ganesh Ramakrishnan,
Rishabh Iyer,
Balaji Krishnamurthy
Abstract:
A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitivel…
▽ More
A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to $\sim99\%$ of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.
△ Less
Submitted 19 October, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning
Authors:
Krishnateja Killamsetty,
Alexandre V. Evfimievski,
Tejaswini Pedapati,
Kiran Kate,
Lucian Popa,
Rishabh Iyer
Abstract:
Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consum…
▽ More
Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models $3\times - 10 \times$ faster and tune hyperparameters $20\times - 75 \times$ faster than full-dataset training or tuning without compromising performance.
△ Less
Submitted 16 June, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning
Authors:
Krishnateja Killamsetty,
Guttu Sai Abhishek,
Aakriti,
Alexandre V. Evfimievski,
Lucian Popa,
Ganesh Ramakrishnan,
Rishabh Iyer
Abstract:
Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire data…
▽ More
Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire dataset for different possible sets of hyper-parameters. Our central insight is that using an informative subset of the dataset for model training runs involved in hyper-parameter optimization, allows us to find the optimal hyper-parameter configuration significantly faster. In this work, we propose AUTOMATA, a gradient-based subset selection framework for hyper-parameter tuning. We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains. Our experiments show that using gradient-based data subsets for hyper-parameter tuning achieves significantly faster turnaround times and speedups of 3$\times$-30$\times$ while achieving comparable performance to the hyper-parameters found using the entire dataset.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Multimodal Anti-Reflective Coatings for Perfecting Anomalous Reflection from Arbitrary Periodic Structures
Authors:
Sherman W. Marcus,
Vinay K. Killamsetty,
Ariel Epstein
Abstract:
Metasurfaces possess vast wave-manipulation capabilities, including reflection and refraction of a plane wave into non-standard directions. This requires meticulously-designed sub-wavelength meta-atoms in each period of the metasurface which guarantee unitary coupling to the desired Floquet-Bloch mode or, equivalently, suppression of the coupling to other modes. Herein, we propose an entirely diff…
▽ More
Metasurfaces possess vast wave-manipulation capabilities, including reflection and refraction of a plane wave into non-standard directions. This requires meticulously-designed sub-wavelength meta-atoms in each period of the metasurface which guarantee unitary coupling to the desired Floquet-Bloch mode or, equivalently, suppression of the coupling to other modes. Herein, we propose an entirely different scheme to achieve such suppression, alleviating the need to devise and realize such dense scrupulously-engineered polarizable particles. Extending the concept of anti-reflective coatings to enable simultaneous manipulation of multiple modes, we show theoretically and experimentally that a simple superstrate consisting of only several uniform dielectric layers can be modularly applied to \textit{aribtrary} periodic structures to yield perfect anomalous reflection. This multimodal anti-reflective coating (MARC), designed based on an analytical model, presents a conceptually and practically simpler paradigm for wave-control across a wide range of physical branches, from electromagnetics and acoustics to seismics and beyond.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
GCR: Gradient Coreset Based Replay Buffer Selection For Continual Learning
Authors:
Rishabh Tiwari,
Krishnateja Killamsetty,
Rishabh Iyer,
Pradeep Shenoy
Abstract:
Continual learning (CL) aims to develop techniques by which a single model adapts to an increasing number of tasks encountered sequentially, thereby potentially leveraging learnings across tasks in a resource-efficient manner. A major challenge for CL systems is catastrophic forgetting, where earlier tasks are forgotten while learning a new task. To address this, replay-based CL approaches maintai…
▽ More
Continual learning (CL) aims to develop techniques by which a single model adapts to an increasing number of tasks encountered sequentially, thereby potentially leveraging learnings across tasks in a resource-efficient manner. A major challenge for CL systems is catastrophic forgetting, where earlier tasks are forgotten while learning a new task. To address this, replay-based CL approaches maintain and repeatedly retrain on a small buffer of data selected across encountered tasks. We propose Gradient Coreset Replay (GCR), a novel strategy for replay buffer selection and update using a carefully designed optimization criterion. Specifically, we select and maintain a "coreset" that closely approximates the gradient of all the data seen so far with respect to current model parameters, and discuss key strategies needed for its effective application to the continual learning setting. We show significant gains (2%-4% absolute) over the state-of-the-art in the well-studied offline continual learning setting. Our findings also effectively transfer to online / streaming CL settings, showing upto 5% gains over existing approaches. Finally, we demonstrate the value of supervised contrastive loss for continual learning, which yields a cumulative gain of up to 5% accuracy when combined with our subset selection strategy.
△ Less
Submitted 15 April, 2022; v1 submitted 18 November, 2021;
originally announced November 2021.
-
Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming
Authors:
Ayush Maheshwari,
Krishnateja Killamsetty,
Ganesh Ramakrishnan,
Rishabh Iyer,
Marina Danilevsky,
Lucian Popa
Abstract:
A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of a…
▽ More
A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data, in a paradigm that is now commonly referred to as data programming. However, previous approaches to automatically generate LFs make no attempt to further use the given labeled data for model training, thus giving up opportunities for improved performance. Moreover, since the LFs are generated from a relatively small labeled dataset, they are prone to being noisy, and naively aggregating these LFs can lead to very poor performance in practice. In this work, we propose an LF based reweighting framework \ouralgo{} to solve these two critical limitations. Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that our algorithm significantly outperforms prior approaches on several text classification datasets.
△ Less
Submitted 10 March, 2022; v1 submitted 23 September, 2021;
originally announced September 2021.
-
SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios
Authors:
Suraj Kothawade,
Nathan Beck,
Krishnateja Killamsetty,
Rishabh Iyer
Abstract:
Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active…
▽ More
Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning, but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ~5% - 18% in the case of rare classes and ~5% - 10% in the case of out-of-distribution data on several image classification tasks like CIFAR-10, MNIST, and ImageNet. SIMILAR is available as a part of the DISTIL toolkit: "https://github.com/decile-team/distil".
△ Less
Submitted 3 November, 2021; v1 submitted 1 July, 2021;
originally announced July 2021.
-
RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning
Authors:
Krishnateja Killamsetty,
Xujiang Zhao,
Feng Chen,
Rishabh Iyer
Abstract:
Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabe…
▽ More
Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, significantly reducing computational costs. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution (OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around $3\times$ in the traditional SSL setting and achieves a speedup of $5\times$ compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data. RETRIEVE is available as a part of the CORDS toolkit: https://github.com/decile-team/cords.
△ Less
Submitted 27 October, 2021; v1 submitted 14 June, 2021;
originally announced June 2021.
-
Metagratings for Perfect Mode Conversion in Rectangular Waveguides: Theory and Experiment
Authors:
Vinay Kumar Killamsetty,
Ariel Epstein
Abstract:
We present a complete design scheme, from theoretical formulation to experimental validation, exploiting the versatility of metagratings (MGs) for designing a rectangular waveguide (RWG) $\mbox{TE}_{10}$ - $\mbox{TE}_{20}$ mode converter (MC). MG devices, formed by sparse periodically positioned polarizable particles (meta-atoms), were mostly used to date for beam manipulation applications. In thi…
▽ More
We present a complete design scheme, from theoretical formulation to experimental validation, exploiting the versatility of metagratings (MGs) for designing a rectangular waveguide (RWG) $\mbox{TE}_{10}$ - $\mbox{TE}_{20}$ mode converter (MC). MG devices, formed by sparse periodically positioned polarizable particles (meta-atoms), were mostly used to date for beam manipulation applications. In this paper, we show that the appealing diffraction engineering features of the MGs in such typical free-space periodic scenarios can be utilized to efficiently mould fields inside waveguides (WGs). In particular, we derive an analytical model allowing harnessing of the MG concept for realization of perfect mode conversion in RWGs. Conveniently, the formalism considers a printed-circuit-board (PCB) MG terminating the RWG, operating as a reflect-mode MC. Following the typical MG synthesis approach, the model directly ties the meta-atom position and geometry with the modal reflection coefficients, enabling resolution of the detailed fabrication-ready design by enforcement of the functionality constraints: elimination of the fundamental $\mbox{TE}_{10}$ reflection and power conservation (passive lossless MG). This reliable semianaltyical scheme, verified via full-wave simulations and laboratory measurements, establishes a simple and efficient alternative to common RWG MCs, typically requiring challenging deformation of the WG designed through time-consuming full-wave optimization. In addition, it highlights the immense potential MGs encompass for a wide variety of applications beyond beam manipulation.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Semianalyitcal synthesis scheme for multifunctional metasurfaces on demand
Authors:
Vinay K. Killamsetty,
Ariel Epstein
Abstract:
We propose a comprehensive field-based semianalytical method for designing fabrication-ready multifunctional periodic metasurfaces (MSs). Harnessing recent work on multielement metagratings based on capacitively-loaded strips, we have extended our previous meta-atom design formulation to generate realistic substrate-supported printed-circuit-board layouts for anomalous refraction MSs. Subsequently…
▽ More
We propose a comprehensive field-based semianalytical method for designing fabrication-ready multifunctional periodic metasurfaces (MSs). Harnessing recent work on multielement metagratings based on capacitively-loaded strips, we have extended our previous meta-atom design formulation to generate realistic substrate-supported printed-circuit-board layouts for anomalous refraction MSs. Subsequently, we apply a greedy algorithm for iteratively optimizing individual scatterers across the entire macroperiod to achieve multiple design goals for corresponding multiple incidence angles with a single MS structure. As verified with commercial solvers, the proposed semianalytical scheme, properly accounting for near-field coupling between the various scatterers, can reliably produce highly efficient multifunctional MSs on demand, without requiring time-consuming full-wave optimization.
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training
Authors:
Krishnateja Killamsetty,
Durga Sivasubramanian,
Ganesh Ramakrishnan,
Abir De,
Rishabh Iyer
Abstract:
The great success of modern machine learning models on large datasets is contingent on extensive computational resources with high financial and environmental costs. One way to address this is by extracting subsets that generalize on par with the full data. In this work, we propose a general framework, GRAD-MATCH, which finds subsets that closely match the gradient of the training or validation se…
▽ More
The great success of modern machine learning models on large datasets is contingent on extensive computational resources with high financial and environmental costs. One way to address this is by extracting subsets that generalize on par with the full data. In this work, we propose a general framework, GRAD-MATCH, which finds subsets that closely match the gradient of the training or validation set. We find such subsets effectively using an orthogonal matching pursuit algorithm. We show rigorous theoretical and convergence guarantees of the proposed algorithm and, through our extensive experiments on real-world datasets, show the effectiveness of our proposed framework. We show that GRAD-MATCH significantly and consistently outperforms several recent data-selection algorithms and achieves the best accuracy-efficiency trade-off. GRAD-MATCH is available as a part of the CORDS toolkit: \url{https://github.com/decile-team/cords}.
△ Less
Submitted 11 June, 2021; v1 submitted 26 February, 2021;
originally announced March 2021.
-
GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning
Authors:
Krishnateja Killamsetty,
Durga Sivasubramanian,
Ganesh Ramakrishnan,
Rishabh Iyer
Abstract:
Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robu…
▽ More
Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. Next, we propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates and can be applied to any loss-based learning algorithm. We then show that for a rich class of loss functions including cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete data selection is an instance of (weakly) submodular optimization, and we analyze conditions for which Glister-Online reduces the validation loss and converges. Finally, we propose Glister-Active, an extension to batch active learning, and we empirically demonstrate the performance of Glister on a wide range of tasks including, (a) data selection to reduce training time, (b) robust learning under label noise and imbalance settings, and (c) batch-active learning with several deep and shallow models. We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms in case (b).
△ Less
Submitted 11 June, 2021; v1 submitted 19 December, 2020;
originally announced December 2020.
-
A Nested Bi-level Optimization Framework for Robust Few Shot Learning
Authors:
Krishnateja Killamsetty,
Changbin Li,
Chen Zhao,
Rishabh Iyer,
Feng Chen
Abstract:
Model-Agnostic Meta-Learning (MAML), a popular gradient-based meta-learning framework, assumes that the contribution of each task or instance to the meta-learner is equal. Hence, it fails to address the domain shift between base and novel classes in few-shot learning. In this work, we propose a novel robust meta-learning algorithm, NestedMAML, which learns to assign weights to training tasks or in…
▽ More
Model-Agnostic Meta-Learning (MAML), a popular gradient-based meta-learning framework, assumes that the contribution of each task or instance to the meta-learner is equal. Hence, it fails to address the domain shift between base and novel classes in few-shot learning. In this work, we propose a novel robust meta-learning algorithm, NestedMAML, which learns to assign weights to training tasks or instances. We consider weights as hyper-parameters and iteratively optimize them using a small set of validation tasks set in a nested bi-level optimization approach (in contrast to the standard bi-level optimization in MAML). We then apply NestedMAML in the meta-training stage, which involves (1) several tasks sampled from a distribution different from the meta-test task distribution, or (2) some data samples with noisy labels. Extensive experiments on synthetic and real-world datasets demonstrate that NestedMAML efficiently mitigates the effects of "unwanted" tasks or instances, leading to significant improvement over the state-of-the-art robust meta-learning methods.
△ Less
Submitted 1 December, 2021; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Semi-Supervised Data Programming with Subset Selection
Authors:
Ayush Maheshwari,
Oishik Chatterjee,
KrishnaTeja Killamsetty,
Ganesh Ramakrishnan,
Rishabh Iyer
Abstract:
The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in several text classification scenarios. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal perf…
▽ More
The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in several text classification scenarios. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performances, particularly when the labelling functions are noisy. The first contribution of this work is an introduction of a framework, \model which is a semi-supervised data programming paradigm that learns a \emph{joint model} that effectively uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study \modelss which additionally does subset selection on top of the joint semi-supervised data programming objective and \emph{selects} a set of examples that can be used as the labelled set by \model. The goal of \modelss is to ensure that the labelled data can \emph{complement} the labelling functions, thereby benefiting from both data-programming as well as appropriately selected data for human labelling. We demonstrate that by effectively combining semi-supervision, data-programming, and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets. \footnote{The source code is available at \url{https://github.com/ayushbits/Semi-Supervised-LFs-Subset-Selection}}
△ Less
Submitted 12 June, 2021; v1 submitted 22 August, 2020;
originally announced August 2020.