-
Deep Submodular Peripteral Networks
Authors:
Gantavya Bhatt,
Arnav Das,
Jeff Bilmes
Abstract:
Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and…
▽ More
Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and methods for their training using a contrastive-learning inspired GPC-ready strategy to connect and then tackle both of the above challenges. We introduce newly devised GPC-style "peripteral" loss which leverages numerically graded relationships between pairs of objects (sets in our case). Unlike traditional contrastive learning, our method utilizes graded comparisons, extracting more nuanced information than just binary-outcome comparisons, and contrasts sets of any size (not just two). We also define a novel suite of automatic sampling strategies for training, including active-learning inspired submodular feedback. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function showing superiority in downstream tasks such as experimental design and streaming applications.
△ Less
Submitted 15 March, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Many-Objective Multi-Solution Transport
Authors:
Ziyue Li,
Tian Li,
Virginia Smith,
Jeff Bilmes,
Tianyi Zhou
Abstract:
Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few number of objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We intr…
▽ More
Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few number of objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce Many-objective multi-solution Transport (MosT), a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. MosT formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between the objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, MosT distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models
Authors:
Gantavya Bhatt,
Yifang Chen,
Arnav M. Das,
Jifan Zhang,
Sang T. Truong,
Stephen Mussmann,
Yinglun Zhu,
Jeffrey Bilmes,
Simon S. Du,
Kevin Jamieson,
Jordan T. Ash,
Robert D. Nowak
Abstract:
Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues t…
▽ More
Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only $50\%$ of annotation cost required by random sampling.
△ Less
Submitted 6 May, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Effective Backdoor Mitigation Depends on the Pre-training Objective
Authors:
Sahil Verma,
Gantavya Bhatt,
Avi Schwarzschild,
Soumye Singhal,
Arnav Mohanty Das,
Chirag Shah,
John P Dickerson,
Jeff Bilmes
Abstract:
Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for…
▽ More
Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.
△ Less
Submitted 5 December, 2023; v1 submitted 25 November, 2023;
originally announced November 2023.
-
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
Authors:
Jifan Zhang,
Yifang Chen,
Gregory Canal,
Stephen Mussmann,
Arnav M. Das,
Gantavya Bhatt,
Yinglun Zhu,
Jeffrey Bilmes,
Simon Shaolei Du,
Kevin Jamieson,
Robert D Nowak
Abstract:
Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires…
▽ More
Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires combinations of these techniques, existing benchmark and evaluation frameworks do not capture a concerted combination of all such techniques. This paper addresses this deficiency by introducing LabelBench, a new computationally-efficient framework for joint evaluation of multiple label-efficient learning techniques. As an application of LabelBench, we introduce a novel benchmark of state-of-the-art active learning methods in combination with semi-supervised learning for fine-tuning pretrained vision transformers. Our benchmark demonstrates better label-efficiencies than previously reported in active learning. LabelBench's modular codebase is open-sourced for the broader community to contribute label-efficient learning methods and benchmarks. The repository can be found at: https://github.com/EfficientTraining/LabelBench.
△ Less
Submitted 1 March, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Accelerating Batch Active Learning Using Continual Learning Techniques
Authors:
Arnav Das,
Gantavya Bhatt,
Megh Bhalerao,
Vianne Gao,
Rui Yang,
Jeff Bilmes
Abstract:
A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this…
▽ More
A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets. We accomplish this by employing existing, and develo** novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm Continual Active Learning (CAL). We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse, uncertain points from the history. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with different neural architectures and dataset sizes. CAL consistently provides a 3x reduction in training time, while retaining performance.
△ Less
Submitted 12 December, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
Online SuBmodular + SuPermodular (BP) Maximization with Bandit Feedback
Authors:
Adhyyan Narang,
Omid Sadeghi,
Lillian J Ratliff,
Maryam Fazel,
Jeff Bilmes
Abstract:
In the context of online interactive machine learning with combinatorial objectives, we extend purely submodular prior work to more general non-submodular objectives. This includes: (1) those that are additively decomposable into a sum of two terms (a monotone submodular and monotone supermodular term, known as a BP decomposition); and (2) those that are only weakly submodular. In both cases, this…
▽ More
In the context of online interactive machine learning with combinatorial objectives, we extend purely submodular prior work to more general non-submodular objectives. This includes: (1) those that are additively decomposable into a sum of two terms (a monotone submodular and monotone supermodular term, known as a BP decomposition); and (2) those that are only weakly submodular. In both cases, this allows representing not only competitive (submodular) but also complementary (supermodular) relationships between objects, enhancing this setting to a broader range of applications (e.g., movie recommendations, medical treatments, etc.) where this is beneficial. In the two-term case, moreover, we study not only the more typical monolithic feedback approach but also a novel framework where feedback is available separately for each term. With real-world practicality and scalability in mind, we integrate Nystrom sketching techniques to significantly reduce the computational cost, including for the purely submodular case. In the Gaussian process contextual bandits setting, we show sub-linear theoretical regret bounds in all cases. We also empirically show good applicability to recommendation systems and data subset selection.
△ Less
Submitted 12 May, 2024; v1 submitted 7 July, 2022;
originally announced July 2022.
-
High Resolution Point Clouds from mmWave Radar
Authors:
Akarsh Prabhakara,
Tao **,
Arnav Das,
Gantavya Bhatt,
Lilly Kumari,
Elahe Soltanaghaei,
Jeff Bilmes,
Swarun Kumar,
Anthony Rowe
Abstract:
This paper explores a machine learning approach for generating high resolution point clouds from a single-chip mmWave radar. Unlike lidar and vision-based systems, mmWave radar can operate in harsh environments and see through occlusions like smoke, fog, and dust. Unfortunately, current mmWave processing techniques offer poor spatial resolution compared to lidar point clouds. This paper presents R…
▽ More
This paper explores a machine learning approach for generating high resolution point clouds from a single-chip mmWave radar. Unlike lidar and vision-based systems, mmWave radar can operate in harsh environments and see through occlusions like smoke, fog, and dust. Unfortunately, current mmWave processing techniques offer poor spatial resolution compared to lidar point clouds. This paper presents RadarHD, an end-to-end neural network that constructs lidar-like point clouds from low resolution radar input. Enhancing radar images is challenging due to the presence of specular and spurious reflections. Radar data also doesn't map well to traditional image processing techniques due to the signal's sinc-like spreading pattern. We overcome these challenges by training RadarHD on a large volume of raw I/Q radar data paired with lidar point clouds across diverse indoor settings. Our experiments show the ability to generate rich point clouds even in scenes unobserved during training and in the presence of heavy smoke occlusion. Further, RadarHD's point clouds are high-quality enough to work with existing lidar odometry and map** workflows.
△ Less
Submitted 16 July, 2023; v1 submitted 18 June, 2022;
originally announced June 2022.
-
Submodularity In Machine Learning and Artificial Intelligence
Authors:
Jeff Bilmes
Abstract:
In this manuscript, we offer a gentle review of submodularity and supermodularity and their properties. We offer a plethora of submodular definitions; a full description of a number of example submodular functions and their generalizations; example discrete constraints; a discussion of basic algorithms for maximization, minimization, and other operations; a brief overview of continuous submodular…
▽ More
In this manuscript, we offer a gentle review of submodularity and supermodularity and their properties. We offer a plethora of submodular definitions; a full description of a number of example submodular functions and their generalizations; example discrete constraints; a discussion of basic algorithms for maximization, minimization, and other operations; a brief overview of continuous submodular extensions; and some historical applications. We then turn to how submodularity is useful in machine learning and artificial intelligence. This includes summarization, and we offer a complete account of the differences between and commonalities amongst sketching, coresets, extractive and abstractive summarization in NLP, data distillation and condensation, and data subset selection and feature selection. We discuss a variety of ways to produce a submodular function useful for machine learning, including heuristic hand-crafting, learning or approximately learning a submodular function or aspects thereof, and some advantages of the use of a submodular function as a coreset producer. We discuss submodular combinatorial information functions, and how submodularity is useful for clustering, data partitioning, parallel machine learning, active and semi-supervised learning, probabilistic modeling, and structured norms and loss functions.
△ Less
Submitted 4 October, 2022; v1 submitted 31 January, 2022;
originally announced February 2022.
-
Independence Properties of Generalized Submodular Information Measures
Authors:
Himanshu Asnani,
Jeff Bilmes,
Rishabh Iyer
Abstract:
Recently a class of generalized information measures was defined on sets of items parametrized by submodular functions. In this paper, we propose and study various notions of independence between sets with respect to such information measures, and connections thereof. Since entropy can also be used to parametrize such measures, we derive interesting independence properties for the entropy of sets…
▽ More
Recently a class of generalized information measures was defined on sets of items parametrized by submodular functions. In this paper, we propose and study various notions of independence between sets with respect to such information measures, and connections thereof. Since entropy can also be used to parametrize such measures, we derive interesting independence properties for the entropy of sets of random variables. We also study the notion of multi-set independence and its properties. Finally, we present optimization algorithms for obtaining a set that is independent of another given set, and also discuss the implications and applications of combinatorial independence.
△ Less
Submitted 6 August, 2021;
originally announced August 2021.
-
An Effective Baseline for Robustness to Distributional Shift
Authors:
Sunil Thulasidasan,
Sushil Thapa,
Sayera Dhaubhadel,
Gopinath Chennupati,
Tanmoy Bhattacharya,
Jeff Bilmes
Abstract:
Refraining from confidently predicting when faced with categories of inputs different from those seen during training is an important requirement for the safe deployment of deep learning systems. While simple to state, this has been a particularly challenging problem in deep learning, where models often end up making overconfident predictions in such situations. In this work we present a simple, b…
▽ More
Refraining from confidently predicting when faced with categories of inputs different from those seen during training is an important requirement for the safe deployment of deep learning systems. While simple to state, this has been a particularly challenging problem in deep learning, where models often end up making overconfident predictions in such situations. In this work we present a simple, but highly effective approach to deal with out-of-distribution detection that uses the principle of abstention: when encountering a sample from an unseen class, the desired behavior is to abstain from predicting. Our approach uses a network with an extra abstention class and is trained on a dataset that is augmented with an uncurated set that consists of a large number of out-of-distribution (OoD) samples that are assigned the label of the abstention class; the model is then trained to learn an effective discriminator between in and out-of-distribution samples. We compare this relatively simple approach against a wide variety of more complex methods that have been proposed both for out-of-distribution detection as well as uncertainty modeling in deep learning, and empirically demonstrate its effectiveness on a wide variety of of benchmarks and deep architectures for image recognition and text classification, often outperforming existing approaches by significant margins. Given the simplicity and effectiveness of this method, we propose that this approach be used as a new additional baseline for future work in this domain.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Submodular Mutual Information for Targeted Data Subset Selection
Authors:
Suraj Kothawade,
Vishal Kaushal,
Ganesh Ramakrishnan,
Jeff Bilmes,
Rishabh Iyer
Abstract:
With the rapid growth of data, it is becoming increasingly difficult to train or improve deep learning models with the right subset of data. We show that this problem can be effectively solved at an additional labeling cost by targeted data subset selection(TSS) where a subset of unlabeled data points similar to an auxiliary set are added to the training data. We do so by using a rich class of Sub…
▽ More
With the rapid growth of data, it is becoming increasingly difficult to train or improve deep learning models with the right subset of data. We show that this problem can be effectively solved at an additional labeling cost by targeted data subset selection(TSS) where a subset of unlabeled data points similar to an auxiliary set are added to the training data. We do so by using a rich class of Submodular Mutual Information (SMI) functions and demonstrate its effectiveness for image classification on CIFAR-10 and MNIST datasets. Lastly, we compare the performance of SMI functions for TSS with other state-of-the-art methods for closely related problems like active learning. Using SMI functions, we observe ~20-30% gain over the model's performance before re-training with added targeted subset; ~12% more than other methods.
△ Less
Submitted 30 April, 2021;
originally announced May 2021.
-
PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Subset Selection
Authors:
Suraj Kothawade,
Vishal Kaushal,
Ganesh Ramakrishnan,
Jeff Bilmes,
Rishabh Iyer
Abstract:
With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important for a plethora of tasks. It is often necessary to guide the subset selection to achieve certain desiderata, which includes focusing or targeting certain data points, while avoiding others. Examples of such problems include: i)targeted learning, where the goal is to find subsets with rare classes or…
▽ More
With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important for a plethora of tasks. It is often necessary to guide the subset selection to achieve certain desiderata, which includes focusing or targeting certain data points, while avoiding others. Examples of such problems include: i)targeted learning, where the goal is to find subsets with rare classes or rare attributes on which the model is underperforming, and ii)guided summarization, where data (e.g., image collection, text, document or video) is summarized for quicker human consumption with specific additional user intent. Motivated by such applications, we present PRISM, a rich class of PaRameterIzed Submodular information Measures. Through novel functions and their parameterizations, PRISM offers a variety of modeling capabilities that enable a trade-off between desired qualities of a subset like diversity or representation and similarity/dissimilarity with a set of data points. We demonstrate how PRISM can be applied to the two real-world problems mentioned above, which require guided subset selection. In doing so, we show that PRISM interestingly generalizes some past work, therein reinforcing its broad utility. Through extensive experiments on diverse datasets, we demonstrate the superiority of PRISM over the state-of-the-art in targeted learning and in guided image-collection summarization
△ Less
Submitted 8 March, 2022; v1 submitted 26 February, 2021;
originally announced March 2021.
-
A Unified Framework for Generic, Query-Focused, Privacy Preserving and Update Summarization using Submodular Information Measures
Authors:
Vishal Kaushal,
Suraj Kothawade,
Ganesh Ramakrishnan,
Jeff Bilmes,
Himanshu Asnani,
Rishabh Iyer
Abstract:
We study submodular information measures as a rich framework for generic, query-focused, privacy sensitive, and update summarization tasks. While past work generally treats these problems differently ({\em e.g.}, different models are often used for generic and query-focused summarization), the submodular information measures allow us to study each of these problems via a unified approach. We first…
▽ More
We study submodular information measures as a rich framework for generic, query-focused, privacy sensitive, and update summarization tasks. While past work generally treats these problems differently ({\em e.g.}, different models are often used for generic and query-focused summarization), the submodular information measures allow us to study each of these problems via a unified approach. We first show that several previous query-focused and update summarization techniques have, unknowingly, used various instantiations of the aforesaid submodular information measures, providing evidence for the benefit and naturalness of these models. We then carefully study and demonstrate the modelling capabilities of the proposed functions in different settings and empirically verify our findings on both a synthetic dataset and an existing real-world image collection dataset (that has been extended by adding concept annotations to each image making it suitable for this task) and will be publicly released. We employ a max-margin framework to learn a mixture model built using the proposed instantiations of submodular information measures and demonstrate the effectiveness of our approach. While our experiments are in the context of image summarization, our framework is generic and can be easily extended to other summarization settings (e.g., videos or documents).
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Concave Aspects of Submodular Functions
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
Submodular Functions are a special class of set functions, which generalize several information-theoretic quantities such as entropy and mutual information [1]. Submodular functions have subgradients and subdifferentials [2] and admit polynomial-time algorithms for minimization, both of which are fundamental characteristics of convex functions. Submodular functions also show signs similar to conca…
▽ More
Submodular Functions are a special class of set functions, which generalize several information-theoretic quantities such as entropy and mutual information [1]. Submodular functions have subgradients and subdifferentials [2] and admit polynomial-time algorithms for minimization, both of which are fundamental characteristics of convex functions. Submodular functions also show signs similar to concavity. Submodular function maximization, though NP-hard, admits constant-factor approximation guarantees, and concave functions composed with modular functions are submodular. In this paper, we try to provide a more complete picture of the relationship between submodularity with concavity. We characterize the super-differentials and polyhedra associated with upper bounds and provide optimality conditions for submodular maximization using the-super differentials. This paper is a concise and shorter version of our longer preprint [3].
△ Less
Submitted 27 June, 2020;
originally announced June 2020.
-
Submodular Combinatorial Information Measures with Applications in Machine Learning
Authors:
Rishabh Iyer,
Ninad Khargonkar,
Jeff Bilmes,
Himanshu Asnani
Abstract:
Information-theoretic quantities like entropy and mutual information have found numerous uses in machine learning. It is well known that there is a strong connection between these entropic quantities and submodularity since entropy over a set of random variables is submodular. In this paper, we study combinatorial information measures that generalize independence, (conditional) entropy, (condition…
▽ More
Information-theoretic quantities like entropy and mutual information have found numerous uses in machine learning. It is well known that there is a strong connection between these entropic quantities and submodularity since entropy over a set of random variables is submodular. In this paper, we study combinatorial information measures that generalize independence, (conditional) entropy, (conditional) mutual information, and total correlation defined over sets of (not necessarily random) variables. These measures strictly generalize the corresponding entropic measures since they are all parameterized via submodular functions that themselves strictly generalize entropy. Critically, we show that, unlike entropic mutual information in general, the submodular mutual information is actually submodular in one argument, holding the other fixed, for a large class of submodular functions whose third-order partial derivatives satisfy a non-negativity property. This turns out to include a number of practically useful cases such as the facility location and set-cover functions. We study specific instantiations of the submodular information measures on these, as well as the probabilistic coverage, graph-cut, and saturated coverage functions, and see that they all have mathematically intuitive and practically useful expressions. Regarding applications, we connect the maximization of submodular (conditional) mutual information to problems such as mutual-information-based, query-based, and privacy-preserving summarization -- and we connect optimizing the multi-set submodular mutual information to clustering and robust partitioning.
△ Less
Submitted 2 March, 2021; v1 submitted 27 June, 2020;
originally announced June 2020.
-
apricot: Submodular selection for data summarization in Python
Authors:
Jacob Schreiber,
Jeffrey Bilmes,
William Stafford Noble
Abstract:
We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires mem…
▽ More
We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets. The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot
△ Less
Submitted 8 June, 2019;
originally announced June 2019.
-
Coresets for Data-efficient Training of Machine Learning Models
Authors:
Baharan Mirzasoleiman,
Jeff Bilmes,
Jure Leskovec
Abstract:
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method t…
▽ More
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.
△ Less
Submitted 16 November, 2020; v1 submitted 5 June, 2019;
originally announced June 2019.
-
On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks
Authors:
Sunil Thulasidasan,
Gopinath Chennupati,
Jeff Bilmes,
Tanmoy Bhattacharya,
Sarah Michalak
Abstract:
Mixup~\cite{zhang2017mixup} is a recently proposed method for training deep neural networks where additional samples are generated during training by convexly combining random pairs of images and their associated labels. While simple to implement, it has been shown to be a surprisingly effective method of data augmentation for image classification: DNNs trained with mixup show noticeable gains in…
▽ More
Mixup~\cite{zhang2017mixup} is a recently proposed method for training deep neural networks where additional samples are generated during training by convexly combining random pairs of images and their associated labels. While simple to implement, it has been shown to be a surprisingly effective method of data augmentation for image classification: DNNs trained with mixup show noticeable gains in classification performance on a number of image classification benchmarks. In this work, we discuss a hitherto untouched aspect of mixup training -- the calibration and predictive uncertainty of models trained with mixup. We find that DNNs trained with mixup are significantly better calibrated -- i.e., the predicted softmax scores are much better indicators of the actual likelihood of a correct prediction -- than DNNs trained in the regular fashion. We conduct experiments on a number of image classification architectures and datasets -- including large-scale datasets like ImageNet -- and find this to be the case. Additionally, we find that merely mixing features does not result in the same calibration benefit and that the label smoothing in mixup training plays a significant role in improving calibration. Finally, we also observe that mixup-trained DNNs are less prone to over-confident predictions on out-of-distribution and random-noise data. We conclude that the typical overconfidence seen in neural networks, even on in-distribution data is likely a consequence of training with hard labels, suggesting that mixup be employed for classification tasks where predictive uncertainty is a significant concern.
△ Less
Submitted 6 January, 2020; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Combating Label Noise in Deep Learning Using Abstention
Authors:
Sunil Thulasidasan,
Tanmoy Bhattacharya,
Jeff Bilmes,
Gopinath Chennupati,
Jamal Mohd-Yusof
Abstract:
We introduce a novel method to combat label noise when training deep neural networks for classification. We propose a loss function that permits abstention during training thereby allowing the DNN to abstain on confusing samples while continuing to learn and improve classification performance on the non-abstained samples. We show how such a deep abstaining classifier (DAC) can be used for robust l…
▽ More
We introduce a novel method to combat label noise when training deep neural networks for classification. We propose a loss function that permits abstention during training thereby allowing the DNN to abstain on confusing samples while continuing to learn and improve classification performance on the non-abstained samples. We show how such a deep abstaining classifier (DAC) can be used for robust learning in the presence of different types of label noise. In the case of structured or systematic label noise -- where noisy training labels or confusing examples are correlated with underlying features of the data-- training with abstention enables representation learning for features that are associated with unreliable labels. In the case of unstructured (arbitrary) label noise, abstention during training enables the DAC to be used as an effective data cleaner by identifying samples that are likely to have label noise. We provide analytical results on the loss function behavior that enable dynamic adaption of abstention rates based on learning progress during training. We demonstrate the utility of the deep abstaining classifier for various image classification tasks under different types of label noise; in the case of arbitrary label noise, we show significant improvements over previously published results on multiple image benchmarks. Source code is available at https://github.com/thulas/dac-label-noise
△ Less
Submitted 1 August, 2019; v1 submitted 27 May, 2019;
originally announced May 2019.
-
A Memoization Framework for Scaling Submodular Optimization to Large Scale Problems
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
We are motivated by large scale submodular optimization problems, where standard algorithms that treat the submodular functions in the \emph{value oracle model} do not scale. In this paper, we present a model called the \emph{precomputational complexity model}, along with a unifying memoization based framework, which looks at the specific form of the given submodular function. A key ingredient in…
▽ More
We are motivated by large scale submodular optimization problems, where standard algorithms that treat the submodular functions in the \emph{value oracle model} do not scale. In this paper, we present a model called the \emph{precomputational complexity model}, along with a unifying memoization based framework, which looks at the specific form of the given submodular function. A key ingredient in this framework is the notion of a \emph{precomputed statistic}, which is maintained in the course of the algorithms. We show that we can easily integrate this idea into a large class of submodular optimization problems including constrained and unconstrained submodular maximization, minimization, difference of submodular optimization, optimization with submodular constraints and several other related optimization problems. Moreover, memoization can be integrated in both discrete and continuous relaxation flavors of algorithms for these problems. We demonstrate this idea for several commonly occurring submodular functions, and show how the precomputational model provides significant speedups compared to the value oracle model. Finally, we empirically demonstrate this for large scale machine learning problems of data subset selection and summarization.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Near Optimal Algorithms for Hard Submodular Programs with Discounted Cooperative Costs
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
In this paper, we investigate a class of submodular problems which in general are very hard. These include minimizing a submodular cost function under combinatorial constraints, which include cuts, matchings, paths, etc., optimizing a submodular function under submodular cover and submodular knapsack constraints, and minimizing a ratio of submodular functions. All these problems appear in several…
▽ More
In this paper, we investigate a class of submodular problems which in general are very hard. These include minimizing a submodular cost function under combinatorial constraints, which include cuts, matchings, paths, etc., optimizing a submodular function under submodular cover and submodular knapsack constraints, and minimizing a ratio of submodular functions. All these problems appear in several real world problems but have hardness factors of $Ω(\sqrt{n})$ for general submodular cost functions. We show how we can achieve constant approximation factors when we restrict the cost functions to low rank sums of concave over modular functions. A wide variety of machine learning applications are very naturally modeled via this subclass of submodular functions. Our work therefore provides a tighter connection between theory and practice by enabling theoretically satisfying guarantees for a rich class of expressible, natural, and useful submodular cost models. We empirically demonstrate the utility of our models on real world problems of cooperative image matching and sensor placement with cooperative costs.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Greed is Still Good: Maximizing Monotone Submodular+Supermodular Functions
Authors:
Wenruo Bai,
Jeffrey A. Bilmes
Abstract:
We analyze the performance of the greedy algorithm, and also a discrete semi-gradient based algorithm, for maximizing the sum of a suBmodular and suPermodular (BP) function (both of which are non-negative monotone non-decreasing) under two types of constraints, either a cardinality constraint or $p\geq 1$ matroid independence constraints. These problems occur naturally in several real-world applic…
▽ More
We analyze the performance of the greedy algorithm, and also a discrete semi-gradient based algorithm, for maximizing the sum of a suBmodular and suPermodular (BP) function (both of which are non-negative monotone non-decreasing) under two types of constraints, either a cardinality constraint or $p\geq 1$ matroid independence constraints. These problems occur naturally in several real-world applications in data science, machine learning, and artificial intelligence. The problems are ordinarily inapproximable to any factor (as we show). Using the curvature $κ_f$ of the submodular term, and introducing $κ^g$ for the supermodular term (a natural dual curvature for supermodular functions), however, both of which are computable in linear time, we show that BP maximization can be efficiently approximated by both the greedy and the semi-gradient based algorithm. The algorithms yield multiplicative guarantees of $\frac{1}{κ_f}\left[1-e^{-(1-κ^g)κ_f}\right]$ and $\frac{1-κ^g}{(1-κ^g)κ_f + p}$ for the two types of constraints respectively. For pure monotone supermodular constrained maximization, these yield $1-κ^g$ and $(1-κ^g)/p$ for the two types of constraints respectively. We also analyze the hardness of BP maximization and show that our guarantees match hardness by a constant factor and by $O(\ln(p))$ respectively. Computational experiments are also provided supporting our analysis.
△ Less
Submitted 23 January, 2018;
originally announced January 2018.
-
Deep Submodular Functions
Authors:
Jeffrey Bilmes,
Wenruo Bai
Abstract:
We start with an overview of a class of submodular functions called SCMMs (sums of concave composed with non-negative modular functions plus a final arbitrary modular). We then define a new class of submodular functions we call {\em deep submodular functions} or DSFs. We show that DSFs are a flexible parametric family of submodular functions that share many of the properties and advantages of deep…
▽ More
We start with an overview of a class of submodular functions called SCMMs (sums of concave composed with non-negative modular functions plus a final arbitrary modular). We then define a new class of submodular functions we call {\em deep submodular functions} or DSFs. We show that DSFs are a flexible parametric family of submodular functions that share many of the properties and advantages of deep neural networks (DNNs). DSFs can be motivated by considering a hierarchy of descriptive concepts over ground elements and where one wishes to allow submodular interaction throughout this hierarchy. Results in this paper show that DSFs constitute a strictly larger class of submodular functions than SCMMs. We show that, for any integer $k>0$, there are $k$-layer DSFs that cannot be represented by a $k'$-layer DSF for any $k'<k$. This implies that, like DNNs, there is a utility to depth, but unlike DNNs, the family of DSFs strictly increase with depth. Despite this, we show (using a "backpropagation" like method) that DSFs, even with arbitrarily large $k$, do not comprise all submodular functions. In offering the above results, we also define the notion of an antitone superdifferential of a concave function and show how this relates to submodular functions (in general), DSFs (in particular), negative second-order partial derivatives, continuous submodularity, and concave extensions. To further motivate our analysis, we provide various special case results from matroid theory, comparing DSFs with forms of matroid rank, in particular the laminar matroid. Lastly, we discuss strategies to learn DSFs, and define the classes of deep supermodular functions, deep difference of submodular functions, and deep multivariate submodular functions, and discuss where these can be useful in applications.
△ Less
Submitted 31 January, 2017;
originally announced January 2017.
-
Semi-Supervised Phone Classification using Deep Neural Networks and Stochastic Graph-Based Entropic Regularization
Authors:
Sunil Thulasidasan,
Jeffrey Bilmes
Abstract:
We describe a graph-based semi-supervised learning framework in the context of deep neural networks that uses a graph-based entropic regularizer to favor smooth solutions over a graph induced by the data. The main contribution of this work is a computationally efficient, stochastic graph-regularization technique that uses mini-batches that are consistent with the graph structure, but also provides…
▽ More
We describe a graph-based semi-supervised learning framework in the context of deep neural networks that uses a graph-based entropic regularizer to favor smooth solutions over a graph induced by the data. The main contribution of this work is a computationally efficient, stochastic graph-regularization technique that uses mini-batches that are consistent with the graph structure, but also provides enough stochasticity (in terms of mini-batch data diversity) for convergence of stochastic gradient descent methods to good solutions. For this work, we focus on results of frame-level phone classification accuracy on the TIMIT speech corpus but our method is general and scalable to much larger data sets. Results indicate that our method significantly improves classification accuracy compared to the fully-supervised case when the fraction of labeled data is low, and it is competitive with other methods in the fully labeled case.
△ Less
Submitted 30 May, 2018; v1 submitted 14 December, 2016;
originally announced December 2016.
-
Efficient Distributed Semi-Supervised Learning using Stochastic Regularization over Affinity Graphs
Authors:
Sunil Thulasidasan,
Jeffrey Bilmes,
Garrett Kenyon
Abstract:
We describe a computationally efficient, stochastic graph-regularization technique that can be utilized for the semi-supervised training of deep neural networks in a parallel or distributed setting. We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with…
▽ More
We describe a computationally efficient, stochastic graph-regularization technique that can be utilized for the semi-supervised training of deep neural networks in a parallel or distributed setting. We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with the graph structure, but also preserve enough stochasticity for convergence of SGD to good local minima. We show how our technique allows a graph-based semi-supervised loss function to be decomposed into a sum over objectives, facilitating data parallelism for scalable training of machine learning models. Empirical results indicate that our method significantly improves classification accuracy compared to the fully-supervised case when the fraction of labeled data is low, and in the parallel case, achieves significant speed-up in terms of wall-clock time to convergence. We show the results for both sequential and distributed-memory semi-supervised DNN training on a speech corpus.
△ Less
Submitted 30 May, 2018; v1 submitted 14 December, 2016;
originally announced December 2016.
-
Scaling Submodular Maximization via Pruned Submodularity Graphs
Authors:
Tianyi Zhou,
Hua Ouyang,
Yi Chang,
Jeff Bilmes,
Carlos Guestrin
Abstract:
We propose a new random pruning method (called "submodular sparsification (SS)") to reduce the cost of submodular maximization. The pruning is applied via a "submodularity graph" over the $n$ ground elements, where each directed edge is associated with a pairwise dependency defined by the submodular function. In each step, SS prunes a $1-1/\sqrt{c}$ (for $c>1$) fraction of the nodes using weights…
▽ More
We propose a new random pruning method (called "submodular sparsification (SS)") to reduce the cost of submodular maximization. The pruning is applied via a "submodularity graph" over the $n$ ground elements, where each directed edge is associated with a pairwise dependency defined by the submodular function. In each step, SS prunes a $1-1/\sqrt{c}$ (for $c>1$) fraction of the nodes using weights on edges computed based on only a small number ($O(\log n)$) of randomly sampled nodes. The algorithm requires $\log_{\sqrt{c}}n$ steps with a small and highly parallelizable per-step computation. An accuracy-speed tradeoff parameter $c$, set as $c = 8$, leads to a fast shrink rate $\sqrt{2}/4$ and small iteration complexity $\log_{2\sqrt{2}}n$. Analysis shows that w.h.p., the greedy algorithm on the pruned set of size $O(\log^2 n)$ can achieve a guarantee similar to that of processing the original dataset. In news and video summarization tasks, SS is able to substantially reduce both computational costs and memory usage, while maintaining (or even slightly exceeding) the quality of the original (and much more costly) greedy algorithm.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
Stream Clipper: Scalable Submodular Maximization on Stream
Authors:
Tianyi Zhou,
Jeff Bilmes
Abstract:
We propose a streaming submodular maximization algorithm "stream clipper" that performs as well as the offline greedy algorithm on document/video summarization in practice. It adds elements from a stream either to a solution set $S$ or to an extra buffer $B$ based on two adaptive thresholds, and improves $S$ by a final greedy step that starts from $S$ adding elements from $B$. During this process,…
▽ More
We propose a streaming submodular maximization algorithm "stream clipper" that performs as well as the offline greedy algorithm on document/video summarization in practice. It adds elements from a stream either to a solution set $S$ or to an extra buffer $B$ based on two adaptive thresholds, and improves $S$ by a final greedy step that starts from $S$ adding elements from $B$. During this process, swap** elements out of $S$ can occur if doing so yields improvements. The thresholds adapt based on if current memory utilization exceeds a budget, e.g., it increases the lower threshold, and removes from the buffer $B$ elements below the new lower threshold. We show that, while our approximation factor in the worst case is $1/2$ (like in previous work, and corresponding to the tight bound), we show that there are data-dependent conditions where our bound falls within the range $[1/2, 1-1/e]$. In news and video summarization experiments, the algorithm consistently outperforms other streaming methods, and, while using significantly less computation and memory, performs similarly to the offline greedy algorithm.
△ Less
Submitted 12 February, 2018; v1 submitted 1 June, 2016;
originally announced June 2016.
-
On Deep Multi-View Representation Learning: Objectives and Optimization
Authors:
Weiran Wang,
Raman Arora,
Karen Livescu,
Jeff Bilmes
Abstract:
We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for learning while only one view is available for downstream tasks. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward netwo…
▽ More
We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for learning while only one view is available for downstream tasks. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward networks with a batch-style correlation-based objective. We analyze several techniques based on prior work, as well as new variants, and compare them empirically on image, speech, and text tasks. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with our new variant, deep canonically correlated autoencoders (DCCAE). We also explore a stochastic optimization procedure for minibatch correlation-based objectives and discuss the time/performance trade-offs for kernel-based and neural network-based implementations.
△ Less
Submitted 2 February, 2016;
originally announced February 2016.
-
Submodular Hamming Metrics
Authors:
Jennifer Gillenwater,
Rishabh Iyer,
Bethany Lusch,
Rahul Kidambi,
Jeff Bilmes
Abstract:
We show that there is a largely unexplored class of functions (positive polymatroids) that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over. By exploiting submodularity, we are able to give hardness results and approximation algorithms for optimizing over such metrics. Additionally, we demonstrate empirically the effectiveness of these…
▽ More
We show that there is a largely unexplored class of functions (positive polymatroids) that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over. By exploiting submodularity, we are able to give hardness results and approximation algorithms for optimizing over such metrics. Additionally, we demonstrate empirically the effectiveness of these metrics and associated algorithms on both a metric minimization task (a form of clustering) and also a metric maximization task (generating diverse k-best lists).
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications to Parallel Machine Learning and Multi-Label Image Segmentation
Authors:
Kai Wei,
Rishabh Iyer,
Shengjie Wang,
Wenruo Bai,
Jeff Bilmes
Abstract:
We study two mixed robust/average-case submodular partitioning problems that we collectively call Submodular Partitioning. These problems generalize both purely robust instances of the problem (namely max-min submodular fair allocation (SFA) and min-max submodular load balancing (SLB) and also generalize average-case instances (that is the submodular welfare problem (SWP) and submodular multiway p…
▽ More
We study two mixed robust/average-case submodular partitioning problems that we collectively call Submodular Partitioning. These problems generalize both purely robust instances of the problem (namely max-min submodular fair allocation (SFA) and min-max submodular load balancing (SLB) and also generalize average-case instances (that is the submodular welfare problem (SWP) and submodular multiway partition (SMP). While the robust versions have been studied in the theory community, existing work has focused on tight approximation guarantees, and the resultant algorithms are not, in general, scalable to very large real-world applications. This is in contrast to the average case, where most of the algorithms are scalable. In the present paper, we bridge this gap, by proposing several new algorithms (including those based on greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large sizes but that also achieve theoretical approximation guarantees close to the state-of-the-art, and in some cases achieve new tight bounds. We also provide new scalable algorithms that apply to additive combinations of the robust and average-case extreme objectives. We show that these problems have many applications in machine learning (ML). This includes: 1) data partitioning and load balancing for distributed machine algorithms on parallel machines; 2) data clustering; and 3) multi-label image segmentation with (only) Boolean submodular functions via pixel partitioning. We empirically demonstrate the efficacy of our algorithms on real-world problems involving data partitioning for distributed optimization of standard machine learning objectives (including both convex and deep neural network objectives), and also on purely unsupervised (i.e., no supervised or semi-supervised learning, and no interactive segmentation) image segmentation.
△ Less
Submitted 16 August, 2016; v1 submitted 29 October, 2015;
originally announced October 2015.
-
Polyhedral aspects of Submodularity, Convexity and Concavity
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
Seminal work by Edmonds and Lovasz shows the strong connection between submodularity and convexity. Submodular functions have tight modular lower bounds, and subdifferentials in a manner akin to convex functions. They also admit poly-time algorithms for minimization and satisfy the Fenchel duality theorem and the Discrete Seperation Theorem, both of which are fundamental characteristics of convex…
▽ More
Seminal work by Edmonds and Lovasz shows the strong connection between submodularity and convexity. Submodular functions have tight modular lower bounds, and subdifferentials in a manner akin to convex functions. They also admit poly-time algorithms for minimization and satisfy the Fenchel duality theorem and the Discrete Seperation Theorem, both of which are fundamental characteristics of convex functions. Submodular functions also show signs similar to concavity. Submodular maximization, though NP hard, admits constant factor approximation guarantees. Concave functions composed with modular functions are submodular, and they also satisfy diminishing returns property. This manuscript provides a more complete picture on the relationship between submodularity with convexity and concavity, by extending many of the results connecting submodularity with convexity to the concave aspects of submodularity. We first show the existence of superdifferentials, and efficiently computable tight modular upper bounds of a submodular function. While we show that it is hard to characterize this polyhedron, we obtain inner and outer bounds on the superdifferential along with certain specific and useful supergradients. We then investigate forms of concave extensions of submodular functions and show interesting relationships to submodular maximization. We next show connections between optimality conditions over the superdifferentials and submodular maximization, and show how forms of approximate optimality conditions translate into approximation factors for maximization. We end this paper by studying versions of the discrete seperation theorem and the Fenchel duality theorem when seen from the concave point of view. In every case, we relate our results to the existing results from the convex point of view, thereby improving the analysis of the relationship between submodularity, convexity, and concavity.
△ Less
Submitted 8 September, 2015; v1 submitted 24 June, 2015;
originally announced June 2015.
-
The Lovasz-Bregman Divergence and connections to rank aggregation, clustering, and web ranking
Authors:
Rishabh Iyer,
Jeff A. Bilmes
Abstract:
We extend the recently introduced theory of Lovasz-Bregman (LB) divergences (Iyer & Bilmes 2012) in several ways. We show that they represent a distortion between a "score" and an "ordering", thus providing a new view of rank aggregation and order based clustering with interesting connections to web ranking. We show how the LB divergences have a number of properties akin to many permutation based…
▽ More
We extend the recently introduced theory of Lovasz-Bregman (LB) divergences (Iyer & Bilmes 2012) in several ways. We show that they represent a distortion between a "score" and an "ordering", thus providing a new view of rank aggregation and order based clustering with interesting connections to web ranking. We show how the LB divergences have a number of properties akin to many permutation based metrics, and in fact have as special cases forms very similar to the Kendall-tau metric. We also show how the LB divergences subsume a number of commonly used ranking measures in information retrieval, like NDCG and AUC. Unlike the traditional permutation based metrics, however, the LB divergence naturally captures a notion of "confidence" in the orderings, thus providing a new representation to applications involving aggregating scores as opposed to just orderings. We show how a number of recently used web ranking models are forms of Lovasz-Bregman rank aggregation and also observe that a natural form of Mallow's model using the LB divergence has been used as conditional ranking models for the "Learning to Rank" problem.
△ Less
Submitted 9 August, 2014;
originally announced August 2014.
-
Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with Applications
Authors:
Rishabh Iyer,
Jeff A. Bilmes
Abstract:
We extend the work of Narasimhan and Bilmes [30] for minimizing set functions representable as a dierence between submodular functions. Similar to [30], our new algorithms are guaranteed to monotonically reduce the objective function at every step. We empirically and theoretically show that the per-iteration cost of our algorithms is much less than [30], and our algorithms can be used to efficient…
▽ More
We extend the work of Narasimhan and Bilmes [30] for minimizing set functions representable as a dierence between submodular functions. Similar to [30], our new algorithms are guaranteed to monotonically reduce the objective function at every step. We empirically and theoretically show that the per-iteration cost of our algorithms is much less than [30], and our algorithms can be used to efficiently minimize a dierence between submodular functions under various combinatorial constraints, a problem not previously addressed. We provide computational bounds and a hardness result on the multiplicative inapproximability of minimizing the dierence between submodular functions. We show, however, that it is possible to give worst-case additive bounds by providing a polynomial time computable lower-bound on the minima. Finally we show how a number of machine learning problems can be modeled as minimizing the dierence between submodular functions. We experimentally show the validity of our algorithms by testing them on the problem of feature selection with submodular cost features.
△ Less
Submitted 9 August, 2014;
originally announced August 2014.
-
Divide-and-Conquer Learning by Anchoring a Conical Hull
Authors:
Tianyi Zhou,
Jeff Bilmes,
Carlos Guestrin
Abstract:
We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ "anchors" lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning…
▽ More
We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ "anchors" lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme "DCA" that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.
△ Less
Submitted 22 June, 2014;
originally announced June 2014.
-
Graph Cuts with Interacting Edge Costs - Examples, Approximations, and Algorithms
Authors:
Stefanie Jegelka,
Jeff Bilmes
Abstract:
We study an extension of the classical graph cut problem, wherein we replace the modular (sum of edge weights) cost function by a submodular set function defined over graph edges. Special cases of this problem have appeared in different applications in signal processing, machine learning, and computer vision. In this paper, we connect these applications via the generic formulation of "cooperative…
▽ More
We study an extension of the classical graph cut problem, wherein we replace the modular (sum of edge weights) cost function by a submodular set function defined over graph edges. Special cases of this problem have appeared in different applications in signal processing, machine learning, and computer vision. In this paper, we connect these applications via the generic formulation of "cooperative graph cuts", for which we study complexity, algorithms, and connections to polymatroidal network flows. Finally, we compare the proposed algorithms empirically.
△ Less
Submitted 26 March, 2016; v1 submitted 2 February, 2014;
originally announced February 2014.
-
Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions
Authors:
Rishabh Iyer,
Stefanie Jegelka,
Jeff Bilmes
Abstract:
We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [53]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the 'curvature' of the submodular function, and provide lower and upper bounds that refine an…
▽ More
We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [53]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the 'curvature' of the submodular function, and provide lower and upper bounds that refine and improve previous results [3, 16, 18, 52]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to influence approximations for submodular maximization [7, 55], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results.
△ Less
Submitted 8 November, 2013;
originally announced November 2013.
-
Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
We investigate two new optimization problems -- minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require ma…
▽ More
We investigate two new optimization problems -- minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [14, 35] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms.
△ Less
Submitted 8 November, 2013;
originally announced November 2013.
-
The Lovasz-Bregman Divergence and connections to rank aggregation, clustering, and web ranking
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
We extend the recently introduced theory of Lovasz-Bregman (LB) divergences (Iyer & Bilmes, 2012) in several ways. We show that they represent a distortion between a 'score' and an 'ordering', thus providing a new view of rank aggregation and order based clustering with interesting connections to web ranking. We show how the LB divergences have a number of properties akin to many permutation based…
▽ More
We extend the recently introduced theory of Lovasz-Bregman (LB) divergences (Iyer & Bilmes, 2012) in several ways. We show that they represent a distortion between a 'score' and an 'ordering', thus providing a new view of rank aggregation and order based clustering with interesting connections to web ranking. We show how the LB divergences have a number of properties akin to many permutation based metrics, and in fact have as special cases forms very similar to the Kendall-$τ$ metric. We also show how the LB divergences subsume a number of commonly used ranking measures in information retrieval, like the NDCG and AUC. Unlike the traditional permutation based metrics, however, the LB divergence naturally captures a notion of "confidence" in the orderings, thus providing a new representation to applications involving aggregating scores as opposed to just orderings. We show how a number of recently used web ranking models are forms of Lovasz-Bregman rank aggregation and also observe that a natural form of Mallow's model using the LB divergence has been used as conditional ranking models for the 'Learning to Rank' problem.
△ Less
Submitted 23 August, 2013;
originally announced August 2013.
-
Fast Semidifferential-based Submodular Function Optimization
Authors:
Rishabh Iyer,
Stefanie Jegelka,
Jeff Bilmes
Abstract:
We present a practical and powerful new framework for both unconstrained and constrained submodular function optimization based on discrete semidifferentials (sub- and super-differentials). The resulting algorithms, which repeatedly compute and then efficiently optimize submodular semigradients, offer new and generalize many old methods for submodular optimization. Our approach, moreover, takes st…
▽ More
We present a practical and powerful new framework for both unconstrained and constrained submodular function optimization based on discrete semidifferentials (sub- and super-differentials). The resulting algorithms, which repeatedly compute and then efficiently optimize submodular semigradients, offer new and generalize many old methods for submodular optimization. Our approach, moreover, takes steps towards providing a unifying paradigm applicable to both submodular min- imization and maximization, problems that historically have been treated quite distinctly. The practicality of our algorithms is important since interest in submodularity, owing to its natural and wide applicability, has recently been in ascendance within machine learning. We analyze theoretical properties of our algorithms for minimization and maximization, and show that many state-of-the-art maximization algorithms are special cases. Lastly, we complement our theoretical analyses with supporting empirical experiments.
△ Less
Submitted 5 August, 2013;
originally announced August 2013.
-
Dynamic Bayesian Multinets
Authors:
Jeff A. Bilmes
Abstract:
In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how information-theoretic criterion functions can be used to induce sparse, discriminative, and class-conditional network structures that yield an optimal approximation to the…
▽ More
In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how information-theoretic criterion functions can be used to induce sparse, discriminative, and class-conditional network structures that yield an optimal approximation to the class posterior probability, and therefore are useful for the classification task. Using a new structure learning heuristic, the resulting models are tested on a medium-vocabulary isolated-word speech recognition task. It is demonstrated that these discriminatively structured dynamic Bayesian multinets, when trained in a maximum likelihood setting using EM, can outperform both HMMs and other dynamic Bayesian networks with a similar number of parameters.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
On Triangulating Dynamic Graphical Models
Authors:
Jeff A. Bilmes,
Chris Bartels
Abstract:
This paper introduces new methodology to triangulate dynamic Bayesian networks (DBNs) and dynamic graphical models (DGMs). While most methods to triangulate such networks use some form of constrained elimination scheme based on properties of the underlying directed graph, we find it useful to view triangulation and elimination using properties only of the resulting undirected g…
▽ More
This paper introduces new methodology to triangulate dynamic Bayesian networks (DBNs) and dynamic graphical models (DGMs). While most methods to triangulate such networks use some form of constrained elimination scheme based on properties of the underlying directed graph, we find it useful to view triangulation and elimination using properties only of the resulting undirected graph, obtained after the moralization step. We first briefly introduce the Graphical model toolkit (GMTK) and its notion of dynamic graphical models, one that slightly extends the standard notion of a DBN. We next introduce the 'boundary algorithm', a method to find the best boundary between partitions in a dynamic model. We find that using this algorithm, the notions of forward- and backward-interface become moot - namely, the size and fill-in of the best forward- and backward- interface are identical. Moreover, we observe that finding a good partition boundary allows for constrained elimination orders (and therefore graph triangulations) that are not possible using standard slice-by-slice constrained eliminations. More interestingly, with certain boundaries it is possible to obtain constrained elimination schemes that lie outside the space of possible triangulations using only unconstrained elimination. Lastly, we report triangulation results on invented graphs, standard DBNs from the literature, novel DBNs used in speech recognition research systems, and also random graphs. Using a number of different triangulation quality measures (max clique size, state-space, etc.), we find that with our boundary algorithm the triangulation quality can dramatically improve.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra
Authors:
Ajit P. Singh,
John Halloran,
Jeff A. Bilmes,
Katrin Kirchoff,
William S. Noble
Abstract:
Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum id…
▽ More
Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly outperforms the de-facto standard tools for this task: SEQUEST and Mascot.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
Learning Mixtures of Submodular Shells with Application to Document Summarization
Authors:
Hui Lin,
Jeff A. Bilmes
Abstract:
We introduce a method to learn a mixture of submodular "shells" in a large-margin setting. A submodular shell is an abstract submodular function that can be instantiated with a ground set and a set of parameters to produce a submodular function. A mixture of such shells can then also be so instantiated to produce a more complex submodular function. What our algorithm learns are the mixture weights…
▽ More
We introduce a method to learn a mixture of submodular "shells" in a large-margin setting. A submodular shell is an abstract submodular function that can be instantiated with a ground set and a set of parameters to produce a submodular function. A mixture of such shells can then also be so instantiated to produce a more complex submodular function. What our algorithm learns are the mixture weights over such shells. We provide a risk bound guarantee when learning in a large-margin structured-prediction setting using a projected subgradient method when only approximate submodular optimization is possible (such as with submodular function maximization). We apply this method to the problem of multi-document summarization and produce the best results reported so far on the widely used NIST DUC-05 through DUC-07 document summarization corpora.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
PAC-learning bounded tree-width Graphical Models
Authors:
Mukund Narasimhan,
Jeff A. Bilmes
Abstract:
We show that the class of strongly connected graphical models with treewidth at most k can be properly efficiently PAC-learnt with respect to the Kullback-Leibler Divergence. Previous approaches to this problem, such as those of Chow ([1]), and Ho gen ([7]) have shown that this class is PAC-learnable by reducing it to a combinatorial optimization problem. However, for k > 1, this problem is NP-com…
▽ More
We show that the class of strongly connected graphical models with treewidth at most k can be properly efficiently PAC-learnt with respect to the Kullback-Leibler Divergence. Previous approaches to this problem, such as those of Chow ([1]), and Ho gen ([7]) have shown that this class is PAC-learnable by reducing it to a combinatorial optimization problem. However, for k > 1, this problem is NP-complete ([15]), and so unless P=NP, these approaches will take exponential amounts of time. Our approach differs significantly from these, in that it first attempts to find approximate conditional independencies by solving (polynomially many) submodular optimization problems, and then using a dynamic programming formulation to combine the approximate conditional independence information to derive a graphical model with underlying graph of the tree-width specified. This gives us an efficient (polynomial time in the number of random variables) PAC-learning algorithm which requires only polynomial number of samples of the true distribution, and only polynomial running time.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
A submodular-supermodular procedure with applications to discriminative structure learning
Authors:
Mukund Narasimhan,
Jeff A. Bilmes
Abstract:
In this paper, we present an algorithm for minimizing the difference between two submodular functions using a variational framework which is based on (an extension of) the concave-convex procedure [17]. Because several commonly used metrics in machine learning, like mutual information and conditional mutual information, are submodular, the problem of minimizing the difference of two submodular pro…
▽ More
In this paper, we present an algorithm for minimizing the difference between two submodular functions using a variational framework which is based on (an extension of) the concave-convex procedure [17]. Because several commonly used metrics in machine learning, like mutual information and conditional mutual information, are submodular, the problem of minimizing the difference of two submodular problems arises naturally in many machine learning applications. Two such applications are learning discriminatively structured graphical models and feature selection under computational complexity constraints. A commonly used metric for measuring discriminative capacity is the EAR measure which is the difference between two conditional mutual information terms. Feature selection taking complexity considerations into account also fall into this framework because both the information that a set of features provide and the cost of computing and using the features can be modeled as submodular functions. This problem is NP-hard, and we give a polynomial time heuristic for it. We also present results on synthetic data to show that classifiers based on discriminative graphical models using this algorithm can significantly outperform classifiers based on generative graphical models.
△ Less
Submitted 4 July, 2012;
originally announced July 2012.
-
Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with Applications
Authors:
Rishabh Iyer,
Jeff Bilmes
Abstract:
We extend the work of Narasimhan and Bilmes [30] for minimizing set functions representable as a difference between submodular functions. Similar to [30], our new algorithms are guaranteed to monotonically reduce the objective function at every step. We empirically and theoretically show that the per-iteration cost of our algorithms is much less than [30], and our algorithms can be used to efficie…
▽ More
We extend the work of Narasimhan and Bilmes [30] for minimizing set functions representable as a difference between submodular functions. Similar to [30], our new algorithms are guaranteed to monotonically reduce the objective function at every step. We empirically and theoretically show that the per-iteration cost of our algorithms is much less than [30], and our algorithms can be used to efficiently minimize a difference between submodular functions under various combinatorial constraints, a problem not previously addressed. We provide computational bounds and a hardness result on the mul- tiplicative inapproximability of minimizing the difference between submodular functions. We show, however, that it is possible to give worst-case additive bounds by providing a polynomial time computable lower-bound on the minima. Finally we show how a number of machine learning problems can be modeled as minimizing the difference between submodular functions. We experimentally show the validity of our algorithms by testing them on the problem of feature selection with submodular cost features.
△ Less
Submitted 24 August, 2013; v1 submitted 2 July, 2012;
originally announced July 2012.
-
Recognizing Activities and Spatial Context Using Wearable Sensors
Authors:
Amarnag Subramanya,
Alvin Raj,
Jeff A. Bilmes,
Dieter Fox
Abstract:
We introduce a new dynamic model with the capability of recognizing both activities that an individual is performing as well as where that ndividual is located. Our model is novel in that it utilizes a dynamic graphical model to jointly estimate both activity and spatial context over time based on the simultaneous use of asynchronous observations consisting of GPS measurements, and measurements fr…
▽ More
We introduce a new dynamic model with the capability of recognizing both activities that an individual is performing as well as where that ndividual is located. Our model is novel in that it utilizes a dynamic graphical model to jointly estimate both activity and spatial context over time based on the simultaneous use of asynchronous observations consisting of GPS measurements, and measurements from a small mountable sensor board. Joint inference is quite desirable as it has the ability to improve accuracy of the model. A key goal, however, in designing our overall system is to be able to perform accurate inference decisions while minimizing the amount of hardware an individual must wear. This minimization leads to greater comfort and flexibility, decreased power requirements and therefore increased battery life, and reduced cost. We show results indicating that our joint measurement model outperforms measurements from either the sensor board or GPS alone, using two types of probabilistic inference procedures, namely particle filtering and pruned exact inference.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Non-Minimal Triangulations for Mixed Stochastic/Deterministic Graphical Models
Authors:
Chris Bartels,
Jeff A. Bilmes
Abstract:
We observe that certain large-clique graph triangulations can be useful to reduce both computational and space requirements when making queries on mixed stochastic/deterministic graphical models. We demonstrate that many of these large-clique triangulations are non-minimal and are thus unattainable via the variable elimination algorithm. We introduce ancestral pairs as the basis for novel triangul…
▽ More
We observe that certain large-clique graph triangulations can be useful to reduce both computational and space requirements when making queries on mixed stochastic/deterministic graphical models. We demonstrate that many of these large-clique triangulations are non-minimal and are thus unattainable via the variable elimination algorithm. We introduce ancestral pairs as the basis for novel triangulation heuristics and prove that no more than the addition of edges between ancestral pairs need be considered when searching for state space optimal triangulations in such graphs. Empirical results on random and real world graphs show that the resulting triangulations that yield significant speedups are almost always non-minimal. We also give an algorithm and correctness proof for determining if a triangulation can be obtained via elimination, and we show that the decision problem associated with finding optimal state space triangulations in this mixed stochastic/deterministic setting is NP-complete.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Consensus ranking under the exponential model
Authors:
Marina Meila,
Kapil Phadnis,
Arthur Patterson,
Jeff A. Bilmes
Abstract:
We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated…
▽ More
We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated around its mode; (2) We show that the generalized Mallows model is jointly exponential in (pi0; theta), and introduce the conjugate prior for this model class; (3) The sufficient statistics are the pairwise marginal probabilities that item i is preferred to item j. Preliminary experiments confirm the theoretical predictions and compare the new algorithm and existing heuristics.
△ Less
Submitted 20 June, 2012;
originally announced June 2012.