Search | arXiv e-print repository

Selectively Contextual Bandits

Authors: Claudia Roberts, Maria Dimakopoulou, Qifeng Qiao, Ashok Chandrashekhar, Tony Jebara

Abstract: Contextual bandits are widely used in industrial personalization systems. These online learning frameworks learn a treatment assignment policy in the presence of treatment effects that vary with the observed contextual features of the users. While personalization creates a rich user experience that reflect individual interests, there are benefits of a shared experience across a community that enab… ▽ More Contextual bandits are widely used in industrial personalization systems. These online learning frameworks learn a treatment assignment policy in the presence of treatment effects that vary with the observed contextual features of the users. While personalization creates a rich user experience that reflect individual interests, there are benefits of a shared experience across a community that enable participation in the zeitgeist. Such benefits are emergent through network effects and are not captured in regret metrics typically employed in evaluating bandits. To balance these needs, we propose a new online learning algorithm that preserves benefits of personalization while increasing the commonality in treatments across users. Our approach selectively interpolates between a contextual bandit algorithm and a context-free multi-arm bandit and leverages the contextual information for a treatment decision only if it promises significant gains. Apart from hel** users of personalization systems balance their experience between the individualized and shared, simplifying the treatment assignment policy by making it selectively reliant on the context can help improve the rate of learning in some cases. We evaluate our approach in a classification setting using public datasets and show the benefits of the hybrid policy. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2103.13420 [pdf, other]

Active Multitask Learning with Committees

Authors: **gxi Xu, Da Tang, Tony Jebara

Abstract: The cost of annotating training data has traditionally been a bottleneck for supervised learning approaches. The problem is further exacerbated when supervised learning is applied to a number of correlated tasks simultaneously since the amount of labels required scales with the number of tasks. To mitigate this concern, we propose an active multitask learning algorithm that achieves knowledge tran… ▽ More The cost of annotating training data has traditionally been a bottleneck for supervised learning approaches. The problem is further exacerbated when supervised learning is applied to a number of correlated tasks simultaneously since the amount of labels required scales with the number of tasks. To mitigate this concern, we propose an active multitask learning algorithm that achieves knowledge transfer between tasks. The approach forms a so-called committee for each task that jointly makes decisions and directly shares data across similar tasks. Our approach reduces the number of queries needed during training while maintaining high accuracy on test data. Empirical results on benchmark datasets show significant improvements on both accuracy and number of query requests. △ Less

Submitted 24 March, 2021; originally announced March 2021.

arXiv:1906.06419 [pdf, other]

Learning Correlated Latent Representations with Adaptive Priors

Authors: Da Tang, Dawen Liang, Nicholas Ruozzi, Tony Jebara

Abstract: Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enfor… ▽ More Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that CVAEs capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (ACVAEs), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint variational distribution. Its tractable form also enables further refinement with belief propagation. Experimental results on link prediction and hierarchical clustering show that ACVAEs significantly outperform CVAEs among other benchmarks. △ Less

Submitted 18 December, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

Comments: 16 pages, 1 figure, 5 tables

arXiv:1905.12052 [pdf, other]

A New Distribution on the Simplex with Auto-Encoding Applications

Authors: Andrew Stirn, Tony Jebara, David A Knowles

Abstract: We construct a new distribution for the simplex using the Kumaraswamy distribution and an ordered stick-breaking process. We explore and develop the theoretical properties of this new distribution and prove that it exhibits symmetry under the same conditions as the well-known Dirichlet. Like the Dirichlet, the new distribution is adept at capturing sparsity but, unlike the Dirichlet, has an exact… ▽ More We construct a new distribution for the simplex using the Kumaraswamy distribution and an ordered stick-breaking process. We explore and develop the theoretical properties of this new distribution and prove that it exhibits symmetry under the same conditions as the well-known Dirichlet. Like the Dirichlet, the new distribution is adept at capturing sparsity but, unlike the Dirichlet, has an exact and closed form reparameterization--making it well suited for deep variational Bayesian modeling. We demonstrate the distribution's utility in a variety of semi-supervised auto-encoding tasks. In all cases, the resulting models achieve competitive performance commensurate with their simplicity, use of explicit probability models, and abstinence from adversarial training. △ Less

Submitted 14 December, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: 15 pages, 6 figures, 1 tables

arXiv:1905.05335 [pdf, other]

Correlated Variational Auto-Encoders

Authors: Da Tang, Dawen Liang, Tony Jebara, Nicholas Ruozzi

Abstract: Variational Auto-Encoders (VAEs) are capable of learning latent representations for high dimensional data. However, due to the i.i.d. assumption, VAEs only optimize the singleton variational distributions and fail to account for the correlations between data points, which might be crucial for learning latent representations from dataset where a priori we know correlations exist. We propose Correla… ▽ More Variational Auto-Encoders (VAEs) are capable of learning latent representations for high dimensional data. However, due to the i.i.d. assumption, VAEs only optimize the singleton variational distributions and fail to account for the correlations between data points, which might be crucial for learning latent representations from dataset where a priori we know correlations exist. We propose Correlated Variational Auto-Encoders (CVAEs) that can take the correlation structure into consideration when learning latent representations with VAEs. CVAEs apply a prior based on the correlation structure. To address the intractability introduced by the correlated prior, we develop an approximation by average of a set of tractable lower bounds over all maximal acyclic subgraphs of the undirected correlation graph. Experimental results on matching and link prediction on public benchmark rating datasets and spectral clustering on a synthetic dataset show the effectiveness of the proposed method over baseline algorithms. △ Less

Submitted 17 April, 2020; v1 submitted 13 May, 2019; originally announced May 2019.

Comments: International Conference on Machine Learning (ICML), 2019

arXiv:1905.03818 [pdf, other]

Beta Survival Models

Authors: David Hubbard, Benoit Rostykus, Yves Raimond, Tony Jebara

Abstract: This article analyzes the problem of estimating the time until an event occurs, also known as survival modeling. We observe through substantial experiments on large real-world datasets and use-cases that populations are largely heterogeneous. Sub-populations have different mean and variance in their survival rates requiring flexible models that capture heterogeneity. We leverage a classical extens… ▽ More This article analyzes the problem of estimating the time until an event occurs, also known as survival modeling. We observe through substantial experiments on large real-world datasets and use-cases that populations are largely heterogeneous. Sub-populations have different mean and variance in their survival rates requiring flexible models that capture heterogeneity. We leverage a classical extension of the logistic function into the survival setting to characterize unobserved heterogeneity using the beta distribution. This yields insights into the geometry of the problem as well as efficient estimation methods for linear, tree and neural network models that adjust the beta distribution based on observed covariates. We also show that the additional information captured by the beta distribution leads to interesting ranking implications as we determine who is most-at-risk. We show theoretically that the ranking is variable as we forecast forward in time and prove that pairwise comparisons of survival remain transitive. Empirical results using large-scale datasets across two use-cases (online conversions and retention modeling), demonstrate the competitiveness of the method. The simplicity of the method and its ability to capture skew in the data makes it a viable alternative to standard techniques particularly when we are interested in the time to event and when the underlying probabilities are heterogeneous. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: 11 pages, 9 figures

arXiv:1812.00856 [pdf, other]

Thompson Sampling for Noncompliant Bandits

Authors: Andrew Stirn, Tony Jebara

Abstract: Thompson sampling, a Bayesian method for balancing exploration and exploitation in bandit problems, has theoretical guarantees and exhibits strong empirical performance in many domains. Traditional Thompson sampling, however, assumes perfect compliance, where an agent's chosen action is treated as the implemented action. This article introduces a stochastic noncompliance model that relaxes this as… ▽ More Thompson sampling, a Bayesian method for balancing exploration and exploitation in bandit problems, has theoretical guarantees and exhibits strong empirical performance in many domains. Traditional Thompson sampling, however, assumes perfect compliance, where an agent's chosen action is treated as the implemented action. This article introduces a stochastic noncompliance model that relaxes this assumption. We prove that any noncompliance in a 2-armed Bernoulli bandit increases existing regret bounds. With our noncompliance model, we derive Thompson sampling variants that explicitly handle both observed and latent noncompliance. With extensive empirical analysis, we demonstrate that our algorithms either match or outperform traditional Thompson sampling in both compliant and noncompliant environments. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: 21 pages, 5 figures

arXiv:1807.06651 [pdf, other]

doi 10.1145/3270323.327032

Item Recommendation with Variational Autoencoders and Heterogenous Priors

Authors: Giannis Karamanolakis, Kevin Raji Cherian, Ananth Ravi Narayan, Jie Yuan, Da Tang, Tony Jebara

Abstract: In recent years, Variational Autoencoders (VAEs) have been shown to be highly effective in both standard collaborative filtering applications and extensions such as incorporation of implicit feedback. We extend VAEs to collaborative filtering with side information, for instance when ratings are combined with explicit text feedback from the user. Instead of using a user-agnostic standard Gaussian p… ▽ More In recent years, Variational Autoencoders (VAEs) have been shown to be highly effective in both standard collaborative filtering applications and extensions such as incorporation of implicit feedback. We extend VAEs to collaborative filtering with side information, for instance when ratings are combined with explicit text feedback from the user. Instead of using a user-agnostic standard Gaussian prior, we incorporate user-dependent priors in the latent VAE space to encode users' preferences as functions of the review text. Taking into account both the rating and the text information to represent users in this multimodal latent space is promising to improve recommendation quality. Our proposed model is shown to outperform the existing VAE models for collaborative filtering (up to 29.41% relative improvement in ranking metric) along with other baselines that incorporate both user ratings and text for item recommendation. △ Less

Submitted 6 October, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

Comments: Accepted for the 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), held in conjunction with the 12th ACM Conference on Recommender Systems (RecSys 2018) in Vancouver, Canada

arXiv:1804.07855 [pdf, other]

Subgoal Discovery for Hierarchical Dialogue Policy Learning

Authors: Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara

Abstract: Develo** agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN… ▽ More Develo** agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a multi-level policy by hierarchical reinforcement learning. We demonstrate our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals. Moreover, we show that the learned subgoals are often human comprehensible. △ Less

Submitted 22 September, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: 11 pages, 6 figures, EMNLP 2018

arXiv:1804.05454 [pdf, ps, other]

A refinement of Bennett's inequality with applications to portfolio optimization

Authors: Tony Jebara

Abstract: A refinement of Bennett's inequality is introduced which is strictly tighter than the classical bound. The new bound establishes the convergence of the average of independent random variables to its expected value. It also carefully exploits information about the potentially heterogeneous mean, variance, and ceiling of each random variable. The bound is strictly sharper in the homogeneous setting… ▽ More A refinement of Bennett's inequality is introduced which is strictly tighter than the classical bound. The new bound establishes the convergence of the average of independent random variables to its expected value. It also carefully exploits information about the potentially heterogeneous mean, variance, and ceiling of each random variable. The bound is strictly sharper in the homogeneous setting and very often significantly sharper in the heterogeneous setting. The improved convergence rates are obtained by leveraging Lambert's W function. We apply the new bound in a portfolio optimization setting to allocate a budget across investments with heterogeneous returns. △ Less

Submitted 15 April, 2018; originally announced April 2018.

arXiv:1802.05814 [pdf, other]

Variational Autoencoders for Collaborative Filtering

Authors: Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, Tony Jebara

Abstract: We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback. This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research.We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation. Despite widespread… ▽ More We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback. This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research.We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation. Despite widespread use in language modeling and economics, the multinomial likelihood receives less attention in the recommender systems literature. We introduce a different regularization parameter for the learning objective, which proves to be crucial for achieving competitive performance. Remarkably, there is an efficient way to tune the parameter using annealing. The resulting model and learning algorithm has information-theoretic connections to maximum entropy discrimination and the information bottleneck principle. Empirically, we show that the proposed approach significantly outperforms several state-of-the-art baselines, including two recently-proposed neural network approaches, on several real-world datasets. We also provide extended experiments comparing the multinomial likelihood with other commonly used likelihood functions in the latent factor collaborative filtering literature and show favorable results. Finally, we identify the pros and cons of employing a principled Bayesian inference approach and characterize settings where it provides the most significant improvements. △ Less

Submitted 15 February, 2018; originally announced February 2018.

Comments: 10 pages, 3 figures. WWW 2018

arXiv:1611.00838 [pdf, other]

Initialization and Coordinate Optimization for Multi-way Matching

Authors: Da Tang, Tony Jebara

Abstract: We consider the problem of consistently matching multiple sets of elements to each other, which is a common task in fields such as computer vision. To solve the underlying NP-hard objective, existing methods often relax or approximate it, but end up with unsatisfying empirical performance due to a misaligned objective. We propose a coordinate update algorithm that directly optimizes the target obj… ▽ More We consider the problem of consistently matching multiple sets of elements to each other, which is a common task in fields such as computer vision. To solve the underlying NP-hard objective, existing methods often relax or approximate it, but end up with unsatisfying empirical performance due to a misaligned objective. We propose a coordinate update algorithm that directly optimizes the target objective. By using pairwise alignment information to build an undirected graph and initializing the permutation matrices along the edges of its Maximum Spanning Tree, our algorithm successfully avoids bad local optima. Theoretically, with high probability our algorithm guarantees an optimal solution under reasonable noise assumptions. Empirically, our algorithm consistently and significantly outperforms existing methods on several benchmark tasks on real datasets. △ Less

Submitted 18 July, 2019; v1 submitted 2 November, 2016; originally announced November 2016.

Comments: Artificial Intelligence and Statistics (AISTATS), 2017

arXiv:1610.07797 [pdf, other]

Frank-Wolfe Algorithms for Saddle Point Problems

Authors: Gauthier Gidel, Tony Jebara, Simon Lacoste-Julien

Abstract: We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. Remarkably, the method only requires access to linear minimization oracles. Leveraging recent advances in FW optimization, we provide the first proof of convergence of a FW-type saddle point solver over polytopes, thereby partially answering a 30 year-old conjecture. We also… ▽ More We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. Remarkably, the method only requires access to linear minimization oracles. Leveraging recent advances in FW optimization, we provide the first proof of convergence of a FW-type saddle point solver over polytopes, thereby partially answering a 30 year-old conjecture. We also survey other convergence results and highlight gaps in the theoretical underpinnings of FW-style algorithms. Motivating applications without known efficient alternatives are explored through structured prediction with combinatorial penalties as well as games over matching polytopes involving an exponential number of constraints. △ Less

Submitted 3 March, 2017; v1 submitted 25 October, 2016; originally announced October 2016.

Comments: Appears in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). 39 pages

MSC Class: 90C52; 90C90; 68T05 ACM Class: G.1.6; I.2.6

arXiv:1511.05212 [pdf, other]

Binary embeddings with structured hashed projections

Authors: Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann LeCun

Abstract: We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) map**s. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadrati… ▽ More We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) map**s. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear map**s accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier. △ Less

Submitted 1 July, 2016; v1 submitted 16 November, 2015; originally announced November 2015.

Comments: arXiv admin note: text overlap with arXiv:1505.03190

arXiv:1504.01119 [pdf, ps, other]

Coloring tournaments with forbidden substructures

Authors: Krzysztof Choromanski, Tony Jebara

Abstract: Coloring graphs is an important algorithmic problem in combinatorics with many applications in computer science. In this paper we study coloring tournaments. A chromatic number of a random tournament is of order $Ω(\frac{n}{\log(n)})$. The question arises whether the chromatic number can be proven to be smaller for more structured nontrivial classes of tournaments. We analyze the class of tourname… ▽ More Coloring graphs is an important algorithmic problem in combinatorics with many applications in computer science. In this paper we study coloring tournaments. A chromatic number of a random tournament is of order $Ω(\frac{n}{\log(n)})$. The question arises whether the chromatic number can be proven to be smaller for more structured nontrivial classes of tournaments. We analyze the class of tournaments defined by a forbidden subtournament $H$. This paper gives a first quasi-polynomial algorithm running in time $e^{O(\log(n)^{2})}$ that constructs colorings of $H$-free tournaments using only $O(n^{1-ε(H)}\log(n))$ colors, where $ε(H) \geq 2^{-2^{50|H|^{2}+1}}$ for many forbidden tournaments $H$. To the best of our knowledge all previously known related results required at least sub-exponential time and relied on the regularity lemma. Since we do not use the regularity lemma, we obtain the first known lower bounds on $ε(H)$ that can be given by a closed-form expression. As a corollary, we give a constructive proof of the celebrated open Erdős-Hajnal conjecture with explicitly given lower bounds on the EH coefficients for all classes of prime tournaments for which the conjecture is known. Such a constractive proof was not known before. Thus we significantly reduce the gap between best lower and upper bounds on the EH coefficients from the conjecture for all known prime tournaments that satisfy it. We also briefly explain how our methods may be used for coloring $H$-free tournaments under the following conditions: $H$ is any tournament with $\leq 5$ vertices or: $H$ is any but one tournament of six vertices. △ Less

Submitted 5 April, 2015; originally announced April 2015.

arXiv:1503.01228 [pdf, other]

Bethe Learning of Conditional Random Fields via MAP Decoding

Authors: Kui Tang, Nicholas Ruozzi, David Belanger, Tony Jebara

Abstract: Many machine learning tasks can be formulated in terms of predicting structured outputs. In frameworks such as the structured support vector machine (SVM-Struct) and the structured perceptron, discriminative functions are learned by iteratively applying efficient maximum a posteriori (MAP) decoding. However, maximum likelihood estimation (MLE) of probabilistic models over these same structured spa… ▽ More Many machine learning tasks can be formulated in terms of predicting structured outputs. In frameworks such as the structured support vector machine (SVM-Struct) and the structured perceptron, discriminative functions are learned by iteratively applying efficient maximum a posteriori (MAP) decoding. However, maximum likelihood estimation (MLE) of probabilistic models over these same structured spaces requires computing partition functions, which is generally intractable. This paper presents a method for learning discrete exponential family models using the Bethe approximation to the MLE. Remarkably, this problem also reduces to iterative (MAP) decoding. This connection emerges by combining the Bethe approximation with a Frank-Wolfe (FW) algorithm on a convex dual objective which circumvents the intractable partition function. The result is a new single loop algorithm MLE-Struct, which is substantially more efficient than previous double-loop methods for approximate maximum likelihood estimation. Our algorithm outperforms existing methods in experiments involving image segmentation, matching problems from vision, and a new dataset of university roommate assignments. △ Less

Submitted 4 March, 2015; originally announced March 2015.

Comments: 19 pages (9 supplementary), 10 figures (3 supplementary)

arXiv:1402.5902 [pdf, ps, other]

On Learning from Label Proportions

Authors: Felix X. Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang

Abstract: Learning from Label Proportions (LLP) is a learning setting, where the training data is provided in groups, or "bags", and only the proportion of each class in each bag is known. The task is to learn a model to predict the class labels of the individual instances. LLP has broad applications in political science, marketing, healthcare, and computer vision. This work answers the fundamental question… ▽ More Learning from Label Proportions (LLP) is a learning setting, where the training data is provided in groups, or "bags", and only the proportion of each class in each bag is known. The task is to learn a model to predict the class labels of the individual instances. LLP has broad applications in political science, marketing, healthcare, and computer vision. This work answers the fundamental question, when and why LLP is possible, by introducing a general framework, Empirical Proportion Risk Minimization (EPRM). EPRM learns an instance label classifier to match the given label proportions on the training data. Our result is based on a two-step analysis. First, we provide a VC bound on the generalization error of the bag proportions. We show that the bag sample complexity is only mildly sensitive to the bag size. Second, we show that under some mild assumptions, good bag proportion prediction guarantees good instance label prediction. The results together provide a formal guarantee that the individual labels can indeed be learned in the LLP setting. We discuss applications of the analysis, including justification of LLP algorithms, learning with population proportions, and a paradigm for learning algorithms with privacy guarantees. We also demonstrate the feasibility of LLP based on a case study in real-world setting: predicting income based on census data. △ Less

Submitted 11 February, 2015; v1 submitted 24 February, 2014; originally announced February 2014.

arXiv:1401.0044 [pdf, ps, other]

Approximating the Bethe partition function

Authors: Adrian Weller, Tony Jebara

Abstract: When belief propagation (BP) converges, it does so to a stationary point of the Bethe free energy $F$, and is often strikingly accurate. However, it may converge only to a local optimum or may not converge at all. An algorithm was recently introduced for attractive binary pairwise MRFs which is guaranteed to return an $ε$-approximation to the global minimum of $F$ in polynomial time provided the m… ▽ More When belief propagation (BP) converges, it does so to a stationary point of the Bethe free energy $F$, and is often strikingly accurate. However, it may converge only to a local optimum or may not converge at all. An algorithm was recently introduced for attractive binary pairwise MRFs which is guaranteed to return an $ε$-approximation to the global minimum of $F$ in polynomial time provided the maximum degree $Δ=O(\log n)$, where $n$ is the number of variables. Here we significantly improve this algorithm and derive several results including a new approach based on analyzing first derivatives of $F$, which leads to performance that is typically far superior and yields a fully polynomial-time approximation scheme (FPTAS) for attractive models without any degree restriction. Further, the method applies to general (non-attractive) models, though with no polynomial time guarantee in this case, leading to the important result that approximating $\log$ of the Bethe partition function, $\log Z_B=-\min F$, for a general model to additive $ε$-accuracy may be reduced to a discrete MAP inference problem. We explore an application to predicting equipment failure on an urban power network and demonstrate that the Bethe approximation can perform well even when BP fails to converge. △ Less

Submitted 30 December, 2013; originally announced January 2014.

Report number: cucs-031-13

arXiv:1309.6872 [pdf]

On MAP Inference by MWSS on Perfect Graphs

Authors: Adrian Weller, Tony S. Jebara

Abstract: Finding the most likely (MAP) configuration of a Markov random field (MRF) is NP-hard in general. A promising, recent technique is to reduce the problem to finding a maximum weight stable set (MWSS) on a derived weighted graph, which if perfect, allows inference in polynomial time. We derive new results for this approach, including a general decomposition theorem for MRFs of any order and number o… ▽ More Finding the most likely (MAP) configuration of a Markov random field (MRF) is NP-hard in general. A promising, recent technique is to reduce the problem to finding a maximum weight stable set (MWSS) on a derived weighted graph, which if perfect, allows inference in polynomial time. We derive new results for this approach, including a general decomposition theorem for MRFs of any order and number of labels, extensions of results for binary pairwise models with submodular cost functions to higher order, and an exact characterization of which binary pairwise MRFs can be efficiently solved with this method. This defines the power of the approach on this class of models, improves our toolbox and expands the range of tractable models. △ Less

Submitted 26 September, 2013; originally announced September 2013.

Comments: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI2013)

Report number: UAI-P-2013-PG-684-693

arXiv:1309.5605 [pdf, ps, other]

Stochastic Bound Majorization

Authors: Anna Choromanska, Tony Jebara

Abstract: Recently a majorization method for optimizing partition functions of log-linear models was proposed alongside a novel quadratic variational upper-bound. In the batch setting, it outperformed state-of-the-art first- and second-order optimization methods on various learning tasks. We propose a stochastic version of this bound majorization method as well as a low-rank modification for high-dimensiona… ▽ More Recently a majorization method for optimizing partition functions of log-linear models was proposed alongside a novel quadratic variational upper-bound. In the batch setting, it outperformed state-of-the-art first- and second-order optimization methods on various learning tasks. We propose a stochastic version of this bound majorization method as well as a low-rank modification for high-dimensional data-sets. The resulting stochastic second-order method outperforms stochastic gradient descent (across variations and various tunings) both in terms of the number of iterations and computation time till convergence while finding a better quality parameter setting. The proposed method bridges first- and second-order stochastic optimization methods by maintaining a computational complexity that is linear in the data dimension and while exploiting second order information about the pseudo-global curvature of the objective function (as opposed to the local curvature in the Hessian). △ Less

Submitted 22 September, 2013; originally announced September 2013.

arXiv:1309.1369 [pdf, other]

Semistochastic Quadratic Bound Methods

Authors: Aleksandr Y. Aravkin, Anna Choromanska, Tony Jebara, Dimitri Kanevsky

Abstract: Partition functions arise in a variety of settings, including conditional random fields, logistic regression, and latent gaussian models. In this paper, we consider semistochastic quadratic bound (SQB) methods for maximum likelihood inference based on partition function optimization. Batch methods based on the quadratic bound were recently proposed for this class of problems, and performed favorab… ▽ More Partition functions arise in a variety of settings, including conditional random fields, logistic regression, and latent gaussian models. In this paper, we consider semistochastic quadratic bound (SQB) methods for maximum likelihood inference based on partition function optimization. Batch methods based on the quadratic bound were recently proposed for this class of problems, and performed favorably in comparison to state-of-the-art techniques. Semistochastic methods fall in between batch algorithms, which use all the data, and stochastic gradient type methods, which use small random selections at each iteration. We build semistochastic quadratic bound-based methods, and prove both global convergence (to a stationary point) under very weak assumptions, and linear convergence rate under stronger assumptions on the objective. To make the proposed methods faster and more stable, we consider inexact subproblem minimization and batch-size selection schemes. The efficacy of SQB methods is demonstrated via comparison with several state-of-the-art techniques on commonly used datasets. △ Less

Submitted 17 February, 2014; v1 submitted 5 September, 2013; originally announced September 2013.

Comments: 11 pages, 1 figure

MSC Class: 90C55; 90C15; 62H30

arXiv:1306.0886 [pdf, other]

$\propto$SVM for learning with label proportions

Authors: Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang

Abstract: We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known. We propose a new method called proportion-SVM, or $\propto$SVM, which explicitly models the latent unknown instance labels together with the known group label proportions in a large-margin framework. Unlike the existing works, our ap… ▽ More We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known. We propose a new method called proportion-SVM, or $\propto$SVM, which explicitly models the latent unknown instance labels together with the known group label proportions in a large-margin framework. Unlike the existing works, our approach avoids making restrictive assumptions about the data. The $\propto$SVM model leads to a non-convex integer programming problem. In order to solve it efficiently, we propose two algorithms: one based on simple alternating optimization and the other based on a convex relaxation. Extensive experiments on standard datasets show that $\propto$SVM outperforms the state-of-the-art, especially for larger group sizes. △ Less

Submitted 4 June, 2013; originally announced June 2013.

Comments: Appears in Proceedings of the 30th International Conference on Machine Learning (ICML 2013)

arXiv:1301.3865 [pdf]

Feature Selection and Dualities in Maximum Entropy Discrimination

Authors: Tony S. Jebara, Tommi S. Jaakkola

Abstract: Incorporating feature selection into a classification or regression method often carries a number of advantages. In this paper we formalize feature selection specifically from a discriminative perspective of improving classification/regression accuracy. The feature selection method is developed as an extension to the recently proposed maximum entropy discrimination (MED) framework. We describe MED… ▽ More Incorporating feature selection into a classification or regression method often carries a number of advantages. In this paper we formalize feature selection specifically from a discriminative perspective of improving classification/regression accuracy. The feature selection method is developed as an extension to the recently proposed maximum entropy discrimination (MED) framework. We describe MED as a flexible (Bayesian) regularization approach that subsumes, e.g., support vector classification, regression and exponential family models. For brevity, we restrict ourselves primarily to feature selection in the context of linear classification/regression methods and demonstrate that the proposed approach indeed carries substantial improvements in practice. Moreover, we discuss and develop various extensions of feature selection, including the problem of dealing with example specific but unobserved degrees of freedom -- alignments or invariants. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000)

Report number: UAI-P-2000-PG-291-300

arXiv:1301.0015 [pdf, ps, other]

Bethe Bounds and Approximating the Global Optimum

Authors: Adrian Weller, Tony Jebara

Abstract: Inference in general Markov random fields (MRFs) is NP-hard, though identifying the maximum a posteriori (MAP) configuration of pairwise MRFs with submodular cost functions is efficiently solvable using graph cuts. Marginal inference, however, even for this restricted class, is in #P. We prove new formulations of derivatives of the Bethe free energy, provide bounds on the derivatives and bracket t… ▽ More Inference in general Markov random fields (MRFs) is NP-hard, though identifying the maximum a posteriori (MAP) configuration of pairwise MRFs with submodular cost functions is efficiently solvable using graph cuts. Marginal inference, however, even for this restricted class, is in #P. We prove new formulations of derivatives of the Bethe free energy, provide bounds on the derivatives and bracket the locations of stationary points, introducing a new technique called Bethe bound propagation. Several results apply to pairwise models whether associative or not. Applying these to discretized pseudo-marginals in the associative case we present a polynomial time approximation scheme for global optimization provided the maximum degree is $O(\log n)$, and discuss several extensions. △ Less

Submitted 31 December, 2012; originally announced January 2013.

arXiv:1207.4148 [pdf]

Dynamical Systems Trees

Authors: Andrew Howard, Tony S. Jebara

Abstract: We propose dynamical systems trees (DSTs) as a flexible class of models for describing multiple processes that interact via a hierarchy of aggregating parent chains. DSTs extend Kalman filters, hidden Markov models and nonlinear dynamical systems to an interactive group scenario. Various individual processes interact as communities and sub-communities in a tree structure that is unrolled in time.… ▽ More We propose dynamical systems trees (DSTs) as a flexible class of models for describing multiple processes that interact via a hierarchy of aggregating parent chains. DSTs extend Kalman filters, hidden Markov models and nonlinear dynamical systems to an interactive group scenario. Various individual processes interact as communities and sub-communities in a tree structure that is unrolled in time. To accommodate nonlinear temporal activity, each individual leaf process is modeled as a dynamical system containing discrete and/or continuous hidden states with discrete and/or Gaussian emissions. Subsequent higher level parent processes act like hidden Markov models and mediate the interaction between leaf processes or between other parent processes in the hierarchy. Aggregator chains are parents of child processes that they combine and mediate, yielding a compact overall parameterization. We provide tractable inference and learning algorithms for arbitrary DST topologies via an efficient structured mean-field algorithm. The diverse applicability of DSTs is demonstrated by experiments on gene expression data and by modeling group behavior in the setting of an American football game. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-260-267

arXiv:1206.3269 [pdf]

Bayesian Out-Trees

Authors: Tony S. Jebara

Abstract: A Bayesian treatment of latent directed graph structure for non-iid data is provided where each child datum is sampled with a directed conditional dependence on a single unknown parent datum. The latent graph structure is assumed to lie in the family of directed out-tree graphs which leads to efficient Bayesian inference. The latent likelihood of the data and its gradients are computable in closed… ▽ More A Bayesian treatment of latent directed graph structure for non-iid data is provided where each child datum is sampled with a directed conditional dependence on a single unknown parent datum. The latent graph structure is assumed to lie in the family of directed out-tree graphs which leads to efficient Bayesian inference. The latent likelihood of the data and its gradients are computable in closed form via Tutte's directed matrix tree theorem using determinants and inverses of the out-Laplacian. This novel likelihood subsumes iid likelihood, is exchangeable and yields efficient unsupervised and semi-supervised learning algorithms. In addition to handling taxonomy and phylogenetic datasets the out-tree assumption performs surprisingly well as a semi-parametric density estimator on standard iid datasets. Experiments with unsupervised and semisupervised learning are shown on various UCI and taxonomy datasets. △ Less

Submitted 13 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-315-324

arXiv:1205.2639 [pdf]

MAP Estimation, Message Passing, and Perfect Graphs

Authors: Tony S. Jebara

Abstract: Efficiently finding the maximum a posteriori (MAP) configuration of a graphical model is an important problem which is often implemented using message passing algorithms. The optimality of such algorithms is only well established for singly-connected graphs and other limited settings. This article extends the set of graphs where MAP estimation is in P and where message passing recovers the exact s… ▽ More Efficiently finding the maximum a posteriori (MAP) configuration of a graphical model is an important problem which is often implemented using message passing algorithms. The optimality of such algorithms is only well established for singly-connected graphs and other limited settings. This article extends the set of graphs where MAP estimation is in P and where message passing recovers the exact solution to so-called perfect graphs. This result leverages recent progress in defining perfect graphs (the strong perfect graph theorem), linear programming relaxations of MAP estimation and recent convergent message passing schemes. The article converts graphical models into nand Markov random fields which are straightforward to relax into linear programs. Therein, integrality can be established in general by testing for graph perfection. This perfection test is performed efficiently using a polynomial time algorithm. Alternatively, known decomposition tools from perfect graph theory may be used to prove perfection for certain families of graphs. Thus, a general graph framework is provided for determining when MAP estimation in any graphical model is in P, has an integral linear programming relaxation and is exactly recoverable by message passing. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-258-267

arXiv:0908.1769 [pdf, ps, other]

Approximating the Permanent with Belief Propagation

Authors: Bert Huang, Tony Jebara

Abstract: This work describes a method of approximating matrix permanents efficiently using belief propagation. We formulate a probability distribution whose partition function is exactly the permanent, then use Bethe free energy to approximate this partition function. After deriving some speedups to standard belief propagation, the resulting algorithm requires $(n^2)$ time per iteration. Finally, we demo… ▽ More This work describes a method of approximating matrix permanents efficiently using belief propagation. We formulate a probability distribution whose partition function is exactly the permanent, then use Bethe free energy to approximate this partition function. After deriving some speedups to standard belief propagation, the resulting algorithm requires $(n^2)$ time per iteration. Finally, we demonstrate the advantages of using this approximation. △ Less

Submitted 12 August, 2009; originally announced August 2009.

Showing 1–28 of 28 results for author: Jebara, T