-
Low-Distortion Clustering in Bounded Growth Graphs
Authors:
Yi-Jun Chang,
Varsha Dani,
Thomas P. Hayes
Abstract:
The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an…
▽ More
The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an $O(\log n)$ distortion factor and rounding issues. Minimizing this distortion factor is important for efficiency in computing the clustering, as well as in other applications.
We prove that there exist graphs for which an $Ω((\log n)^{1/3})$ distortion factor is necessary for any clustering. We also consider a class of nice graphs which we call uniformly bounded independence graphs. These include, for example, paths, lattice graphs, and "dense" unit disk graphs. For these graphs, we prove that clusterings of distortion $O(1)$ always exist, and moreover, we give new efficient distributed algorithms to construct them. This clustering is based on Voronoi cells centered at the vertices of a maximal independent set in a suitable power graph.
Applications include low-energy simulation of distributed algorithms in the LOCAL, CONGEST, and RADIO-CONGEST models and efficient approximate solutions to distributed combinatorial optimization problems. We also investigate related lower bounds.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-Tuning
Authors:
Mingtian Zhang,
Shawn Lan,
Peter Hayes,
David Barber
Abstract:
Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific…
▽ More
Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific domain knowledge, necessitating fine-tuning. This paper addresses scenarios where the embeddings are only available from a black-box model. We introduce Model augmented fine-tuning (Mafin) -- a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model. Our results demonstrate that Mafin significantly enhances the performance of the black-box embeddings by only requiring the training of a small augmented model. We validate the effectiveness of our method on both labeled and unlabeled datasets, illustrating its broad applicability and efficiency.
△ Less
Submitted 12 March, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Active Preference Learning for Large Language Models
Authors:
William Muldrew,
Peter Hayes,
Mingtian Zhang,
David Barber
Abstract:
As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of…
▽ More
As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
△ Less
Submitted 28 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Optimal Mixing via Tensorization for Random Independent Sets on Arbitrary Trees
Authors:
Charilaos Efthymiou,
Thomas P. Hayes,
Daniel Stefankovic,
Eric Vigoda
Abstract:
We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, for generating a random independent set of a tree. Our focus is obtaining optimal convergence results for arbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter $λ>0$; the special case $λ=1$…
▽ More
We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, for generating a random independent set of a tree. Our focus is obtaining optimal convergence results for arbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter $λ>0$; the special case $λ=1$ corresponds to the uniform distribution over all independent sets. Previous work of Martinelli, Sinclair and Weitz (2004) obtained optimal mixing time bounds for the complete $Δ$-regular tree for all $λ$. However, Restrepo et al. (2014) showed that for sufficiently large $λ$ there are bounded-degree trees where optimal mixing does not hold. Recent work of Eppstein and Frishberg (2022) proved a polynomial mixing time bound for the Glauber dynamics for arbitrary trees, and more generally for graphs of bounded tree-width.
We establish an optimal bound on the relaxation time (i.e., inverse spectral gap) of $O(n)$ for the Glauber dynamics for unweighted independent sets on arbitrary trees. We stress that our results hold for arbitrary trees and there is no dependence on the maximum degree $Δ$. Interestingly, our results extend (far) beyond the uniqueness threshold which is on the order $λ=O(1/Δ)$. Our proof approach is inspired by recent work on spectral independence. In fact, we prove that spectral independence holds with a constant independent of the maximum degree for any tree, but this does not imply mixing for general trees as the optimal mixing results of Chen, Liu, and Vigoda (2021) only apply for bounded degree graphs. We instead utilize the combinatorial nature of independent sets to directly prove approximate tensorization of variance via a non-trivial inductive proof.
△ Less
Submitted 18 February, 2024; v1 submitted 15 July, 2023;
originally announced July 2023.
-
Towards Healing the Blindness of Score Matching
Authors:
Mingtian Zhang,
Oscar Key,
Peter Hayes,
David Barber,
Brooks Paige,
François-Xavier Briol
Abstract:
Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of de…
▽ More
Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.
△ Less
Submitted 15 October, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Integrated Weak Learning
Authors:
Peter Hayes,
Mingtian Zhang,
Raza Habib,
Jordan Burgess,
Emine Yilmaz,
David Barber
Abstract:
We introduce Integrated Weak Learning, a principled framework that integrates weak supervision into the training process of machine learning models. Our approach jointly trains the end-model and a label model that aggregates multiple sources of weak supervision. We introduce a label model that can learn to aggregate weak supervision sources differently for different datapoints and takes into consi…
▽ More
We introduce Integrated Weak Learning, a principled framework that integrates weak supervision into the training process of machine learning models. Our approach jointly trains the end-model and a label model that aggregates multiple sources of weak supervision. We introduce a label model that can learn to aggregate weak supervision sources differently for different datapoints and takes into consideration the performance of the end-model during training. We show that our approach outperforms existing weak learning techniques across a set of 6 benchmark classification datasets. When both a small amount of labeled data and weak supervision are present the increase in performance is both consistent and large, reliably getting a 2-5 point test F1 score gain over non-integrated methods.
△ Less
Submitted 19 June, 2022;
originally announced June 2022.
-
How to Wake Up Your Neighbors: Safe and Nearly Optimal Generic Energy Conservation in Radio Networks
Authors:
Varsha Dani,
Thomas P. Hayes
Abstract:
Recent work has shown that it is sometimes feasible to significantly reduce the energy usage of some radio-network algorithms by adaptively powering down the radio receiver when it is not needed. Although past work has focused on modifying specific network algorithms in this way, we now ask the question of whether this problem can be solved in a generic way, treating the algorithm as a kind of bla…
▽ More
Recent work has shown that it is sometimes feasible to significantly reduce the energy usage of some radio-network algorithms by adaptively powering down the radio receiver when it is not needed. Although past work has focused on modifying specific network algorithms in this way, we now ask the question of whether this problem can be solved in a generic way, treating the algorithm as a kind of black box.
We are able to answer this question in the affirmative, presenting a new general way to modify arbitrary radio-network algorithms in an attempt to save energy. At the expense of a small increase in the time complexity, we can provably reduce the energy usage to an extent that is provably nearly optimal within a certain class of general-purpose algorithms.
As an application, we show that our algorithm reduces the energy cost of breadth-first search in radio networks from the previous best bound of $2^{O(\sqrt{\log n})}$ to $\mathrm{polylog}(n)$, where $n$ is the number of nodes in the network
A key ingredient in our algorithm is hierarchical clustering based on additive Voronoi decomposition done at multiple scales. Similar clustering algorithms have been used in other recent work on energy-aware computation in radio networks, but we believe the specific approach presented here may be of independent interest.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Generalization Gap in Amortized Inference
Authors:
Mingtian Zhang,
Peter Hayes,
David Barber
Abstract:
The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized i…
▽ More
The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized inference. Based on this observation, we propose a new training objective that improves the generalization of amortized inference. We demonstrate how our method can improve performance in the context of image modeling and lossless compression.
△ Less
Submitted 15 October, 2022; v1 submitted 23 May, 2022;
originally announced May 2022.
-
Sample Efficient Model Evaluation
Authors:
Emine Yilmaz,
Peter Hayes,
Raza Habib,
Jordan Burgess,
David Barber
Abstract:
Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling…
▽ More
Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling. For both approaches we derive the minimal error sampling distributions and how to approximate and use them to form estimators and confidence intervals. We show that Poisson Sampling outperforms Importance Sampling both theoretically and experimentally.
△ Less
Submitted 24 September, 2021;
originally announced September 2021.
-
CeMux: Maximizing the Accuracy of Stochastic Mux Adders and an Application to Filter Design
Authors:
Timothy J. Baker,
John P. Hayes
Abstract:
Stochastic computing (SC) is a low-cost computational paradigm that has promising applications in digital filter design, image processing and neural networks. Fundamental to these applications is the weighted addition operation which is most often implemented by a multiplexer (mux) tree. Mux-based adders have very low area but typically require long bit-streams to reach practical accuracy threshol…
▽ More
Stochastic computing (SC) is a low-cost computational paradigm that has promising applications in digital filter design, image processing and neural networks. Fundamental to these applications is the weighted addition operation which is most often implemented by a multiplexer (mux) tree. Mux-based adders have very low area but typically require long bit-streams to reach practical accuracy thresholds when the number of summands is large. In this work, we first identify the main contributors to mux adder error. We then demonstrate with analysis and experiment that two new techniques, precise sampling and full correlation, can target and mitigate these error sources. Implementing these techniques in hardware leads to the design of CeMux (Correlation-enhanced Multiplexer), a stochastic mux adder that is significantly more accurate and uses much less area than traditional weighted adders. We compare CeMux to other SC and hybrid designs for an electrocardiogram filtering case study that employs a large digital filter. One major result is that CeMux is shown to be accurate even for large input sizes. CeMux's higher accuracy leads to a latency reduction of 4x to 16x over other designs. Further, CeMux uses about 35% less area than existing designs, and we demonstrate that a small amount of accuracy can be traded for a further 50% reduction in area. Finally, we compare CeMux to a conventional binary design and we show that CeMux can achieve a 50 to 73% area reduction for similar power and latency as the conventional design, but at a slightly higher level of error.
△ Less
Submitted 30 August, 2021; v1 submitted 27 August, 2021;
originally announced August 2021.
-
Reconstruction of Random Geometric Graphs: Breaking the Omega(r) distortion barrier
Authors:
Varsha Dani,
Josep Díaz,
Thomas P. Hayes,
Cristopher Moore
Abstract:
Embedding graphs in a geographical or latent space, i.e.\ inferring locations for vertices in Euclidean space or on a smooth manifold or submanifold, is a common task in network analysis, statistical inference, and graph visualization. We consider the classic model of random geometric graphs where $n$ points are scattered uniformly in a square of area $n$, and two points have an edge between them…
▽ More
Embedding graphs in a geographical or latent space, i.e.\ inferring locations for vertices in Euclidean space or on a smooth manifold or submanifold, is a common task in network analysis, statistical inference, and graph visualization. We consider the classic model of random geometric graphs where $n$ points are scattered uniformly in a square of area $n$, and two points have an edge between them if and only if their Euclidean distance is less than $r$. The reconstruction problem then consists of inferring the vertex positions, up to the symmetries of the square, given only the adjacency matrix of the resulting graph. We give an algorithm that, if $r=n^α$ for any $α> 0$, with high probability reconstructs the vertex positions with a maximum error of $O(n^β)$ where $β=1/2-(4/3)α$, until $α\ge 3/8$ where $β=0$ and the error becomes $O(\sqrt{\log n})$. This improves over earlier results, which were unable to reconstruct with error less than $r$. Our method estimates Euclidean distances using a hybrid of graph distances and short-range estimates based on the number of common neighbors. We extend our results to the surface of the sphere in $\R^3$ and to hypercubes in any constant fixed dimension. Additionally we examine the extent to which reconstruction is still possible when the original adjacency lists have had a subset of the edges independently deleted at random.
△ Less
Submitted 17 May, 2022; v1 submitted 29 July, 2021;
originally announced July 2021.
-
Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity
Authors:
Michael Morris,
Peter Hayes,
Ingemar J. Cox,
Vasileios Lampos
Abstract:
Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, so…
▽ More
Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, something essential for using them effectively during decision making. In this paper, we demonstrate how Bayesian Neural Networks (BNNs) can be used to both provide a forecast and a corresponding uncertainty without significant loss in forecasting accuracy compared to traditional NNs. Our method accounts for two sources of uncertainty: data and model uncertainty, arising due to measurement noise and model specification, respectively. Experiments are conducted using 14 years of data for England, assessing the model's accuracy over the last 4 flu seasons in this dataset. We evaluate the performance of different models including competitive baselines with conventional metrics as well as error functions that incorporate uncertainty estimates. Our empirical analysis indicates that considering both sources of uncertainty simultaneously is superior to considering either one separately. We also show that a BNN with recurrent layers that models both sources of uncertainty yields superior accuracy for these metrics for forecasting horizons greater than 7 days.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
Wake Up and Join Me! An Energy-Efficient Algorithm for Maximal Matching in Radio Networks
Authors:
Varsha Dani,
Aayush Gupta,
Thomas P. Hayes,
Seth Pettie
Abstract:
We consider networks of small, autonomous devices that communicate with each other wirelessly. Minimizing energy usage is an important consideration in designing algorithms for such networks, as battery life is a crucial and limited resource. Working in a model where both sending and listening for messages deplete energy, we consider the problem of finding a maximal matching of the nodes in a radi…
▽ More
We consider networks of small, autonomous devices that communicate with each other wirelessly. Minimizing energy usage is an important consideration in designing algorithms for such networks, as battery life is a crucial and limited resource. Working in a model where both sending and listening for messages deplete energy, we consider the problem of finding a maximal matching of the nodes in a radio network of arbitrary and unknown topology.
We present a distributed randomized algorithm that produces, with high probability, a maximal matching. The maximum energy cost per node is $O(\log^2 n)$, where $n$ is the size of the network. The total latency of our algorithm is $O(n \log n)$ time steps. We observe that there exist families of network topologies for which both of these bounds are simultaneously optimal up to polylog factors, so any significant improvement will require additional assumptions about the network topology.
We also consider the related problem of assigning, for each node in the network, a neighbor to back up its data in case of node failure. Here, a key goal is to minimize the maximum load, defined as the number of nodes assigned to a single node. We present a decentralized low-energy algorithm that finds a neighbor assignment whose maximum load is at most a polylog($n$) factor bigger that the optimum.
△ Less
Submitted 16 April, 2022; v1 submitted 19 April, 2021;
originally announced April 2021.
-
The Energy Complexity of BFS in Radio Networks
Authors:
Yi-Jun Chang,
Varsha Dani,
Thomas P. Hayes,
Seth Pettie
Abstract:
We consider a model of energy complexity in Radio Networks in which transmitting or listening on the channel costs one unit of energy and computation is free. This simplified model captures key aspects of battery-powered sensors: that battery life is most influenced by transceiver usage, and that at low transmission powers, the actual cost of transmitting and listening are very similar.
The ener…
▽ More
We consider a model of energy complexity in Radio Networks in which transmitting or listening on the channel costs one unit of energy and computation is free. This simplified model captures key aspects of battery-powered sensors: that battery life is most influenced by transceiver usage, and that at low transmission powers, the actual cost of transmitting and listening are very similar.
The energy complexity of tasks in single-hop networks is well understood. Recent work of Chang et al. considered energy complexity in multi-hop networks and showed that $\mathsf{Broadcast}$ admits an energy-efficient protocol, by which we mean each of the $n$ nodes in the network spends $O(\text{polylog}(n))$ energy. This work left open the strange possibility that all natural problems in multi-hop networks might admit such an energy-efficient solution.
In this paper we prove that the landscape of energy complexity is rich enough to support a multitude of problem complexities. Whereas $\mathsf{Broadcast}$ can be solved by an energy-efficient protocol, exact computation of $\mathsf{Diameter}$ cannot, requiring $Ω(n)$ energy. Our main result is that $\mathsf{Breadth First Search}$ has sub-polynomial energy complexity at most $2^{O(\sqrt{\log n\log\log n})}=n^{o(1)}$; whether it admits an efficient $O(\text{polylog}(n))$-energy protocol is an open problem.
Our main algorithm involves recursively solving a generalized BFS problem on a cluster graph introduced by Miller, Peng, and Xu. In this application, we make crucial use of a close relationship between distances in this cluster graph, and distances in the original network. This relationship is new and may be of independent interest.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
Improved Strong Spatial Mixing for Colorings on Trees
Authors:
Charilaos Efthymiou,
Andreas Galanis,
Thomas P. Hayes,
Daniel Stefankovic,
Eric Vigoda
Abstract:
Strong spatial mixing (SSM) is a form of correlation decay that has played an essential role in the design of approximate counting algorithms for spin systems. A notable example is the algorithm of Weitz (2006) for the hard-core model on weighted independent sets. We study SSM for the $q$-colorings problem on the infinite $(d+1)$-regular tree. Weak spatial mixing (WSM) captures whether the influen…
▽ More
Strong spatial mixing (SSM) is a form of correlation decay that has played an essential role in the design of approximate counting algorithms for spin systems. A notable example is the algorithm of Weitz (2006) for the hard-core model on weighted independent sets. We study SSM for the $q$-colorings problem on the infinite $(d+1)$-regular tree. Weak spatial mixing (WSM) captures whether the influence of the leaves on the root vanishes as the height of the tree grows. Jonasson (2002) established WSM when $q>d+1$. In contrast, in SSM, we first fix a coloring on a subset of internal vertices, and we again ask if the influence of the leaves on the root is vanishing. It was known that SSM holds on the $(d+1)$-regular tree when $q>αd$ where $α\approx 1.763...$ is a constant that has arisen in a variety of results concerning random colorings. Here we improve on this bound by showing SSM for $q>1.59d$. Our proof establishes an $L^2$ contraction for the BP operator. For the contraction we bound the norm of the BP Jacobian by exploiting combinatorial properties of the coloring of the tree.
△ Less
Submitted 16 September, 2019;
originally announced September 2019.
-
Distributed Metropolis Sampler with Optimal Parallelism
Authors:
Weiming Feng,
Thomas P. Hayes,
Yitong Yin
Abstract:
The Metropolis-Hastings algorithm is a fundamental Markov chain Monte Carlo (MCMC) method for sampling and inference. With the advent of Big Data, distributed and parallel variants of MCMC methods are attracting increased attention. In this paper, we give a distributed algorithm that can correctly simulate sequential single-site Metropolis chains without any bias in a fully asynchronous message-pa…
▽ More
The Metropolis-Hastings algorithm is a fundamental Markov chain Monte Carlo (MCMC) method for sampling and inference. With the advent of Big Data, distributed and parallel variants of MCMC methods are attracting increased attention. In this paper, we give a distributed algorithm that can correctly simulate sequential single-site Metropolis chains without any bias in a fully asynchronous message-passing model. Furthermore, if a natural Lipschitz condition is satisfied by the Metropolis filters, our algorithm can simulate $N$-step Metropolis chains within $O(N/n+\log n)$ rounds of asynchronous communications, where $n$ is the number of variables. For sequential single-site dynamics, whose mixing requires $Ω(n\log n)$ steps, this achieves an optimal linear speedup. For several well-studied important graphical models, including proper graph coloring, hardcore model, and Ising model, our condition for linear speedup is weaker than the respective uniqueness (mixing) conditions.
The novel idea in our algorithm is to resolve updates in advance: the local Metropolis filters can often be executed correctly before the full information about neighboring spins is available. This achieves optimal parallelism without introducing any bias.
△ Less
Submitted 14 July, 2019; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Spread Divergence
Authors:
Mingtian Zhang,
Peter Hayes,
Tom Bird,
Raza Habib,
David Barber
Abstract:
For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discrimina…
▽ More
For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).
△ Less
Submitted 4 December, 2022; v1 submitted 21 November, 2018;
originally announced November 2018.
-
Distributed Symmetry Breaking in Sampling (Optimal Distributed Randomly Coloring with Fewer Colors)
Authors:
Weiming Feng,
Thomas P. Hayes,
Yitong Yin
Abstract:
We examine the problem of almost-uniform sampling proper $q$-colorings of a graph whose maximum degree is $Δ$. A famous result, discovered independently by Jerrum(1995) and Salas and Sokal(1997), is that, assuming $q > (2+δ) Δ$, the Glauber dynamics (a.k.a. single-site dynamics) for this problem has mixing time $O(n \log n)$, where $n$ is the number of vertices, and thus provides a nearly linear t…
▽ More
We examine the problem of almost-uniform sampling proper $q$-colorings of a graph whose maximum degree is $Δ$. A famous result, discovered independently by Jerrum(1995) and Salas and Sokal(1997), is that, assuming $q > (2+δ) Δ$, the Glauber dynamics (a.k.a. single-site dynamics) for this problem has mixing time $O(n \log n)$, where $n$ is the number of vertices, and thus provides a nearly linear time sampling algorithm for this problem. A natural question is the extent to which this algorithm can be parallelized. Previous work Feng, Sun and Yin [PODC'17] has shown that a $O(Δ\log n)$ time parallelized algorithm is possible, and that $Ω(\log n)$ time is necessary.
We give a distributed sampling algorithm, which we call the Lazy Local Metropolis Algorithm, that achieves an optimal parallelization of this classic algorithm. It improves its predecessor, the Local Metropolis algorithm of Feng, Sun and Yin [PODC'17], by introducing a step of distributed symmetry breaking that helps the mixing of the distributed sampling algorithm.
For sampling almost-uniform proper $q$-colorings of graphs $G$ on $n$ vertices, we show that the Lazy Local Metropolis algorithm achieves an optimal $O(\log n)$ mixing time if either of the following conditions is true for an arbitrary constant $δ>0$:
$\bullet$ $q\ge(2+δ)Δ$, on general graphs with maximum degree $Δ$;
$\bullet$ $q \geq (α^* + δ)Δ$, where $α^* \approx 1.763$ satisfies $α^* = \mathrm{e}^{1/α^*}$, on graphs with sufficiently large maximum degree $Δ\ge Δ_0(δ)$ and girth at least $9$.
△ Less
Submitted 21 June, 2018; v1 submitted 19 February, 2018;
originally announced February 2018.
-
The Energy Complexity of Broadcast
Authors:
Yi-Jun Chang,
Varsha Dani,
Thomas P. Hayes,
Qizheng He,
Wenzheng Li,
Seth Pettie
Abstract:
Energy is often the most constrained resource in networks of battery-powered devices, and as devices become smaller, they spend a larger fraction of their energy on communication (transceiver usage) not computation. As an imperfect proxy for true energy usage, we define energy complexity to be the number of time slots a device transmits/listens; idle time and computation are free.
In this paper…
▽ More
Energy is often the most constrained resource in networks of battery-powered devices, and as devices become smaller, they spend a larger fraction of their energy on communication (transceiver usage) not computation. As an imperfect proxy for true energy usage, we define energy complexity to be the number of time slots a device transmits/listens; idle time and computation are free.
In this paper we investigate the energy complexity of fundamental communication primitives such as broadcast in multi-hop radio networks. We consider models with collision detection (CD) and without (No-CD), as well as both randomized and deterministic algorithms. Some take-away messages from this work include:
1. The energy complexity of broadcast in a multi-hop network is intimately connected to the time complexity of leader election in a single-hop (clique) network. Many existing lower bounds on time complexity immediately transfer to energy complexity. For example, in the CD and No-CD models, we need $Ω(\log n)$ and $Ω(\log^2 n)$ energy, respectively.
2. The energy lower bounds above can almost be achieved, given sufficient ($Ω(n)$) time. In the CD and No-CD models we can solve broadcast using $O(\frac{\log n\log\log n}{\log\log\log n})$ energy and $O(\log^3 n)$ energy, respectively.
3. The complexity measures of Energy and Time are in conflict, and it is an open problem whether both can be minimized simultaneously. We give a tradeoff showing it is possible to be nearly optimal in both measures simultaneously. For any constant $ε>0$, broadcast can be solved in $O(D^{1+ε}\log^{O(1/ε)} n)$ time with $O(\log^{O(1/ε)} n)$ energy, where $D$ is the diameter of the network.
△ Less
Submitted 4 October, 2017;
originally announced October 2017.
-
Sampling Random Colorings of Sparse Random Graphs
Authors:
Charilaos Efthymiou,
Thomas P. Hayes,
Daniel Stefankovic,
Eric Vigoda
Abstract:
We study the mixing properties of the single-site Markov chain known as the Glauber dynamics for sampling $k$-colorings of a sparse random graph $G(n,d/n)$ for constant $d$. The best known rapid mixing results for general graphs are in terms of the maximum degree $Δ$ of the input graph $G$ and hold when $k>11Δ/6$ for all $G$. Improved results hold when $k>αΔ$ for graphs with girth $\geq 5$ and…
▽ More
We study the mixing properties of the single-site Markov chain known as the Glauber dynamics for sampling $k$-colorings of a sparse random graph $G(n,d/n)$ for constant $d$. The best known rapid mixing results for general graphs are in terms of the maximum degree $Δ$ of the input graph $G$ and hold when $k>11Δ/6$ for all $G$. Improved results hold when $k>αΔ$ for graphs with girth $\geq 5$ and $Δ$ sufficiently large where $α\approx 1.7632\ldots$ is the root of $α=\exp(1/α)$; further improvements on the constant $α$ hold with stronger girth and maximum degree assumptions. For sparse random graphs the maximum degree is a function of $n$ and the goal is to obtain results in terms of the expected degree $d$. The following rapid mixing results for $G(n,d/n)$ hold with high probability over the choice of the random graph for sufficiently large constant~$d$. Mossel and Sly (2009) proved rapid mixing for constant $k$, and Efthymiou (2014) improved this to $k$ linear in~$d$. The condition was improved to $k>3d$ by Yin and Zhang (2016) using non-MCMC methods. Here we prove rapid mixing when $k>αd$ where $α\approx 1.7632\ldots$ is the same constant as above. Moreover we obtain $O(n^{3})$ mixing time of the Glauber dynamics, while in previous rapid mixing results the exponent was an increasing function in $d$. As in previous results for random graphs our proof analyzes an appropriately defined block dynamics to "hide" high-degree vertices. One new aspect in our improved approach is utilizing so-called local uniformity properties for the analysis of block dynamics. To analyze the "burn-in" phase we prove a concentration inequality for the number of disagreements propagating in large blocks.
△ Less
Submitted 12 July, 2017;
originally announced July 2017.
-
Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing
Authors:
Vincent T. Lee,
Armin Alaghi,
John P. Hayes,
Visvesh Sathe,
Luis Ceze
Abstract:
Recent advances in neural networks (NNs) exhibit unprecedented success at transforming large, unstructured data streams into compact higher-level semantic information for tasks such as handwriting recognition, image classification, and speech recognition. Ideally, systems would employ near-sensor computation to execute these tasks at sensor endpoints to maximize data reduction and minimize data mo…
▽ More
Recent advances in neural networks (NNs) exhibit unprecedented success at transforming large, unstructured data streams into compact higher-level semantic information for tasks such as handwriting recognition, image classification, and speech recognition. Ideally, systems would employ near-sensor computation to execute these tasks at sensor endpoints to maximize data reduction and minimize data movement. However, near- sensor computing presents its own set of challenges such as operating power constraints, energy budgets, and communication bandwidth capacities. In this paper, we propose a stochastic- binary hybrid design which splits the computation between the stochastic and binary domains for near-sensor NN applications. In addition, our design uses a new stochastic adder and multiplier that are significantly more accurate than existing adders and multipliers. We also show that retraining the binary portion of the NN computation can compensate for precision losses introduced by shorter stochastic bit-streams, allowing faster run times at minimal accuracy losses. Our evaluation shows that our hybrid stochastic-binary design can achieve 9.8x energy efficiency savings, and application-level accuracies within 0.05% compared to conventional all-binary designs.
△ Less
Submitted 7 June, 2017;
originally announced June 2017.
-
Distributed Computing with Channel Noise
Authors:
Abhinav Aggarwal,
Varsha Dani,
Thomas P. Hayes,
Jared Saia
Abstract:
A group of $n$ users want to run a distributed protocol $π$ over a network where communication occurs via private point-to-point channels. Unfortunately, an adversary, who knows $π$, is able to maliciously flip bits on the channels. Can we efficiently simulate $π$ in the presence of such an adversary? We show that this is possible, even when $L$, the number of bits sent in $π$, and $T$, the number…
▽ More
A group of $n$ users want to run a distributed protocol $π$ over a network where communication occurs via private point-to-point channels. Unfortunately, an adversary, who knows $π$, is able to maliciously flip bits on the channels. Can we efficiently simulate $π$ in the presence of such an adversary? We show that this is possible, even when $L$, the number of bits sent in $π$, and $T$, the number of bits flipped by the adversary are not known in advance. In particular, we show how to create a robust version of $π$ that 1) fails with probability at most $δ$, for any $δ>0$; and 2) sends $\tilde{O}(L + T)$ bits, where the $\tilde{O}$ notation hides a $\log (nL/ δ)$ term multiplying $L$. Additionally, we show how to improve this result when the average message size $α$ is not constant. In particular, we give an algorithm that sends $O( L (1 + (1/α) \log (n L/δ) + T)$ bits. This algorithm is adaptive in that it does not require a priori knowledge of $α$. We note that if $α$ is $Ω\left( \log (n L/δ) \right)$, then this improved algorithm sends only $O(L+T)$ bits, and is therefore within a constant factor of optimal.
△ Less
Submitted 24 July, 2017; v1 submitted 18 December, 2016;
originally announced December 2016.
-
Codes, Lower Bounds, and Phase Transitions in the Symmetric Rendezvous Problem
Authors:
Varsha Dani,
Thomas P. Hayes,
Cristopher Moore,
Alexander Russell
Abstract:
In the rendezvous problem, two parties with different labelings of the vertices of a complete graph are trying to meet at some vertex at the same time. It is well-known that if the parties have predetermined roles, then the strategy where one of them waits at one vertex, while the other visits all $n$ vertices in random order is optimal, taking at most $n$ steps and averaging about $n/2$. Anderson…
▽ More
In the rendezvous problem, two parties with different labelings of the vertices of a complete graph are trying to meet at some vertex at the same time. It is well-known that if the parties have predetermined roles, then the strategy where one of them waits at one vertex, while the other visits all $n$ vertices in random order is optimal, taking at most $n$ steps and averaging about $n/2$. Anderson and Weber considered the symmetric rendezvous problem, where both parties must use the same randomized strategy. They analyzed strategies where the parties repeatedly play the optimal asymmetric strategy, determining their role independently each time by a biased coin-flip. By tuning the bias, Anderson and Weber achieved an expected meeting time of about $0.829 n$, which they conjectured to be asymptotically optimal.
We change perspective slightly: instead of minimizing the expected meeting time, we seek to maximize the probability of meeting within a specified time $T$. The Anderson-Weber strategy, which fails with constant probability when $T= Θ(n)$, is not asymptotically optimal for large $T$ in this setting. Specifically, we exhibit a symmetric strategy that succeeds with probability $1-o(1)$ in $T=4n$ steps. This is tight: for any $α< 4$, any symmetric strategy with $T = αn$ fails with constant probability. Our strategy uses a new combinatorial object that we dub a "rendezvous code," which may be of independent interest.
When $T \le n$, we show that the probability of meeting within $T$ steps is indeed asymptotically maximized by the Anderson-Weber strategy. Our results imply new lower bounds, showing that the best symmetric strategy takes at least $0.638 n$ steps in expectation. We also present some partial results for the symmetric rendezvous problem on other vertex-transitive graphs.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Evaluation System for a Bayesian Optimization Service
Authors:
Ian Dewancker,
Michael McCourt,
Scott Clark,
Patrick Hayes,
Alexandra Johnson,
George Ke
Abstract:
Bayesian optimization is an elegant solution to the hyperparameter optimization problem in machine learning. Building a reliable and robust Bayesian optimization service requires careful testing methodology and sound statistical analysis. In this talk we will outline our development of an evaluation framework to rigorously test and measure the impact of changes to the SigOpt optimization service.…
▽ More
Bayesian optimization is an elegant solution to the hyperparameter optimization problem in machine learning. Building a reliable and robust Bayesian optimization service requires careful testing methodology and sound statistical analysis. In this talk we will outline our development of an evaluation framework to rigorously test and measure the impact of changes to the SigOpt optimization service. We present an overview of our evaluation system and discuss how this framework empowers our research engineers to confidently and quickly make changes to our core optimization engine
△ Less
Submitted 19 May, 2016;
originally announced May 2016.
-
Convergence of MCMC and Loopy BP in the Tree Uniqueness Region for the Hard-Core Model
Authors:
Charilaos Efthymiou,
Thomas P. Hayes,
Daniel Stefankovic,
Eric Vigoda,
Yitong Yin
Abstract:
We study the hard-core model defined on independent sets of an input graph where the independent sets are weighted by a parameter $λ>0$. For constant $Δ$, previous work of Weitz (2006) established an FPTAS for the partition function for graphs of maximum degree $Δ$ when $λ< λ_c(Δ)$. The threshold $λ_c(Δ)$ is the critical point for the phase transition for uniqueness/non-uniqueness on the infinite…
▽ More
We study the hard-core model defined on independent sets of an input graph where the independent sets are weighted by a parameter $λ>0$. For constant $Δ$, previous work of Weitz (2006) established an FPTAS for the partition function for graphs of maximum degree $Δ$ when $λ< λ_c(Δ)$. The threshold $λ_c(Δ)$ is the critical point for the phase transition for uniqueness/non-uniqueness on the infinite $Δ$-regular trees. Sly (2010) showed that there is no FPRAS, unless NP=RP, when $λ>λ_c(Δ)$. The running time of Weitz's algorithm is exponential in $\log(Δ)$. Here we present an FPRAS for the partition function whose running time is $O^*(n^2)$. We analyze the simple single-site Glauber dynamics for sampling from the associated Gibbs distribution. We prove there exists a constant $Δ_0$ such that for all graphs with maximum degree $Δ\geqΔ_0$ and girth $\geq 7$, the mixing time of the Glauber dynamics is $O(n\log(n))$ when $λ<λ_c(Δ)$. Our work complements that of Weitz which applies for constant $Δ$ whereas our work applies for all $Δ\geq Δ_0$.
We utilize loopy BP (belief propagation), a widely-used inference algorithm. A novel aspect of our work is using the principal eigenvector for the BP operator to design a distance function which contracts in expectation for pairs of states that behave like the BP fixed point. We also prove that the Glauber dynamics behaves locally like loopy BP. As a byproduct we obtain that the Glauber dynamics converges, after a short burn-in period, close to the BP fixed point, and this implies that the fixed point of loopy BP is a close approximation to the Gibbs distribution. Using these connections we establish that loopy BP quickly converges to the Gibbs distribution when the girth $\geq 6$ and $λ<λ_c(Δ)$.
△ Less
Submitted 29 August, 2016; v1 submitted 5 April, 2016;
originally announced April 2016.
-
A Stratified Analysis of Bayesian Optimization Methods
Authors:
Ian Dewancker,
Michael McCourt,
Scott Clark,
Patrick Hayes,
Alexandra Johnson,
George Ke
Abstract:
Empirical analysis serves as an important complement to theoretical analysis for studying practical Bayesian optimization. Often empirical insights expose strengths and weaknesses inaccessible to theoretical analysis. We define two metrics for comparing the performance of Bayesian optimization methods and propose a ranking mechanism for summarizing performance within various genres or strata of te…
▽ More
Empirical analysis serves as an important complement to theoretical analysis for studying practical Bayesian optimization. Often empirical insights expose strengths and weaknesses inaccessible to theoretical analysis. We define two metrics for comparing the performance of Bayesian optimization methods and propose a ranking mechanism for summarizing performance within various genres or strata of test functions. These test functions serve to mimic the complexity of hyperparameter optimization problems, the most prominent application of Bayesian optimization, but with a closed form which allows for rapid evaluation and more predictable behavior. This offers a flexible and efficient way to investigate functions with specific properties of interest, such as oscillatory behavior or an optimum on the domain boundary.
△ Less
Submitted 30 March, 2016;
originally announced March 2016.
-
Interactive Communication with Unknown Noise Rate
Authors:
Varsha Dani,
Thomas P. Hayes,
Mahnush Movahedi,
Jared Saia,
Maxwell Young
Abstract:
Alice and Bob want to run a protocol over a noisy channel, where a certain number of bits are flipped adversarially. Several results take a protocol requiring $L$ bits of noise-free communication and make it robust over such a channel. In a recent breakthrough result, Haeupler described an algorithm that sends a number of bits that is conjectured to be near optimal in such a model. However, his al…
▽ More
Alice and Bob want to run a protocol over a noisy channel, where a certain number of bits are flipped adversarially. Several results take a protocol requiring $L$ bits of noise-free communication and make it robust over such a channel. In a recent breakthrough result, Haeupler described an algorithm that sends a number of bits that is conjectured to be near optimal in such a model. However, his algorithm critically requires $a \ priori$ knowledge of the number of bits that will be flipped by the adversary.
We describe an algorithm requiring no such knowledge. If an adversary flips $T$ bits, our algorithm sends $L + O\left(\sqrt{L(T+1)\log L} + T\right)$ bits in expectation and succeeds with high probability in $L$. It does so without any $a \ priori$ knowledge of $T$. Assuming a conjectured lower bound by Haeupler, our result is optimal up to logarithmic factors.
Our algorithm critically relies on the assumption of a private channel. We show that privacy is necessary when the amount of noise is unknown.
△ Less
Submitted 13 August, 2015; v1 submitted 23 April, 2015;
originally announced April 2015.
-
Lower Bounds on the Critical Density in the Hard Disk Model via Optimized Metrics
Authors:
Thomas P. Hayes,
Cristopher Moore
Abstract:
We prove a new lower bound on the critical density $ρ_c$ of the hard disk model, i.e., the density below which it is possible to efficiently sample random configurations of $n$ non-overlap** disks in a unit torus. We use a classic Markov chain which moves one disk at a time, but with an improved path coupling analysis. Our main tool is an optimized metric on neighboring pairs of configurations,…
▽ More
We prove a new lower bound on the critical density $ρ_c$ of the hard disk model, i.e., the density below which it is possible to efficiently sample random configurations of $n$ non-overlap** disks in a unit torus. We use a classic Markov chain which moves one disk at a time, but with an improved path coupling analysis. Our main tool is an optimized metric on neighboring pairs of configurations, i.e., configurations that differ in the position of a single disk: we define a metric that depends on the difference in these positions, and which approaches zero continuously as they coincide. This improves the previous lower bound $ρ_c \ge 1/8$ to $ρ_c \ge 0.154$.
△ Less
Submitted 7 July, 2014;
originally announced July 2014.
-
Block Coordinate Descent for Sparse NMF
Authors:
Vamsi K. Potluru,
Sergey M. Plis,
Jonathan Le Roux,
Barak A. Pearlmutter,
Vince D. Calhoun,
Thomas P. Hayes
Abstract:
Nonnegative matrix factorization (NMF) has become a ubiquitous tool for data analysis. An important variant is the sparse NMF problem which arises when we explicitly require the learnt features to be sparse. A natural measure of sparsity is the L$_0$ norm, however its optimization is NP-hard. Mixed norms, such as L$_1$/L$_2$ measure, have been shown to model sparsity robustly, based on intuitive a…
▽ More
Nonnegative matrix factorization (NMF) has become a ubiquitous tool for data analysis. An important variant is the sparse NMF problem which arises when we explicitly require the learnt features to be sparse. A natural measure of sparsity is the L$_0$ norm, however its optimization is NP-hard. Mixed norms, such as L$_1$/L$_2$ measure, have been shown to model sparsity robustly, based on intuitive attributes that such measures need to satisfy. This is in contrast to computationally cheaper alternatives such as the plain L$_1$ norm. However, present algorithms designed for optimizing the mixed norm L$_1$/L$_2$ are slow and other formulations for sparse NMF have been proposed such as those based on L$_1$ and L$_0$ norms. Our proposed algorithm allows us to solve the mixed norm sparsity constraints while not sacrificing computation time. We present experimental evidence on real-world datasets that shows our new algorithm performs an order of magnitude faster compared to the current state-of-the-art solvers optimizing the mixed norm and is suitable for large-scale datasets.
△ Less
Submitted 18 March, 2013; v1 submitted 15 January, 2013;
originally announced January 2013.
-
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman
Authors:
Thomas P. Hayes
Abstract:
Consider a gambling game in which we are allowed to repeatedly bet a portion of our bankroll at favorable odds. We investigate the question of how to minimize the expected number of rounds needed to increase our bankroll to a given target amount.
Specifically, we disprove a 50-year old conjecture of L. Breiman, that there exists a threshold strategy that optimizes the expected number of rounds;…
▽ More
Consider a gambling game in which we are allowed to repeatedly bet a portion of our bankroll at favorable odds. We investigate the question of how to minimize the expected number of rounds needed to increase our bankroll to a given target amount.
Specifically, we disprove a 50-year old conjecture of L. Breiman, that there exists a threshold strategy that optimizes the expected number of rounds; that is, a strategy that always bets to try to win in one round whenever the bankroll is at least a certain threshold, and that makes Kelly bets (a simple proportional betting scheme) whenever the bankroll is below the threshold.
△ Less
Submitted 4 December, 2011;
originally announced December 2011.
-
Checking Equivalence of Quantum Circuits and States
Authors:
George F. Viamontes,
Igor L. Markov,
John P. Hayes
Abstract:
Quantum computing promises exponential speed-ups for important simulation and optimization problems. It also poses new CAD problems that are similar to, but more challenging, than the related problems in classical (non-quantum) CAD, such as determining if two states or circuits are functionally equivalent. While differences in classical states are easy to detect, quantum states, which are repres…
▽ More
Quantum computing promises exponential speed-ups for important simulation and optimization problems. It also poses new CAD problems that are similar to, but more challenging, than the related problems in classical (non-quantum) CAD, such as determining if two states or circuits are functionally equivalent. While differences in classical states are easy to detect, quantum states, which are represented by complex-valued vectors, exhibit subtle differences leading to several notions of equivalence. This provides flexibility in optimizing quantum circuits, but leads to difficult new equivalence-checking issues for simulation and synthesis. We identify several different equivalence-checking problems and present algorithms for practical benchmarks, including quantum communication and search circuits, which are shown to be very fast and robust for hundreds of qubits.
△ Less
Submitted 1 May, 2007; v1 submitted 1 May, 2007;
originally announced May 2007.
-
How to Beat the Adaptive Multi-Armed Bandit
Authors:
Varsha Dani,
Thomas P. Hayes
Abstract:
The multi-armed bandit is a concise model for the problem of iterated decision-making under uncertainty. In each round, a gambler must pull one of $K$ arms of a slot machine, without any foreknowledge of their payouts, except that they are uniformly bounded. A standard objective is to minimize the gambler's regret, defined as the gambler's total payout minus the largest payout which would have b…
▽ More
The multi-armed bandit is a concise model for the problem of iterated decision-making under uncertainty. In each round, a gambler must pull one of $K$ arms of a slot machine, without any foreknowledge of their payouts, except that they are uniformly bounded. A standard objective is to minimize the gambler's regret, defined as the gambler's total payout minus the largest payout which would have been achieved by any fixed arm, in hindsight. Note that the gambler is only told the payout for the arm actually chosen, not for the unchosen arms.
Almost all previous work on this problem assumed the payouts to be non-adaptive, in the sense that the distribution of the payout of arm $j$ in round $i$ is completely independent of the choices made by the gambler on rounds $1, \dots, i-1$. In the more general model of adaptive payouts, the payouts in round $i$ may depend arbitrarily on the history of past choices made by the algorithm.
We present a new algorithm for this problem, and prove nearly optimal guarantees for the regret against both non-adaptive and adaptive adversaries. After $T$ rounds, our algorithm has regret $O(\sqrt{T})$ with high probability (the tail probability decays exponentially). This dependence on $T$ is best possible, and matches that of the full-information version of the problem, in which the gambler is told the payouts for all $K$ arms after each round.
Previously, even for non-adaptive payouts, the best high-probability bounds known were $O(T^{2/3})$, due to Auer, Cesa-Bianchi, Freund and Schapire. The expected regret of their algorithm is $O(T^{1/2}) for non-adaptive payouts, but as we show, $Ω(T^{2/3})$ for adaptive payouts.
△ Less
Submitted 14 February, 2006;
originally announced February 2006.